Can't make Ant write proper version info with unicode (c) character

Question

After upgrading ant from 1.6 to 1.8.3 version info resources of Windows .dlls that are built with Ant became corrupted.

Previously this value was properly saved to the version-info resource:

product.copyright=\u00a9 Copyright 20xx-20xx yyyyyyyyyy \u2122 (so (c) and TM symbols were properly displayed).

After upgrading Ant default encoding was changed to UTF-8 which is expected, but currently Copyright string looks like this:

Â© Copyright 20xx-20xx yyyyyy â„¢

This is not a console issue - I checked with hex editor and File Properties dialog - both display it incorrectly.

Looking at file's hexdump I see that following (obviously incorrect) mapping occurs

\u00a9 -> 0x00c2 0x00a9
\u2122 -> 0x00e2 0x201e 0x00a2

The problem here is that Ant encodes UTF-8 bytes (not Unicode string) into 16-bit characters and writes it to version-info.

Although this looks like a bug in ant, I would ask if anyone managed to find any workarounds for this or similar problems.

Here are some snippets from the script: Project properties file:

...
product.copyright=(c) Copyright 2005-2012 Clarabridge
....

Files included into build.xml:

<versioninfo id="current-version" if="is-windows"
    fileversion="${product.version}"
    productversion="${product.version}"
    compatibilityversion="1"
    legalcopyright="${product.copyright}"
    companyname="${product.company}"
    filedescription="${ant.project.name}"
    productname="${ant.project.name}"
/>
...
<cc objdir="${target.dir}/${target.platform}/obj"
    outfile="${target.dir}/${target.platform}/${ant.project.name}"
    subsystem="other"
    failonerror="true"
    incremental="false"
    outtype="shared"
    runtime="dynamic"
>
    <versioninfo refid="current-version" />
    <compiler refid="compiler-shared-${target.platform}" />
    <compiler refid="rc-compiler" />
    <linker extends="linker-${target.platform}">
        <libset dir="${target.dir}/${target.platform}/lib" libs="${lib.list}" />
    </linker>

    <fileset dir="${src.dir}" casesensitive="false">
        <include name="*.cpp"/>
    </fileset>
</cc>

did you try to start ant with ant -Dfile.encoding=utf8, perhaps your console encoding is off? — oers, Commented May 10, 2012 at 7:25
This parameter is included by default for me; also I'm not looking at the console output, I look at dll binary and Windows file properties — Alex Z, Commented May 10, 2012 at 7:28
okay :) so you have an ant tasks that creates those dlls (maybe add it to the question)? — oers, Commented May 10, 2012 at 7:38
@Fahrenheit2539 You should post the script (the snippet with resource using) — Alex K, Commented May 10, 2012 at 11:42

tchrist · Accepted Answer · 2012-05-10 12:37:59Z

2

Your bug is that something is misinterpreting the UTF-8 characters as 8-bit ones!!!

BTW, Java doesn’t use 16-bit characters; that would be UCS-2. Java uses UTF-16, which is just as much a variable-width encoding as UTF-8 is. Distressing how many Java programmers screw this up!

UTF-8 has 8-bit code units where UTF-16 has 16-bit code units; neither one supports an “8-bit character” or a “16-bit character”. If you catch yourself writing code that thinks they do, you’ve just written buggy code.

Your output is the result of erroneously displaying UTF-8 as though it were in Latin1, which does use 8-bit characters. You, however, do not.

answered May 10, 2012 at 12:37

tchrist

80k31 gold badges129 silver badges184 bronze badges

The problem here is that there is no single line of code written by me here. So for now I filed a bug for Ant. We'll see if it would be confirmed
– Alex Z
Commented May 10, 2012 at 14:41
@Fahrenheit2539 Is your terminal program set to display UTF-8? Betcha anything that’s your only problem. Windows is super-bad at Unicode. I imagine your file is perfectly correct. You just don’t know how to look at it.
– tchrist
Commented May 10, 2012 at 14:51
As I mentioned I checked dll file with binary editor, and it is not correct. I don't trust console output either
– Alex Z
Commented May 10, 2012 at 15:09
1

@Fahrenheit2539 What do you mean it’s not correct? I don’t trust these dumb hex dumps. The right answer is that U+00A9 COPYRIGHT SYMBOL, which should display as ©, is represented by the single byte "\xA9" in Latin1 but by the pair of bytes "\xC2\xA9" in UTF-8. If it is double-encoded due to a bug, then it will be four bytes, "\xC3\x82\xC2\xA9", which is what you just showed. Which of those three byte sequences is it that’s actually in the file? Don’t just look at the A9 at the end, as that’s the same for all three. You have to see how many high-bytes there are in a row right there.
– tchrist
Commented May 10, 2012 at 15:31
You did a good suggestion to rethink what is correct. To do it I looked at the other dll file that has correct version information (it is from Windows' distributive). And there (c) symbol is represented as a single two-byte character: \x00\xa9 and displayed properly
– Alex Z
Commented May 13, 2012 at 9:33

Add a comment |

Collectives™ on Stack Overflow

Can't make Ant write proper version info with unicode (c) character

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
unicode
ant
utf-8
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged unicodeantutf-8 or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
unicode
ant
utf-8
or ask your own question.