Can anyone explain this encoding-related behaviour?

Question

Encoding is not my strong point, despite having read quite a bit.

There's a file I want to edit, its extension is .tdl, but that doesn't mean anything in particular.

It is an XML file. The first line looks like this:

<?xml version="1.0" encoding="utf-16"?>

When I try to open this file with gedit I get a big message on a yellow background, saying:

"There was a problem opening the file ... The file you opened has some invalid characters. If you continue editing this file you could corrupt this document. You can also choose another character encoding and try again"

The Character Encoding dropdown box under this says "Current Locale (UTF-8)".

I try to set that to "Unicode (UTF-16)" and click "Retry". The nasty message comes back and the dropdown is set back to "Current Locale (UTF-8)".

I've also tried opening the file by going File --> Open --> Character Encoding: change from "Automatically Detected" to "Unicode (UTF-16)". But I get the nasty message again, again with the dropdown set to "Current Locale (UTF-8)".

Programmatically (using Groovy, groovy.xml.XMLParser) I am able to parse this file and produce a seemingly valid groovy.util.Node structure. I haven't yet got to the stage of trying to save this internal Node structure, whether modified or not.

Can someone tell me what's wrong (if anything) with this file, and how I might edit it safely?

Thanks ... that's a helpful solution. I just installed it and it has no problem opening the file. I can save in a variety of encodings, but I can't find something like "properties" which might tell me whether this is a UTF-16-encoded file, or whatever. Investigations continuing... Aha, spoke too soon: down at the bottom right, a little box which says "UTF-16LE". Thanks again. — mike rodent, Commented Mar 18, 2020 at 8:26

xenoid · Accepted Answer · 2020-03-18 10:52:51Z

1

In UTF-16, characters are on two bytes, and for ASCII characters the high byte is 0x00.

For instance "Something" in UTF-16 is:

00000000  ff fe 53 00 6f 00 6d 00  65 00 74 00 68 00 69 00  |..S.o.m.e.t.h.i.|
00000010  6e 00 67 00 0a 00                                 |n.g...|

(The OxFFFE at the start is the Byte Order Mark, if you see 0xFEFF you know you have to swap bytes...).

The NUL characters all over the place do confuse software...

You can convert to a more reasonable UTF-8, using iconv:

iconv -f UTF-16 -t UTF-8 <utf16file >utf8file

And don't forget to change the encoding in the file header

answered Mar 18, 2020 at 10:52

xenoid

10.2k4 gold badges23 silver badges32 bronze badges

I didn't ask for general encoding information: my question was quite specific. I have no desire to convert to UTF-8.
– mike rodent
Commented Mar 18, 2020 at 11:24
@mikerodent Your call... But it's not only a matter of editing. The UTF-16 file is opaque to grep, for instance.
– xenoid
Commented Mar 18, 2020 at 12:18

Add a comment |

vonbrand · Accepted Answer · 2020-03-18 15:16:59Z

-1

If the file is UTF-16 (Windows-typical encoding), you'll have trouble under Linux (UTF-8 native, militant...). At least GNU emacs says it supports UTF-16, have never used it in anger.

You might try recode(1) to translate into UTF-8 (and fix headers and such to match), but that might break tools expecting UTF-16 horribly.

Update: Just thought about this: recode to UTF-8; mangle, spindle, deface at leisure; recode back to UTF-16. That way you can use familiar tools in the middle. But do fix the UTF-16 encoding announced, who knows if tools get confused. Or perhaps XML mangling tools do heed this...

edited Mar 18, 2020 at 15:16

answered Mar 17, 2020 at 22:28

vonbrand

2,4793 gold badges22 silver badges21 bronze badges

1

Yes, this file and the application responsible for it are from the Dark Side (although the app runs OK under Wine). I need to be able to edit the file by manipulating the Node structure programmatically, and then save with an encoding will not mess up when running the application, either in Windoze or using Wine.
– mike rodent
Commented Mar 18, 2020 at 8:15
1

Idiot downvote not by me, by the way.
– mike rodent
Commented Mar 18, 2020 at 9:12

Add a comment |

Stack Exchange Network

Can anyone explain this encoding-related behaviour?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
encoding
unicode
utf-8
.

Hot Network Questions

Can anyone explain this encoding-related behaviour?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxencodingunicodeutf-8.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
encoding
unicode
utf-8
.