How to convert this string to Japanese using GNU/Linux tools?

Question

Here is a string from a text file:

@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Åj

It includes many nonprinting characters and is copied here: https://pastebin.com/TUG4agN4

Using https://2cyr.com/decode/?lang=en, we can confirm that it translates to the following:

　☆Tda式照れミクストレート・ビキニver1.11d1.d2（ビキニモデルver.1.1正式配布版）

This is with source encoding = SJIS (shift-jis), displayed as Windows-1252.

But how can we obtain the same result without a website? The relevant tool is iconv, but something in the tool chain is broken. If I try to cat from the source text file or use it as standard input with '<' in bash, one of the 'iconv's in the chain quickly errors out. If I copy the above string from text editor gedit (reading the file as utf-16le) or as output by iconv with utf16-to-utf8 conversion, then the result is close, but still wrong:

@儺da式ﾆれミクストレ[トEビキニver1.11d1.d2iビキニモデルver.1.1ｳ式配布版j

Some evidence of the tool chain failing:

$ cat 'utf8.txt' |head -1

@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Å

$ cat 'utf8.txt' |head -1| iconv -f utf8 -t utf16

��@�"!Tda}�� ~�N�X�g�R�[�g�E�r�L�jver1.11d1.d2�i�r�L�j� �f�9 ver.1.1��}� z" z ��j

Note three invalid characters at start.

$ cat 'utf8.txt' |head -1| iconv -f utf8 -t utf16|iconv -f utf16 -t windows-1252

iconv: illegal input sequence at position 2

$ echo "@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Åj"| iconv -f utf8 -t utf16

��@"!Tda}�� ~�N�X�g�R[�gE�r�L�jver1.11d1.d2i�r�L�j� �f�9 ver.1.1�}� z" z �j

Note two invalid characters at start, other differences. The sequence copied from terminal matches the string displayed in text editor, confirmed by find (ctrl-F) matching it, which is the same string that gives the correct result on 2cyr.com.

Extending the last command above with '|iconv -f utf16 -t windows-1252|iconv -f shift-jis -t utf8' gives the close, but incorrect result quoted above, instead of erroring out as the direct chain does.

If I tried making a file named the example string and using the tool convmv on it, convmv said the output filename contained "characters, which are not POSIX filesystem conform! This may result in data loss." Most filenames that are invalid with UTF-8 don't give this warning.

Is there any bit sequence that piping in bash can't handle? If not, why is the tool chain not working?

Apparently the difference is because bash won't paste unprinting characters (the boxes with numbers) to the command line; maybe 'readline' can't handle them? But the result being close suggests the conversion order in the toolchain is correct, so why isn't it working?

The original file, with its filename scrambled in a different way (expires after 30 days): https://ufile.io/oorcq

grawity_u1686 · Accepted Answer · 2018-03-30 19:19:05Z

Pipes are an OS feature which works with byte buffers and does not interpret their contents in any way. So piped text doesn't go through to bash and especially never through 'readline'. Text pasted as command-line arguments does. (And yes, both readline and the terminal may filter out control characters as a security measure.)

Your file is actually a mix of two encodings, windows-1252 and iso8859-1, due to the different ways they use the C1 control character block (0x80..0x9F).

ISO 8859-1 uses this entire range for control characters, and bytes 0x80..0x9F correspond to Unicode codepoints U+0080..U+009F.
Windows-1252 cannot represent C1 control characters; it uses most of this range for printable characters and has a few "holes" – i.e. byte values which have nothing assigned (0x81, 0x8D, 0x8F, 0x90, 0x9D).
The two encodings are otherwise identical in 0x00..0x7F and 0xA0..0xFF ranges.

Let's take the first line of your "bad" input file, decoded from UTF-16 to Unicode text and with nonprintable characters escaped:

\u0081@\u0081™TdaŽ®\u008FÆ‚êƒ~ƒNƒXƒgƒŒ\u0081[ƒg\u0081EƒrƒLƒjver1.11d1.d2\u0081iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1\u0090³Ž®”z•z”Å\u0081j\n

You can see \u0081 (U+0081), which maps to byte 0x81 in ISO 8859-1 but cannot be encoded in Windows-1252.
You can also see the symbol ƒ (U+0192), which maps to 0x83 in Windows-1252 but does not exist at all in ISO 8859-1.

So the trick is to use Windows-1252 when possible and ISO 8859-1 as the fallback, deciding individually for each codepoint. (libiconv could do this via 'ICONV_SET_FALLBACKS', but the CLI iconv tool cannot.) It is easy to write your own tool:

#!/usr/bin/env python3
with open("/dev/stdin", "rb") as infd:
    with open("/dev/stdout", "wb") as outfd:
        for rune in infd.read().decode("utf-16"):
            try:
                chr = rune.encode("windows-1252")
            except UnicodeEncodeError:
                chr = rune.encode("iso8859-1")
            outfd.write(chr)
            # outputs shift-jis

Note that only half of your input file is mis-encoded Shift-JIS. The other half (English) is perfectly fine UTF-16; fortunately Shift-JIS will pass it through so no manual splitting is needed:

#!/usr/bin/env python3
with open("éΦé╟é▌üEé╓é╚é┐éσé▒éªéΦé⌐.txt", "r", encoding="utf-16") as infd:
    with open("りどみ・へなちょこえりか.txt", "w", encoding="utf-8") as outfd:
        buf = b""
        for rune in infd.read():
            try:
                buf += rune.encode("windows-1252")
            except UnicodeEncodeError:
                try:
                    buf += rune.encode("iso8859-1")
                except UnicodeEncodeError:
                    buf += rune.encode("shift-jis")
        outfd.write(buf.decode("shift-jis"))

This is a good solution that answers the question of how to retrieve the original text. My questions are these: — Misaki, Commented Mar 30, 2018 at 20:33
1) is there a way to read the original file that doesn't involve a fallback to a second encoding? My assumption that UTF-16 is involved is partly because I tried to open it as other encodings in gedit and they all failed. 2) Does this method of reading and converting one character/"rune" at a time always work? Could 2-byte characters be improperly decoded as 3-byte or 1-byte characters, resulting in a 'rune' with too much or too little information? — Misaki, Commented Mar 30, 2018 at 20:42
3) Is 2cyr.com forced to use the same fallback? The string is sent to it as UTF-8 as I understand, and when selecting the decoding settings there's no mention of either UTF-16 or ISO 8859-1. It seems pretty simple to test pairs of encodings, like SJIS+Windows-1252, but detecting that UTF-16 is also involved is an increase in complexity and my understanding is poor enough that I'm not entirely sure that this must be done. — Misaki, Commented Mar 30, 2018 at 20:52
Some of these comments might be extraneous and could be deleted. I don't think it's a coincidence that the missing symbol in Windows-1252, 0x81, is U+0081. I think the text editor that originally read the SJIS file as Windows-1252 saw 0x81, was unable to convert it, and then just passed it on. 2cyr then did a similar thing when converting from Unicode (any type) to Windows-1252. <del>I'm guessing U+0081 is not actually</del> ok, it is 0x0081 in UTF-16. So instead of the fallback being a second encoding, it would be the raw bit sequence. Maybe sub-255 assumed to be clean by programs. — Misaki, Commented Mar 30, 2018 at 21:51
Or, since U+0081 in UTF8 is 0xC2 0x81, the fallback bit sequence would be the Unicode codepoint. — Misaki, Commented Mar 30, 2018 at 22:27

Stack Exchange Network

How to convert this string to Japanese using GNU/Linux tools?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
bash
character-encoding
utf-8
.

Hot Network Questions

How to convert this string to Japanese using GNU/Linux tools?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxbashcharacter-encodingutf-8.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
bash
character-encoding
utf-8
.