1

This is a problem that I've encountered a couple times recently. Here is my most recent experience:

Trying to browse https://www.scape.sc/release.php?id=48, a page that contains japanese text. The japanese in this page is completely garbled and displayed as unicode square characters, symbols and various Latin accented characters. This is true even in the html source, so I don't think it is an issue of font choice.

The site uses what I understand from this webhint.io article an out-of-date method of declaring the character set, <META Http-equiv="Content-Type" Content="text/html; charset=utf8">. Although the article does mention that this shouldn't be a problem nowadays.

This is how the raw html looks when I visit it in my browser:

<TR><TD>2.</TD><TD>記憶ã¨ç©º</TD><TD> <I>(kioku to sora)</I></TD></TR>

In the past, I had found that searching for older versions of websites with this issue on the Internet Archive's Wayback Machine would display the japanese characters correctly. This is true in my current case as well.

In the following two examples from the Wayback Machine, the first link is from a capture in 2016, both the page source and the rendered page use valid/uncorrupted japanese characters. The second is from 2023, and displays the same garbled text that I see on my own machine, which makes me more confident that it is not a problem on my end.

  1. page as displayed in 2016

raw html from 2016:

<tr><td>2.</td><td>記憶と空</td><td> <i>(kioku to sora)</i></td></tr>

  1. page as displayed in July 2023

raw html from 2023:

<tr><td>2.</td><td>記憶ã¨ç©º</td><td> <i>(kioku to sora)</i></td></tr>

My suspicion is that this is an error on the webmaster's part, that maybe there was some charset mismatch when making changes to the site in a text editor sometime between 2016 and now. Does this sound reasonable? Is there any way to recover the "corrupted" unicode and avoid having to rely on old captures of sites on the Wayback Machine?

TL;DR: Website used to contain valid unicode, no longer does. How can such an issue occur? Can the problem text be reversed/made legible by the end-user?

2 Answers 2

2

The text is doubly encoded UTF-8. That is, the UTF-8 data was misinterpreted as being one of the legacy single-byte encodings (likely Windows-1252) and then converted from that encoding to UTF-8 again. (For example, the same bytes that represent in UTF-8 also represent 記 in Windows-1252, and those three characters were stored as UTF-8 again.)

In other words, it's an error on the webmaster's side. (Actually, my guess is that they upgraded their MySQL database server, as current versions deal with UTF-8 Unicode strings whereas MySQL 4.x from the website's era used to deal with "latin1" which was more-or-less raw byte values. Some evidence of this is that the sidebar link that says -közi- is not double-encoded, nor are the hand-written artist pages. In MySQL, the encoding can be set on the DB side, on the PHP client side, and even per-connection; it is very easy to end up with a mismatch and to get double-encoded text, especially with older MySQL configurations.)

Browsers don't usually have any features for dealing with such corruption; as far as they know, the charset declaration is 100% correct, it's the input data that's wrong. An extension or a 'userscript' (GreaseMonkey-style) might work; you might also be able to recover text from a locally saved page.

The rough process for recovering the text would be:

  1. Obtain the raw HTML.
  2. Feed it through iconv or other encoding converter, specifying UTF-8 as input and Windows-1252 (or other candidate legacy codepage) as output.
  3. The output should now be regular UTF-8.

In this case, regular iconv is a bit too strict, as is Python's cp1252 encoding, as they both refuse to use character slots that are "undefined" in cp1252 (e.g. to translate U+0081 back to byte 0x81), so the encoder needs to be customized a little:

#!/bin/python3
import argparse
import codecs
import encodings.cp1252

# Patch Python runtime to replace U+FFFE ("undefined" indicator) with
# direct mappings to byte values, e.g. so that U+0081 becomes \x81
# instead of reporting an error.
tab = encodings.cp1252.decoding_table
tab = [tab[i].replace("\uFFFE", chr(i)) for i in range(256)]
tab = "".join(tab)
encodings.cp1252.decoding_table = tab
encodings.cp1252.encoding_table = codecs.charmap_build(tab)

parser = argparse.ArgumentParser()
parser.add_argument("file", nargs="+")
args = parser.parse_args()

for arg in args.file:
    print("Processing:", arg)

    with open(arg, "rb") as fh:
        buf = fh.read()

    # Undo double-encoding; the result of encode(cp1252) will
    # actually be normal UTF-8.
    buf = buf.decode("utf-8").encode("cp1252")

    with open(arg + ".fixed", "wb") as fh:
        fh.write(buf)

Note that this damages the -közi- sidebar link, which wasn't double-encoded at all.

The site uses what I understand from this webhint.io article an out-of-date method of declaring the character set

The method used by the site was completely appropriate for the time it was written. It's not exactly "out-of-date", it's merely "no longer the most convenient option" but still 100% supported – the same as the rest of the page's HTML 4.01 markup (whereas the article talks about HTML 5).

Either way, the declaration is correct; the HTML is indeed encoded in UTF-8. It's what was encoded in UTF-8 that's the actual problem.

1
  • the sidebar közi was a great catch, doubly-encoded utf-8, it's gotta be that. and thanks for the clarification on charset declarations. i'll have to work out a userscript soon for this sort of conversion!
    – homework
    Commented Aug 6, 2023 at 9:09
1

It might be that the current website's text is actually using an older encoding such as Shift JIS, EUC or ISO-2022-JP instead of the UTF-8 it claims.

So, if that is the case here, the data isn't corrupted, it is just not in the advertised encoding.

A website with hints for recognizing some Japanese encodings is https://www.sljfaq.org/afaq/encodings.html

There are various tools for changing encoding such as iconv and recode. I would try those first.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .