0

I have a text file that uses various characters in the 128+ range in currently non-standard ways. The file command just says Non-ISO extended-ASCII.

From the context I can recognise these:

Octal 201: u + unlaut
      204: a + umlaut
      216: A + umlaut
      224: o + umlaut
      341: double s

(There are many others, which I suspect are graphical symbols, not characters.)

Addition, example:

 example:   E0X A ANCIENT.IMG 2 0 C:\DOS\DISKOPT.EXE A: /O /Sa /M2
              ДВД В ДДВДДДДДДДД В Д ДДДДДДДВДДДДДДДДДД ДДДДДДДВДДДДД
           і  і   і         і          і                  і
     load E0X ДЩ  АДДДДДДДДДї   і          і                  і
                      і     і   і          і                  і
     with ANCIENT.IMG Щ     і   і          і                  і
                            і   і          і                  і
     for drive A: ДДДДДДДДДДЩ   і          і                  і
                                і          і                  і
     let DISKOPT work ДДДДДДДДДДіДДДДДДДДДДБДДДДДДДДДДДДДДДДДДЩ
                    і
     and write the result back to disk if finished.

(The graphical chars are octal 263, 277, 302, 304, 331.)

And here is the link to the file: e0x.arj. It is the E0X.ENG, but I guess it is the same encoding in all the text files.

Which character set is this, and how can I make it readable on a modern computer?

1 Answer 1

1

Most probably the character positions you mention are octal numbers: 201 (which is customarily written as 0201 to make it clear it's octal) is decimal 129, or 0x81.

Those characters are consistent with several DOC codepages:

  • VGA codepage 437 (VGA ROM charset)
  • Codepage 437 (IBM-PC: default)
  • Codepage 775 (IBM-PC: Baltic)
  • Codepage 850 (IBM-PC: European)
  • Codepage 852 (IBM-PC: East European)
  • Codepage 857 (IBM-PC: Turkish)
  • Codepage 861 (IBM-PC: Icelandic)
  • Codepage 865 (IBM-PC: Nordic European)

If it's German, I'd bet that it's 437 or 850. Any editor should be able to read that text file and write it in a different character set.

For example you can read it with Notepad++ and write it in UTF-8 if you are sure you need that.

P.S. after reading the file that you attached, I can see that E0X.ENG charset is MS-DOS codepage 437. You can see it converted to utf-8 at https://pastebin.com/LdnQCpk4.

If you run on Linux, you can automate conversion with GNU recode. If you run on DOS, I see this recode utility https://docs.seneca.nl/Smartsite-Docs/Features-Modules/Features/Tools/Recode-commandline-utility.html should do the same

5
  • Ok, octal is possible, but I have tried 850 and windows 1250,1252 and it does not work for all the characters. Yes using a windows machine might be easier.
    – Tomas By
    Commented Jan 28, 2021 at 13:29
  • Yes, sure... I was hoping there was some reasonable, practical solution.
    – Tomas By
    Commented Jan 28, 2021 at 13:35
  • And it looks like garbage in all the various editors I have tried. If it had been a standard code page then I guess file would have recognised it.
    – Tomas By
    Commented Jan 28, 2021 at 13:53
  • @Tomas: The problem is that most of the 8-bit codepages are impossible to distinguish one from another – every value is equally valid in every codepage (except for C1 controls range). There are tools which do some statistical analysis (like chardetect) but none of them can guess it every time; the answer depends on outside knowledge, such as knowing that the files come from MS-DOS. Commented Jan 28, 2021 at 19:10
  • If you look at (for example) ascii-codes.com you can see that the IBM-PC codepage 437 includes extended codes (those with the high bit set). The character ä in the codepage is at position 132, 0x84 or 0204. In the file named E0X.ENG it is fond twice, at positions 1766 and 7166: the latter is towards the end, in the last but one row. Commented Feb 3, 2021 at 21:51

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .