5

I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:

recode ISO-8859-1..UTF-8 file.txt

I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> ä)

My question is, what kind of script would run recode only if needed, i.e. only for files that weren't already in the target encoding (UTF-8 in my case)?

From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script...

1

7 Answers 7

7

This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

You can use it this way :

recodeifneeded utf-8 file.txt

So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

I hope this helps.

2
  • 2
    Only solution that works regardless of the original encoding.
    – Jr. Hames
    Commented Jun 3, 2015 at 19:44
  • Some encodings are just detected as data.
    – mwfearnley
    Commented Feb 26, 2020 at 9:26
3

This script, adapted from harrymc's idea, which recodes one file conditionally (based on existence of certain UTF-8 encoded Scandinavian characters), seems to work for me tolerably well.

$ cat recode-to-utf8.sh 

#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already

result=`grep -c [åäöÅÄÖ] $1` 
if [ "$result" -eq "0" ]
then
    echo "Recoding $1 from ISO-8859-1 to UTF-8"
    recode ISO-8859-1..UTF-8 $1 # overwrites file
else
    echo "$1 was already UTF-8 (probably); skipping it"
fi

(Batch processing files is of course a simple matter of e.g. for f in *txt; do recode-to-utf8.sh $f; done.)

NB: this totally depends on the script file itself being UTF-8. And as this is obviously a very limited solution suited to what kind of files I happen to have, feel free to add better answers which solve the problem in a more generic way.

2

UTF-8 has strict rules about which byte sequences are valid. This means that if data could be UTF-8, you'll rarely get false positives if you assume that it is.

So you can do something like this (in Python):

def convert_to_utf8(data):
    try:
        data.decode('UTF-8')
        return data  # was already UTF-8
    except UnicodeError:
        return data.decode('ISO-8859-1').encode('UTF-8')

In a shell script, you can use iconv to perform the converstion, but you'll need a means of detecting UTF-8. One way is to use iconv with UTF-8 as both the source and destination encodings. If the file was valid UTF-8, the output will be the same as the input.

1
  • Thanks, seems useful - I'll try this the next time when batch converting text files
    – Jonik
    Commented Aug 22, 2010 at 16:31
1

Both ISO-8859-1 and UTF-8 are identical on the first 128 characters, so your problem is really how to detect files that contain funny characters, meaning numerically encoded as above 128.

If the number of funny characters is not excessive, you could use egrep to scan and find out which files need recoding.

1
  • Indeed, in my case the "funny characters" are mostly just åäö (+ uppercase) used in Finnish. It's not quite that simple, but I could adapt this idea... I'm using UTF-8 terminal, and grepping for e.g. 'ä' finds it only in files that are already UTF-8 (i.e. the very files I want to skip)! So I should do the opposite: recode files where grep finds none of [äÄöÖåÅ]. Sure, for some of these files (pure ascii) recoding's not necessary, but it doesn't matter either. Anyway, this way I'd perhaps get all files to be UTF-8 without breaking those that already were. I'll test this some more...
    – Jonik
    Commented Mar 6, 2010 at 17:54
1

I'm a bit late, but i've been strugling so often with the same question again and again... Now that i've found a great way to do it, i can't help but share it :)

Despite beeing an emacs user, i'll recommend you to use vim today.

with this simple command, it will recode your file, no matter what's inside to the desired encoding :

vim +'set nobomb | set fenc=utf8 | x' <filename>

never found something giving me better results than this.

I hope it will help some others.

0

You can detect and guess the charset of a file by using

file -bi your_file_with_strange_encoding.txt

This bash one liner uses the above command as the input for recode and loops over multiple files:

for f in *.txt; do recode -v "`file -bi ${f} | grep -o 'charset=.*' | cut -f2- -d=`..utf-8" ${f}; done

Don't worry about converting existing utf-8, recode is smart enough to not do anything in that case and would print a message:

Request: *mere copy*
0

There are many ways to detect a character set and none is 100% reliable. It helps a lot if the possible languages and character sets are limited, and you have enough text to count specific bytes.

Another approach is to try to recode (using recode) and check the exit value for errors.

To only differentiate between UTF-8 and ISO-8859-X for languages using Latin characters, one trick is to try to recode to UTF-16 first. It will work for UTF-8 or exit with an error for ISO-8859-X.

I sometimes use this in a script:

# UTF-16 or non-text binary ?
if grep -P -q '[\0-\x08\x0B\x0C\x0E-\x1F]' "$file" ; then
    if cat "$file" | recode -s utf16/..utf8 &>/dev/null ; then
        echo "utf-16"
    else
        echo "BINARY?"
    fi
    exit
fi

# plain ASCII ?
if ! grep -P -q '[\x7F-\xFF]' "$file" ; then
    echo "ASCII"
    exit
fi

# UTF-8 or Latin1/CP1252 ?
# order of tests is important!
for charset in utf8 latin1 cp1252 ; do
    if cat "$file" | recode -s $charset/..utf16 &>/dev/null ; then
        found=$charset
        if [ "$found" == "latin1" ]; then
            # checking if latin1 is really cp1252
            if grep -P -q '[\x80-\x9F]' "$file" ; then
                found=cp1252
            fi
        fi
        break
    fi
done

if [ -n "$found" ]; then
    echo "$found"
else
    echo "UNKNOWN"
fi

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .