20

I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

The original database is in UTF8, like so:

-- Database: favela

-- DROP DATABASE favela;

CREATE DATABASE favela
  WITH OWNER = favela
       ENCODING = 'UTF8'
       TABLESPACE = favela
       CONNECTION LIMIT = -1;

I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe3a709
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY arena, line 62

When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

Thanks, in advance.

6 Answers 6

9

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

1
  • use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)
    – Jasen
    Commented May 25, 2015 at 23:04
7

"I don't know how they got there in the first place"

It could have happened as described here - although this generates an error on 8.4:

If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.

For instance, if you have a UTF8-encoded database, you can do:

=> CREATE TABLE foo(t TEXT);

=> INSERT INTO foo VALUES(E'\377');

Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.

There is a good post on this excellent blog about the general issues and some ways to deal with them

0
3

I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.

grep -naxv '.*' plain_text_dump.sql

and then run iconv on the particular data. Check this doc for detailed step by step explanation.

1

Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

$ echo $LANG
en_US

In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.

So to fix this, we just set the LANG variable in example to the following:

$ export LANG=en_US.UTF-8

Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Reference

0

I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding

$ file -bi cabot.sql
text/plain; charset=utf-16le
$ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
$ file -bi converted.sql
text/plain; charset=utf-8
0

First, I exported a plain file in WIN1252 encoding:

$ sudo -u postgres pg_dump "OLD-DB" -F p -E WIN1252 -f /tmp/db.dumplain

I dropped wrong characters to UTF-8:

$ sudo -u postgres iconv -c -t utf-8 /tmp/db.dumplain > /tmp/db.utf8

Then imported with no problems:

$ sudo -u postgres psql "NEW-DB" < /tmp/db.utf8

Not the answer you're looking for? Browse other questions tagged or ask your own question.