How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

Question

I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

The original database is in UTF8, like so:

-- Database: favela

-- DROP DATABASE favela;

CREATE DATABASE favela
  WITH OWNER = favela
       ENCODING = 'UTF8'
       TABLESPACE = favela
       CONNECTION LIMIT = -1;

I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe3a709
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY arena, line 62

When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

Thanks, in advance.

Colin 't Hart · Accepted Answer · 2018-06-01 12:14:14Z

9

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

edited Jun 1, 2018 at 12:14

Colin 't Hart

9,34315 gold badges35 silver badges43 bronze badges

answered Aug 18, 2011 at 19:09

Richard

1

use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)
– Jasen
Commented May 25, 2015 at 23:04

Add a comment |

Jack Douglas · Accepted Answer · 2011-08-18 21:59:46Z

"I don't know how they got there in the first place"

It could have happened as described here - although this generates an error on 8.4:

If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.

For instance, if you have a UTF8-encoded database, you can do:

=> CREATE TABLE foo(t TEXT);

=> INSERT INTO foo VALUES(E'\377');

Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.

There is a good post on this excellent blog about the general issues and some ways to deal with them

Nijil · Accepted Answer · 2019-06-05 06:07:05Z

3

I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.

grep -naxv '.*' plain_text_dump.sql

and then run iconv on the particular data. Check this doc for detailed step by step explanation.

answered Jun 5, 2019 at 6:07

Nijil

1312 bronze badges

Add a comment |

arulraj.net · Accepted Answer · 2014-08-14 00:59:53Z

Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

$ echo $LANG
en_US

In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.

So to fix this, we just set the LANG variable in example to the following:

$ export LANG=en_US.UTF-8

Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Reference

Biswajit Barman · Accepted Answer · 2020-02-06 13:58:12Z

0

I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding

$ file -bi cabot.sql
text/plain; charset=utf-16le
$ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
$ file -bi converted.sql
text/plain; charset=utf-8

answered Feb 6, 2020 at 13:58

Biswajit Barman

1

Add a comment |

alboforlizo · Accepted Answer · 2024-02-29 18:46:19Z

0

First, I exported a plain file in WIN1252 encoding:

$ sudo -u postgres pg_dump "OLD-DB" -F p -E WIN1252 -f /tmp/db.dumplain

I dropped wrong characters to UTF-8:

$ sudo -u postgres iconv -c -t utf-8 /tmp/db.dumplain > /tmp/db.utf8

Then imported with no problems:

$ sudo -u postgres psql "NEW-DB" < /tmp/db.utf8

edited Feb 29 at 18:46

answered Feb 29 at 18:39

alboforlizo

1011 bronze badge

Add a comment |

Stack Exchange Network

How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
postgresql
migration
pgadmin
or ask your own question.

Hot Network Questions

How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

6 Answers 6

Not the answer you're looking for? Browse other questions tagged postgresqlmigrationpgadmin or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgresql
migration
pgadmin
or ask your own question.