6

Background

I'm in a real mess with unicode and Python. It seems to be a common angst and I've tried using other solutions out there but I just can't get my head around it.

Setup

MySQL Database Setup

  • collation_database: utf8_general_ci
  • character_set_database: utf8

SQLAlchemy Model

class Product(Base):
    id = Column('product_id', Integer, primary_key=True)
    name = Column('product_name', String(64)) #Tried using Unicode() but didn't help

Pyramid View

@view_config(renderer='json', route_name='products_search')
def products_search(request):
    json_products = []
    term = "%%%s%%" % request.params['term']

    products = dbsession.query(Product).filter(Product.name.like(term)).all()

    for prod in products:
        json_prod = {'id': prod.id, 'label': prod.name, 'value': prod.name, 'sku': prod.sku, 'price': str(prod.price[0].price)}
        json_products.append(json_prod)

    return json_products

Problem

I get encoding errors reported from the json module (which is called as its the renderer for this route) like so:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 37: invalid start byte

The culprit is a "-" (dash symbol) in the prod.name value. Full stack trace here. If the returned products don't have a "-" in then it all works fine!

Tried

I've tried encoding, decoding with various types before returning the json_products variable.

1
  • 3
    I don't know about the frameworks you're using, but somewhere along the line, the dash (N-dash, , not -) is getting encoded in cp1252 (which gives byte 0x96). JSON always deals with unicode, so it tries to decode it using UTF-8, and fails. So somewhere you will need a .decode("cp1252").
    – Thomas K
    Commented Aug 21, 2011 at 17:19

1 Answer 1

6

The above comment is right, but more specifially, you can replace 'label': prod.name with 'label': prod.name.decode("cp1252"). You should probably also do this for all of the strings in your json_prod dictionary since it's likely you'll see cp1252 encoded characters elsewhere in the real-world use of your application.

On that note, depending on the source of these strings and how widely that source is used in your app, you may run into such problems elsewhere in your app and generally when you least expect it. To investigate further you may want to figure What the source for these strings is and if you can do the decoding/re-encoding at a lower level to correct most future problems with this.

1
  • 3
    Fantastic! Thank you - this has helped solve a few other encoding issues too. I Couldn't understand why it was encoded at cp1252 when the database was utf8. Turns out MySQL clients default to latin1 when not specifically told otherwise. Even if the table is utf8. I added "?charset=utf8" to the end of my database connection URL which ensured all returned data was encoded as UTF-8. Sorted at the lower level now then! Thanks again!
    – RustyFluff
    Commented Aug 22, 2011 at 13:41

Not the answer you're looking for? Browse other questions tagged or ask your own question.