Convert HTML entities to Unicode and vice versa

Question

How do you convert HTML entities to Unicode and vice versa in Python?

@Jarret Hardie: Actually, show-and-tell is perfectly fine on SO. From the first entry on the FAQ (stackoverflow.com/faq) "It's also perfectly fine to ask and answer your own programming question". Although, it's also encouraged to look for duplicates as well. — chauncey, Commented Mar 31, 2009 at 16:13
I am posting questions that I have answered for myself in the past for the benefit of other users searching for similar answers. — hekevintran, Commented Mar 31, 2009 at 16:25
Can also be done without external libraries. See stackoverflow.com/questions/663058/html-entity-codes-to-text/… — bobince, Commented Mar 31, 2009 at 16:31
This question is wider in scope than then one pointed to by the "duplicate" link: this question also asks for "vice versa", i.e., from Unicode to HTML entities. — Vebjorn Ljosa, Commented Sep 24, 2009 at 10:52

Isaac · Accepted Answer · 2010-04-17 06:13:38Z

112

As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):

u'some string'.encode('ascii', 'xmlcharrefreplace')

will return a plain string with any non-ascii characters turned into XML (HTML) entities.

answered Apr 17, 2010 at 6:13

Isaac

10.8k5 gold badges62 silver badges70 bronze badges

1

I've forgotten about xmlcharrefreplace and this was very helpful. Any time I need to safely store encoded or non-ascii characters to mysql I find I need to use this method.
– cybertoast
Commented Feb 2, 2012 at 20:36
1

This doesn't work with a string literal containing the unicode character U+2019 HTML entity equivalent ’ Isn't this what the question was asking for (this answer converts ascii which is a subset of unicode)? text.decode('utf-8').encode('ascii', 'xmlcharrefreplace')
– Mike S
Commented Jul 7, 2014 at 20:26
1

@MikeS It works without problem; >>> u'\u2019'.encode('utf-8').decode('utf-8').encode('ascii', 'xmlcharrefreplace') gives '’'
– Piotr Dobrogost
Commented Jun 6, 2016 at 11:46

Add a comment |

hekevintran · Accepted Answer · 2009-03-31 15:57:56Z

33

You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&amp;' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&amp;'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

answered Mar 31, 2009 at 15:57

hekevintran

23.4k33 gold badges114 silver badges182 bronze badges

2

The BeautifulSoup api has changed. Please see the most recent doc.
– scharfmn
Commented Mar 3, 2015 at 6:22
@hekevintran: Is it possible to print '¢, £, ¥, €, §, ©' instead of '¢, £, ¥, €, §, ©'. Any idea?
– Jagath
Commented Aug 5, 2016 at 7:49
9

This answer is in desperate need of a Python3 update.
– Routhinator
Commented Sep 25, 2018 at 23:12

Add a comment |

scharfmn · Accepted Answer · 2017-02-02 19:51:46Z

Update for Python 2.7 and BeautifulSoup4

Unescape -- Unicode HTML to unicode with htmlparser (Python 2.7 standard lib):

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape -- Unicode HTML to unicode with bs4 (BeautifulSoup4):

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape -- Unicode to unicode HTML with bs4 (BeautifulSoup4):

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

upvote for showing a standard library solution with no dependencies — Hartley Brody, Commented Jul 21, 2016 at 15:58
Revisiting I just saw the comment @bobince left on the question pointing to this answer. Since htmlparser is documented now, and since that comment is not prominent, leaving that part of answer. — scharfmn, Commented Jul 21, 2016 at 17:02

Pedro Lobito · Accepted Answer · 2023-06-12 13:42:11Z

18

For python3 use html.unescape():

import html
s = "&amp;"
u = html.unescape(s)
# &

edited Jun 12, 2023 at 13:42

answered May 6, 2020 at 23:22

Pedro Lobito

97.6k32 gold badges267 silver badges275 bronze badges

1

Simple and sweet.
– Mark Ransom
Commented Jun 1, 2023 at 20:10

Add a comment |

AXO · Accepted Answer · 2014-07-09 00:13:30Z

As hekevintran answer suggests, you may use cgi.escape(s) for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True keyword argument alongside your string. But even by passing quote=True, the function won't escape single quotes ("'") (Because of these issues the function has been deprecated since version 3.2)

It's been suggested to use html.escape(s) instead of cgi.escape(s). (New in version 3.2)

Also html.unescape(s) has been introduced in version 3.4.

So in python 3.4 you can:

Use html.escape(text).encode('ascii', 'xmlcharrefreplace').decode() to convert special characters to HTML entities.
And html.unescape(text) for converting HTML entities back to plain-text representations.

In Python 2.7 you can use HTMLParser.unescape(text)
– frank
Commented May 16, 2016 at 19:28 — frank, Commented May 16, 2016 at 19:28

Jan Kyu Peblik · Accepted Answer · 2020-05-01 20:17:32Z

9

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

HTML only strictly requires & (ampersand) and < (left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state

edited May 1, 2020 at 20:17

answered Oct 8, 2019 at 17:24

Jan Kyu Peblik

1,48914 silver badges21 bronze badges

Add a comment |

brucekaushik · Accepted Answer · 2018-02-08 15:50:26Z

3

If someone like me is out there wondering why some entity numbers (codes) like  (for trademark symbol),  (for euro symbol) are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.

Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4

So, we will have to workaround somehow (find & replace those at first)

Reference (starting point) from Mozilla's documentation

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

edited Feb 8, 2018 at 15:50

answered Feb 8, 2018 at 15:14

brucekaushik

3871 gold badge4 silver badges16 bronze badges

Add a comment |

Stephen Ellwood · Accepted Answer · 2017-05-17 14:18:29Z

I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

hope this is useful to somebody

HappyFace · Accepted Answer · 2020-04-16 15:17:21Z

0

#!/usr/bin/env python3
import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))

answered Apr 16, 2020 at 15:17

HappyFace

3,9173 gold badges28 silver badges48 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Convert HTML entities to Unicode and vice versa

9 Answers 9

Not the answer you're looking for? Browse other questions tagged
python
html
html-entities
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Not the answer you're looking for? Browse other questions tagged pythonhtmlhtml-entities or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
html
html-entities
or ask your own question.