How can I unescape HTML character entities in Java?

Question

Basically, I would like to decode a given HTML document, and replace all special characters, such as " " → " " and ">" → ">".

In .NET, we can make use of the HttpUtility.HtmlDecode method.

What's the equivalent function in Java?

is called character entity. Edited the title.
– Eugene Yokota
Commented Jun 15, 2009 at 2:46 — Eugene Yokota, Commented Jun 15, 2009 at 2:46

Vivien · Accepted Answer · 2019-08-30 09:48:04Z

225

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

edited Aug 30, 2019 at 9:48

Vivien

578 bronze badges

answered Jun 15, 2009 at 2:43

Kevin Hakanson

42k23 gold badges128 silver badges156 bronze badges

23

Sadly I just realized today that it does not decode HTMLspecial characters very well :(
– Sid
Commented Oct 13, 2010 at 20:04
1

a dirty trick is to store the value initially in a hidden field to escape it, then the target field should get the value from the hidden field.
– setzamora
Commented Jun 16, 2011 at 5:19
3

Class StringEscapeUtils is deprecated and moved to Apache commons-text
– Pauli
Commented Dec 3, 2018 at 22:16
2

I want to convert the string <p>üè</p> to <p>üé</p>, with StringEscapeUtils.unescapeHtml4() I get <p>üè</p>. Is there a way to keep existing html tags intact?
– Nickkk
Commented Jan 13, 2020 at 12:10
If I have something like  which escapes to a quotation mark in Windows-1252 but some control character in Unicode, can the escaping encoding be changed?
– ifly6
Commented Dec 11, 2020 at 13:21

| Show 1 more comment

Peter Mortensen · Accepted Answer · 2023-05-03 13:33:16Z

69

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world HTML content in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. It's open source and MIT License.

edited May 3, 2023 at 13:33

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered May 17, 2016 at 13:25

Dale

5,7434 gold badges49 silver badges84 bronze badges

4

upvote+, but I should point that newer versions of Jsoup use .text() instead of .getText()
– SourceVisor
Commented Nov 10, 2016 at 16:25
6

Perhaps more direct is to use org.jsoup.parser.Parser.unescapeEntities(String string, boolean inAttribute). API docs: jsoup.org/apidocs/org/jsoup/parser/…
– danneu
Commented Dec 1, 2016 at 21:17
3

This was perfect, since I'm already using Jsoup in my project. Also, @danneu was right - Parser.unescapeEntities works exactly as advertised.
– MandisaW
Commented Aug 29, 2017 at 17:23
why then does the following not return un-escaped html: Parser.unescapeEntities(Jsoup.parse("<div>•</div>").text(), true)
– DavesPlanet
Commented Nov 1, 2023 at 19:15
1

@Dale see question 77405300 for details but what happens is I get a utf16 string with a bullet point out instead of the text
– DavesPlanet
Commented Nov 6, 2023 at 19:31

| Show 1 more comment

Peter Mortensen · Accepted Answer · 2023-05-03 14:16:22Z

I tried Apache Commons' StringEscapeUtils.unescapeHtml3() in my project, but I wasn't satisfied with its performance. It turns out, it does a lot of unnecessary operations. For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string. I've rewritten that code differently, and now it works much faster.

The following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3). You can just add more entries to the map if you need HTML 4.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // Look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // Found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // Found escape
            if (input.charAt(i) == '#') {
                // Numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null)
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) {
                    i++;
                    continue;
                }
            }
            else {
                // Named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null)
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // Skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"},   // Non-breaking space
        {"\u00A1", "iexcl"},  // Inverted exclamation mark
        {"\u00A2", "cent"},   // Cent sign
        {"\u00A3", "pound"},  // Pound sign
        {"\u00A4", "curren"}, // Currency sign
        {"\u00A5", "yen"},    // Yen sign = yuan sign
        {"\u00A6", "brvbar"}, // Broken bar = broken vertical bar
        {"\u00A7", "sect"},   // Section sign
        {"\u00A8", "uml"},    // Diaeresis = spacing diaeresis
        {"\u00A9", "copy"},   // © - copyright sign
        {"\u00AA", "ordf"},   // Feminine ordinal indicator
        {"\u00AB", "laquo"},  // Left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"},    // Not sign
        {"\u00AD", "shy"},    // Soft hyphen = discretionary hyphen
        {"\u00AE", "reg"},    // ® - registered trademark sign
        {"\u00AF", "macr"},   // Macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"},    // Degree sign
        {"\u00B1", "plusmn"}, // Plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"},   // Superscript two = superscript digit two = squared
        {"\u00B3", "sup3"},   // Superscript three = superscript digit three = cubed
        {"\u00B4", "acute"},  // Acute accent = spacing acute
        {"\u00B5", "micro"},  // Micro sign
        {"\u00B6", "para"},   // Pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // Middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"},  // Cedilla = spacing cedilla
        {"\u00B9", "sup1"},   // Superscript one = superscript digit one
        {"\u00BA", "ordm"},   // Masculine ordinal indicator
        {"\u00BB", "raquo"},  // Right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // Vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // Vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // Vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // Inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"},  // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"},   // Д - uppercase A, umlaut
        {"\u00C5", "Aring"},  // Е - uppercase A, ring
        {"\u00C6", "AElig"},  // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"},  // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"},   // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"},  // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"},   // П - uppercase I, umlaut
        {"\u00D0", "ETH"},    // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"},  // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"},   // Ц - uppercase O, umlaut
        {"\u00D7", "times"},  // Multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"},  // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"},   // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"},  // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"},  // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"},  // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"},   // д - lowercase a, umlaut
        {"\u00E5", "aring"},  // е - lowercase a, ring
        {"\u00E6", "aelig"},  // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"},  // к - lowercase e, circumflex accent
        {"\u00EB", "euml"},   // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"},  // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"},   // п - lowercase i, umlaut
        {"\u00F0", "eth"},    // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"},  // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"},   // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // Division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"},  // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"},   // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"},  // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"},   // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES)
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

Recently, I had to optimize a slow Struts project. It turned out that under the cover Struts calls Apache for html string escaping by default (<s:property value="..."/>). Turning off escaping (<s:property value="..." escaping="false"/>) got some pages to run 5% to 20% faster. — Stephan, Commented Jul 13, 2014 at 22:10
A StringWriter uses a StringBuffer internally which uses locking. Using a StringBuilder directly should be faster. — Axel Dörfler, Commented Feb 22, 2016 at 12:40
found a bug in the above code when encountering "=" aka =. writer.write(entityValue); should be writer.write(Character.toString((char)entityValue)); – Stevko 4 hours ago — Stevko, Commented May 16, 2016 at 23:54
@NickFrolov, your comments seem a bit messed up. auml is for instance ä and not д. — aioobe, Commented Oct 17, 2016 at 1:52
Improved version with all HTML5 characters: gist.github.com/MarkJeronimus/798c452582e64410db769933ec71cfb7 — Mark Jeronimus, Commented Jun 22, 2020 at 12:26

herman · Accepted Answer · 2020-05-14 09:10:44Z

19

Spring Framework HtmlUtils

If you're using Spring framework already, use the following method:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);

answered May 14, 2020 at 9:10

herman

12.1k5 gold badges49 silver badges60 bronze badges

Add a comment |

Stephan · Accepted Answer · 2016-07-27 12:02:45Z

17

The following library can also be used for HTML escaping in Java: unbescape.

HTML can be unescaped this way:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

edited Jul 27, 2016 at 12:02

answered Jul 13, 2014 at 22:59

Stephan

42.6k65 gold badges244 silver badges337 bronze badges

2

It did nothing to this: %3Chtml%3E%0D%0A%3Chead%3E%0D%0A%3Ctitle%3Etest%3C%2Ftitle%3E%0D%0A%3C%2Fhead%3E%0D%0A%3Cbody%3E%0D%0Atest%0D%0A%3C%2Fbody%3E%0D%0A%3C%2Fhtml%3E
– user1191027
Commented Aug 27, 2015 at 16:33
45

@ThreaT Your text is not html-encoded, it is url-encoded.
– Mikhail Batcer
Commented Oct 28, 2015 at 7:23

Add a comment |

Peter Mortensen · Accepted Answer · 2023-05-03 14:21:34Z

12

This did the job for me,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml(encodedXML);

Or

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml4(encodedXML);

I guess it’s always better to use the lang3 for obvious reasons.

edited May 3, 2023 at 14:21

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Apr 19, 2017 at 2:31

tk_

17k9 gold badges83 silver badges90 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2023-05-03 14:22:53Z

4

A very simple, but inefficient solution without any external library is:

public static String unescapeHtml3(String str) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read(new StringReader("<html><body>" + str), doc, 0);
        return doc.getText(1, doc.getLength());
    } catch(Exception ex) {
        return str;
    }
}

This should be used only if you have only small count of string to decode.

edited May 3, 2023 at 14:22

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Dec 3, 2016 at 22:07

Horcrux7

24.2k22 gold badges102 silver badges166 bronze badges

1

Very close, but not exact - it converted "qwAS12ƷƸǅǚǪǼȌ" to "qwAS12ƷƸǅǚǪǼȌ\n".
– Greg
Commented Jul 16, 2018 at 17:21

Add a comment |

Floern · Accepted Answer · 2017-09-12 21:43:21Z

3

The most reliable way is with

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils.

And to escape the whitespaces

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.

edited Sep 12, 2017 at 21:43

Floern

33.8k24 gold badges105 silver badges121 bronze badges

answered Sep 12, 2017 at 21:16

mike oganyan

1575 bronze badges

Add a comment |

Pramod H G · Accepted Answer · 2021-09-09 12:07:23Z

1

StringEscapeUtils (Apache Commons Lang)
Escapes and unescapes Strings for Java, JavaScript, HTML, and XML.

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

Reference: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

answered Sep 9, 2021 at 12:07

Pramod H G

1,59315 silver badges21 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2023-05-03 13:45:30Z

0

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities, like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

edited May 3, 2023 at 13:45

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jun 3, 2014 at 23:25

Joost

1418 bronze badges

222 is likely in octal (hexadecimal 0x92. decimal 146). In Windows-1252 (but not in ISO 8859-1), 0x92 corresponds to U+2019 (RIGHT SINGLE QUOTATION MARK). Are you sure it is not octal 221? Or right single quote?
– Peter Mortensen
Commented May 3, 2023 at 14:04

Add a comment |

Peter Mortensen · Accepted Answer · 2023-05-03 14:19:32Z

In my case, I use the replace method by testing every entity in every variable. My code looks like this:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

In my case this worked very well.

This isn't every special entity. Even the two mentioned in the question are missing. — Sandy Gifford, Commented Oct 27, 2016 at 15:27

Peter Mortensen · Accepted Answer · 2023-05-03 13:42:18Z

In case you want to mimic what PHP function htmlspecialchars_decode() does, use PHP function get_html_translation_table() to dump the table and then use the Java code like,

static Map<String, String> html_specialchars_table = new Hashtable<String, String>();

static {
    html_specialchars_table.put("&lt;", "<");
    html_specialchars_table.put("&gt;", ">");
    html_specialchars_table.put("&amp;", "&");
}

static String htmlspecialchars_decode_ENT_NOQUOTES(String s) {
    Enumeration en = html_specialchars_table.keys();
    while(en.hasMoreElements()) {
        String key = en.nextElement();
        String val = html_specialchars_table.get(key);
        s = s.replaceAll(key, val);
    }
    return s;
}

Collectives™ on Stack Overflow

How can I unescape HTML character entities in Java?

12 Answers 12

Spring Framework HtmlUtils

Not the answer you're looking for? Browse other questions tagged
java
html
string
eclipse
decode
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

Spring Framework HtmlUtils

Not the answer you're looking for? Browse other questions tagged javahtmlstringeclipsedecode or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
java
html
string
eclipse
decode
or ask your own question.