13

I'm trying to implement a cross-platform (desktop browsers, iOS, & Android) typography system that allows users to input any Unicode string.

What are some strings I should use to stress-test my system and ensure the most nines of users will have a good experience? Is there a standard or de-facto standard list that I can also use?

9
  • If this is off-topic here, please direct me somewhere I can find my answer.
    – Ky -
    Commented Dec 30, 2015 at 22:45
  • Doesn't seem off-topic just a bit too vague for it to be likely you'll get much useful feedback.
    – pvg
    Commented Dec 30, 2015 at 22:48
  • @pvg any idea how I could make it more specific?
    – Ky -
    Commented Dec 30, 2015 at 23:16
  • well, you say input then talk about rendering, you mention several different platforms all of which their own font rendering and input systems, some of which with limited end-user control. So it's not really obvious what you're doing, what you're trying to achieve, what specific problems you are encountering or hoping to avoid, etc.
    – pvg
    Commented Dec 30, 2015 at 23:24
  • 1
    +1 from me for the samples you already have. Fascinating to see how well modern browsers and VS Code handle this stuff.
    – HeyHeyJC
    Commented Jul 25, 2018 at 21:10

3 Answers 3

17

Here are some strings that I use in tests like that:

  • Vertically-stacked characters: Z̤͔ͧ̑̓ä͖̭̈̇lͮ̒ͫǧ̗͚̚o̙̔ͮ̇͐̇
  • Right-to-left words: اختبار النص
  • Mixed-direction words: من left اليمين to الى right اليسار
  • Mixed-direction characters: a‭b‮c‭d‮e‭f‮g
  • Very long characters: ﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽
  • Emoji with skintone variations: 👱👱🏻👱🏼👱🏽👱🏾👱🏿
  • Emoji with gender variations: 🧟‍♀️🧟‍♂️
  • Emoji created by combining codepoints: 👨‍❤️‍💋‍👨👩‍👩‍👧‍👦🏳️‍⚧️🇵🇷
2

There are a lot of good examples in the Big List of Naughty Strings:

https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

I cannot include the whole file, but here's a few lines:

#   Unicode Subscript/Superscript/Accents
#
#   Strings which contain unicode subscripts/superscripts; can cause rendering issues


ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็
#   Two-Byte Characters
#
#   Strings which contain two-byte characters: can cause rendering issues or character-length issues

田中さんにあげて下さい
#   Strings which contain two-byte letters: can cause issues with naïve UTF-16 capitalizers which think that 16 bits == 1 character

𐐜 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐙𐐊𐐡𐐝𐐓/𐐝𐐇𐐗𐐊𐐤𐐔 𐐒𐐋𐐗 𐐒𐐌 𐐜 𐐡𐐀𐐖𐐇𐐤𐐓𐐝 𐐱𐑂 𐑄 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐏𐐆𐐅𐐤𐐆𐐚𐐊𐐡𐐝𐐆𐐓𐐆
#   Special Unicode Characters Union
#
#   A super string recommended by VMware Inc. Globalization Team: can effectively cause rendering issues or character-length issues to validate product globalization readiness.

表ポあA鷗ŒéB逍Üߪąñ丂㐀𠀀
#   Ogham Text
#
#   The only unicode alphabet to use a space which isn't empty but should still act like a space.

᚛ᚄᚓᚐᚋᚒᚄ ᚑᚄᚂᚑᚏᚅ᚜
᚛                 ᚜

#   iOS Vulnerabilities
#
#   Strings which crashed iMessage in various versions of iOS

Powerلُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ冗
🏳0🌈️
జ్ఞ‌ా
1

Some others:

  • Reversible characters in Right-to-Left scripts. Ex. Parentheses get reversed for display in Hebrew. Unicode spec has a whole list of these reversible characters.
  • Scripts with letter shaping: Arabic, Hindi, etc.
5
  • These sound super fascinating! Do you have any samples?
    – Ky -
    Commented Sep 9, 2020 at 16:26
  • Microsoft font development resources seem to have some good examples of script "shaping". They show examples where multiple Unicode characters get assembled into the proper shape for the script. Sorta like turning "ff" into the 'ff' ligature character, but much much more complicated. Indic: learn.microsoft.com/en-us/typography/script-development/… Arabic: learn.microsoft.com/en-us/typography/script-development/arabic Commented Sep 11, 2020 at 19:19
  • Reversible characters are ones that can be tricky when the context of rendering them changes between left-to-right and right-to-left script. For example, in a left-to-right script (ex. English), an opening bracket is rendered '['. But in a right-to-left script the opening bracket is rendered ']'. Within a single text line with a mixture of L2R and R2L text you have to keep track of current direction in order to draw the correct glyphs amongst the characters which can be rendered blindly (i.e. without consideration for current direction). Commented Sep 11, 2020 at 19:22
  • Here's an issue with reversible characters in LibreOffice - including some test text strings: ask.libreoffice.org/en/question/18912/… Commented Sep 11, 2020 at 19:39
  • Those are very insightful indeed! I tried to edit this answer to include some, but it really didn't want me doing that 😜 - if you ever find a way to make that happen, StackOverflow prefers that, so content isn't lost if links rot away
    – Ky -
    Commented Sep 14, 2020 at 21:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.