24

Recently I have created a regex, for my PHP code which allows only the letters (including special characters plus spaces), but now I'm having a problem with converting it (?) into the JavaScript compatible regex, here it is: /^[\s\p{L}]+$/u, the problem is the /u modifier at the end of the regex pattern, as the JavaScript doesn't allow such flag.

How can I rewrite this, so it will work in the JavaScript as well?

Is there something to allow only Polish characters: Ł, Ą, Ś, Ć, ...

3
  • 3
    Perhaps this answer will be helpful here.
    – Lix
    Commented Oct 15, 2012 at 13:49
  • 1
    Are you sure you need the u flag? Have you tried removing it and testing the expression?
    – cammil
    Commented Oct 15, 2012 at 13:52
  • 1
    @cammil "u" is required so the "\p{L}" is recognized as checking for UTF-8 letters.
    – Matt S
    Commented Oct 15, 2012 at 13:55

3 Answers 3

20

The /u modifier is for unicode support. Support for it was added to JavaScript in ES2015.

Read http://stackoverflow.com/questions/280712/javascript-unicode to learn more information about unicode in regex with JavaScript.


Polish characters:

Ą \u0104
Ć \u0106
Ę \u0118
Ł \u0141
Ń \u0143
Ó \u00D3
Ś \u015A
Ź \u0179
Ż \u017B
ą \u0105
ć \u0107
ę \u0119
ł \u0142
ń \u0144
ó \u00F3
ś \u015B
ź \u017A
ż \u017C

All special Polish characters:

[\u0104\u0106\u0118\u0141\u0143\u00D3\u015A\u0179\u017B\u0105\u0107\u0119\u0142\u0144\u00F3\u015B\u017A\u017C]
6
  • 1
    One might argue that the modifier isn't needed in any language/environment that properly handles Unicode instead of a mishmash of binary data and actual Unicode text in strings such as PHP.
    – Joey
    Commented Oct 15, 2012 at 14:02
  • @Joey - The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
    – Ωmega
    Commented Oct 15, 2012 at 14:04
  • @Scott - Polish language use latin, so go with ranges [\u0000-\u007F] = Basic Latin; [\u0080-\u00FF] = Latin-1 Supplement; [\u0100-\u017F] = Latin Extended-A; [\u0180-\u024F] = Latin Extended-B; ... which together get [\u0000-\u024F] to include all latin characters :)
    – Ωmega
    Commented Oct 15, 2012 at 14:07
  • 1
    Ωmega, I know why the flag is needed in PCRE and fundamentally it's the problem that PHP doesn't have a defined character set for strings, leading to some strings being in some legacy character set, some in UTF-8, some storing even non-text binary data. Environments such as Java or .NET have it far easier in that regard, given that text is always Unicode.
    – Joey
    Commented Oct 15, 2012 at 14:15
  • 2
    This answer is one of the first results on Google when searching for "regex u flag", so you might want to update it with a preface stating that it has been defined in ES2016 and is now supported by most recent browsers :)
    – Aaron
    Commented Aug 25, 2016 at 20:44
6

JavaScript doesn't have any notion of UTF-8 strings, so it's unlikely that you need the /u flag. (Your strings are probably already in the usual JavaScript form, one UTF-16 code-unit per "character".)

The bigger problem is that JavaScript doesn't support \p{L}, nor any equivalent notation; JavaScript regexes have no awareness of Unicode character properties. See the answers to this StackOverflow question for some ways to approximate it.


Edited to add: If you only need to support Polish letters, then you can write /^[\sa-zA-ZĄĆĘŁŃÓŚŹŻąćęłńóśźż]+$/. The a-z and A-Z parts cover the ASCII letters, and then the remaining letters are listed out individually.

6
  • Bad news... so maybe there is something to allow only those Polish characters: Ł, Ą, Ś, Ć, Ę instead?
    – Scott
    Commented Oct 15, 2012 at 13:57
  • Scott, if you have a small set of characters you want to allow you can always use a character class.
    – Joey
    Commented Oct 15, 2012 at 14:03
  • @Joey Yea, generally I would like to additionaly allow only those special characters I mentioned above.
    – Scott
    Commented Oct 15, 2012 at 14:09
  • In Javascript regexp you can refer to unicode chars like this: \u0161. For example this will allow only printable ASCII and Ć: var newtxt = txt.replace(/[^\u0107\u0020-\u007e]/g, '') . Unicode codes for your chars find for example here: fileformat.info/info/unicode/char/107/index.htm
    – DamirR
    Commented Oct 15, 2012 at 14:36
  • 1
    @ruakh: Life is full of bizarre moments. :) For /Ć/ to work you MUST save js file in UTF-8. Sometimes, other people might use, change, save your code and they might use other encoding (eg. iso-8859-1). So /Ć/ will not be saved correctly and script will not work. If you use /\u0107/ that kind of bugs will be avoided.
    – DamirR
    Commented Oct 28, 2012 at 13:35
1

As of ES2015, /u is supported in JavaScript. See:

3
  • It's currently not supported by all browsers.
    – Poul Bak
    Commented Dec 3, 2018 at 4:11
  • @PoulBak It says on the Mozilla docs it's supported by all major browsers, unless they got it wrong. Commented Dec 8, 2018 at 18:47
  • Some versions of Edge will simply crash, if you use it, but I think that has been fixed, so you're probably right (noone use IE any more).
    – Poul Bak
    Commented Dec 8, 2018 at 20:25

Not the answer you're looking for? Browse other questions tagged or ask your own question.