19

In C# code, I am trying to pass chinese characters: " 中文ABC123".

When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",

it doesn't pass for "中文ABC123" and regex validation fails.

What other expressions do I need to add for C#?

2 Answers 2

44

To match any letter character from any language use:

\p{L}

If you also want to match numbers:

[\p{L}\p{Nd}]+

\p{L} ... matches a character of the unicode category letter.
                it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
                  \p{Ll} ... matches lowercase letters. (abc)
                  \p{Lu} ... matches uppercase letters. (ABC)
                  \p{Lt} ... matches titlecase letters.
                  \p{Lm} ... matches modifier letters.
                  \p{Lo} ... matches letters without case. (中文)

\p{Nd} ... matches a character of the unicode category decimal digit.

Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

6
  • Or, if punctuation is OK, the simpler \w (word character) can be used instead of [\p{L}0-9].
    – bzlm
    Commented Jan 26, 2015 at 19:33
  • 1
    By the way Andie2302, there is a huge conflict of this one with html5 Pattern, I was getting this one for HTML5 pattern attribute and it failed to validate. Do you have any idea to work witrh HTML5 Pattern attirbute for all the languages? Commented Jan 26, 2015 at 20:57
  • 6
    @user2683269 JavaScript (and hence html5 input patterns) doesn't support \p, and treats \w as "latin word character", so it's trickier there: stackoverflow.com/a/22075070/7724
    – bzlm
    Commented Jan 26, 2015 at 21:17
  • besides Chinese and Japanese characters, what other languages does \p{Lo} might capture? Commented Oct 18, 2017 at 15:06
  • 2
    @bzlm a bit further info on \w in .NET: stackoverflow.com/a/2998550/2246411 (note that \w does not work for all languages if using ECMAScript-compliant behavior Commented May 19, 2019 at 16:26
3

Thanks to @Andie2302 for pointing to the right way to do it.

In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก���บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).

That's why only \p{L} will not work for all foreign language.

So, you need to use code below, to support almost foreign language

\p{L}\p{M}

NOTE:

L stand for 'Letter' (All letter from all language, but does not include the 'Mark')

M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)

In Addition that you need Number, use code below

\p{N}

NOTE:

N stand for 'Numeric'


Thanks to this website for very useful information

https://www.regular-expressions.info/unicode.html

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.