Foreign language characters in Regular expression in C#

Question

In C# code, I am trying to pass chinese characters: " 中文ABC123".

When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",

it doesn't pass for "中文ABC123" and regex validation fails.

What other expressions do I need to add for C#?

Andie2302 · Accepted Answer · 2015-01-26 21:40:35Z

44

To match any letter character from any language use:

\p{L}

If you also want to match numbers:

[\p{L}\p{Nd}]+

\p{L} ... matches a character of the unicode category letter.
                it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
                  \p{Ll} ... matches lowercase letters. (abc)
                  \p{Lu} ... matches uppercase letters. (ABC)
                  \p{Lt} ... matches titlecase letters.
                  \p{Lm} ... matches modifier letters.
                  \p{Lo} ... matches letters without case. (中文)

\p{Nd} ... matches a character of the unicode category decimal digit.

Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

edited Jan 26, 2015 at 21:40

answered Jan 26, 2015 at 18:55

Andie2302

4,8774 gold badges25 silver badges44 bronze badges

Or, if punctuation is OK, the simpler \w (word character) can be used instead of [\p{L}0-9].
– bzlm
Commented Jan 26, 2015 at 19:33
1

By the way Andie2302, there is a huge conflict of this one with html5 Pattern, I was getting this one for HTML5 pattern attribute and it failed to validate. Do you have any idea to work witrh HTML5 Pattern attirbute for all the languages?
– user2683269
Commented Jan 26, 2015 at 20:57
6

@user2683269 JavaScript (and hence html5 input patterns) doesn't support \p, and treats \w as "latin word character", so it's trickier there: stackoverflow.com/a/22075070/7724
– bzlm
Commented Jan 26, 2015 at 21:17
besides Chinese and Japanese characters, what other languages does \p{Lo} might capture?
– Yoav Feuerstein
Commented Oct 18, 2017 at 15:06
2

@bzlm a bit further info on \w in .NET: stackoverflow.com/a/2998550/2246411 (note that \w does not work for all languages if using ECMAScript-compliant behavior
– derekantrican
Commented May 19, 2019 at 16:26

| Show 1 more comment

Sruit A.Suk · Accepted Answer · 2019-06-14 18:55:04Z

Thanks to @Andie2302 for pointing to the right way to do it.

In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก��บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).

That's why only \p{L} will not work for all foreign language.

So, you need to use code below, to support almost foreign language

\p{L}\p{M}

NOTE:

L stand for 'Letter' (All letter from all language, but does not include the 'Mark')

M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)

In Addition that you need Number, use code below

\p{N}

NOTE:

N stand for 'Numeric'

Thanks to this website for very useful information

https://www.regular-expressions.info/unicode.html

Collectives™ on Stack Overflow

Foreign language characters in Regular expression in C#

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
c#
regex
non-english
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged c#regexnon-english or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c#
regex
non-english
or ask your own question.