The Unicode Blog: regex

Showing posts with label regex. Show all posts

Wednesday, February 9, 2022

Enhancements to Unicode Regular Expressions

A new revision of UTS #18, Unicode Regular Expressions is now available.

Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).

The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement, followed by explicitly defining the complement operator [^...] to be code point complement, and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.

For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, February 24, 2021

Enhancements to Unicode Regular Expressions

A Proposed Update UTS #18, Unicode Regular Expressions is now available for review and feedback.

Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).

A major enhancement to UTS #18 in 2020 focused on the addition of Character Classes with strings. The initial impetus for this was to handle emoji effectively in browsers, as most emoji consist of more than one code point. Supporting strings directly in character classes frees up programs from having to download large amounts of data or handle complicated syntax. Using a property like RGI_Emoji allows a regular expression to match both individual codes such as "😁" and multi-codepoint strings such as "🇫🇷". This extension to strings is also important for internationalization. For example, the alphabets used by many languages contain multi-code-point strings, so this extension allows them to be handled easily.

Additional enhancements are in progress this year, based on working with members of the ECMAScript committee, including more clarifications, better guidance on implementation, and addressing some tricky issues dealing with complementing (inverting) Character Classes. The end goal of all of these enhancements in 2020 and 2021 is to significantly raise the level of Unicode support in programming languages and libraries.

For more information, see https://www.unicode.org/review/pri427/.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, June 18, 2020

Unicode Regular Expressions v21 Released

Regular expressions are a powerful tool for using patterns to search and modify text, and are vital in many programs, programming languages, databases, and spreadsheets.

Starting in 1999, UTS #18: Unicode Regular Expressions has supplied guidelines and conformance levels for supporting Unicode in regular expressions. The new version 21 broadens the scope of properties for regular expressions (regex) to allow for properties of strings (such as for emoji sequences). For example, the following matches all emoji flags except the French flag:

/[\p{RGI_Emoji_Flag_Sequence}--\q{🇫🇷}]/

Among the improvements are:

Provides a new Annex D: Resolving Character Classes with Strings for handling negations of sets of strings.
Updates the full property list to include the latest UCD properties, plus Emoji properties and UTS #39 properties.
Removes obsolete text passages, and makes editorial changes for clarity.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, November 21, 2019

Call for feedback on UTS #18: Unicode Regular Expressions

Regular expressions are a powerful tool for using patterns to search and modify text. They are a key component of many programming languages, databases, and spreadsheets.

Starting in 1999, UTS #18: Unicode Regular Expressions has supplied guidelines and conformance levels for supporting Unicode in regular expressions. A proposed update of that specification is now available for public review and comment. The following are the main modifications in this draft:

Broadened the scope of properties to allow for properties of strings (as well as properties of code points).
Added 11 Emoji properties including RGI sets as Full Properties in Level 2.
Added other new properties as Full Properties in Level 2: Equivalent_Unified_Ideograph, Vertical_Orientation, Regional_Indicator, Indic_Positional_Category, Indic_Syllabic_Category.
Provided a draft data file with property metadata for matching and validating non-UCD properties and their values for syntax such as \p{pname=pvalue}, so that such properties can be used in the same way as UCD properties. See Annex D.

There are a number of review notes requesting feedback on these and other possible changes. In particular, the Unicode Technical Committee would appreciate feedback on the discussion of and syntax for properties of strings, and on the recommended properties to be supported at Level 2.

The review period closes on 2020-01-06. For more information on reviewing and supplying feedback, see Proposed Update UTS #18, Unicode Regular Expressions.

Wednesday, February 9, 2022

Enhancements to Unicode Regular Expressions

Wednesday, February 24, 2021

Enhancements to Unicode Regular Expressions

Thursday, June 18, 2020

Unicode Regular Expressions v21 Released

Thursday, November 21, 2019

Call for feedback on UTS #18: Unicode Regular Expressions

Links of Interest

Blog Archive

Labels

Followers

Wednesday, February 9, 2022

Enhancements to Unicode Regular Expressions

Wednesday, February 24, 2021

Enhancements to Unicode Regular Expressions

Thursday, June 18, 2020

Unicode Regular Expressions v21 Released

Thursday, November 21, 2019

Call for feedback on UTS #18: Unicode Regular Expressions

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog