Showing posts with label RGI. Show all posts
Showing posts with label RGI. Show all posts

Thursday, June 18, 2020

Unicode Regular Expressions v21 Released

Regex image Regular expressions are a powerful tool for using patterns to search and modify text, and are vital in many programs, programming languages, databases, and spreadsheets.

Starting in 1999, UTS #18: Unicode Regular Expressions has supplied guidelines and conformance levels for supporting Unicode in regular expressions. The new version 21 broadens the scope of properties for regular expressions (regex) to allow for properties of strings (such as for emoji sequences). For example, the following matches all emoji flags except the French flag:

/[\p{RGI_Emoji_Flag_Sequence}--\q{🇫🇷}]/

Among the improvements are:
  • Provides a new Annex D: Resolving Character Classes with Strings for handling negations of sets of strings.
  • Updates the full property list to include the latest UCD properties, plus Emoji properties and UTS #39 properties.
  • Removes obsolete text passages, and makes editorial changes for clarity.


Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Thursday, November 21, 2019

Call for feedback on UTS #18: Unicode Regular Expressions

Regex image Regular expressions are a powerful tool for using patterns to search and modify text. They are a key component of many programming languages, databases, and spreadsheets.

Starting in 1999, UTS #18: Unicode Regular Expressions has supplied guidelines and conformance levels for supporting Unicode in regular expressions. A proposed update of that specification is now available for public review and comment. The following are the main modifications in this draft:
  • Broadened the scope of properties to allow for properties of strings (as well as properties of code points).
  • Added 11 Emoji properties including RGI sets as Full Properties in Level 2.
  • Added other new properties as Full Properties in Level 2: Equivalent_Unified_Ideograph, Vertical_Orientation, Regional_Indicator, Indic_Positional_Category, Indic_Syllabic_Category.
  • Provided a draft data file with property metadata for matching and validating non-UCD properties and their values for syntax such as \p{pname=pvalue}, so that such properties can be used in the same way as UCD properties. See Annex D.
There are a number of review notes requesting feedback on these and other possible changes. In particular, the Unicode Technical Committee would appreciate feedback on the discussion of and syntax for properties of strings, and on the recommended properties to be supported at Level 2.

The review period closes on 2020-01-06. For more information on reviewing and supplying feedback, see Proposed Update UTS #18, Unicode Regular Expressions.