Showing posts with label Collation. Show all posts
Showing posts with label Collation. Show all posts

Thursday, March 19, 2015

CLDR Version 27 Released

CLDR 27 Coverage Unicode CLDR 27 has been released, providing an update to the key building blocks for software supporting the world's languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

There was no Survey Tool data collection phase for CLDR 27. Instead, the release focused primarily on stability—cleaning up data inheritance and making specific fixes—as well as improvements to the JSON format of the data. Changes include the following:
  • Cleanup of region locales: A major cleanup effort was undertaken to resolve gratuitous differences between region-specific locales and the parent from which they inherit. In regional locales, it was determined where the parent value was an acceptable replacement for a child-specific value which could then be removed, providing greater consistency in behavior in the various region locales. A special effort was made to clean up country names in certain locales.
  • Changes to English inheritance: As an outcome of the cleanup effort above, the inheritance model for English locales is now simplified, making all en_XX locales inherit from either “en” directly ( for current or former U.S. territories ), or from British-influenced “en_001 - World English”. This is also reflected in some changes for measurement systems.
  • Emoji: Data for emoji annotations and an emoji collation were added, to accompany Unicode Technical Report #51, Unicode Emoji.
  • Collation: There are new sort orders for emoji (as noted above), and an Austrian phonebook sort order. Scripts can be reordered individually, rather than only in specific groups. Fractional tertiary weights are now used that are lower than common, to allow shorter sort-keys with normal Hiragana letters.
  • Specification: The LDML specification has descriptions of new or modified structure, plus a number of fixes and clarifications. See Modifications for a list of changes.
    • Improved documentation of locale inheritance and matching, bundle versus item lookup, and parent locale information.
    • Extensive clarifications to the intended use of the language matching data.
    • Explicit new definitions of Unicode identifiers, such as Unicode Calendar Identifier, for use in citations.
  • Charts: The navigation within charts has been improved, and new ones added:
  • JSON on github: The JSON form of the data is now available on github, rather than being found through the Data link.
Details are provided in http://cldr.unicode.org/index/downloads/cldr-27, along with a detailed Migration section.

Wednesday, September 24, 2014

Proposed Update UAXes for Unicode 8.0

Proposed updates for several of the Unicode Standard Annexes for Version 8.0 of the Unicode Standard have been posted for public review. See http://www.unicode.org/review/ for details and links to the various documents.

UTS #10, Unicode Collation Algorithm has also been posted for public review. In this update, Cyrillic contractions have been removed. See the Modifications section of the draft document for further information.

Review periods for provision of feedback on these proposed updates close on October 20, 2014 for the November UTC meeting, but there will be further opportunities for feedback on the annexes after that November meeting.

To supply feedback on these issues, please see http://www.unicode.org/review/#feedback

Thursday, September 18, 2014

CLDR Version 26 Released

CLDR 26 Coverage Unicode CLDR 26 has been released, providing an update to the key building blocks for software supporting the world's languages. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. This release focused primarily on Unicode 7.0 compatibility, Survey Tool improvements, increased coverage, new units, and improvements to collation and RBNF. Changes include the following:
  • Data Growth: Major increase in the number of translations, with 77 locales now reaching the 100% modern coverage level, and an overall growth of about 20% in data.
  • Units: Added 72 new units, added display names for all units and a new perUnitPattern (eg, liters per second).
  • Collation: Updated collation (sorting) to Unicode 7.0, moved Unihan radical-stroke collation into root to avoid duplication, used import to reduce source size by 23% and ease maintenance. Major changes to Arabic collation.
  • Spell-out numbers: improvements for round-trip fidelity; new syntax for use of plural categories.
  • Specification: documented new structure, \x{h...h} syntax for Unicode code points; construction of “unit per unit” formats; clarified BCP47 and Unicode identifiers, and different kinds of locale lookup, matching, and inheritance.
  • Survey Tool: Major improvements to the UI to make it easier and faster to enter and check data.
Details are provided in http://cldr.unicode.org/index/downloads/cldr-26, along with a detailed Migration section.

Friday, December 13, 2013

Unicode 7.0 Annexes Available for Early Review

As technical work gets underway to prepare the publication of Unicode 7.0 (tentatively scheduled for June, 2014), the Unicode Technical Committee has posted proposed updates for several important specifications:

PRI #260, Proposed Update UTS #10, Unicode Collation Algorithm
PRI #261, Proposed Update UAX #15, Unicode Normalization Forms
PRI #262, Proposed Update UAX #44, Unicode Character Database

In UTS #10, collation weights are discussed more generically, with fewer references to the 16-bit weights used in the DUCET. Section 6.3.2, Large Values for Secondary or Tertiary Weights was merged into Section 6.2, Large Weight Values. In UAX #44, the derivation of the Alphabetic property has been updated and the discussion of @missing in Section 4.2.10 @missing Conventions has been simplified to reflect the revised conventions in the UCD data files, which eliminated special edge cases.

Review periods for these new public review issues close January 27, 2014. For details about reviewing and commenting, please see the Public Review Issues page.

http://unicode-inc.blogspot.com/2013/12/unicode-70-annexes-available-for-early.html

Monday, December 10, 2012

Unicode Collation Proposed Update

The Unicode Collation Algorithm (UCA) data is being modified to make all digits with the same numeric value sort the same, whether they are European (ASCII), Arabic, Devanagari, or others. In addition, the format of the main data table has changed to omit the (unused) 4th level weight, and some data tables are moved to the Unicode CLDR project.

These and other changes are in the new proposed update: see PRI 235. For the exact list of modifications, see Modifications.

Tuesday, June 26, 2012

Proposed updates for Unicode Collation and IDNA

The proposed update of UTS#10 Unicode Collation Algorithm (UCA) modifies the specification for certain edge cases (overlapping contractions), and tightens the requirements for well-formed collation element tables. The detailed descriptions of parametric tailoring options have been removed, and now refer to the corresponding section in LDML. That section adds new explanations and definitions. There are a number of improvements, including additional examples, and some rearrangement of text. See PRI #223

The data has been updated for the Unicode 6.2 beta review, and the associated CollationAuxiliary.txt file in CollationAuxiliary.zip now includes a description of the implicit fractional weight generation and the context syntax. For more details, see Modifications.

There is also a proposed update of UTS #46 Unicode IDNA Compatibility Processing. The data has been updated for the Unicode 6.2 beta review, with minor changes to the text. See PRI #224