The Unicode Blog: LDML

Showing posts with label LDML. Show all posts

Tuesday, May 16, 2023

LDML (UTS#35) Part 7: Keyboards

CLDR-TC has authorized a new Public Review Issue, #476, for a major revision of LDML (UTS#35) Part 7: Keyboards. CLDR-TC and CLDR Keyboard-SC would appreciate feedback on whether there are specific changes or enhancements that should be made in the proposed specification.

Today, every platform must independently evaluate, prioritize, and implement all new or updated keyboard layouts, leading to major inconsistencies and delays especially where digitally disadvantaged languages are concerned. Consequently, language communities and other keyboard authors must see their designs developed independently for every platform/operating system, resulting in unnecessary duplication of technical and organizational effort.

“Keyboard 3.0” is designed from the ground up to be usable as a solution to support both hardware and on-screen (touch) layouts for all platforms in a single source file for each language.

With Keyboard 3.0, leading members of the language communities will be able to submit their layout once to CLDR, and it will be available to all platforms as part of the latest version of CLDR, making adoption much easier for platforms. Platform vendors will not need to develop and maintain their own keyboard layout data, especially for languages that they don’t yet support.

This work contributes to the goals of the United Nations International Decade of Indigenous Languages by improving the path for Digitally Disadvantaged Language communities to develop platform support for their languages. Users should see improvements in consistency between platforms, as layouts can be shared.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Thursday, October 28, 2021

Unicode CLDR v40 now available!

Unicode CLDR version 40 is now available, with approximately 140,000 new or modified data fields.

In this release, the focus is on:

Grammatical features (gender and case)

In many languages, forming grammatical phrases requires dealing with grammatical gender and case. Without that, it can sound as bad as "on top of 3 hours" instead of "in 3 hours". The overall goal for CLDR is to supply building blocks so that implementations of advanced message formatting can handle gender and case.

Phase 1 (v39) of grammatical features included just 12 locales (da, de, es, fr, hi, it, nl, no, pl, pt, ru, sv) for all units of measurement.
Phase 2 (v40) has expanded the number of locales by 29 (am, ar, bn, ca, cs, el, fi, gu, he, hr, hu, hy, is, kn, lt, lv, ml, mr, nb, pa, ro, si, sk, sl, sr, ta, te, uk, ur), but for a more restricted number of units.
Phase 3 (v41) will further expand the units.

Emoji v14 names and search keywords

CLDR supplies short names and search keywords for the new emoji, so that implementations can build on them to provide, for example, type-ahead in keyboards.

Modernized Survey Tool front end

The Survey Tool is used to gather all the data for locales. The outmoded Javascript infrastructure was modernized to make it easier to add enhancements (such as the split-screen dashboard) and to fix bugs.

Specification Improvements

The LDML specification has some important fixes and clarifications for Locale Identifiers, Dates, and Units of Measurement.

Please see the CLDR v40 Release Note for details, including:

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, October 9, 2020

Unicode CLDR Locale Data v38 beta available for testing

The beta version of Unicode CLDR version 38 is now available. The data will not be changed except for showstoppers, but the LDML v38 spec can still be changed. The final release of v38 is planned for October 28, 2020. If you find any problems, please file a ticket.

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

CLDR v38 includes:

Enhancements to existing locale data: adding support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for Unicode symbols that are non-emoji (~400), and annotations for Emoji v13.1.
Survey Tool upgrades: substantial performance improvements, plus structured forum entries to improve coordination among translators.

LDML v38 includes:

To make the canonicalization of locale identifiers clear and unambiguous, provided major restructuring of the specification for it. (This was done in concert with fixes to the alias data to work better with the specification.)
To support inflected units of measurement:
- minimalPairs adds new elements
  caseMinimalPairs and genderMinimalPairs
- unit adds a new element gender
- grammaticalData adds new elements
  grammaticalDerivations, deriveCompound, and deriveComponent
- unitPattern adds a new attribute case
- grammaticalCase, grammaticalGender, grammaticalDefiniteness add a new attribute scope
- compoundUnitPattern1 adds new attributes case and gender
- compoundUnitPattern adds a new attribute case
To allow for overriding dictionary-based segmentation breaks, added the Unicode Dictionary Break Exclusion Identifier, with the new key “dx”.
For picking the correct units of measurement for locales, defined the userPreferences skeleton more precisely.
For accurate plural categories in compact numbers, added the 'c' operand to plural rules to provide formatting for languages such as French.

See additional details in the draft CLDR v38 Release note.

The overall changes to the data items were:

Added	Deleted	Changed	Total
155,131	33,805	45,895	2,175,821

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, January 5, 2018

Unicode LDML Keyboard Enhancements

The Unicode CLDR Technical Committee is planning to enhance the Unicode LDML keyboard specifications. The goal is to be able to represent all the keyboard features necessary to support keyboard layouts from all major providers, allowing the CLDR repository of keyboard layouts to support not only languages in widespread use, but also digitally disadvantaged languages. As a part of this work, keyboards add support for more complex scripts, add capabilities for virtual keyboards (especially mobile phones), incorporate features needed on specific platforms, and provide better layouts overall. Keyboards would also be able to import files, reducing maintenance by allowing common features to be shared. For complex scripts, the transform elements are made more powerful, and reorder and backspace transforms are added.

The plan is to incorporate the changes specified in the PRI #367 background document into CLDR v33 (ca. March 2018), and to work thereafter to improve the tooling for the new specification, and streamline the process for submitting new keyboards into CLDR.

The committee is soliciting feedback on the proposal so that it can make any necessary improvements. The closing date for providing feedback is February 1, 2018.

Please see the PRI #367 page for complete details.

Over 130,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.

Wednesday, October 5, 2016

CLDR Version 30 Released

Unicode CLDR 30 provides an update to the key building blocks for software supporting the world’s languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. The following summarizes the main improvements in the release.

Unicode support is updated to 9.0, including updated Unihan readings for the pinyin collation and Han-Latin transforms, and support for new script codes and number systems.
The set of language codes for translation has been updated, with a significant increase in the total number of translated language names.
Substantial new data has been added for likely subtags (e.g., to get the main script for each language).
New data items have been added to support relative times such as “3 Fridays ago” or “this hour”.
New draft format and preference structure has been added to support week designations such as “the week of August 10” or “week 3 of March”.
New <characterlabels> data can be used to generate labels for groups of related characters in character pickers.
The structure for emoji annotations has been revised, and the data has been significantly updated. The emoji collation has been updated, and data is added for improved segmentation behavior. Added a specification for synthesizing ZWJ sequence names.
The CLDR 30 Survey Tool data collection resulted in a net increase in data items of about 9.2%, with an additional 5.9% of items changed.

For further details and links to documentation, see the CLDR Release Notes

Wednesday, March 19, 2014

CLDR Version 25 Released

Unicode CLDR 25 has been released, providing an update to the key building blocks for software supporting the world's languages. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

Unicode CLDR 25 focused primarily on improvements to the LDML structure and tools, and on consistency of data. There are many smaller data fixes, but there was no general data submission. Changes include the following:

New rules for plural ranges (1-2 liters) for 72 locales, plurals for 2 locales, and ordinals for 18 locales.
Better locale matching with fallbacks for languages, default languages for continents and subcontinents, and default scripts for more languages.
Two new locales: West Frisian (fy) and Uyghur (ug).
Two new metazones: Mexico_Pacific and Mexico_Northwest
Updated zh pinyin & zhuyin collations and translators for Unicode 6.3 kMandarin data
Updated keyboard layout data for OSX, Windows and others.

This version contains data for 238 languages and 259 territories—740 locales in all.

Details are provided in http://cldr.unicode.org/index/downloads/cldr-25, along with a detailed Migration section.

Wednesday, September 18, 2013

CLDR Version 24 released

Unicode CLDR 24 has been released, providing an update to the key building blocks for software supporting the world's languages. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.
Unicode CLDR 24 focused on additional structure for formatting units, dates, and times, and improving data coverage. This version contains data for 238 languages and 259 territories—740 locales in all. Ten languages were added to the 100%-modern-coverage list for a total of 70 languages. Between the new languages, and the new structure, more data was entered than in any previous release.

The new structure focused primarily on formatting of units and improvements to date and time formatting.

fractional plural forms. major extension to handle fractions (eg, some languages use the equivalent of “1.2 teaspoons” but “2.1 teaspoon”)
measurement units. many additional unit types (“10.3 kg”), in up to 6 plural forms per language
compound units. video length: "23 hrs, 7 mins", or "23:07"
dates/times. new relative fields such as "last Sunday", and "now"; 12 hour time formats that omit "am/pm"; neutral eras ("405 BCE"); additional timezone falback regional patterns ("{city} Daylight Time")
number formatting. exponential notation (1.42×10²³), at-least ("99+"), ranges ("3.5-4.5 kg"), narrow currency symbols (both "US$12.23" and "$12.23").
collation. major simplification of rule syntax, updated root files to Unicode 6.3; preliminary version of European Ordering Rules; documentation of the CLDR Collation Algorithm (extending UCA)
JSON. improved support, including new structure and data.

In addition, the data already present from CLDR v23 was reviewed for the supported languages, and many improvements made.

Details of coverage improvements and new features are provided in http://cldr.unicode.org/index/downloads/cldr-24, along with a detailed Migration section.

About the Unicode Consortium
The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. Members are: Adobe Systems, Apple, Google, Government of Andhra Pradesh, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Sultanate of Oman MARA, Oracle, SAP, Tamil Virtual University, The University of California (Berkeley), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members. For more information, please contact the Unicode Consortium:
http://www.unicode.org/contacts.html.

Friday, March 15, 2013

CLDR Version 23 Released

Unicode CLDR 23 has been released, providing an update to the key building blocks for software supporting the world's languages.

Unicode CLDR 23.0 contains data for 215 languages and 227 territories—654 locales in all. This release focused primarily on improvements to the LDML structure and tools, and on consistency of data. It includes substantially improved support for non-Gregorian calendars (such as the Japanese Imperial calendar used extensively in Japan). The data and structure has also been modified to easily permit changing between 12 and 24 hour formats, and between 2 digit and 4 digit years. The new Unicode character is used for the Turkish Lira, and information is provided for currencies that round to 5 cents (or other subunits) in cash transactions. For most languages that use non-Latin scripts, characters in the language’s script now collate before those in other scripts (including A-Z). Language-specific letter-casing changes (Lower, Upper, Title) have been added for Azerbaijani, Greek, Lithuanian, and Turkish. Keyboard data has also been updated for Android. Also, as of this release, the LDML specification is split into multiple parts, each focusing on a particular area.

The release had a short cycle so that we could move to the new regular semi-annual schedule. It thus only included a limited data submission phase, for 4 languages only: Armenian (hy), Georgian (ka), Mongolian (mn), and Welsh (cy). For those languages, the data increased by over 100%.

About the Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. Members are: Adobe Systems, Apple, Google, Government of Andhra Pradesh, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, SAP, Tamil Virtual University, The University of California (Berkeley), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members. For more information, please contact the Unicode Consortium http://www.unicode.org/contacts.html.

Tuesday, June 26, 2012

Proposed updates for Unicode Collation and IDNA

The proposed update of UTS#10 Unicode Collation Algorithm (UCA) modifies the specification for certain edge cases (overlapping contractions), and tightens the requirements for well-formed collation element tables. The detailed descriptions of parametric tailoring options have been removed, and now refer to the corresponding section in LDML. That section adds new explanations and definitions. There are a number of improvements, including additional examples, and some rearrangement of text. See PRI #223

The data has been updated for the Unicode 6.2 beta review, and the associated CollationAuxiliary.txt file in CollationAuxiliary.zip now includes a description of the implicit fractional weight generation and the context syntax. For more details, see Modifications.

There is also a proposed update of UTS #46 Unicode IDNA Compatibility Processing. The data has been updated for the Unicode 6.2 beta review, with minor changes to the text. See PRI #224

Tuesday, May 16, 2023

LDML (UTS#35) Part 7: Keyboards

Thursday, October 28, 2021