Showing posts with label spoofing. Show all posts
Showing posts with label spoofing. Show all posts

Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

header image
By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.

Background

Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pаypal.com”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵‍💫

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

Examples

  • Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:
    precondition image
    To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!

  • The “pаypal.com” above is an example of a confusable spoof.

  • As for a bidirectional spoof, take pair of variables named Aא1 and A1א; these look identical, but the former consists of the letters A and א followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter א, in that order.
Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.

Process

In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.

Impact

The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Wednesday, December 21, 2022

Unicode in 2022

2022 Image

Hello Everyone!

As we go into the New Year, the Unicode team thought we’d share some highlights from this past year. From source-code spoofing to preserving indigenous languages, the Unicode team has had another full year, including expanding the number of characters that appear on billions of devices around the world.


Nearly 150,000 characters!

On the character side, we reached a total of just shy of 150,000 characters (149,186 to be exact). Of the 4,489 characters added in the 15.0 release, the biggest set was 4,192 ideographs for use in Chinese, Japanese, and Korean. There are also two new scripts, Nag Mundari and Kawi. Nag Mundari is a script used to write the Mundari language of India, a language with 1.1 million speakers. Kawi is an important historic script of insular Southeast Asia, found in inscriptions and on artifacts in several languages dating from the 8th to the 16th centuries — and is undergoing a revival today amongst enthusiasts.

And we can’t forget the 20 new emoji characters — we’re looking forward to seeing which are the most popular: shaking face? Goose? Maracas? Pink heart? If you’re involved in implementing emoji, you’ll also want to look at latest changes in UTS #51 Unicode Emoji.

See the Unicode15.0.0 page for more details. We’re also changing how we do releases — for more, see 2023 Release Planning.

The Launch of ICU4X

ICU is used in every major device and operating system; it’s how you see a date or number on your phone, for example. This new project, ICU4X, was created to solve the needs of clients who wish to provide client-side internationalization for their products in resource-constrained environments and across many programming languages. After 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution. For details, see Announcing ICU4X 1.0.

When does i ≠ і?

Can you tell the difference between i and і? Yeah, most people can’t. The first set of changes to help counter source-code spoofing were included in the 15.0 versions of the UAX #9 Unicode Bidirectional Algorithm, UAX #31 Unicode Identifier and Pattern Syntax, and UTS #39 Unicode Security Mechanisms.

For 2023, there is a new draft UTS #55 Unicode Source Code Handling, providing guidance for programming language designers and tooling developers, and specifying mechanisms to avoid usability and security issues arising from improper handling of Unicode. More changes are on their way for UAX #9, UAX #31, and UTS #39 as well.

Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்

We’re making great progress on internationalized formatting of people’s names. What does that mean? Software needs to be able to format people's names, such as John Smith or 宮崎駿. The formatting can be surprisingly complicated: for example, people may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

There are many more complexities — for more details, see Formatting people’s names.

You have 2 unread messages.

Or, you have 3 items in your cart. Whenever a computer needs to construct a sentence using “placeholders” such as 3, it is formatting a message. The current industry standard is ICU’s message formatting; a project started about 3 years ago, with the goal of improving on that to build a more robust and extensible mechanism. There is now a Tech Preview in ICU — we’d urge developers to try it out!

See message-format-wg for details on the syntax and message2/package-summary.html for the API (note that the ICU’s convention for tech previews is to mark as Deprecated), and the test code in MessageFormat2Test.java for examples of usage.

(There are of course other fixes, upgrades and new features in ICU: see ICU 72 and ICU 71 for more details.)

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

In CLDR, we now have 95 languages at the Modern level (suitable for full UI internationalization), 6 at the Moderate level (suitable for “document content” internationalization), and 29 at the Basic level (suitable for locale selection). We added a tech preview of formatting for person names, plus additions for Unicode 15.0 (emoji names and search keywords), names for new scripts, new CJK collation, and so on. For more information, see CLDR v42.

Revitalization and Preservation of Indigenous Languages

The Nattilik language community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging. The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada. By collaborating with Nattilik language keepers and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Consortium worked with the Typotheque Syllabics Project to add 16 characters to the script to support Nattilik and other languages in Unicode version 14.0, and improved the glyphs in Unicode version 15.0. See this blog post from June.

The Past and Future of Flag Emoji

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others. Curious to learn more? Read more about the Past and Future of Flag Emoji.

Available Now! New YouTube Playlist and Technical Quick Start Guide

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. Unicode technical leadership and other experts shared background on our core projects with participants from more than 30 countries. If you missed the webinar, no worries! The recorded sessions are available on this YouTube playlist. And if you are new to Unicode and internationalization or simply want a refresh, you can also check out our Technical Quick Start Guide. This handy guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There are also useful links to more detailed information and how you can get involved. Read more here.

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Finally, if you are already a contributor to — or member of Unicode (or your company or organization is!), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we have accomplished is only possible because of supporters like you.

And if you want to support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode is a US-based non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Wednesday, March 2, 2022

Avoiding Source Code Spoofing

Unicode has convened a group of experts in programming languages, tooling, and security to provide guidance and recommendations on how to better handle international text in source code, as well as providing code to help implementations.

Recent reports have highlighted problems in the review of source code containing non-ASCII Unicode characters (the so-called “Trojan Source exploit”). A person reviewing a submission of source code could be fooled into thinking that the code was okay, when it was actually malicious. The basic problem occurs when the actual text is different from what the reader perceives it to be, based on what is displayed. This can result either from the presence of characters used in right-to-left scripts (such as Arabic or Hebrew) that can change the visual ordering of text, or from the presence of characters that look like others (also known as “confusables”).

The problems here are not solely a security issue: text with different writing directions or confusable characters can be hard to work with. Finding a solution here is important from both security and usability points of view. Developers of source code editors or compilers should not be required to have a deep knowledge of Unicode to provide good user experience and robust security mitigations.

Unicode’s mission is to allow everyone to use their own languages on computers and mobile devices. The above issues are part and parcel of a character set that covers all the writing systems of the world – and have been documented in the Unicode Standard since its very first version in 1991. Unicode’s past efforts have focused on misleading URLs and identifiers, and correct visual ordering of plain text. And while much of this material is relevant to source code, this group of experts will now collect, curate, and supplement that early documentation with concrete recommendations to support source code editors and compilers.

While it may seem that it is easiest to simply go back to limiting source code to only ASCII characters, ASCII-only environments make it much harder to write and maintain software that can be used all over the world – a fundamental requirement for modern software. Moreover, this approach disadvantages software developers who use languages other than English.

More details on the source code spoofing issue, the proposed plan, and formation of this group are found in document L2/22-007R2.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]