AI

Tokens are a big reason today’s generative AI falls short

Comment

LLM word with icons as vector illustration. AI concept of Large Language Models
Image Credits: Getty Images

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today’s transformer models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

This “fuzziness” creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today’s models are bad at math.

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That’s also the reason models aren’t great at solving anagram problems or reversing words.

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

Barring a tokenization breakthrough, it seems new model architectures will be the key.

More TechCrunch

This is a guide on how to check whether someone compromised your online accounts.

How to tell if your online accounts have been hacked

There is a general consensus today that generative AI is going to transform business in a profound way, and companies and individuals who don’t get on board will be quickly…

The AI financial results paradox

Google’s parent company Alphabet might be on the verge of making its biggest acquisition ever. The Wall Street Journal reports that Alphabet is in advanced talks to acquire Wiz for…

Google reportedly in talks to acquire cloud security company Wiz for $23B

Featured Article

Hank Green reckons with the power — and the powerlessness — of the creator

Hank Green has had a while to think about how social media has changed us. He started making YouTube videos in 2007 with his brother, novelist John Green, at a time when the first iPhone was in development, MySpace was still relevant and Instagram didn’t exist. Seventeen years later, posting…

Hank Green reckons with the power — and the powerlessness — of the creator

Here is a timeline of Synapse’s troubles and the ongoing impact it is having on banking consumers. 

Synapse’s collapse has frozen nearly $160M from fintech users — here’s how it happened

Featured Article

Helixx wants to bring fast-food economics and Netflix pricing to EVs

When Helixx co-founder and CEO Steve Pegg looks at Daisy — the startup’s 3D printed prototype delivery van  — he sees a second chance. And he’s pulling inspiration from McDonald’s to get there.  The prototype, which made its global debut this week at the Goodwood Festival of Speed, is an…

Helixx wants to bring fast-food economics and Netflix pricing to EVs

Featured Article

India clings to cheap feature phones as brands struggle to tap new smartphone buyers

India is struggling to get new smartphone buyers, as millions of Indians don’t go for an upgrade and continue to be on feature phones.

India clings to cheap feature phones as brands struggle to tap new smartphone buyers

Roboticists at The Faboratory at Yale University have developed a way for soft robots to replicate some of the more unsettling things that animals and insects can accomplish — say,…

Meet the soft robots that can amputate limbs and fuse with other robots

Featured Article

If you’re an AT&T customer, your data has likely been stolen

This week, AT&T confirmed it will begin notifying around 110 million AT&T customers about a data breach that allowed cybercriminals to steal the phone records of “nearly all” of its customers. The stolen data contains phone numbers and AT&T records of calls and text messages during a six-month period in…

If you’re an AT&T customer, your data has likely been stolen

In the first half of 2024 alone, more than $35.5 billion was invested into AI startups globally.

Here’s the full list of 28 US AI startups that have raised $100M or more in 2024

Whistleblowers have accused OpenAI of placing illegal restrictions on how employees can communicate with government regulators, according to a letter obtained by The Washington Post. Lawyers representing anonymous whistleblowers sent…

Whistleblowers accuse OpenAI of ‘illegally restrictive’ NDAs

Business email compromise attacks are on the rise. Here’s how you can stay ahead of the hackers.

How to protect your startup from email scams

Featured Article

What exactly is an AI agent?

Regardless of how they’re defined, the agents are for helping complete tasks in an automated way with as little human interaction as possible.

What exactly is an AI agent?

Meta announced former President Donald Trump’s Facebook and Instagram accounts will no longer be subject to heightened suspension penalties, according to an updated blog post on Friday. The company says…

Meta removes special restrictions for Trump’s account ahead of 2024 elections

A Castro Valley resident was charged Thursday for allegedly slashing the tires of 17 Waymo robotaxis in San Francisco between June 24 and June 26, according to the city’s district…

Waymo cameras capture footage of person charged in alleged robotaxi tire slashings

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. Sign up here to get it in your inbox every Friday. This…

Defending Russia’s EU neighbors

Cat-Wells said she started this platform because traditional hiring processes are exclusionary and often overlook skilled, talented disabled people.

A VC told Keely Cat-Wells to get a male, non-disabled co-founder — she balked, nabbed a $2M pre-seed round

A new study examines whether AI could be an automated helpmeet in creative tasks, with mixed results: It appeared to help less naturally creative people write more original short stories…

Experiment finds AI boosts creativity individually — but lowers it collectively

Featured Article

HeadSpin, whose founder is in prison for fraud, sold to PE firm in fire sale, sources say

In total, HeadSpin raised $117 million since its 2015 inception and was last valued at $1.1 billion in 2020.

HeadSpin, whose founder is in prison for fraud, sold to PE firm in fire sale, sources say

A bipartisan group of senators has introduced a new bill that seeks to protect artists, songwriters and journalists from having their content used to train AI models or generate AI…

New Senate bill seeks to protect artists’ and journalists’ content from AI use

When Keith Rabois announced he was leaving Founders Fund to return to Khosla Ventures in January, it came as a shock to many in the venture capital ecosystem — and…

From Ethan Choi to Spencer Peterson, venture capitalists continue to play musical chairs

Archer Aviation and Southwest Airlines are teaming up to figure out what it will take to build out a network of electric air taxis at California airports. Southwest’s customer data…

Archer’s vision of an air taxi network could benefit from Southwest customer data

If you visited the Wikipedia website on mobile this week, you might have seen a pop-up indicating that dark mode is ready for prime time.

Wikipedia’s mobile website finally gets a dark mode — here’s how to turn it on

Featured Article

What the AT&T phone records data breach means for you

The giant U.S. telco lost the information of around 110 million customers. Here’s what you need to know.

What the AT&T phone records data breach means for you

The error brings to a close SpaceX’s incredible streak of 335 flawless launches across the company’s Falcon family of rockets, which also includes the more powerful Falcon Heavy.

SpaceX Falcon 9 suffers rare failure on orbit during Starlink deployment

The AI chatbot has been trained on Amazon’s product catalog, customer reviews, community Q&As, and other public information found around the web.

Amazon AI chatbot Rufus is now live for all US customers

If X continues to violate Europe’s data protection rules, the company is on the hook for fines of up to €4,000 per day.

More bad news for Elon Musk after X user’s legal challenge to shadowban prevails

HERO Software has closed a €40 million Series B financing round, and plans to expand across Europe. 

A startup set out to fight climate change — it did it by helping plumbers

Fusion power may still be a few years away, but one startup is laying the groundwork for what it hopes will become a bustling sector of the economy.

Fusion pioneer Commonwealth Fusion Systems is selling core magnet tech to the University of Wisconsin

For months, rumors persisted that Google, and perhaps others, were interested in buying HubSpot, a Boston-based CRM and marketing software company. HubSpot’s market cap ballooned as the rumors persisted, eventually…

Boston VCs are pleased that HubSpot will remain an independent company