3

From the question How long is a "token"? we learn that tokens are commonly around 4 characters. So it seems plausible that LLMs might therefore prefer to have word boundaries coincide with token boundaries. E.g. maybe ChatGPT, say, has a bias towards (4n-1)-character words (-1 for a whitespace character).

Question: Does the length of a token give LLMs a preference for words of certain lengths?

I didn't find the answer by Google; I asked Koala.sh and it said Language models do not have a preference for words of certain lengths, and Assistant said Language models like GPT-3.5, which is based on transformer architecture, do not inherently have a preference for words of certain lengths. However, neither AI explained their reasoning; I wonder if there's an inherent reason for this, or research into this topic.

(Note this question is not about Google, Koala.sh, nor Assistant in particular; I'm just showing my attempts at finding an answer myself, as is generally expected when writing questions.)

3
  • 1
    What do you mean by "prefer"? Commented Aug 7, 2023 at 4:01
  • The LLM doesn't "know" the length of a token, so it cannot consider or even prefer it. All it knows and works with is its embedding - a fixed-size vector of floats, encoding lots of information about the token, but not its length. Commented Aug 7, 2023 at 5:41
  • Maybe this is a topic for meta, but I'm generally opposed to the trend of "Asking the question to various AI tools and trusting what they say", both in questions and especially as answers on this site. I don't ask my tarot cards what they think of SE questions, and on uncommon questions (the type that would require more research than a cursory google) I think an LLM and my tarot deck are ~equally accurate and trustworthy (that is, no better than random chance)
    – Kaia
    Commented Aug 8, 2023 at 18:52

1 Answer 1

3

So it seems plausible that LLMs might therefore prefer to have word boundaries coincide with token boundaries. E.g. maybe ChatGPT, say, has a bias towards (4n-1)-character words (-1 for a whitespace character).

Tokens are around 4 characters on average across enough text, but not strictly 4 characters each. Tokenisation will usually give common words their own token - whereas rarer/longer/more-splittable words may be composed of multiple.

Example of the previous sentence in OpenAI's GPT-3 Tokeniser: enter image description here

The model won't be directly aware of how many characters are in each token, although may pick it up from context in the training data.


As a rough empirical check, I downloaded a dataset of ShareGPT conversations, filtered to only ASCII data, and compared bot messages to user messages:

enter image description here
(62060295 total bot words, 15193352 total user words)

User messages won't be a perfect fit for all the data that GPT-3.5 was trained on, but I don't see any particular pattern above in the bot's character length preference.

1
  • (Note that the GPT-3 Tokeniser isn't the one used by ChatGPT. Someone made an equivalent here foxabilo.com/tokenizer)
    – endolith
    Commented Aug 8, 2023 at 16:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.