4

My understanding of embeddings is this:

  • Given some text string an embedding function (API) generates an embedding (a fixed-length vector of floats). Semantically similar texts have nearby vectors. Texts may be single words (tokens) or complete documents (or chunks of when the documents are too long).

  • To compare two embeddings they must be generated by the same embedding function.

  • The transformer of an autoregressive LLM like ChatGPT automatically generates for each prompt one distinguished text embedding: the "contextualized" embedding of the last token of the prompt (after having passed 96 x 96 self-attention heads, representing the whole prompt) which then is decoded in the final linear layer to yield the next token.

  • When a database of (precalculated) text embeddings is to be used by an LLM, it might go like this:

    1. A user enters a prompt, and an embedding of the prompt is calculated (which is possibly not the contextualized embedding of the last token)

    2. The precalculated text embeddings which are most similar to the prompt embedding are retrieved.

    3. These embeddings are "decoded" and passed to the LLM in a system message next to the original user prompt - to improve the response.

To make it clear: Especially steps 1 to 3 is just how I believe it might work, and my question is, if this is basically a correct high-level description of what's going on.

1 Answer 1

3

You have provided a good high-level overview of how text embeddings can be used by large language models (LLMs). Here are some additional details and clarifications:

  • The embedding generated by the LLM for a given prompt is called a "contextualized" embedding because it takes into account the full context of the prompt. This is in contrast to "static" embeddings, which are precomputed and do not change based on context.

  • The contextualized embedding extracted for a prompt represents LLM's learned representation of that prompt. This embedding encodes semantic information that captures the meaning of the full prompt.

  • When using a database of pre-computed text embeddings, the process would be as follows

    1. User enters a prompt
    2. A static embedding is generated for the prompt.
    3. This embedding is compared with the database to find similar embeddings.
    4. Retrieve the texts associated with the most similar embeddings
    5. These texts are provided to the LLM as context, along with the original prompt.
  • Providing relevant context texts can help the LLM generate a better response by giving it additional information to condition on. The external embeddings act as a basic retrieval mechanism.

  • However, this process relies on static embeddings that may not capture context as well as the LLM's internal contextualized embeddings. Thus, there are trade-offs between the use of pre-computed embeddings and the LLM's own representations.

In general, your high-level understanding is correct. The key distinction is between static embeddings used for retrieval vs. contextualized embeddings used internally by the LLM to represent prompts. Providing external context based on similarity of static embeddings can improve LLM performance.

Now, to answer your first question, here are a few ways that text embeddings can be used by large language models (LLMs):

  • Retrieval: As you outlined, pre-computed text embeddings can be used to retrieve relevant context passages for a given query. The embeddings serve as an efficient lookup mechanism to find semantically similar texts to feed the LLM.
  • Initialization: Text embeddings pre-trained on large corpora can be used to initialize the word embeddings in the LLM. This provides a better starting point compared to random initialization.
  • Transfer Learning: LLMs pre-trained on a text embedding goal (e.g. masked language modeling) can transfer this knowledge to downstream tasks. The text embeddings capture useful language representations.
  • Hybrid approaches: Retrieval systems can combine static embeddings for efficiency with contextual embeddings from the LLM for accuracy. For example, use static embeddings to narrow down candidates and then contextual embeddings for final ranking.
  • Self-supervision: Text embeddings derived from self-supervised goals such as masked language modeling can be used to pretrain LLMs on unlabeled data before fine-tuning in downstream tasks.
  • Multimodal learning: Text embeddings can be combined with image, audio, and video embeddings for multimodal LLMs that understand different data modalities.
  • Evaluation: Text embedding similarity can be used to evaluate LLM performance in semantic similarity tasks. Higher embedding similarity between generated and reference texts indicates better performance.

In conclusion, text embeddings are a versatile technique for improving LLM capabilities in retrieval, initialization, transfer learning, self-supervision, multimodal learning, and evaluation. They provide useful semantic representations that can enhance LLMs in a variety of ways.

Not the answer you're looking for? Browse other questions tagged or ask your own question.