5

LLMs can be prompted to translate content from one language to another. But how they compare to specialized models (that are often very small compared to LLMs)?

3 Answers 3

1

It's complicated. I'd recommend looking at results from recent WMT competitions. For example, the WMT 2023 findings state:

Large Language Models (LLMs) exhibit strong performance across the majority of language pairs, although this is based only on two LLM-based system submissions. Test suite analysis revealed that although GPT4 excelled in some areas (e.g. UGC translation) struggled with other aspects such as speaker gender translation and specific domains (e.g. legal), whereas it ranked lower than encoder-decoder systems when translating from English into less-represented languages (e.g. Czech and Russian)

Here, UGC means social/user-generated content.

Specifically, they test the GPT-4 system as described in Hendy et al. (2023) except instead of retrieving the most relevant few-shot samples, they use "fixed random translation examples" in addition to their "predefined few-shot examples". Another LLM-based system tested was Lan-BridgeMT although that tended to perform worse on average.

GPT4 performed among the best for e.g., English -> German:

WMT2023 English -> German results

and worse (but still ok) for English -> Czech:

WMT2023 English -> Czech results.

Finally, they also talk about some interesting errors that GPT4 makes:

Additionally, test suites providers noted that GPT4 outputs are not always faithful to the source sentence (Bawden and Sagot, 2023) and that they have some issues with speaker gender translation (Savoldi et al., 2023) and specific domains (Mukherjee and Shrivastava, 2023, e.g. legal;).

As an aside, It's kind of confusing what they mean by "LLM" in their paper as, although they say: "we received only one submission using LLM methods (Lan-BridgeMT), whereas one dominant commercial LLM (GPT4) was included via our own efforts", most (all?) of the other submissions use large transformer models, including one with 10.5B parameters. Maybe they only include closed-source models/particularly large models? (Lan-BridgeMT uses GPT3.5/4)

2

The PaLM 2 Technical Report reports that PaLM 2 outperforms Google Translate in some settings.

0

LLMs offer versatility and broad context understanding, but specialized models typically provide higher accuracy and optimization for translation tasks. Use specialized models for precision and LLMs for broader context and adaptability

1
  • I don't see how this answers the question. It's not clear what "precision" means in this context, or what you mean by "broader context and adaptability". I encourage you to edit your answer to explain what you mean and provide specific definitions for those phrases. I hope I don't need to say this, but I want to bring the following policy to your attention: genai.stackexchange.com/help/gen-ai-policy
    – D.W.
    Commented Jul 1 at 19:16

Not the answer you're looking for? Browse other questions tagged or ask your own question.