1

Huge models are trained on data in multiple languages, so I would like to use such model to detect the language of the input. I would extract paragraphs from a webpage, then have the AI analyze the text and spit out something like "the majority of the text is in English, small parts are in German and Swedish".

Is it a feasible application for an LLM? Or will a simple frequency analysis for language detection be more accurate and efficient?

5
  • An LLM probably can do this with reasonable accuracy, but it's like swatting flies with a sledgehammer. You'd get comparable if not better performance from purely statistical methods, while requiring orders of magnitude fewer computing resources.
    – Mark
    Commented Apr 18 at 1:56
  • 1
    why the downvote? Commented Apr 18 at 15:56
  • I don't personally find downvotes offensive, so feel free to cast any vote :)
    – tpimh
    Commented Apr 18 at 19:00
  • @FranckDernoncourt, I'd guess that it's because this is yet another case of "Hi, I found this cool-looking hammer. How do I use it to install screws?"
    – Mark
    Commented Apr 18 at 23:36
  • The question is not "how do I use it", but rather "is it a good tool for this purpose". And "no" is an acceptable answer if it's backed by something.
    – tpimh
    Commented Apr 22 at 6:56

1 Answer 1

4

Simple frequency analysis for language detection is orders of magnitude more efficient from a computational standpoint, and I'm guessing at least as good as LLMs. Just need to look a n-gram char stats (typically yields >99% accuracy in most language detection scenarios), with sliding windows if suspecting code-mixing (code-mixing significantly degrades language detection accuracy).

Not the answer you're looking for? Browse other questions tagged or ask your own question.