Sreekanth Madisetty, PhD’s Post

View profile for Sreekanth Madisetty, PhD, graphic

Research Scientist | Ph.D @IIT Hyderabad

IndicGenBench Google Research India recently released IndicGenBench, a multilingual benchmark to evaluate generation capabilities of LLMs on 29 Indic languages spanning 13 writing scripts and 4 language families. Extended the datasets in Cross-lingual Summarization, Machine Translation, Multi-lingual Question Answering, and Cross-lingual Question Answering, the team has collected human translations for English examples into target Indic languages, thereby extending the scope and applicability of evaluation metrics in this domain. One of the key insights from their study is the analysis of token fertility across all Indic languages within IndicGenBench. Token fertility, representing the average number of sub-words that a word is broken down into by the tokenizer, varies significantly across languages Some languages have simple breakdowns, while others are more complex. Now, why does this matter? Well, it affects how well the language models work. Languages with more complex breakdowns might struggle because they can't use as many examples to learn from. They found that languages with simpler breakdowns can use more examples effectively compared to those with complex breakdowns. #LLMs #GenAI #IndicGenBench #AI #IndicLanguages #IndicDatasets #multilingual

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics