Sreekanth Madisetty, PhD’s Post

Research Scientist | Ph.D @IIT Hyderabad

3mo

IndicGenBench Google Research India recently released IndicGenBench, a multilingual benchmark to evaluate generation capabilities of LLMs on 29 Indic languages spanning 13 writing scripts and 4 language families. Extended the datasets in Cross-lingual Summarization, Machine Translation, Multi-lingual Question Answering, and Cross-lingual Question Answering, the team has collected human translations for English examples into target Indic languages, thereby extending the scope and applicability of evaluation metrics in this domain. One of the key insights from their study is the analysis of token fertility across all Indic languages within IndicGenBench. Token fertility, representing the average number of sub-words that a word is broken down into by the tokenizer, varies significantly across languages Some languages have simple breakdowns, while others are more complex. Now, why does this matter? Well, it affects how well the language models work. Languages with more complex breakdowns might struggle because they can't use as many examples to learn from. They found that languages with simpler breakdowns can use more examples effectively compared to those with complex breakdowns. #LLMs #GenAI #IndicGenBench #AI #IndicLanguages #IndicDatasets #multilingual

1 Comment

Sreekanth Madisetty, PhD

Research Scientist | Ph.D @IIT Hyderabad

3mo

Links: Paper: https://arxiv.org/pdf/2404.16816 Github: https://github.com/google-research-datasets/indic-gen-bench

To view or add a comment, sign in

More Relevant Posts

Dimpy Varshni

AI Scientist @ Accenture | Former ML Engineer @ Qualcomm | Former ML Intern @ IBM | IIIT Delhi | Computer Vision | NLP | Deep Learning | Machine Learning
1mo
Report this post
India’s First Text-to-Image Model for Multilingual Content As research has indicated that publicly available text-to-image models often underperform for non-English languages due to limited diverse language data. Studies have pointed out this challenge post the release of such models (https://lnkd.in/gGuaSUW6 - 2022). Enter Kalaido.ai, a pioneering text-to-image diffusion model by home-grown AI unicorn Fractal. Kalaido not only excels in generating high-quality images from text prompts in English but also supports 17 Indian languages, including Hindi, Kannada, Tamil, Telugu, and Sanskrit. Internal trials have showcased Kalaido's impressive 40% higher efficiency in producing detailed images compared to global competitors. Explore this model at https://kalaido.ai/create and witness the fusion of language and imagery firsthand. #TextToImage #AI #Multilingual #Technology #Research

Text to Image Generation: Leaving no Language Behind

deepai.org
Like Comment
To view or add a comment, sign in
Rocío del Amo
6mo
Report this post
A ‘Shocking’ Amount of the Web Is Already AI-Translated Trash, Scientists Determine https://lnkd.in/d8rNMjcS #language #languages #ai #mt #xl8 #translation

A ‘Shocking’ Amount of the Web Is Already AI-Translated Trash, Scientists Determine

vice.com
Like Comment
To view or add a comment, sign in
Emmanuel Dwamena

Research ^ Eng ^ Design @ Leading Technologies
11mo Edited
Report this post
With the rise of Large Language Models and Natural Language Processing, Paul Azunre and his team at Algorine built a chatbot like app which translate languages in different dialects in Ghana. Eg. Twi to Ewe , Fante to Ewe , English to Dagomba , English to twi, etc Download to support his Mission of removing language barriers and a platform to learn new languages from scratch. Link: https://lnkd.in/gVBp4bJh #LLM #Machinelearning #KhayaApp #Africa #ghana #aistartup ...

Khaya - Apps on Google Play

play.google.com
Like Comment
To view or add a comment, sign in
Siddhartha K Goel

Principal PM, Adobe Experience Platform
2mo Edited
Report this post
Really impressed to see what's happening in Sarvam_ai. - Building open models [1] (OpenHathi on HuggingFace) on existing open-source LLMs (Llama, Mistral) to tackle Indian language use-cases - systematically improving accuracy on tokenization, translation, and conversation - via fine-tuning using Indian language Datasets compiled by AI4Bharat [2]. - V focussed and cost-effective way of quickly enabling use-cases and workflows already either in English or in the developed world, for Indian heartland. [1] https://lnkd.in/gbaFMD6w [2] https://lnkd.in/g8Ymu8Nv

OpenHathi Series: An Approach To Build Bilingual LLMs Frugally

sarvam.ai
Like Comment
To view or add a comment, sign in
Daniel Wilson

Founder and CEO of XRI | Our vision: A world without language barriers
4mo
Report this post
Today we are releasing into Creative Commons the datasets for three new languages which will unlock the benefits of AI to their speakers. Our company has developed a method for guaranteeing complete coverage of a conceptual space based on the linguistic characteristics of individual languages which reduces the amount of data required to train AI models. After selecting the target domain, we generate ideal sentences based on our algorithm using LLMs and then get those sentences translated by native speakers. In 6 weeks, we collected 8,000 sentences in each language from native speakers (through our data collection app), creating the ideal datasets for fine-tuning a machine translation model or LLM. The machine translation models reached google translate level quality in a specific domain. The contributors gave consent, were hired by a local agency, were paid a fair wage, and will now reap the benefits of AI in their local language. We are collecting similar datasets for 3 more languages right now, with 8 more scheduled in the next 2 months. If you are interested in creating ideal datasets for new languages, reach out to us at contact@xriglobal.ai https://lnkd.in/eKMPGWqq https://lnkd.in/e_crhDRN https://lnkd.in/eUvPvQtK

xri/BatakTobaNMT · Datasets at Hugging Face

huggingface.co

6 Comments
Like Comment
To view or add a comment, sign in
Deon V.

🚀International Editor for Tech Innovation Publications |🏆Award Winning Solution Development | 🤝Brand Ambassador | 📣Founder of Large Communities | 📝Development, Cybersecurity, Data and Automation
6mo
Report this post
The slippery slope of AI and content. As AI trains on web content, it ingests vast amount of AI generated content, which makes the data untrusted and full of hallucinations. Additionally, AI translations in other languages of this content should be considered trash. #ai #data #web #tech https://lnkd.in/d_2ZSc44

A ‘Shocking’ Amount of the Web Is Already AI-Translated Trash, Scientists Determine

vice.com
Like Comment
To view or add a comment, sign in
Raj Dabre

Researcher at NICT, Adjunct Faculty at IIT Madras, Visiting Professor at IIT Bombay
8mo
Report this post
Last week at AACL-IJCNLP 2023, Jay Gala, Pranjal Chitale and I delivered a tutorial on "Massively Multilingual Machine Translation for Related Languages". If you are interested in this but could not attend, we are making everything available: https://lnkd.in/djf46rQa The GitHub repo contains our slides, recorded talk and all the papers we referred to to prepare the tutorial slides. We are happy to present this tutorial again upon request so please feel free to reach out to us. We hope that this helps motivate further research into language relatedness for massively multilingual machine translation. A big thanks to Prof Kurohashi for motivating us to submit a tutorial application. Also special thanks to Varun Gumma for their feedback. This tutorial is a part of the series of tutorials on: a. NMT (https://lnkd.in/dnTbgMPW) and b. Multilingual Machine Translation (https://lnkd.in/d6tepmwu) Feel free to take a look and reach out if you have any questions.

GitHub - AI4Bharat/aacl23-mnmt-tutorial: Additional resources from our AACL tutorial

github.com

1 Comment
Like Comment
To view or add a comment, sign in
Multiplatform.AI

1,634 followers
6mo
Report this post
Unveiling the Web's Tower of Babel: Machine Translation's Impact on Low-Resource Languages #AI #AItechnology #artificialintelligence #llm #lowresourcelanguages #machinelearning #MachineTranslationsystems #MultiWayccMatrix #realm #Trainingdata #webscraping

Unveiling the Web's Tower of Babel: Machine Translation's Impact on Low-Resource Languages

https://multiplatform.ai
Like Comment
To view or add a comment, sign in
Toolplate

350 followers
4mo
Report this post
OLA’s CEO Bhavish Aggarwal introduced India’s first start-up unicorn, 𝐊𝐫𝐮𝐭𝐫𝐢𝐦 𝐀𝐈! It’s larger than GPT-4 in terms of Indic language support. Krutrim AI can understand 22 languages and generates results in 10 languages. It provides the facility to select any one out of 2 languages for conversation. Some of the extraordinary features of Krutrim AI: ✨ Multilingual Email Writer Easily write emails in 20 Indian languages. ✨ Tourist Guide Provide a plan with timings to all the places you can travel. ✨ Creative Content and Entertainment Put your own spin on poems, stories, songs, jokes, or memes. ✨ Recipe Supplier Provide any recipe from the main dish of cuisine to the dessert. ✨ Language Translator Translate the content in various Indian languages and many others. Krutim AI is a Large Language Model used for daily communication and provides results similar to humans. Use the free version of Krutrim AI 👉 https://olakrutrim.com/ Learn more about Krutrim AI through Toolplate 👉 https://lnkd.in/ggd9Qexi #KrutrimAI #OlaAI #AIforGood
Like Comment
To view or add a comment, sign in

768 followers

View Profile Follow

Sreekanth Madisetty, PhD’s Post

More from this author

Extending Context Length in Large Language Models (LLMs)

Explore topics