Sachin Kumar’s Post

View profile for Sachin Kumar, graphic

Staff Machine Learning Engineer at Chegg Inc.

RULER : Benchmark to evaluate long-context modeling capabilities of language models In a recent paper from Nvidia researchers, authors propose 𝗥𝗨𝗟𝗘𝗥 as a new synthetic benchmark to evaluate long-context modeling capabilities for language model.  It contains 𝗳𝗼𝘂𝗿 𝘁𝗮𝘀𝗸 𝗰𝗮𝘁𝗲𝗴𝗼𝗿𝗶𝗲𝘀 to test behaviors beyond simple retrieval from context, which are as follows: 𝗶) 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 : extends the needle-in-a-haystack test to evaluate retrieval capability with diverse types and quantities of needles. 𝗶𝗶) 𝗠𝘂𝗹𝘁𝗶-𝗵𝗼𝗽 𝗧𝗿𝗮𝗰𝗶𝗻𝗴 :  proposes novel technique of variable tracking, a minimal proxy task for coreference chain resolution to check the behavior of tracing entities with multi-hop connections. 𝗶𝗶𝗶) 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 : proposes common/frequent words extraction, proxy tasks for summarization to test the ability to aggregate relevant information that spans long-range context. 𝗶𝘃) 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗔𝗻𝘀𝘄𝗲𝗿𝗶𝗻𝗴 : add distracting information to the input of existing short context QA datasets to evaluate question answering capability at various context sizes. 🔒 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗘𝘅𝗶𝘀𝘁𝗶𝗻𝗴 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 - Simple retrieval-based test like Needle-in-a-Haystack is indicative of only a superficial form of long-context understanding. 🔬 𝗘𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝗦𝗲𝘁𝘂𝗽 𝗳𝗼𝗿 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 - 10 long-context LLMs were selected, including 9 open-source models and one closed-source model (GPT-4), covering diverse model sizes (6B to 8x7B with MoE architecture) - weighted average score was used to aggregate model performance across various context sizes. ⚖️ 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - all models exhibited large degradation in RULER as sequence length increases. - best performant model on RULER is GPT-4, which has the highest performance at length of 4k and demonstrates the least but non-marginal degradation (15.4) when extending the context to 128K. - top three ranked open-source models, Command-R, Yi-34B and Mixtral, all use a large base frequency in RoPE and are larger in parameter size than other models. 🕵️ 𝗠𝗼𝗱𝗲𝗹 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 - context sizes overall lead to better performance, but the ranking can be inconsistent for long sequences - scaling model sizes has positive effects on better long-context modeling. - Non-transformer model architectures  demonstrated significant degradation when extending context size to 8K, and both underperform the Transformer baseline Llama2–7B by large margins up till the length of 4K. 🏆 For my detailed analysis of this paper, please refer to my blogpost on it at: https://lnkd.in/e5R9B7SC #largelanguagemodels #generativeai #modelevaluation

RULER: Benchmark to evaluate long-context modeling capabilities of language models

RULER: Benchmark to evaluate long-context modeling capabilities of language models

medium.com

To view or add a comment, sign in

Explore topics