Sachin Kumar’s Post

Staff Machine Learning Engineer at Chegg Inc.

3mo

RULER : Benchmark to evaluate long-context modeling capabilities of language models In a recent paper from Nvidia researchers, authors propose 𝗥𝗨𝗟𝗘𝗥 as a new synthetic benchmark to evaluate long-context modeling capabilities for language model. It contains 𝗳𝗼𝘂𝗿 𝘁𝗮𝘀𝗸 𝗰𝗮𝘁𝗲𝗴𝗼𝗿𝗶𝗲𝘀 to test behaviors beyond simple retrieval from context, which are as follows: 𝗶) 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 : extends the needle-in-a-haystack test to evaluate retrieval capability with diverse types and quantities of needles. 𝗶𝗶) 𝗠𝘂𝗹𝘁𝗶-𝗵𝗼𝗽 𝗧𝗿𝗮𝗰𝗶𝗻𝗴 : proposes novel technique of variable tracking, a minimal proxy task for coreference chain resolution to check the behavior of tracing entities with multi-hop connections. 𝗶𝗶𝗶) 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 : proposes common/frequent words extraction, proxy tasks for summarization to test the ability to aggregate relevant information that spans long-range context. 𝗶𝘃) 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗔𝗻𝘀𝘄𝗲𝗿𝗶𝗻𝗴 : add distracting information to the input of existing short context QA datasets to evaluate question answering capability at various context sizes. 🔒 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗘𝘅𝗶𝘀𝘁𝗶𝗻𝗴 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 - Simple retrieval-based test like Needle-in-a-Haystack is indicative of only a superficial form of long-context understanding. 🔬 𝗘𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝗦𝗲𝘁𝘂𝗽 𝗳𝗼𝗿 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 - 10 long-context LLMs were selected, including 9 open-source models and one closed-source model (GPT-4), covering diverse model sizes (6B to 8x7B with MoE architecture) - weighted average score was used to aggregate model performance across various context sizes. ⚖️ 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - all models exhibited large degradation in RULER as sequence length increases. - best performant model on RULER is GPT-4, which has the highest performance at length of 4k and demonstrates the least but non-marginal degradation (15.4) when extending the context to 128K. - top three ranked open-source models, Command-R, Yi-34B and Mixtral, all use a large base frequency in RoPE and are larger in parameter size than other models. 🕵️ 𝗠𝗼𝗱𝗲𝗹 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 - context sizes overall lead to better performance, but the ranking can be inconsistent for long sequences - scaling model sizes has positive effects on better long-context modeling. - Non-transformer model architectures demonstrated significant degradation when extending context size to 8K, and both underperform the Transformer baseline Llama2–7B by large margins up till the length of 4K. 🏆 For my detailed analysis of this paper, please refer to my blogpost on it at: https://lnkd.in/e5R9B7SC #largelanguagemodels #generativeai #modelevaluation

RULER: Benchmark to evaluate long-context modeling capabilities of language models

medium.com

To view or add a comment, sign in

More Relevant Posts

Partha Pratim Ray

Top 2% Scientist by Stanford University, Founder of Indian Knowledge Forum, IoT Researcher, Generative AI Enthusiast, Indian Knowledge Bearer, FIETE, Technical Evangelist, Promoter of Indian Knowledge
3mo
Report this post
Infini-Transformers: Efficient Infinite Context Modeling with Bounded Memory Munkhdalai et al. introduce Infini-Transformers, a novel approach that enables Transformer-based Large Language Models (LLMs) to process infinitely long inputs with bounded memory and computation. The key component is the Infini-attention mechanism, which incorporates a compressive memory into the standard attention mechanism and combines both masked local attention and long-term linear attention in a single Transformer block. Infini-attention reuses the query, key, and value states from the dot-product attention computation for long-term memory consolidation and retrieval. This allows for efficient plug-and-play long-context adaptation and speeds up training and inference. The compressive memory is parameterized with an associative matrix, and the memory update and retrieval process is cast as a linear attention mechanism. Experiments demonstrate the effectiveness of Infini-Transformers on long-context language modeling benchmarks, outperforming baseline models while achieving a 114x compression ratio. The approach also enables a 1B LLM to scale to 1M sequence length and solve the passkey retrieval task. Furthermore, an 8B model with Infini-attention achieves state-of-the-art results on a 500K length book summarization task after continual pre-training and fine-tuning. Infini-Transformers introduce minimal bounded memory parameters and enable fast streaming inference for LLMs. The subtle but critical modification to the Transformer attention layer allows LLMs to process infinitely long contexts with bounded memory and computation resources, paving the way for more efficient and scalable long-context modeling. Reference: https://lnkd.in/gxEcD7G4
Like Comment
To view or add a comment, sign in
Shekar Ramachandran

Architect - AI and Platform Software @Krutrim | Ex-Intel Technologies| Ex-Broadcom | Ex-Qualcomm | Ex-Wipro
3mo
Report this post
Thanks Ashish Patel 🇮🇳 for sharing this Google gives the wings 🦋 to the Attention - "Infini-Attention" ✦ The Challenge: Transformers have significantly advanced the capabilities of Large Language Models (LLMs), but the quadratic complexity in memory and computation has posed challenges when scaling to longer input sequences. 🕹️ The Solution: Infini-Attention This work introduces a novel attention mechanism called "Infini-attention" that enables Transformers to process infinitely long inputs with bounded memory and computational cost. 💡 Key Features: 1. Compressive Memory: Infini-attention embeds a compressive memory that efficiently encodes long-term context into a compact, fixed-size representation. 2. Efficient Attention: Through clever reuse of attention's key-value pairs, the mechanism supports incremental learning over vast inputs without linearly increasing memory requirements. 3. Recurrent Attention Layers: By updating the associative memory matrix incrementally, Infini-attention facilitates a recurrence mechanism within each attention layer, allowing the model to retain a coherent understanding of extended contexts. 🧰 Benefits: → Superior Performance on Long-Context Tasks: The model demonstrated state-of-the-art results on challenging benchmarks, including: → Long-context language modeling (PG19 and Arxiv-math) → Passkey context block retrieval (up to 1M tokens) → Book summarization (500K length) → Significant Efficiency Improvements: Infini-Transformers outperform existing segment-level memory models in terms of memory footprint and effective context length. → Unparalleled Data Compression: The model can compress information more than 100x compared to Memory Transformers, without a loss in modeling quality. 🔥 Implications and Future Directions: ❇️ The development of Infini-attention presents a significant advancement in the efficiency and applicability of Transformer-based LLMs, opening new avenues for research and application in areas requiring extensive contextual understanding. ❇️ Future work can explore extending this framework to other domains, improving memory compression techniques, and optimizing the architecture for larger datasets and more complex tasks.
Like Comment
To view or add a comment, sign in
Anurag dubey

Data and LLM architect || Data & AI platform Setup || Passionate about scalable AI solutions, architectures and implementations.
4mo
Report this post
Been working on collecting the basics of geanAI landscape. I am going to share these here. Hope these will come handy to others as well! Different LLM models and parameter size: BERT - 110 Mill, (BERTL - 340 Mill) Llama - 65 Bill GPT - (GPT2 - 1.5Bill), (GPT3 - 175Bill) BLOOM - 176 Bill FLAN-T5 PALM - 540 Bill Notions: Prompt: The input message given to the model Completion: The output returned by the model Context Window: The space to give the prompt. Typically a 1000 words LLM Use case and Tasks: Essay Writing Summarization Translation Information retrieval (NER) Invoke APIs and Actions Different type of Models: Encoder Only Models: Masked Language Modeling to reconstruct the original text using bidirectional context Use cases: Sentiment Analysis, NER, Word Classification Model Type: BERT, ROBERTA Decoder Only Models: Causal Language Modeling to predict next word using unidirectional context Use cases: Text Generation Model Type: GPT, BLOOM Encoder Decoder Models: Sequence to sequence models to reconstruct the text Span by solving span corruption Use cases: Translation, Text summarization, Question answering Model Type: T5, BART Generative configuration: Inference parameters: Max New Tokens: Maximum Number of tokens generated per completion To avoid repeating words, use random sampling with one of the below options, Sample Top K: Select from K top priority Tokens Sample Top P: Select from the group of highest priority tokens where the combined probability doesn't exceed than a number. E.g., 30% Temperature: Controls the randomness of model output. Higher the temperature, higher the randomness and vice versa Approximate GPU RAM needed to store 1 Bill parameters: 1 Parameter = 4 bytes (32 bit Float) 1 Bill Parameters = 4*10^9 bytes (32 bit Float) = 4GB 32 byte Full Precision Additional GPU RAM needed to train 1 Bill parameters: ~20 extra bytes per parameter 1 Bill Parameters = 80GB 32 byte Full Precision
Like Comment
To view or add a comment, sign in
Sohrab Rahimi

Associate Partner at McKinsey & Company | AI Researcher | Keynote Speaker
3mo Edited
Report this post
Multimodal Large Language Models (MLLMs) have been good at recognizing text but struggle with understanding the overall structure of text-rich images. Take infographics as an example. Infographics often have text arranged in a specific order to guide the reader through a narrative or argument. They might include numbers, bullet points, headings, and subheadings that indicate the hierarchy and organization of information. An MLLM might recognize individual elements, such as text snippets and icons, but could struggle to understand how these elements fit together to convey a comprehensive message. This difficulty arises because interpreting the structure of an infographic requires not just recognizing text and images but also understanding their spatial relationships, sequencing, and the intended flow of information as designed by the creator. In this paper, the authors propose a new approach called Unified Structure Learning to enhance MLLMs’ ability to comprehend and interpret the structure of visual documents across various domains like webpages, documents, tables, charts, and natural images. To improve how these models process structure information, a vision-to-text module named H-Reducer is introduced. This module is designed to efficiently merge adjacent visual features horizontally through convolution, reducing the length of visual features while preserving layout information. This allows for better processing of high-resolution images. Their model, DocOwl 1.5, demonstrates significant improvement over previous models in understanding visual documents, achieving more than a 10-point improvement in 5 out of 10 benchmark tasks for visual document understanding. The work underscores the potential of integrating structure-aware learning into MLLMs for more sophisticated visual document analysis, with all resources made publicly available to encourage further research in this field. Link to code: https://lnkd.in/eBAe7DTY Link to paper: https://lnkd.in/edWVfECw
1 Comment
Like Comment
To view or add a comment, sign in
Ashish Patel 🇮🇳

🔥 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers
3mo
Report this post
Beyond benchmarks: How a new study is turning LLMs evaluation on its head with compression efficiency ✦ The Challenge: Accurately evaluating the intelligence and capabilities of large language models (LLMs) remains a significant challenge. Existing benchmark suites can be prone to overfitting and dataset contamination, limiting their reliability as comprehensive measures of model performance. 🕹️ The Solution: Infini-Attention The research paper by Huang et al. suggests a promising solution - leveraging compression efficiency as a proxy for intelligence in LLMs. The key insight is that a model's ability to compress external text corpora is strongly correlated with its performance on a wide range of intelligence-related tasks. 💡 Key Features: ✱ Compression Efficiency as an Intelligent Metric: The study found a near-linear correlation (Pearson coefficient ~-0.95) between an LLM's compression efficiency (measured in bits per character) and its scores on benchmarks covering knowledge, commonsense, coding, and mathematical reasoning. ✱ Robustness Across Models: This compression-intelligence relationship held true across a diverse set of 30 public LLMs, regardless of their size, architecture, or training data. ✱ Unsupervised Evaluation Approach: Measuring compression efficiency is an unsupervised technique that can overcome the limitations of supervised benchmark suites, making it a promising alternative for LLM evaluation. 🧰 Benefits: → Reliable Intelligence Estimation: By leveraging compression as a proxy, organizations can more accurately assess the true intelligence and capabilities of their LLMs, beyond what traditional benchmarks can provide. → Efficient Development Cycles: Using compression efficiency as a metric can streamline the model development and tuning process, as it offers a direct, unsupervised signal of the model's overall performance. → Reduced Benchmark Contamination: The unsupervised nature of compression-based evaluation mitigates the risk of dataset contamination, ensuring a more objective assessment of model intelligence. 🔥 Implications and Future Directions: ❇️ Widespread Adoption in LLM Evaluation: The findings of this research could lead to the widespread adoption of compression efficiency as a standard metric for measuring and benchmarking the intelligence of LLMs. ❇️ Expansion to Fine-Tuned Models: Future research could explore how the compression-intelligence relationship holds up for fine-tuned LLMs, further expanding the applicability of this approach. ❇️ Exploration of Cross-Domain Compression: Investigating the impact of compressing diverse datasets on a model's cross-domain abilities could provide an even more holistic view of its intelligence. ❇️ Minimum Corpus Size for Reliable BPC: Determining the minimum corpus size required for reliable bits per character (BPC) computation could help establish guidelines for effective compression-based evaluation. P.S. What do you think about this?

1 Comment
Like Comment
To view or add a comment, sign in
Abhishek Bisht

Biomedical AI Specialist | NLP & Machine Learning| 🤗 |🦙🦙🦙
6mo
Report this post
🚀 Check out Maxime Labonne latest blog post on merging Large Language Models with mergekit! 📚 Learn about: - Techniques to combine two or more LLMs - Four merging methods: SLERP, TIES, DARE, and Passthrough - Examples of configurations for successful model merging - Marcoro14-7B-slerp's top performance on Open LLM Leaderboard and NousResearch benchmarks. Read it now for a deep dive into cutting-edge advancements in the world of language models! 🔗 https://lnkd.in/eg-9JpNq #NLP #LanguageModels #MergeKit #AIInnovation #opensource #huggingface #opensourcecommunity

Merge Large Language Models with mergekit

huggingface.co

1 Comment
Like Comment
To view or add a comment, sign in
Antonio Montano 🪄

Delivering perpetual agility via technology ✨
1mo
Report this post
Block Transformer: Global-to-Local Language Modeling for Fast Inference | Kaist, LG, Google Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun Abstract This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. 👉 https://lnkd.in/dkVcFYMY Code is available 👉 https://lnkd.in/dxemTQBc #machinelearning
Like Comment
To view or add a comment, sign in
Carlos Reyes

Innovative Tech Leader | Co-Founder & Principal Architect at SmartDocHealth | Driving Technology Strategy & Growth
1mo
Report this post
Imagine having an AI language model that can understand and generate text with virtually unlimited context, all while maintaining high efficiency and performance. That's the exciting potential of SAMBA, a new architecture introduced in this groundbreaking research paper! SAMBA combines the strengths of State Space Models and attention-based models to enable AI systems that can process and generate incredibly long sequences of text, opening up new possibilities for applications like long-form content generation, document summarization, and beyond. The paper "SAMBA: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling" dives into the technical details of this innovative approach. Key highlights: - SAMBA achieves unlimited sequence length extrapolation with linear time complexity - The largest 3.8B model, pre-trained with 3.2T tokens, substantially outperforms strong open-source models up to 8B parameters on benchmarks like MMLU, HumanEval, and GSM8K - Despite pre-training on only 4K sequence lengths, SAMBA can extrapolate to 1M length with improved perplexity on Proof-Pile - Instruction-tuned SAMBA achieves perfect memory recall at 256K context length and excels on long-context summarization tasks Through rigorous analysis and ablation studies, the researchers validate SAMBA's architectural design and demonstrate its remarkable effectiveness as a simple hybrid approach. 🔗 Read the full paper: https://lnkd.in/ekhDy3Vq I believe SAMBA represents a significant step forward in efficient, unlimited-context language modeling and has fascinating implications for the field. What potential applications do you see for language models with unlimited context? How might this research shape the future of AI-powered text processing and generation? #LanguageModeling #MachineLearning #DeepLearning #NLProc #SAMBA

2406.07522

arxiv.org
Like Comment
To view or add a comment, sign in

6,532 followers

View Profile Follow

Sachin Kumar’s Post

RULER: Benchmark to evaluate long-context modeling capabilities of language models

medium.com

More from this author

Starting Analytics on Social Web : Facebook

Explore topics