How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities

Jerry Huang
Mila - Quebec AI Institute
Université de Montréal
jerry.huang@mila.quebec

Abstract

Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.

Jerry Huang Mila - Quebec AI Institute Université de Montréal jerry.huang@mila.quebec

1 Introduction

Advances in AI system engineering (Dao et al., 2022; Dao, 2024; Rasley et al., 2020) and design of models (Katharopoulos et al., 2020; Jiang et al., 2023; AI21, 2024) have opened up language models to the broader public for a diverse set of purposes and use-cases. However, Transformer-based architechtures (Vaswani et al., 2017) remain bounded in terms of their context windows, as they require fixed-length positional embedding representations (Press et al., 2022; Su et al., 2023; Peng et al., 2024) which cannot be modified a posteriori. With this glaring limitation, linear sequence models (Gu et al., 2022; Gu and Dao, 2024; Orvieto et al., 2023; Qin et al., 2023; Peng et al., 2023; De et al., 2024; Dao and Gu, 2024) have emerged an alternative that present a seeming ability to extend to infinite-length contexts in theory while retaining all the original benefits of the Transformer related to training-based parallization.

However, despite the temptation to assert linear sequence models as superior, properly testing for information retention from long-context tasks remains callenging. While some works have attempted to evaluate this ability through long contexts (Shaham et al., 2022; Pang et al., 2022; Dong et al., 2024; Bai et al., 2023; Li et al., 2023; Han et al., 2024), whether or not they truly require the use of long-contexts is uncertain and ascertaining long-context abilities from these tasks is difficult. This has prompted the use of more synthetic tasks (Hsieh et al., 2024), such as needle-in-a-haystack (NIAH) (Kamradt, 2023) and passkey retreival (Mohtashami and Jaggi, 2023), to better control and evaluate the context sizes of models.

Nevertheless, an outstanding question remains whether or not long-context models can effectively model long contexts. While some works (Gu and Dao, 2024; Fu et al., 2023; Poli et al., 2023; Peng et al., 2024; Team, 2024) purport to be able to extrapolate towards sequences of long length (100k tokens+), further investigation has suggested differently. For example, Hsieh et al. (2024) claim modern LLMs significantly over-state true context windows on a number of synthetic tasks. Meanwhile Han et al. (2024) observe models to perform reasonably well on synthetic tasks, but struggle on real-world tasks, as do Li et al. (2023). Hence despite a consistent trend in models behaving underwhelmingly, it remains to be understood why this occurs. Yet one interesting question is whether or not linear sequence models are in fact more suited for these compared to Transformer-based ones, as has been claimed repeatedly.

To this end, we further analyze the behaviour of sequence models to observe how differently they behave compared to Transformer-based ones. We perform a more extensive study into each type of model, as well as a mixture of both, to better investigate how they perform in principle and how they change in behaviour when extending to longer and longer sequences. On both synthetic and realistic data, we conduct a thorough study and observe:

•

All models, whether they use pure sequence layers, attention or a mix, struggle with extrapolating beyond their training context length.
•

The abiliy to extrapolate can vary signficantly based on the format of the sequence even if the task remains constant.

These results highlight that long sequence models suffer from significant limitations despite their theoretical soundness, highlighting a need to better understand this striking dissonance between expectation and observation and how to amend it for better long-context understanding and reasoning.

2 Related Work

Efficient Long-Context Models.

Due to the computational bottleneck of attention (Bahdanau et al., 2015) relative to sequence length, significant modifications have been made to overcome this limitation of the Transformer (Child et al., 2019; Katharopoulos et al., 2020; Su et al., 2023) yet they remain theoretically bounded in terms of its context length. Alternatively, sequence models (Rumelhart et al., 1986; Jordan, 1986; Hochreiter and Schmidhuber, 1997; Cho et al., 2014) originally faced significant issues that limited their application but recent modifications (Gu et al., 2020, 2021) have led to the prominence of linear sequence models which are significantly more compute-effective than Transformer-based architechtures.

On the Limits of Long Sequence Models.

Due to their more intuitive and interpretable architechture, long/linear sequence models remain easier to analyze when placed in comparision to Transformers. As such, their limitations also become easier to discover and analyze. Vardasbi et al. (2023) first show that SSMs struggle at sequence-to-sequence tasks due to to the use of a fixed-size hidden representation which compresses the entire prior context, making it difficult to extract information from the past, fact further substantiated by Jelassi et al. (2024). Park et al. (2024) additionally demonstrate that these models have difficulty with more complex in-context learning tasks, while Merrill et al. (2024) show them to possess similar limiations in terms of representational power as Transformers (Merrill and Sabharwal, 2023). Waleffe et al. (2024) finally make a comparision between Mamba, Transformers as well as a hybrid and observe hybrid models to perform better on long-context tasks, while Mamba2 often trails behind Transformers. These observations thus beg a question: can long sequence models really model long sequences? Given the hints that long sequence models may not always be as they seem, a more formal investigation is necessary. We distinguish ourselves by conducting a more controlled but intricate study which aims to uncover why some of the prior results might occur, which we discuss in the work that follows.

3 Background

Attention and Long Sequences.

Self-attention as used in Transformers is powerful but costly. When provided an embedded text representation as a sequence of tokens $\bm{x}\in\mathbb{R}^{L\times d}$ , each Transformer layer in the network applies a function

T_{\ell}(\bm{x})=\text{FF}_{\ell}(A_{\ell}(\bm{x})+\bm{x})+A_{\ell}(\bm{x})

(1)

where $A_{\ell}$ is the self-attention mechanism of the $\ell$ -th layer and $\text{FF}_{\ell}$ is the following feed-forward network¹¹1Excludes normalization operations.. Self-attention computes, for every position, a weighted average of the feature representations of all other positions with a weight proportional to a similarity score between the representations.

\begin{split}&\bm{Q}_{\ell}=\bm{x}\bm{W}_{\ell}^{\bm{Q}}\quad\bm{K}_{\ell}=\bm% {x}\bm{W}_{\ell}^{\bm{K}}\quad\bm{V}_{\ell}=\bm{x}\bm{W}_{\ell}^{\bm{V}}\\ &A_{\ell}(\bm{x})=\bm{V}_{\ell}^{\prime}=\text{softmax}\big{(}{\bm{Q}_{\ell}% \bm{K}_{\ell}^{T}}/{\sqrt{d}}\big{)}\bm{V}_{\ell}\end{split}

(2)

As the softmax operation operates in $O(L^{2})$ time when applied naively, this limits the ability to process long-sequences.

Transformers to Sequence Models.

SSMs model a dynamical system, traditionally mapping a 1-D continuous input signal $x(t)\in\mathbb{R}$ to an $n$ -dimensional hidden state $h(t)\in\mathbb{R}^{n}$ that is projected back to a 1-D output $y(t)\in\mathbb{R}$ using:

\begin{cases}h^{\prime}(t)&={\bm{A}}h(t)+{\bm{B}}x(t)\\ y(t)&={\bm{C}}h(t)+{\bm{D}}x(t)\end{cases}

(3)

where $\bm{A}$ , $\bm{B}$ , $\bm{C}$ and $\bm{D}$ are all trainable parameters. Gu et al. (2021) use this paradigm to define a recurrent model to work on discrete signals, in which case the input can be regarded as discretized data sampled from a continuous signal with a step size $\Delta$ , for which the corresponding SSM is defined by:

\begin{split}h_{t}&=\overline{\bm{A}}h_{t-1}+\overline{\bm{B}}x_{t}\quad y_{t}% =\overline{\bm{C}}h_{t}+\overline{\bm{D}}x_{t}\\ \overline{\bm{A}}&=\frac{\big{(}I+{\Delta}\bm{A}/{2})}{\big{(}I-{\Delta}\bm{A}% /{2}\big{)}}\quad\overline{\bm{B}}=\frac{\Delta\bm{B}}{\big{(}I-{\Delta}\bm{A}% /{2}\big{)}}\end{split}

(4)

and $\overline{\bm{C}}=\bm{C}$ (They set $\overline{\bm{D}}=0$ due to being equivalent to a residual connection.) Thus the output $\bm{y}$ given an input $\bm{x}$ is

\begin{split}\overline{\bm{K}}&=(\overline{\bm{CB}},\overline{\bm{CAB}},\dots,% \overline{\bm{CA}}^{L-1}\overline{\bm{B}})\\ y_{t}&=\sum_{j=0}^{L-1}\overline{\bm{CA}}^{j}\overline{\bm{B}}x_{L-j}=% \overline{\bm{K}}*\bm{x}\end{split}

(5)

where $\overline{\bm{K}}$ is the SSM kernel. As $\bm{y}$ can be computed in $O(L\log L)$ with a Fast Fourier Transform (Cormen et al., 2009), the entire output can be computed in tandem based on the input, given the matrices that parametrize the system. Gu et al. (2021) use this to overcome issues of parallelization and vanishing gradients (Bengio et al., 1994; Hochreiter et al., 2001; Pascanu et al., 2013) observed by prior recurrent models by

(1)

Removing non-linearities in the recurrence, enabling the efficient pre-computation of $\overline{\bm{K}}$ .
(2)

Using a special matrix parameterization (Gu et al., 2020) for $\bm{A}$ to memorize the input and eliminate exponential gradient scaling.

This has sparked a new wave of recurrent models to compete with Transformers (Orvieto et al., 2023; Qin et al., 2023; De et al., 2024; Beck et al., 2024), with the added benefit of theoretically having longer context sizes that scale more efficiently.

4 Experiments and Results

Datasets.

We conduct an initial evaluation using Ruler (Hsieh et al., 2024), a set of synthetic benchmarks that test long-context information retention, before conductin a more fine-grained evaluation on a general needle-in-the-haystack task. We use this benchmark as for more granular control over the exact information that must be retained. Results are measured in terms of accuracy based on exact matching of predicted tokens.

Baselines.

Our main objective is to compare how long-sequence models fare on long context tasks. To this end, we compare models with the same number of parameters that are evenly trained on the same data. Hence we first use Mamba2 (Dao and Gu, 2024) as well as a Transformer variant (Transformer++) as well as a hybrid Mamba2Attn, each with 2.7 billion parameters. We further add Sheared-LLaMA (Xia et al., 2024) and RecurrentGemma (Botev et al., 2024) baselines (with and without intruction-tuning) as same-sized baselines trained under different conditions. We finally add a 3 billion RWKV (Peng et al., 2023) variant as another sequence model baseline.

Results.

We present initial results on the base set of Ruler tasks (as defined by its original authors) in Table 1. However, we present two additional ablation studies. In the first, we use a single needle hidden within a large haystack, however we modify its relative position within the context. The goal of this ablation, presented in Table 2 and 3, is to observe how the use of a unified hidden state rather than attention can affect the ability to retain information throughout a long sequence. The second (Table 4) further test how this information retention may change when the content that is being memorized changes (ex. numbers versus UUIDs within a haystack of repeated sentences or essays).

Length	1K	2K	4K	8K	16K	Average
Mamba2	38.52	32.91	12.98	6.51	0.1	18.2
M2A	39.14	30.43	12.89	7.8	3.49	18.75
TPP	46.61	36.74	0.31	0.06	0.03	16.75
RG	78.82	71.72	22.45	11.21	6.29	38.1
SL	84.38	69.89	58.37	0.0	0.0	42.53
RWKV	68.09	55.27	37.47	23.73	13.81	39.67
RG-IT	85.64	79.45	44.33	24.19	14.18	49.56
SL-IT	86.22	77.54	74.25	0.0	0.0	47.6

Table 1: Results on Ruler. Accuracy is aggregated across several tasks for each model and context length. Context length for which each model was trained is underlined. Best performing models are bolded.

Position	0	20	40	50	60	80	100	Avg
Mamba2	59.07	31.47	33.07	39.07	40.0	31.33	66.0	42.63
M2A	40.27	36.53	30.27	29.33	29.33	35.07	37.2	35.26
TPP	53.33	33.47	22.8	26.27	31.33	35.07	55.73	35.64
RG	100.0	100.0	100.0	100.0	100.0	100.0	99.47	99.92
SL	99.6	99.6	100.0	100.0	100.0	100.0	100.0	99.89
RWKV	82.4	100.0	100.0	80.27	100.0	100.0	100.0	94.67
RG-IT	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
SL-IT	98.27	99.6	100.0	100.0	100.0	100.0	99.73	99.66

Table 2: Results on needle-in-a-haystack task where the position of a single needle is at a fixed depth within the haystack. Context length is set to the maximum on which the models were trained.

Position	0	20	40	50	60	80	100	Avg
Mamba2	26.8	19.6	17.73	18.93	18.93	20.13	21.87	21.03
M2A	38.8	26.27	18.93	28.8	10.13	21.6	66.67	27.07
TPP	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
RG	0.0	0.0	0.0	99.87	100.0	100.0	96.27	56.59
SL	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
RWKV	33.47	99.6	100.0	36.53	100.0	100.0	100.0	81.37
RG-IT	0.0	0.0	0.0	100.0	99.6	100.0	99.73	57.05
SL-IT	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Table 3: Same results as above with context length set to twice the maximum training length.

Model	Context	Essay-Word-Num			Essay-Word-UUID			Repeat-Word-Num
Model	Length	0	50	100	0	50	100	0	50	100
Mamba2	1024	86.0	73.6	82.0	78.0	70.8	80.8	77.6	70.4	55.2
	2048	45.6	20.8	65.2	49.6	20.4	66.0	82.0	76.0	66.8
	4096	0.0	0.0	0.0	0.0	0.0	0.0	80.4	56.8	65.6
M2A	1024	37.2	28.0	48.0	39.2	26.8	48.0	47.2	44.4	70.0
	2048	41.6	27.6	39.6	42.4	28.4	30.8	36.8	32.0	63.2
	4096	29.2	25.6	59.2	27.6	28.0	58.0	59.6	32.8	82.8
TPP	1024	52.0	36.0	47.6	58.8	34.4	50.4	81.6	33.2	58.4
	2048	51.6	29.6	62.4	44.8	36.0	55.6	63.6	13.2	49.2
	4096	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Table 4: Results on needle-in-a-haystack task where the position of a single needle is placed at the beginning, end or middle of the haystack while the types of each component varies. Context length is set to the maximum on which the models were trained.

5 Discussion

All models have limits.

Our first observation is that regardless of the model, performance drops steeply upon testing with sequences that are longer than what the model was initially trained on. This is made clear in Table 1, where the decline in performance is greatest once the evaluated sequences are longer than the training context (with the mild exception of RWKV which demonstrates approximately linear degredation as the sequences progressively double in length). However, an important observation is that linear sequence models do appear to extrapolate slightly better than pure-attention models, whose performance drop to near 0 performance upon the increase, as these models do show non-trivial accuracy even when evaluated on the longer sequences. This distinction is less clear when comparing between pure linear sequence models and hybrid models which alternate between sequence-model layers and attention layers, as there is no explicit pattern as to when one class will perform better on one length or another.

Being lost in the middle is a common event.

Being lost in the middle, whereby models have difficulty recalling relevant information positionally located in the middle of long contexts (Liu et al., 2024), has been observed as a common limitaiton among attention-based models. In Table 2, this appears to be a common feature among all models we test, as all classes of models see increasing drops in performance as the information is more closely located at the center of the sequence. This suggests that despite their long-context modelling ability, recurrent models cannot effectively reason over their entire context window when prompted. However, when extending past the training context length (Table 3), there is less of a consistent pattern. In particular, while Mamba models still appear lost-in-the-middle, other recurrent models such as RecurrentGemma and RWKV have no clear depth-performance trends, further bringing into question their general long-context modelling abilities.

Extrapolation can inconsistent.

Furthermore, extrapolation can be inconsistent based on characteristics of the model as well as the data. In Table 4, we can first note that depending on the data format of the haystack, key and value to be retrieved, the performance of each model can vary significantly even if we use the same task template, context length and needle position. Furthermore, extrapolation can vary signficantly based on the model as these characteristics change. For example, pure sequence layers (Mamba2) appears to only extrapolate when the haystack is a repeated sequence and the retrived value is a number related to a key word. Upon changing the haystack to be essays, extrapolation craters and the model fails. An equally trained hybrid model (M2A) can meanwhile always extrapolate to some degree, but performance on sequences up to the training context length appears to compare much worse. Pure attention (TPP) meanwhile performs favorably only when evaluating on the extact training context length under specific data formats, but otherwise underwhelms.

6 Conclusion

In this work, we conduct a comprehensive comparision between the long-sequence models and attention-based language models, showing that long-context abilities of such sequence models may hold from a theoretical perspective, they empirically still struggle in comparison to models that make no guarantees. This highlights the need to improve long sequence reasoning abilties not only for Transformer-based LLMs, but also SSMs and new classes of RNNs, which hopefully can serve as motivation to further analyze this topic.

7 Limitations

We limit ourself to a model size in which it is easy to compare models of various paradigms. As such, some perhaps more powerful models are not explored as the analysis between such models can become difficult due to multiple additional changing variables that can perhaps lead to incorrect or undersupported claims.

8 Ethical Concerns

This paper discusses how different types of language models behave on long-context data. It follows that mistakes in our methodology (both experimental and analytical) could lead to unsupported confidence or skepticism about LLMs. Though neither are unethical, unsupported confidence can be very dangerous. However, given that the overall claim is that LLMs should not be assumed to support context length that extend beyond what they have trained, regardless of their training data, we do not think this paper in itself could be misinterpreted for particularly dangerous outcomes.

As for model choices, we use publicly available models where the license agreements do not restrict what we can say about the model. This should give the reader confidence that our views are unbiased. This is unlike ChatGPT or GPT4, which include an unrestricted indemnity-clause in their license agreement, which could make us financially liable for damages.

9 Acknowledgements

JH is supported by a National Science and Engineering Research Council (NSERC) Canada Graduate Scholarship, a Fonds de Recherche du Québec Nature et technologies (FRQNT) Training Scholarship and a Hydro-Québec Excellence Scholarship. SC is supported by a Canada CIFAR AI Chair, the Canada Research Chair in Lifelong Machine Learning and a NSERC Discovery Grant. The experiments were in part enabled by computational resources provided by Calcul Québec (calculquebec.ca) and Mila.

References

AI21 (2024) AI21. 2024. Introducing jamba: Ai21’s groundbreaking ssm-transformer model.
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
Bai et al. (2023) Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2023. Longbench: A bilingual, multitask benchmark for long context understanding. Preprint, arXiv:2308.14508.
Beck et al. (2024) Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. 2024. xlstm: Extended long short-term memory. Preprint, arXiv:2405.04517.
Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.
Botev et al. (2024) Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, and Nando de Frietas. 2024. Recurrentgemma: Moving past transformers for efficient open language models. Preprint, arXiv:2404.07839.
Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. Preprint, arXiv:1904.10509.
Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
Cormen et al. (2009) Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition, 3rd edition. The MIT Press.
Dao (2024) Tri Dao. 2024. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations.
Dao et al. (2022) Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
Dao and Gu (2024) Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Forty-first International Conference on Machine Learning.
De et al. (2024) Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. 2024. Griffin: Mixing gated linear recurrences with local attention for efficient language models. Preprint, arXiv:2402.19427.
Dong et al. (2024) Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2024. BAMBOO: A comprehensive benchmark for evaluating long text modeling capacities of large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2086–2099, Torino, Italia. ELRA and ICCL.
Fu et al. (2023) Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. 2023. Hungry hungry hippos: Towards language modeling with state space models. In International Conference on Learning Representations.
Gu and Dao (2024) Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces.
Gu et al. (2020) Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. 2020. Hippo: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487. Curran Associates, Inc.
Gu et al. (2022) Albert Gu, Karan Goel, and Christopher Ré. 2022. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
Gu et al. (2021) Albert Gu, Isys Johnson, Karan Goel, Khaled Kamal Saab, Tri Dao, Atri Rudra, and Christopher Re. 2021. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems.
Han et al. (2024) Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. 2024. LM-infinite: Simple on-the-fly length generalization for large language models.
Hochreiter et al. (2001) Sepp. Hochreiter, Yoshua. Bengio, Paolo. Frasconi, and Jürgen Schmidhuber. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
Hsieh et al. (2024) Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What’s the real context size of your long-context language models? Preprint, arXiv:2404.06654.
Jelassi et al. (2024) Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and eran malach. 2024. Repeat after me: Transformers are better than state space models at copying. In Forty-first International Conference on Machine Learning.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
Jordan (1986) Michael I. Jordan. 1986. Serial order: a parallel distributed processing approach. Technical report, University of California, San Diego: Institute for Cognitive Science.
Kamradt (2023) Gregory Kamradt. 2023. Needle In A Haystack - pressure testing LLMs. Github.
Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR.
Li et al. (2023) Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2023. Loogle: Can long-context language models understand long contexts? Preprint, arXiv:2311.04939.
Liu et al. (2024) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
Merrill et al. (2024) William Merrill, Jackson Petty, and Ashish Sabharwal. 2024. The illusion of state in state-space models. In Forty-first International Conference on Machine Learning.
Merrill and Sabharwal (2023) William Merrill and Ashish Sabharwal. 2023. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545.
Mohtashami and Jaggi (2023) Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models @ ICML2023.
Orvieto et al. (2023) Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. 2023. Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 26670–26698. PMLR.
Pang et al. (2022) Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. 2022. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
Park et al. (2024) Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. 2024. Can mamba learn how to learn? a comparative study on in-context learning tasks. In Forty-first International Conference on Machine Learning.
Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA. PMLR.
Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023. RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
Peng et al. (2024) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2024. YaRN: Efficient context window extension of large language models. In International Conference on Learning Representations.
Poli et al. (2023) Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. 2023. Hyena hierarchy: Towards larger convolutional language models. In Fortieth International Conference on Machine Learning.
Press et al. (2022) Ofir Press, Noah Smith, and Mike Lewis. 2022. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
Qin et al. (2023) Zhen Qin, Songlin Yang, and Yiran Zhong. 2023. Hierarchically gated recurrent neural network for sequence modeling. In Thirty-seventh Conference on Neural Information Processing Systems.
Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
Rumelhart et al. (1986) David E. Rumelhart, James L. McClelland, and PDP Research Group. 1986. Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press.
Shaham et al. (2022) Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Su et al. (2023) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. Roformer: Enhanced transformer with rotary position embedding. Preprint, arXiv:2104.09864.
Team (2024) Qwen Team. 2024. Qwen2 technical report.
Vardasbi et al. (2023) Ali Vardasbi, Telmo Pessoa Pires, Robin Schmidt, and Stephan Peitz. 2023. State spaces aren’t enough: Machine translation needs attention. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 205–216, Tampere, Finland. European Association for Machine Translation.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Waleffe et al. (2024) Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. 2024. An empirical study of mamba-based language models. Preprint, arXiv:2406.07887.
Xia et al. (2024) Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2024. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations.

Appendix A Technical Implementation Details

A.1 Models Used

Model	Public Link	HuggingFace Model
Mamba2	state-spaces/mamba2-2.7b	✘
Mamba2Attention	state-spaces/mamba2attn-2.7b	✘
Transformer++	state-spaces/transformerpp-2.7b	✘
RWKV	RWKV/rwkv-6-world-3b-v2.1	✔
Sheared-LLaMA	princeton-nlp/Sheared-LLaMA-2.7B	✔
Sheared-LLaMA-ShareGPT	princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT	✔
RecurrentGemma-2B	google/recurrentgemma-2b	✔
RecurrentGemma-2B-IT	google/recurrentgemma-2b-it	✔

Table 5: Models used and public links to their weights.

A.2 Computing Resources Used

All experiments were conduced using a single NVIDIA A100 80GB SXM GPU with 6 CPU worker cores. Experiments are run using PyTorch Version 2.2.0 and CUDA 11.8.

Appendix B Ruler Task Results

Length	1K	2K	4K	8K	16K	Average
Mamba2	66.8	71.6	60.0	62.4	0.0	52.16
M2A	58.0	36.4	43.2	18.4	0.0	31.2
TPP	40.4	24.8	0.0	0.0	0.0	13.04
RG	100.0	100.0	52.0	24.8	10.0	57.36
SL	100.0	100.0	100.0	0.0	0.0	60.0
RWKV	100.0	100.0	100.0	100.0	54.4	90.88
RG-IT	100.0	100.0	51.6	28.8	16.4	59.36
SL-IT	100.0	100.0	100.0	0.0	0.0	60.0

Table 6: Results on niah_single_1 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	62.4	60.4	0.0	0.0	0.0	24.56
M2A	33.2	34.8	9.6	4.8	0.0	16.48
TPP	50.8	48.0	0.0	0.0	0.0	19.76
RG	100.0	100.0	36.4	16.8	2.8	51.2
SL	99.6	99.6	100.0	0.0	0.0	59.84
RWKV	100.0	100.0	53.6	30.4	9.6	58.72
RG-IT	100.0	100.0	55.2	24.4	12.8	58.48
SL-IT	100.0	100.0	100.0	0.0	0.0	60.0

Table 7: Results on niah_single_2 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	52.0	61.6	0.0	0.0	0.0	22.72
M2A	38.8	32.4	2.8	6.4	0.0	16.08
TPP	64.4	53.2	0.0	0.0	0.0	23.52
RG	100.0	100.0	39.2	16.8	8.4	52.88
SL	100.0	100.0	96.4	0.0	0.0	59.28
RWKV	99.2	96.4	15.2	19.6	4.4	46.96
RG-IT	100.0	100.0	53.6	24.0	13.6	58.24
SL-IT	100.0	99.6	99.6	0.0	0.0	59.84

Table 8: Results on niah_single_3 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	25.6	23.6	0.0	0.0	0.0	9.84
M2A	21.2	16.4	5.2	1.2	0.0	8.8
TPP	50.0	34.4	0.0	0.0	0.0	16.88
RG	98.8	98.8	23.2	15.6	4.4	48.16
SL	99.2	100.0	94.0	0.0	0.0	58.64
RWKV	81.6	64.0	30.4	18.0	11.2	41.04
RG-IT	99.2	100.0	36.8	17.6	11.2	52.96
SL-IT	99.6	99.2	98.0	0.0	0.0	59.36

Table 9: Results on niah_multikey_1 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	4.8	2.0	0.0	0.0	0.0	1.36
M2A	17.2	7.6	0.4	0.0	0.0	5.04
TPP	60.0	36.4	0.0	0.0	0.0	19.28
RG	98.0	94.8	8.4	2.4	1.6	41.04
SL	95.2	86.8	53.6	0.0	0.0	47.12
RWKV	20.4	4.0	0.8	0.4	0.0	5.12
RG-IT	100.0	98.0	43.6	27.2	9.6	55.68
SL-IT	97.6	96.0	78.8	0.0	0.0	54.48

Table 10: Results on niah_multikey_2 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	14.4	2.4	0.0	0.0	0.0	3.36
M2A	17.6	12.4	0.0	0.0	0.0	6.0
TPP	61.2	56.4	0.0	0.0	0.0	23.52
RG	74.8	58.8	7.2	2.8	1.6	29.04
SL	96.4	46.4	38.8	0.0	0.0	36.32
RWKV	14.8	1.6	0.4	0.0	0.0	3.36
RG-IT	88.0	92.0	16.0	14.0	1.6	42.32
SL-IT	85.6	63.2	59.2	0.0	0.0	41.6

Table 11: Results on niah_multikey_3 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	34.9	26.6	0.0	0.0	0.0	12.3
M2A	48.8	33.5	1.3	0.1	0.0	16.74
TPP	42.3	31.1	0.0	0.0	0.0	14.68
RG	97.4	95.1	14.7	3.3	3.0	42.7
SL	100.0	82.5	44.0	0.0	0.0	45.3
RWKV	96.5	87.0	57.2	10.8	5.2	51.34
RG-IT	96.7	87.6	41.8	22.0	11.3	51.88
SL-IT	100.0	87.5	77.2	0.0	0.0	52.94

Table 12: Results on niah_multivalue task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	39.1	39.2	0.0	0.0	0.0	15.66
M2A	54.4	37.5	1.6	0.0	0.0	18.7
TPP	44.4	34.8	0.0	0.0	0.0	15.84
RG	99.5	99.7	4.7	2.8	2.8	41.9
SL	98.8	80.8	45.6	0.0	0.0	45.04
RWKV	94.3	80.7	38.4	9.3	2.4	45.02
RG-IT	97.8	97.9	48.5	21.1	11.4	55.34
SL-IT	98.4	94.7	85.9	0.0	0.0	55.8

Table 13: Results on niah_multiquery task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	69.12	36.64	35.2	20.72	0.0	32.34
M2A	78.24	56.88	9.6	1.76	0.56	29.41
TPP	40.88	21.12	0.0	0.0	0.0	12.4
RG	98.0	75.52	0.0	0.0	0.0	34.7
SL	98.16	81.68	19.36	0.0	0.0	39.84
RWKV	68.56	47.76	20.08	6.88	10.95	30.85
RG-IT	84.24	79.36	50.4	31.76	19.92	53.14
SL-IT	93.68	76.88	42.32	0.0	0.0	42.58

Table 14: Results on vt task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	28.52	14.72	4.08	0.16	0.12	9.52
M2A	26.48	15.24	3.04	5.92	0.8	10.3
TPP	30.32	17.8	0.64	0.0	0.04	9.76
RG	48.6	21.32	42.88	17.24	4.24	26.86
SL	71.2	25.32	55.24	0.0	0.04	30.36
RWKV	57.08	3.24	45.0	14.84	1.92	24.42
RG-IT	55.4	4.56	17.4	3.24	0.2	16.16
SL-IT	78.96	18.64	57.2	0.0	0.0	30.96

Table 15: Results on cwe task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	57.87	44.67	40.67	0.53	0.0	28.75
M2A	59.73	53.73	58.0	52.0	39.6	52.61
TPP	59.6	56.4	0.13	0.0	0.0	23.23
RG	56.0	53.87	7.6	15.6	17.33	30.08
SL	72.0	38.67	45.07	0.0	0.0	31.15
RWKV	74.67	67.47	68.0	56.67	43.42	62.05
RG-IT	80.8	67.87	69.73	64.8	50.67	66.77
SL-IT	78.67	78.27	73.07	0.0	0.0	46.0

Table 16: Results on fwe task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	25.2	24.4	18.4	0.0	0.4	13.68
M2A	33.6	35.6	18.0	6.8	3.6	19.52
TPP	37.2	36.4	2.8	0.8	0.4	15.52
RG	26.8	15.6	31.2	6.8	8.8	17.84
SL	41.6	37.2	37.2	0.0	0.0	23.2
RWKV	46.4	35.6	30.8	21.2	18.4	30.48
RG-IT	74.0	66.8	58.4	10.0	9.6	43.76
SL-IT	54.4	56.4	55.6	0.0	0.0	33.28

Table 17: Results on qa_1 task of Ruler.

Length	1K	2K	4K	8K	16K	Average
Mamba2	20.0	20.0	10.4	0.8	0.8	10.4
M2A	21.6	23.2	14.8	4.0	0.8	12.88
TPP	24.4	26.8	0.4	0.0	0.0	10.32
RG	26.8	18.8	24.4	20.8	16.8	21.52
SL	24.8	29.6	29.6	0.0	0.0	16.8
RWKV	31.6	30.8	27.2	20.4	17.6	25.52
RG-IT	37.2	38.8	33.2	25.6	16.0	30.16
SL-IT	34.0	37.6	38.4	0.0	0.0	22.0

Table 18: Results on qa_2 task of Ruler.