When LLMs Play the Telephone Game:
Cumulative Changes and Attractors
in Iterated Cultural Transmissions

Jérémy Perez Corresponding author: jeremy.perez@inria.fr Inria, Flowers team, Université de Bordeaux, France Corentin Léger Inria, Flowers team, Université de Bordeaux, France Grgur Kovač Inria, Flowers team, Université de Bordeaux, France Cédric Colas Inria, Flowers team, Université de Bordeaux, France MIT, Computational Cognitive Science Lab, Cambridge, MA, USA Gaia Molinaro Inria, Flowers team, Université de Bordeaux, France Department of Psychology, University of California, Berkeley, Berkeley, CA, USA Maxime Derex Institute for Advanced Study in Toulouse, Toulouse, France Pierre-Yves Oudeyer Inria, Flowers team, Université de Bordeaux, France Clément Moulin-Frier Inria, Flowers team, Université de Bordeaux, France
Abstract

As large language models (LLMs) start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplified in iterated interactions, potentially leading the content to evolve towards attractor states. In a series of telephone game experiments, we apply a transmission chain design borrowed from the human cultural evolution literature: LLM agents iteratively receive, produce, and transmit texts from the previous to the next agent in the chain. By tracking the evolution of text toxicity, positivity, difficulty, and length across transmission chains, we uncover the existence of biases and attractors, and study their dependence on the initial text, the instructions, language model, and model size. For instance, we find that more open-ended instructions lead to stronger attraction effects compared to more constrained tasks. We also find that different text properties display different sensitivity to attraction effects, with toxicity leading to stronger attractors than length. These findings highlight the importance of accounting for multi-step transmission dynamics and represent a first step towards a more comprehensive understanding of LLM cultural dynamics.

1 Introduction

Large language models (LLMs) are playing an increasingly significant role in the production of media content across various domains [10]. They are being used for academic writing [12, 2, 16, 24, 37, 40, 14], journalism [55], story generation [70, 78, 79], as chatbots on social media [61] and in the workplace at large [11, 27]. As LLMs become more performant, controllable, and more widespread, their impact on the creation and dissemination of information content humans consume is expected to grow even further [10].

Given the implications of LLM usage for the production of cultural information, numerous researchers have studied the properties of LLM-generated content. In these studies, LLMs showed several biases with respect to gender [1, 62, 72, 30, 25], race [57], values [5, 66], politics [48, 3, 30], authority, fallacy oversight, and beauty [15]. They were also found to generate at least as attractive [40] and compelling [68] texts than humans and to display similar cognitive biases [26].

Refer to caption
Figure 1: The transmission chain experimental design. (a) Single-turn transmission: an LLM agent receives a human-generated input text (e.g. a story) and a task (e.g. “rephrase the text”) and generates an output text. (b) Multi-turn transmission: a chain of LLM agents is given the same task, with the first agent receiving an initial text and subsequent agents receiving the output of the preceding agent. Measures of toxicity, positivity, difficulty, and length are recorded at each step of the chain.

While single-turn behaviors of LLMs prompted with a human-generated prompt are under active scrutiny, little is known about the effect of iterated interactions. Indeed, LLMs often use existing cultural content to generate new ones, for example when writing scientific reviews [37, 2] or generating new stories based on existing examples [78]. As the share of AI-generated content increases, LLMs will be producing outputs using content that is already LLM-generated, making it crucial to study the consequence of this iterative process. Moreover, LLMs are increasingly being used in multi-agent settings [20, 51, 52, 77, 34, 71, 17] and are already interacting with one another as chatbots on social media111https://chirper.ai/.

Nonetheless, little is known about how populations of LLMs might self-organize, as other complex systems do. Research in complex systems traditionally studies how global-level patterns emerge from local interactions [29, 43]. Here, we ask whether multi-turn behaviors conditioned on LLM-generated content cause the appearance of new kinds of biases, undetectable in single-turn behaviors but accumulating across iterated interactions.

To address this question, we take inspiration from the cultural evolution literature. The field of cultural evolution aims to provide causal explanations for the change of culture (defined as socially inherited information) over time. In particular, we draw insights from a research tradition called cultural attraction theory (CAT) [67, 47, 44]. CAT aims to determine how non-random transformations of cultural information during transmission events may lead to the evolution of progressively more stable forms, referred to as attractors. Although the precise conceptualization of attractors varies across authors, an encompassing definition may be "theoretical posits that capture the way in which certain ideational variants are more likely to be the outcome of transformations than others." [13]. For example, CAT has shown how the cognitive appeal of bloodletting — the historically commonplace but often damaging practice of letting blood out to cure a patient — explains why it is found in many unrelated cultures worldwide [45]. Experiments from the CAT literature showed that successive transmissions of a story about a mundane event can lead bloodletting to acquire a causal role that it did not have in the original story. Other experimental work has also revealed how modifying information over transmission chains makes it converge toward inductive biases [36] or how ecological factors may influence the position of attractors [46].

Most experiments in CAT employ a transmission chain design [42], first introduced by [7]. In this setup, chains of participants receive, produce and transmit social information from and to each other in a sequential manner (as in the popular telephone game). This powerful and highly controlled design allows to evaluate the high-level patterns that emerge from the accumulation of directional changes during single-turn transmission events. Here, we adapt this design to study how culture evolves along chains of LLMs, rather than human participants (Figure 1).

We conducted several transmission chain experiments with LLMs, where the first LLM-based agent in the chain receives a human-generated text, elaborates on it, and then passes it to the next agent in the chain. This transmission step is repeated with different instances of the LLM agent until the end of the chain is reached. By introducing several novel evaluation methods, we estimate the extent to which successive transmission events affect the evolution of multiple text properties, namely its toxicity, positivity, difficulty, and length. By comparing the properties of the initial (human-generated) and final texts (after several transmissions between LLMs), we illustrate and study the existence of potential attractors in LLM cultural evolution. In particular, we measure the effect of consecutive interactions compared to the effect of the first one. To study how text properties and attractors are affected by the specific context in which culture evolves, we conduct our analyses on five different models (ChatGPT-3.5-turbo-0125, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2, Llama3-70B-Instruct, and Mixtral-8x7B-Instruct-v0.1), three different tasks (i.e., instructions to either “rephrase”, “take inspiration from”, or “continue” the initial text) and 20 different initial texts. Although our focus is on a better understanding of the cultural dynamics of LLMs, the metrics and evaluation methods introduced here may also be of interest to researchers studying human cultural evolution.

The code for reproducing the simulations, analyses and figures is available on our GitHub222https://github.com/jeremyperez2/TelephoneGameLLM.

Our main contributions are:

  • We introduce an experimental paradigm based on transmission chains to study the biases and attractors introduced by multi-turn LLM interactions (Section 3.1)

  • We introduce a method to statistically assess the additional effects of multi-turn interactions over single-turn ones. (Section 3.2)

  • We introduce a method to estimate the existence, position, and strength of potential cultural attractors (Section 3.4).

  • We study the evolution of text properties (positivity, difficulty, toxicity, and length) across iterated transmissions and uncover systematic biases and attractors that are specific to multi-turn interactions (Section 4.2)

  • We show how the LLM properties (model and size) and the task influence the position and strength of attractors (Section 4.3) as well as the tendency of different chains to converge to semantically similar texts (Section 4.4).

2 Related work

Biases in LLMs outputs

LLM-generated content is known to exhibit a variety of stereotypical biases [8, 73]. In single-turn settings, LLMs perpetrate [49] or even amplify human biases based on gender, nationality, race, and religion [39]. For instance, the GPT model was shown to exhibit cultural values similar to those of WEIRD (Western, Educated, Industrial, Rich, Democratic) cultures [5]. LLMs trained through reinforcement learning with human feedback (RLHF) were found to overly express left-wing opinions on American politics — a tendency that, once formed, is difficult to avoid even after steering the model toward different demographic groups [63].

Transmission chains featuring artificial agents

Several studies have applied experimental designs used in cultural evolution to study knowledge and skill accumulation in groups of Reinforcement Learning agents [18, 64, 69, 56]. Closer to the current study, populations of LLMs have also been studied [10]. Iterative chains of generative models trained on the preceding model’s output have been shown to sometimes collapse toward the most likely outputs while the tails of the original distribution disappear [65, 54]. This idea of using LLM-generated content to fine-tune the next generation has also been applied to groups of LLMs with various communication structures [33]. Similar to our approach, iterative chains with frozen (i.e., not re-trained) LLMs have been shown to express human-like biases in terms of gender stereotypes, positivity, and social, threat, and biology-related information [1]. Strong, but non-human-like biases for producing factual information have also been observed [17], stressing the importance of understanding the evolution of content in LLMs and the ways it might deviate from human cultural evolution.

Refer to caption
Figure 2: Method for estimating attractor strength and position This figures depicts the method introduced in Section 3.4 to estimate the strength and position of theoretical attractors. Each dot in this figure corresponds to one chain, for a total of 100 chains (20 initial texts * 5 seeds). The position of a dot on the x-axis corresponds to the value of the property (positivity in this example) in the initial text, while the position on the y-axis corresponds to the value of this property of the text produced after 50 generations. We then used these 100 data points to fit a linear regression predicting the relationship between the initial and final values of the property. As visible in the figure, this relationship can be used to predict the value toward which the property would theoretically converge at the limit. This convergence point only exists if the slope of the curve is lower than 1. When the slope is greater than 1, the fitted relationhsip would predict divergence. The position of the potential theoretical attractor corresponds to intersection between the fitted line and the diagonal, while its strength corresponds to 1s1𝑠1-s1 - italic_s, s𝑠sitalic_s being the slope of the fitted relationship.

3 Methods

Our telephone game experiments aim to study the possible attractors and biases that may accumulate across multiple turns of interactions between LLMs. This is done with a transmission chain design tracking the evolution LLM outputs as a function of the number of interactions in the LLM chain. This section introduces our transmission chain design (Section 3.1), the set of metrics used to study the evolution of text properties, semantic similarity and the added effect of multi-turn interactions (Section 3.2), and our method to characterize the properties of attractors (Section 3.4).

3.1 LLM transmission chains

In transmission chains, individual participants are ordered linearly. Each participant receives some information from the previous one, performs a task, and transmits new information to the next participant. Each agent is prompted with a task (instruction on how the text should be processed) and a text, which are concatenated and passed to the user message. The first agent is given a human-generated text and a task, and subsequent agents are given the same task and the text generated by the previous agent in the transmission chain:

texti+1=LLM(task,texti),𝑡𝑒𝑥subscript𝑡𝑖1𝐿𝐿𝑀𝑡𝑎𝑠𝑘𝑡𝑒𝑥subscript𝑡𝑖text_{i+1}=LLM(task,text_{i}),italic_t italic_e italic_x italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_L italic_L italic_M ( italic_t italic_a italic_s italic_k , italic_t italic_e italic_x italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where text0 is the initial human-generated text and LLM𝐿𝐿𝑀LLMitalic_L italic_L italic_M generates an output based on task task𝑡𝑎𝑠𝑘taskitalic_t italic_a italic_s italic_k and the previous agent’s text xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We run this process for 50 generations. Examples of texts evolving through generations are provided in Appendix Section A.

Initial texts (text0𝑡𝑒𝑥subscript𝑡0text_{0}italic_t italic_e italic_x italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)

We borrow human-generated text from various databases to provide the initial input to each transmission chain. Since we were interested in how variation in the initial text would impact the properties of the ensuing chain, human-generated texts spanned various types of content: scientific abstracts 333https://huggingface.co/datasets/CCRss/arxiv_papers_cs, news articles 444https://huggingface.co/datasets/RealTimeData/bbc_latest, and social media posts 555https://huggingface.co/datasets/FredZhang7/toxi-text-3M/blob/e0e5b168b4a7e14e84f07271bfe1c6b42bc91ccd/multilingual-train-deduplicated.csv. As we are interested in the evolution of the toxicity, positivity, difficulty and length of generated texts, we sample the entire dataset to obtain a subset of 20 initial texts that covered the range of possible values for these properties. The exact method used to extract these texts is detailed in Appendix Section A.

Tasks

To determine the effects of instructions on the evolution of content over generations of LLMs, we prompt each chain of LLMs with three different tasks encompassing typical uses of LLMs:

  • Rephrase: agents are instructed to paraphrase the received text without modifying its meaning. This task is relevant for applications such as text simplification, or for content summarization.

  • Take inspiration: agents are instructed to take inspiration from the received text to produce a new one. It can be used in creative writing, where the goal is to generate new and original content.

  • Continue: agents are instructed to continue the received text. It is relevant for applications such as dialogue generation, in order to generate coherent and relevant responses to user inputs, or for content generation in storytelling and gaming.

As for models, tasks remained consistent within each chain. The exact prompt used for each task is reported in Appendix Section A.

Models

To assess whether and how cultural evolution dynamics are affected by the model specifications, we run identical experiments using five different models, all commonly used, from three different companies and with varying sizes: GPT-3.5-turbo-0125 (referred to as GPT3.5) , Llama3-8B-Instruct (referred to as Llama3-8B), Mistral-7B-Instruct-v0.2 (referred to as Mistral-7B), Llama3-70B-Instruct (referred to as Llama3-70B), Mixtral-8x7B-Instruct-v0.1 (referred to as Mixtral-8x7B). For inference, we used the OpenAI API (The MIT License) 666https://openai.com/index/openai-api/ to run GPT3.5 and the HuggingFace’s Transformer library [76] for other models (Apache Licence, v2.0). In our setup, transmission chains are always homogeneous with respect to the model, i.e. each chain is composed of a population of agents sharing the same underlying model.

3.2 Metrics

Text properties

Iterated transmissions may affect the generated text in several ways. We focus on four, orthogonal properties for each text which could be automatically measured (as opposed to requiring human annotators, which would have been impractical given the size of the output corpus):

  • Toxicity. Companies typically fine-tune LLMs to avoid the generation of toxic (i.e., dangerous or harmful) content. However, to our knowledge, this fine-tuning step focuses on single-turn dynamics, and the evolution of content with respect to its toxicity is currently understudied. We measure the toxicity of texts by quantifying the presence of rude, disrespectful, or unreasonable language, using a probability score that ranges from 0.0 (benign and non-toxic) to 1.0 (highly likely to be toxic), as estimated by the classifier introduced in [31].

  • Positivity. Even when trained to avoid toxic content, LLMs have been shown to express similar positivity biases to humans, often favoring negative over positive information in preserving and generating new information [1]. To study whether positivity biases over transmission chains are affected by tasks and models, we measure the positivity of produced contents using the SentimentIntensityAnalyzer tool from NLTK [32]. It uses this information to calculate a sentiment score for the text, ranging from -1.0 (highly negative) to 1.0 (highly positive).

  • Difficulty. While LLMs are argued to benefit society by democratizing knowledge [74], such positive outcomes are conditioned on the LLMs generating output that is accessible and inclusive to all kinds of audiences. However, whether text difficulty is preserved, increased, or reduced over transmission chains is currently unknown. We estimate text difficulty using the Gunning-Fog index [9], which depends on the average sentence length and the percentage of difficult words. A standard interpretation of this index is that it estimates the years of formal education required to fully understand the text.

  • Length: A simple, yet crucial aspect of a piece of text is its length. As more and more content is generated by and from LLM outputs, cultural media may become populated with increasingly short (potentially incomplete) or long (potentially redundant, meaningless, or hard to process) material. We therefore assess the evolution of content length as measured by the character count of generated text.

We provide additional details about metrics in Appendix Section A.

Semantic similarity

As we are also interested in measuring the extent to which different chains converge (or diverge) towards producing semantically similar (or different) outputs, we measure the similarity between all pairs of text produced. We do so by computing the cosine similarity between text embeddings, obtaining a value of similarity in the range [1,1]11[-1,1][ - 1 , 1 ]. An average similarity score close to 1 indicates all texts were similar to each other, whereas scores close to 0 and below indicates pairs of texts were highly dissimilar. Computing similarity scores for text outputs requires text embeddings, which we obtained with an embedding model from Sentence-Transformer [59].

3.3 Effect of multi-turn transmissions

One of our questions is how content evolves over multi-turn transmissions compared to single-turn settings. To address this point, we compare the distribution of a given property in the generated texts at the first generation to the distributions at subsequent generations. Thus for each model and task, we look at the properties of each of the 100 generated texts (20 transmissions chains * 5 seeds) at each generation, which gives us a sample of 100 property values for each value. Using a Kolmogorov–Smirnov test [41], we then test whether the sample obtain at each generation comes from the same distribution as the sample obtained after the first generation. If we can confidently reject the hypothesis that the sample of property values at the end of the transmission chain comes from the same distribution as the sample obtained after the first generation, this would confirm that looking at outputs after a single-turn transmission is not enough for predicting output properties in a multi-turn setting.

3.4 Attractor strength and position

Human cultural evolution shows that cultural traits sometimes evolve towards attractor states, i.e., content that invites convergence even with different starting points [36, 45, 46, 13]. Therefore, we were interested in whether transmission chains with LLMs would show similar attractor dynamics, and whether these depend on the model and task used in the chain. The concept of cultural attractor is not consistently formalized in the human cultural evolution literature [13]. Here, we defined attractors as the theoretical equilibrium point to which the iterated generation process (defined in Eq. 1) may eventually converge. We mathematically define attractors in terms of two properties of interest: its position (i.e., the location in output generation space the process converges toward) and its strength (i.e., the intensity to which generated outputs are pulled toward it). The strength takes values in [0,1]01[0,1][ 0 , 1 ], which allows for a continuous notion of an attractor: rather than being a binary concept that either exists or does not, attractors here lie on a spectrum, covering systems without attraction effects (strength=0) to ideal attractors (strength=1). To compute position and strength, we use the simulated data to fit a linear regression predicting the value of a property at the end of the chain as a function of its value in the initial text (Figure 2).

777Linear regressions were fit using the SciPy library (https://scipy.org/) released under the BSD Licence

For example for a given text property, we fit:

property(generation=50)=I+spropertyinitial,property𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛50𝐼𝑠subscriptproperty𝑖𝑛𝑖𝑡𝑖𝑎𝑙\textit{property}(generation=50)=I+s*\textit{property}_{initial},property ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 50 ) = italic_I + italic_s ∗ property start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT , (2)

where I is the estimated intercept and s the estimated slope.

This enables us to estimate the final output of a new chain starting from the final output of the previous chain as:

property(generation=100)=I+sproperty(generation=50).property𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛100𝐼𝑠property𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛50\textit{property}(generation=100)=I+s*\textit{property}(generation=50).property ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 100 ) = italic_I + italic_s ∗ property ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 50 ) . (3)

The fitted linear regression thus allows to define a recurrent relationship between the output of a chain as a function of the output of the previous chain:

property(generation=n50)=I+sproperty(generation=(n1)50).property𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑛50𝐼𝑠property𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑛150\textit{property}(generation=n*50)=I+s*\textit{property}(generation=(n-1)*50).property ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = italic_n ∗ 50 ) = italic_I + italic_s ∗ property ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = ( italic_n - 1 ) ∗ 50 ) . (4)

This relationship is a linear recurrence sequence which can be rewritten as:

property(generation=n50)=sn(propertyinitiall)+l,𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑛50superscript𝑠𝑛subscriptproperty𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑙𝑙{property}(generation=n*50)=s^{n}*(\textit{property}_{initial}-l)+l,italic_p italic_r italic_o italic_p italic_e italic_r italic_t italic_y ( italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_i italic_o italic_n = italic_n ∗ 50 ) = italic_s start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∗ ( property start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT - italic_l ) + italic_l , (5)

where l=I1s𝑙𝐼1𝑠l=\frac{I}{1-s}italic_l = divide start_ARG italic_I end_ARG start_ARG 1 - italic_s end_ARG.

If |s|<1𝑠1|s|<1| italic_s | < 1, then the sequence converges, its limit is l𝑙litalic_l and its convergence rate is 1s1𝑠1-s1 - italic_s.

We can therefore use the estimated relationship Eq. 2 to determine if an attractor exists (|s|<1𝑠1|s|<1| italic_s | < 1) and, if so, estimate its position l=I1s𝑙𝐼1𝑠l=\frac{I}{1-s}italic_l = divide start_ARG italic_I end_ARG start_ARG 1 - italic_s end_ARG and strength 1s1𝑠1-s1 - italic_s.

To validate that these theoretical fixed points correctly capture attraction dynamics, we estimated their positions using only data from the first 10 generations of each chain, and compared the predictions with the actual output after 50 generations. Visual inspection of the results (Appendix Section B) confirmed that our method is suited for estimating the strength and position of attractors.

4 Results

Refer to caption
Figure 3: Evolution of properties for one of the 20 initial texts over generations (Mean ±plus-or-minus\pm± SE) Evolution over generations of texts toxicity, positivity, difficulty and length (rows) for the Rephrase, Take inspiration and Continue tasks (columns) for various LLMs (colors). Plots show average and standard deviations over 5 seeds, starting with Initial Text 18. Equivalent curves for all 20 initial stories are provided as Supplementary Material in Appendix Section  B. We observe that iterated transmissions shape the properties of generated texts beyond the effect of the first transmission. Evolution can converge quickly (e.g. bottom row) or more gradually (e.g. second row). The value of convergence points varies between tasks and models.
Refer to caption
Figure 4: Evolution of the distribution of text properties in generated texts across generations for different models and tasks. We here represent the distribution of each of the four properties at each generation, for each model and task. These distributions thus represent the properties observed in the set of 100 transmission chains (20 initial texts * 5 seeds) for each model and task. For each property, task and model, the 50 generations are arranged vertically, with first generations at the top and last generations at the bottom. This representation allows to capture how iterated transmissions shift the distributions toward certain values, and how quickly this shift happens.

For each of the 5 models, 3 tasks, and 20 initial texts, we ran 5 transmission chains with 50 transmission steps. We provide some examples of generated texts in Appendix Section 1, and complete data on the companion website 888https://sites.google.com/view/telephone-game-llm. By extracting the properties of generated texts at each generation of each chain, we can study the evolution of these properties through generations, measure how they are affected by interactions beyond single-turn effects, as well as detect and characterize theoretical attractors. By comparing the semantic similarities of texts produced by different chains, we can also evaluate whether sets of chains tend to converge or diverge.

4.1 Qualitative analysis of property evolutions over generations

We first hypothesized that iterated transmissions would affect the properties of generated texts beyond the first transmission. Across many instances, we indeed found that text properties keep evolving after the first generation. In Figure 3, we show the evolution of text difficulty, positivity, length and toxicity for one of the 20 initial texts, for all models and all tasks. Evolution of these properties for each of the 20 initial texts can be found in Appendix Section B. This specific example already allows to notice important differences in dynamics between models, tasks, and properties. Indeed, we observe that while toxicity converges to values close to zero for all models and tasks, this happens at a slower pace for GPT3.5 in the Rephrase task, and for Llama3-70B in the Take Inspiration task. For positivity, we observe that on the Take Inspiration and Continue tasks, GPT3.5 and Mixtral-8x7B converge to high positivities almost instantly, while evolution is more gradual for other models. The dynamics of difficulty appears to be highly influenced by task and models, as we observe cumulative changes for Llama3-8B and Llama3-70B in the Take Inspiration and Continue tasks, but not so much for other tasks and models. Length appears to exhibit little to no cumulative dynamics in this example. Interestingly, there seems to be a discontinuity of high magnitude for Mistral-7B toward the end of the chain in the Continue task. Qualitative observation of the texts revealed that this appears to be an example of collapsing behavior, which we discuss more extensively in Appendix Section B.

Overall, these results suggest that the evolution of text properties across repeated transmission is highly sensitive to both agent models and tasks.

Although we here focused on a specific example, looking at the evolution over generations of the distribution of each properties gives a higher-level idea of the dynamics. In Figure 4, we show the evolution of property distributions over generation for each model and task. This reveals important difference depending on the analyzed property, the task and the model. For instance, we observe that toxicity (Figure 4.a ) converges very quickly to a very narrow peak centered around 0. This is very different from the evolution of positivity (Figure 4.b), for which the initial distribution appears to be quite preserved for the Rephrase task (Figure 4.b, first row), while less constrained tasks such as Take inspiration (second row) and Continue (third row) lead to more visible changes. Interestingly, we observe that for Llama3-8B (blue) and Llama3-70B (pink), the distribution of positivity values converges to a bimodal distribution, while distributions are unimodal for other models. In some cases, we also observe that different models lead the distributions to be shifted in opposite directions. For instance when looking at the evolution of texts length (Figure 4.d), using GPT3.5 (green) or Llama3-8B (blue) leads text to become on average shorter, while using Mixtral-8x7B shifts the distribution towards greater length values.

4.2 To what extent do multi-turn transmissions affect the evolution of properties?

Qualitative analyses from the previous section appear to suggest that multi-turn transmissions lead texts to acquire different properties compared to single-turn settings. To quantitatively evaluate this observation, we use Kolmogorov–Smirnov (KS) tests [41] to estimate the compare property distributions after a single interaction and after multiple interactions. In Figure 5, we report for each model, task and property the p-value of the KS test for the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: "The text properties at generation i𝑖iitalic_i are sampled from the same distribution as the text properties after generation 1". Across most instances, we observe that H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be confidently rejected, indicating that property distributions become significantly different after multi-turn transmissions compare to single-turn transmissions. We observe that this is more often the case for less constrained tasks (Take Inspiration and Continue, second and third columns) than for more constrained task (Rephrase, first column). This finding confirms that studying single-turn interactions is in general not sufficient for analyzing the property of interacting LLMs outputs. This warrants a more detailed account of the cultural dynamics across iterated interactions among LLMs.

Refer to caption
Figure 5: Text properties are affected by transmissions beyond the first one. p-values of the KS-test for the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: "The text properties at generation i𝑖iitalic_i are sampled from the same distribution as the text properties after generation 1", for each task (columns), property (rows) and models (colors). The grey shaded area represents p-values lower than 0.05. Over most instances, p-values decrease over generation and become close to 0. This indicates that multi-turn transmissions lead property distributions to become significantly different from the distributions observed after single-turn interactions.

4.3 What influences the presence, strength, and position of attractors?

Visual inspection of the evolution of text properties as presented on Figure 4 indicate that multi-turn transmissions lead distributions to become skewed toward certain values, which suggests the presence of attractors. The task assigned to a chain and the model type populating it appear to influence the position of those attractors, as well as their strength (i.e. how quickly do shifts in distributions happen). To have quantitative measures of attractors strengths and positions, we use the method described in 3.4 and Figure 2. By fitting a linear regression predicting the value of a given property at the end of the chain as a function of the property in the initial text, we can represent for each property, model and task the position and strength of the attractors, provided they exist. Figure 6 presents the estimated strengths and positions of attractors, and fitted linear regressions are provided as supplementary material in Figure 8.

For all combinations of property, task and model, we found that the recurrent relationship defined by the fitted linear regression converges. This means that all conditions admit a theoretical attractor as defined in Section 3.4. As for the strength and position of these theoretical attractors, Figure 6 already allows to notice some tendencies. For instance, it seems that less constrained tasks (e.g. Continue) lead to stronger attractors than more constrained tasks (e.g. Rephrase). To better disentangle the respective contributions of model type, task and property on attractors, we fitted bayesian models predicting attractor Strength as a function of Task, Model and Property, and predicting attractor Position as a function of Task and Model, for each of the four considered properties. Details about statistical analyses are provided in Appendix section B.

We find a strong effect of Task on attractor strength: Continue leads to significantly stronger attraction than Take Inspiration, itself leading to significantly stronger attraction than Rephrase. This confirms our observation that less constrained tasks lead to stronger attraction than more constrained tasks. Different properties are also find to display different sensitivity to attraction effects. We detect that toxicity possesses significantly stronger attractors than positivity, difficulty and length. As for the effect of model, we observe significantly weaker attraction for Llama3-70b compared to GPT3.5, Llama3-8B and Mixtral-8x7b.

As for the position of the attractors, we observe the attractor for toxicity is significantly higher for Llama3-8b than for GPT3.5 and Mixtral-8x7b, and significantly higher for Llama3-70b than for GPT3.5, Mistral-7b and Mixtral-8x7b. It is also higher for Continue than for Rephrase. For positivity, we found that the position of the attractor was significantly lower for Llama3-8b than for GPT3.5, Mistral-7b and Mixtral-8x7b, and that the task Take inspiration and Continue both led to significantly higher positivity than the Rephrase task.

Refer to caption
Figure 6: Attractors strength and position. The heigth of the bars represent the position (top row) and strength (bottom row) of theoretical attractors estimated using the method described in Section 3.4, for each property (columns), task, and model. Visual inspection of these plots allows to notice some tendencies, such as the effect of task on attractor strength: less constrained tasks, such as Continue, appear to produce stronger attractors than more constrained tasks, such as Rephrase. Our definition of attractors also allows to compare attractors strength accross properties: we can notice that attractors appears to be stronger for toxicity (second row, first column) than for length (second row, fourth column). Finally, we can notice that the position of attractors appears to vary between models. For instance, the attractor for difficulty apperars to be higher for Llama3-8b and Llama3-70b than for other models. Statistical analyses allowed to quantify these differences and are presented in Section 4.3.

4.4 To what extent do different transmission chains converge on similar content?

Lastly, we investigate the extent to which iterated transmissions lead different chains to diverge or converge as determined by whether the between-chain similarity among generated texts increases or decreases after several generations. This can be measured by assessing the cosine similarity between final text embeddings versus the cosine similarity between initial text embeddings for each possible pair of chains, for each model and task (Figure 7).

Refer to caption
Figure 7: Convergence: texts in outputs of transmission chains are often more similar than texts given in input. We perform pair-wise comparisons of all simulated chains and plot the relationship between the similarity between the two texts given an input of the two chains and the similarity between the two texts produced at the last generation of these two chains. Points above the diagonal indicate that similarity is higher at the end of the chain than at the beginning, revealing convergence.

As expected given the nature of the task, chains instructed with the Rephrase task maintained close similarity with the initial text (Figure 7). Out of the five model tested, Llama3-8B was more likely to maintain semantic similarity across generations for this prompt. For Take inspiration and Rephrase, there seems to be a tendency for chains to lead to a specific distance between final texts, as the initial distance between texts has little impact on the final distance. This means that over generations, chains that started with very similar texts diverge while chains that started with very different texts converge. The position of this attractor appears to be influenced by the model, as for example Llama3-8b displays much more convergence than the two other models for the Take inspiration task.

5 Discussion

While current studies analyzing the outputs of LLMs are restricted to a single prompt-output interaction, we borrowed the methodology from studies on human cultural evolution to address how cultural content may evolve over transmission chains with LLMs. This resulted in a series of telephone game experiments assessing the evolution of cultural content in LLMs as a function of models, instructions, and text properties. Our results reveal that several changes in generated content appear after multiple iterations. For example, we observed that the difficulty of a provided text was preserved after an LLM was prompted to elaborate it a single time, but changed dramatically after the text was processed iteratively by a chain of LLMs.

By comparing the properties of input texts to those of texts produced by transmission chains spanning several generations of LLMs, we identified property-specific patterns in the convergence of LLM dynamics toward attractor states. Although the existence of fixed points identified with our method does not prove the existence of an attractor, it allows to perform quantitative evaluations on the dynamics of text evolution across models, tasks and properties. Using this method, we found high convergence rates for some text properties (toxicity and positivity), independent of the chain’s model and task whereas the evolution of other text characteristics (difficulty and length) was influenced by the task and the model. Differences in dynamics among properties could be a consequence of fine-tuning through reinforcement learning with human feedback, a commonplace practice in LLM training which may target some properties more than other (e.g., specifically avoiding toxic content without addressing its difficulty), creating strong attractors. It is also interesting to notice that convergence rates were, on average, higher for more open-ended tasks (Take inspiration and Continue) than a more constrained one (Rephrase), suggesting the study of cultural evolution in LLM transmission chains might be particularly relevant to situations in which LLMs are used to simulate artificial societies [20, 51, 52, 77, 34, 71, 17], where they are often granted a relatively high freedom in order to witness emergent behaviors.

We also introduced several evaluation metrics for analyzing cultural dynamics, in particular defining a task- and metric-independent notion of theoretical attractor. Although these methods were developed to study transmission chains in LLMs, similar tools may be applied to studies of human cultural evolution, allowing for inferences across tasks and cultural domains, moving beyond domain-specific results.

Limitations and future work

As the study of the cultural dynamics of generative agents is an emerging research area, our setting involved several simplifications. While we focused on linear transmission chains, real-world interactions typically involve networks of senders and receivers. Human studies have shown that network size [28, 60, 4, 6, 21] and structure [58, 38, 22, 19, 23] influence cultural evolution. Following some initial endeavors [50, 53], future work may assess similar effects in machine networks. To investigate model- and task-specific biases, we studied transmission chains in homogeneous settings, where agents belonged to the same model type and received the same instructions, but future studies may address cultural dynamics in heterogeneous populations of LLMs prompted with various instructions. While we focused on LLM and, therefore, text outputs, similar studies may be run to address the properties of various generative tools (e.g., for image generation). In the future, researchers may also address hybrid networks in which humans and LLMs interact — a scenario that is becoming increasingly relevant as generative tools become more widespread and which may, in turn, shape the future of human cultural evolution. [10].

Acknowledgements

This research was partially funded by the French National Research Agency (ANR, project ECOCURL, Grant ANR-20-CE23-0006). This work benefited from access to the Jean Zay (Idris) supercomputer associated with the Genci grant A0151011996. We also thank Chris Foulon, Marcela Ovando-Tellez and Joan Dussauld who participated in the hackathon Hack1Robo during which this project originated.

References

  • [1] A. Acerbi and J. M. Stubbersfield. Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44):e2313790120, 2023.
  • [2] S. Agarwal, I. H. Laradji, L. Charlin, and C. Pal. LitLLM: A Toolkit for Scientific Literature Review, Feb. 2024. arXiv:2402.01788 [cs].
  • [3] A. Agiza, M. Mostagir, and S. Reda. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs, Apr. 2024. arXiv:2404.08699 [cs].
  • [4] C. Andersson and D. Read. Group size and cultural complexity. Nature, 511(7507):E1–E1, 2014. Publisher: Nature Publishing Group UK London.
  • [5] M. Atari, M. J. Xue, P. S. Park, D. E. Blasi, and J. Henrich. Which humans?, Sep 2023.
  • [6] R. Baldini. Revisiting the effect of population size on cumulative cultural evolution. Journal of Cognition and Culture, 15(3-4):320–336, 2015. Publisher: Brill.
  • [7] P. L. Bartlett. Remembering. Cambridge University Press., 1932.
  • [8] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery.
  • [9] J. Bogert. In defense of the fog index. The Bulletin of the Association for Business Communication, 48(2):9–12, 1985.
  • [10] L. Brinkmann, F. Baumann, J.-F. Bonnefon, M. Derex, T. F. Müller, A.-M. Nussberger, A. Czaplicka, A. Acerbi, T. L. Griffiths, J. Henrich, et al. Machine culture. Nature Human Behaviour, 7(11):1855–1868, 2023.
  • [11] E. Brynjolfsson, D. Li, and L. R. Raymond. Generative AI at Work, Apr. 2023.
  • [12] O. O. Buruk. Academic Writing with GPT-3.5: Reflections on Practices, Efficacy and Transparency. In 26th International Academic Mindtrek Conference, pages 144–153, Oct. 2023. arXiv:2304.11079 [cs].
  • [13] A. Buskell. What are cultural attractors? Biology & Philosophy, 32(3):377–394, 2017.
  • [14] G. Cabanac, C. Labbé, and A. Magazinov. Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals, July 2021. arXiv:2107.06751 [cs].
  • [15] G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang. Humans or LLMs as the Judge? A Study on Judgement Biases, Apr. 2024. arXiv:2402.10669 [cs].
  • [16] H. Cheng, B. Sheng, A. Lee, V. Chaudary, A. G. Atanasov, N. Liu, Y. Qiu, T. Y. Wong, Y.-C. Tham, and Y. Zheng. Have AI-Generated Texts from LLM Infiltrated the Realm of Scientific Writing? A Large-Scale Analysis of Preprint Platforms, Mar. 2024. Pages: 2024.03.25.586710 Section: New Results.
  • [17] Y.-S. Chuang, A. Goyal, N. Harlalka, S. Suresh, R. Hawkins, S. Yang, D. Shah, J. Hu, and T. T. Rogers. Simulating opinion dynamics with networks of llm-based agents. arXiv preprint arXiv:2311.09618, 2023.
  • [18] J. Cook, C. Lu, E. Hughes, J. Z. Leibo, and J. Foerster. Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning, June 2024. arXiv:2406.00392 [cs].
  • [19] J. F.-L. de Pablo, V. Romano, M. Derex, E. Gjesfjeld, C. Gravel-Miguel, M. J. Hamilton, A. B. Migliano, F. Riede, and S. Lozano. Understanding hunter–gatherer cultural evolution needs network thinking. Trends in Ecology & Evolution, 37(8):632–636, 2022. Publisher: Elsevier.
  • [20] I. de Zarzà, J. de Curtò, G. Roig, P. Manzoni, and C. T. Calafate. Emergent cooperation and strategy adaptation in multi-agent systems: An extended coevolutionary theory with llms. Electronics, 12(12):2722, 2023. Publisher: MDPI.
  • [21] M. Derex, M.-P. Beugin, B. Godelle, and M. Raymond. Experimental evidence for the influence of group size on cultural complexity. Nature, 503(7476):389–391, 2013. Publisher: Nature Publishing Group UK London.
  • [22] M. Derex and R. Boyd. Partial connectivity increases cultural accumulation within groups. Proceedings of the National Academy of Sciences, 113(11):2982–2987, Mar. 2016.
  • [23] M. Derex and A. Mesoudi. Cumulative cultural evolution within evolving population structures. Trends in Cognitive Sciences, 24(8):654–667, 2020. Publisher: Elsevier.
  • [24] I. Dergaa, K. Chamari, P. Zmijewski, and H. Ben Saad. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biology of Sport, 40(2):615–622, Apr. 2023.
  • [25] X. Dong, Y. Wang, P. S. Yu, and J. Caverlee. Disclosure and Mitigation of Gender Bias in LLMs, Feb. 2024. arXiv:2402.11190 [cs].
  • [26] J. Echterhoff, Y. Liu, A. Alessa, J. McAuley, and Z. He. Cognitive Bias in High-Stakes Decision-Making with LLMs, Feb. 2024. arXiv:2403.00811 [cs].
  • [27] T. Eloundou, S. Manning, P. Mishkin, and D. Rock. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, Aug. 2023. arXiv:2303.10130 [cs, econ, q-fin].
  • [28] N. Fay, N. De Kleine, B. Walker, and C. A. Caldwell. Increasing population size can inhibit cumulative cultural evolution. Proceedings of the National Academy of Sciences, 116(14):6726–6731, Apr. 2019.
  • [29] J. Gleick and R. C. Hilborn. Chaos, Making a New Science. American Journal of Physics, 56(11):1053–1054, Nov. 1988.
  • [30] P. Haller, A. Aynetdinov, and A. Akbik. OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs, Sept. 2023. arXiv:2309.03876 [cs].
  • [31] L. Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  • [32] N. Hardeniya, J. Perkins, D. Chopra, N. Joshi, and I. Mathur. Natural language processing: python and NLTK. Packt Publishing Ltd, 2016.
  • [33] H. Helm, B. Duderstadt, Y. Park, and C. E. Priebe. Tracking the perspectives of interacting language models, June 2024. arXiv:2406.11938 [cs].
  • [34] W. Hua, L. Fan, L. Li, K. Mei, J. Ji, Y. Ge, L. Hemphill, and Y. Zhang. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars, Nov. 2023. arXiv:2311.17227 [cs].
  • [35] C. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225, 2014.
  • [36] M. L. Kalish, T. L. Griffiths, and S. Lewandowsky. Iterated learning: Intergenerational knowledge transmission reveals inductive biases. Psychonomic Bulletin & Review, 14(2):288–294, 2007. Place: US Publisher: Psychonomic Society.
  • [37] Q. Khraisha, S. Put, J. Kappenberg, A. Warraitch, and K. Hadfield. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods, Mar. 2024.
  • [38] S. Kirby and M. Tamariz. Cumulative cultural evolution, population structure and the origin of combinatoriality in human language. Philosophical Transactions of the Royal Society B: Biological Sciences, 377(1843):20200319, Jan. 2022.
  • [39] H. Kotek, R. Dockum, and D. Sun. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, pages 12–24, 2023.
  • [40] R. Marlow and D. Wood. Ghost in the machine or monkey with a typewriter—generating titles for Christmas research articles in The BMJ using artificial intelligence: observational study. The BMJ, 375:e067732, Dec. 2021.
  • [41] F. J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253):68–78, 1951.
  • [42] A. Mesoudi. Experimental studies of cultural evolution, July 2021.
  • [43] M. Mitchell. Complexity: A Guided Tour. Oxford University Press, Oxford, New York, Apr. 2009.
  • [44] H. Miton. Cultural Attraction, Feb. 2024.
  • [45] H. Miton, N. Claidière, and H. Mercier. Universal cognitive mechanisms explain the cultural success of bloodletting. Evolution and Human Behavior, 36(4):303–312, July 2015.
  • [46] H. Miton, T. Wolf, C. Vesper, G. Knoblich, and D. Sperber. Motor constraints influence cultural evolution of rhythm. Proceedings of the Royal Society B: Biological Sciences, 287(1937):20202001, Oct. 2020. Publisher: Royal Society.
  • [47] O. Morin. How Traditions Live and Die. Oxford University Press, 2016. Google-Books-ID: kSukCgAAQBAJ.
  • [48] F. Motoki, V. Pinho Neto, and V. Rodrigues. More human than human: measuring ChatGPT political bias. Public Choice, Aug. 2023.
  • [49] M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
  • [50] E. Nisioti, M. Mahaut, P.-Y. Oudeyer, I. Momennejad, and C. Moulin-Frier. Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS, Nov. 2022. arXiv:2206.05060 [cs].
  • [51] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior, Aug. 2023. arXiv:2304.03442 [cs].
  • [52] J. S. Park, L. Popowski, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Social Simulacra: Creating Populated Prototypes for Social Computing Systems, Aug. 2022. arXiv:2208.04024 [cs].
  • [53] J. Perez, C. Léger, M. Ovando-Tellez, C. Foulon, J. Dussauld, P.-Y. Oudeyer, and C. Moulin-Frier. Cultural evolution in populations of Large Language Models, Mar. 2024. arXiv:2403.08882 [cs, q-bio].
  • [54] A. J. Peterson. Ai and the problem of knowledge collapse. arXiv preprint arXiv:2404.03502, 2024.
  • [55] S. Petridis, N. Diakopoulos, K. Crowston, M. Hansen, K. Henderson, S. Jastrzebski, J. V. Nickerson, and L. B. Chilton. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–16, New York, NY, USA, Apr. 2023. Association for Computing Machinery.
  • [56] B. Prystawski, D. Arumugam, and N. D. Goodman. Cultural reinforcement learning: a framework for modeling cumulative culture on a limited channel, May 2023.
  • [57] C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis, July 2024. arXiv:2407.02030 [cs].
  • [58] L. Raviv, A. Meyer, and S. Lev-Ari. The Role of Social Network Structure in the Emergence of Linguistic Structure. Cognitive Science, 44(8):e12876, Aug. 2020.
  • [59] N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020.
  • [60] P. Richerson. Group size determines cultural complexity. Nature, 503(7476):351–352, 2013. Publisher: Nature Publishing Group UK London.
  • [61] E. Sadikoğlu, M. Gök, M. Mijwil, and I. Kosesoy. The evolution and impact of large language model chatbots in social media: A comprehensive review of past, present, and future applications, 12 2023.
  • [62] L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz, and Z. Akata. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. Advances in Neural Information Processing Systems, 36, 2024.
  • [63] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect? In International Conference on Machine Learning, pages 29971–30004. PMLR, 2023.
  • [64] S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zisserman, K. Simonyan, and S. M. A. Eslami. Kickstarting Deep Reinforcement Learning, Mar. 2018. arXiv:1803.03835 [cs].
  • [65] I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Papernot, and R. Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  • [66] S. Sivaprasad, P. Kaushik, S. Abdelnabi, and M. Fritz. Exploring Value Biases: How LLMs Deviate Towards the Ideal, Feb. 2024. arXiv:2402.11005 [cs].
  • [67] D. Sperber. Anthropology and Psychology: Towards an Epidemiology of Representations. Man, 20(1):73–89, 1985. Publisher: [Wiley, Royal Anthropological Institute of Great Britain and Ireland].
  • [68] G. Spitale, N. Biller-Andorno, and F. Germani. AI model GPT-3 (dis)informs us better than humans. Science Advances, 9(26):eadh1850, June 2023. Publisher: American Association for the Advancement of Science.
  • [69] O. E. L. Team, A. Stooke, A. Mahajan, C. Barros, C. Deck, J. Bauer, J. Sygnowski, M. Trebacz, M. Jaderberg, M. Mathieu, N. McAleese, N. Bradley-Schmieg, N. Wong, N. Porcel, R. Raileanu, S. Hughes-Fitt, V. Dalibard, and W. M. Czarnecki. Open-Ended Learning Leads to Generally Capable Agents, July 2021. arXiv:2107.12808 [cs].
  • [70] M. Valentini, J. Weber, J. Salcido, T. Wright, E. Colunga, and K. Kann. On the Automatic Generation and Simplification of Children’s Stories, Oct. 2023.
  • [71] A. S. Vezhnevets, J. P. Agapiou, A. Aharon, R. Ziv, J. Matyas, E. A. Duéñez-Guzmán, W. A. Cunningham, S. Osindero, D. Karmon, and J. Z. Leibo. Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia, Dec. 2023. arXiv:2312.03664 [cs].
  • [72] Y. Wan, G. Pu, J. Sun, A. Garimella, K.-W. Chang, and N. Peng. "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters, Dec. 2023. arXiv:2310.09219 [cs].
  • [73] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  • [74] D. Weiss. Generative AI is the Next Step in Democratizing Knowledge. https://techstrong.ai/articles/generative-ai-is-the-next-step-in-democratizing-knowledge/. [Accessed 22-05-2024].
  • [75] T. Wiecki, R. Vieira, J. Salvatier, M. Kochurov, A. Patil, M. Osthege, B. T. Willard, B. Engels, O. A. Martin, C. Carroll, A. Seyboldt, A. Rochford, L. Paz, rpgoldman, K. Meyer, P. Coyle, O. Abril-Pla, V. Andreani, M. E. Gorelli, R. Kumar, J. Lao, A. Andorra, T. Yoshioka, G. Ho, T. Kluyver, K. Beauchamp, D. Pananos, E. Spaak, and B. Edwards. pymc-devs/pymc: v3.11.6, May 2024.
  • [76] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • [77] B. Xiao, Z. Yin, and Z. Shan. Simulating Public Administration Crisis: A Novel Generative Agent-Based Simulation System to Lower Technology Barriers in Social Science Research, Nov. 2023. arXiv:2311.06957 [cs].
  • [78] Z. Xie, T. Cohn, and J. H. Lau. The Next Chapter: A Study of Large Language Models in Storytelling, July 2023. arXiv:2301.09790 [cs].
  • [79] Z. Zhao, S. Song, B. Duah, J. Macbeth, S. Carter, M. P. Van, N. S. Bravo, M. Klenk, K. Sick, and A. L. S. Filipowicz. More human than human: LLM-generated narratives outperform human-LLM interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition, pages 368–370, New York, NY, USA, June 2023. Association for Computing Machinery.

Appendix A Additional details on the methods

Selecting initial texts

We extracted 5 scientific abstracts 999https://huggingface.co/datasets/CCRss/arxiv_papers_cs, 10 news articles 101010https://huggingface.co/datasets/RealTimeData/bbc_latest, and 5 social media comments 111111https://huggingface.co/datasets/FredZhang7/toxi-text-3M/blob/e0e5b168b4a7e14e84f07271bfe1c6b42bc91ccd/multilingual-train-deduplicated.csv from online datasets as initial texts. To ensure that those initial texts covered the range of text properties we were interested in, we proceeded as follows: for difficulty, we measured the maximal and minimal difficulty dminsubscript𝑑𝑚𝑖𝑛d_{min}italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and dmaxsubscript𝑑𝑚𝑎𝑥d_{max}italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT of texts from the scientific abstracts datasets, defined a linear space of 5 values (di)i=1:5subscriptsubscript𝑑𝑖:𝑖15(d_{i})_{i=1:5}( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 : 5 end_POSTSUBSCRIPT between dminsubscript𝑑𝑚𝑖𝑛d_{min}italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and dmaxsubscript𝑑𝑚𝑎𝑥d_{max}italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and sampled 5 texts, each having a value of difficulty close to (di)i=1:5subscriptsubscript𝑑𝑖:𝑖15(d_{i})_{i=1:5}( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 : 5 end_POSTSUBSCRIPT. We then followed the same procedure for toxicity, using the dataset of social media comments; for positivity, using the dataset of news articles; length, using the dataset of news articles.

Pre-processing outputs

Data analyses revealed that, on the Continue task, when using Mistral-7B, agents of the chains would sometimes start outputting very long text by filling them with “#some_keyword”. As this behavior created a few outliers, we thought it would be better to filter-out those “#some_keyword” when performing the main analyses. This behavior is nevertheless an interesting result, reminiscent of the collapsing dynamics found when training LLMs on their own output [65]. We therefore discuss it separately in Appendix B.

Hyperparameters

We use the following hyperparameters for generations in all models. Temperature was set to 0.80.80.80.8 with and top_p to 0.950.950.950.95. All models, except GPT3.5, bfloat16 precision was used.

Computational resources

Experiments were conducted with the OpenAI API (less than 5 million tokens), and with a cluster equipped with A100, and V100 GPU graphic cards. Running the final experiments for the four models, which were run on the cluster, required less than 3000 GPU hours. Experiments with Llama3-8B and Mistral-7B were conducted on V100 NVIDIA GPUs with 32GB of VRAM, and experiments with Llama3-70B and Mixtral-8x7B on two A100 NVIDIA GPUs with 80GB VRAM in parallel.

Prompts used

In our experiments, each task was induced by a specific instruction (prompt), which is given to each agent in the chain. For the Rephrase task, the instruction is: “You will receive a text. Your task is to rephrase this text without modifying its meaning. Just output your new text, nothing else. Here is the text:”, for the Inspiration task, the instruction is: “You will receive a text. Your task is to create a new original text by taking inspiration from this text. Just output your new text, nothing else. Here is the text:”, and for the Continue task, the instruction is: “You will receive a text. Your task is to continue this text. Just output your new text, nothing else. Here is the text:”.

Examples of stories

Here we provide examples of stories that were given as input and stories that were generated in the last iteration of some chains. Table 1 shows one example for each task. Complete data can be found on the companion website 121212https://sites.google.com/view/telephone-game-llm using the Data Explorer tool.

Measuring text properties

  • Toxicity. We assess the level of toxicity in generated texts using the Detoxify library, a classifier developed for the Jigsaw Toxic Comment Classification Challenges (see https://github.com/unitaryai/detoxify/tree/master . This classifier defines toxicity as the presence of rude, disrespectful, or unreasonable language in a text and assigns a probability score ranging from 0.0 (benign and non-toxic) to 1.0 (highly likely to be toxic). Trained on a large dataset of human-labeled comments from various online platforms, the classifier use a transformer-based architecture to analyze the text’s context and meaning, identifying patterns indicative of toxicity.

  • Positivity. We employ the SentimentIntensityAnalyzer tool from the NLTK library to assess the positivity of generated texts. The tool is based on the Valence Aware Dictionary and sEntiment Reasoner (VADER) method [35], which is a lexicon and rule-based sentiment analysis tool specifically designed for social media data. It uses a combination of lexical features, such as words and their semantic orientation, to determine the overall sentiment of a text. In the VADER method, every word in the vocabulary is rated with respect to its positive or negative sentiment and the intensity of that sentiment. The SentimentIntensityAnalyzer uses this information to calculate a sentiment score for the text, ranging from -1.0 (highly negative) to 1.0 (highly positive).

  • Difficulty. We estimate the difficulty of generated texts using the Gunning-Fog index. In the method, complex words are defined as those with three or more syllables, excluding proper nouns, familiar jargon (and removing common suffixes as syllables). We used textstat to compute it.

Table 1: Examples of starting stories and the corresponding final stories for chains with GPT3.5 on the three tasks
Description Story
Starting story You are certainly no friend of mine. You have accused me of lying about what you had said, and run away when I proved by actual quotation that the accusation was false. We both know that, even though you praise humility, you are not humble enough to apologize. You sneered at my combat experience, saying “A lot of so called “combat veterans” of Vietnam smelled more perfume from call girls than they smelled cordite” and “So when you were an Army officer in Vietnam (even though you spent your time inside of Danang airbase, with at least two battalions of infantry marines on the wire) what was the combat load for M-16 ammo per soldier on a daily patrol?” An actual friend would not have written such things. No, you are no friend of mine, you ignorant, arrogant, hypocritical blow-hard. You should be able to guess that I do not feel friendly towards you
Final story (Continue) Your laughter and light will always be a beacon of hope in our darkest days. Your presence may be gone, but your spirit will forever be alive in our hearts. We will carry on your legacy of kindness and love, spreading your positivity to all those we encounter. Your memory will never fade, as we hold you close in our thoughts and actions. Rest easy, knowing that you are surrounded by our love and gratitude. You may be gone, but you will never be forgotten. Until we meet again, dear friend, know that you are always with us.
Final story (Inspiration) Amidst the chaos of our busy lives, it is essential to pause and prioritize forming genuine connections with those around us. By being fully present and showing empathy, we can cultivate relationships that are deep and fulfilling. Authenticity is the key to creating bonds that bring true joy and contentment into our lives. Engaging in open and sincere dialogues, sharing our vulnerabilities, and striving to forge meaningful connections are all essential in nurturing authentic relationships. Let’s come together to create a community where empathy and mutual respect are the guiding principles, and where every interaction is infused with love and gratitude. These authentic connections serve as a support system rooted in compassion and generosity, uplifting and inspiring each other as we navigate the ups and downs of life. Together, we can weave a network of relationships that exude positivity and light, enriching our lives in meaningful and transformative ways.
Final story (Rephrase) I need to end our friendship since you won’t apologize for accusing me of dishonesty, ignoring evidence, and making disrespectful comments about my time in Vietnam. Your behavior shows a lack of empathy, arrogance, and insincerity, qualities that are not in line with those of a true friend.

Appendix B Additional figures and analyses

Refer to caption
Figure 8: Fitted linear regressions used to compute attractors strength and position. For three tasks (Rephrase, Take Inspiration, Continue), five models, and three metrics (toxicity, positivity, difficulty, length), we plot (Mean ±plus-or-minus\pm± SE) the relationship between the metric value of the initial human-written text (input to the first agent) and the value of the final LLM-generated text (output of the last agent). A slope close to zero indicates strong attraction, while the value at the intersection with the diagonal captures the position of the attractor.

Statistical models

We performed statistical analyses using the Python package pymc [75] to fit Bayesian models.

  • Model 1 We fitted a model predicting the attractor strength (Figure 6) as a function of the Task, Model and Property: Strength𝒩(μ,σ2)similar-to𝑆𝑡𝑟𝑒𝑛𝑔𝑡𝒩𝜇superscript𝜎2Strength\sim\mathcal{N}(\mu,\,\sigma^{2})italic_S italic_t italic_r italic_e italic_n italic_g italic_t italic_h ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

    where μ=αTask+βModel+γProperty𝜇subscript𝛼𝑇𝑎𝑠𝑘subscript𝛽𝑀𝑜𝑑𝑒𝑙subscript𝛾𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦\mu=\alpha_{Task}+\beta_{Model}+\gamma_{Property}italic_μ = italic_α start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_M italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT italic_P italic_r italic_o italic_p italic_e italic_r italic_t italic_y end_POSTSUBSCRIPT

    Priors for parameters a𝑎aitalic_a, b𝑏bitalic_b and c𝑐citalic_c were standard normal distribution, and standard half-normal distribution for σ𝜎\sigmaitalic_σ.

  • Model 2 For each Property, we fitted a model predicting the attractor position (Figure 6) as a function of the Task and Model: Positionproperty𝒩(μ,σ2)similar-to𝑃𝑜𝑠𝑖𝑡𝑖𝑜subscript𝑛𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝒩𝜇superscript𝜎2Position_{property}\sim\mathcal{N}(\mu,\,\sigma^{2})italic_P italic_o italic_s italic_i italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r italic_t italic_y end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

    where μ=αTask+βModel𝜇subscript𝛼𝑇𝑎𝑠𝑘subscript𝛽𝑀𝑜𝑑𝑒𝑙\mu=\alpha_{Task}+\beta_{Model}italic_μ = italic_α start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_M italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT

To determine the significance of the difference between estimated parameters, we computed the 95% credibility intervals of the difference by sampling from the posteriors. In Figures 9 and 10, we provide the matrices indicating the mean of this difference between estimated parameters. Stars indicate that the credibility interval does not contain 0.

Refer to caption
Figure 9: Differences between posterior estimates for the statistical model predicting attractor Strength. Differences between estimate for the effect of Model (left), Measure (center) and Task (right) on attractor Strength. Values and color represent the mean value of the 95% credibility intervals. Stars indicate that this interval does not contain 0.
Refer to caption
Figure 10: Differences between posterior estimates for the statistical model predicting attractor Position. Differences between estimate for the effect of Model (top row) and Task (bottom row) on attractor Position for toxicity (first column), positivity (second column), difficulty (third column) and length (fourth column). Values and color represent the mean value of the 95% credibility intervals. Stars indicate that this interval does not contain 0.

Discontinuities and collapsing behavior

In the main text, the experiment with the Mistral-7B model on the Continue task was analyzed by first filtering the hashtags, as discussed in Appendix A. Given, that this behavior is interesting in itself, we discuss it here in more details.

Figure 11 shows the average length of text generated with the Mistral-7B model chain on the Continue task for five different seeds of the same story. We can observe several discontinuities in terms of the generated text length, i.e. at some iterations the length drastically increases or decreases. It is interesting to note that when the length decreases, it returns to the original value as before the first discontinuity. This suggests the existence of an attractor regarding this specific length. To better understand the cause of these discontinuities, Figure 2 shows examples of stories generated before and after those discontinuities (for the seed number three in figure 11). We can see that at generation 14 the model abruptly starts to generate many hashtags. It generates 283 hashtags, compared to 12 in the previous generation. At generation 45, we can see that the overall quality of the text decreased into generating solely hashtags and brief descriptions. This reduction in text quality is reminiscent of collapsing dynamics observed in iterative chains of LLMs, where each model was trained on the output of a previous one [65].

Refer to caption
Figure 11: Discontinuities and collapse in the Mistral-7B model chain The lengths of generated stories are shown (without filtering out the hashtags) for five chains starting with the same initial story. We observe discontinuities, where the length drastically increases or decreases. After decreasing, the length of the story goes back to the original length, suggesting the existence of an attractor.
Table 2: Examples of stories demonstrating the collapse and recovery in Mistral-7B chains on the Continue task for (seed 2 from figure 11
Gen. Story
13 We are thrilled about the progress we’ve made in making Xiangqi more accessible for all. Let’s keep the conversation going and explore new ways to adapt the game for different abilities. Share your ideas, experiences, and success stories using the hashtags #CXAInclusiveXiangqi #XiangqiForAll #AccessibleXiangqi #XiangqiUnited #DisabilityInclusion #AdaptiveXiangqi #InclusiveGaming #AccessibleCommunity. Let’s continue to learn from each other and create a vibrant, inclusive Xiangqi community that celebrates diversity and welcomes everyone. #UnitedWeGame #AccessibleXiangqiJourney #TogetherWeCan #XiangqiForEveryone
14 Let’s exchange innovative ideas on modifying Xiangqi pieces, boards, and rules to accommodate various disabilities. #AdaptiveXiangqiDesigns #InclusiveXiangqiSolutions #TogetherWeAdapt #XiangqiEmpowerment #AccessibleXiangqiProgress #DisabilityFriendlyXiangqi #XiangqiInclusiveCommunity #BreakingBarriers #XiangqiForAllPlayers #AccessibleXiangqiChampions #XiangqiInclusionSuccessStories #XiangqiUnitedForAll #AccessibleXiangqiFuture #InclusiveXiangqiVision
(omitted 264 hashtags for clarity)
#XiangqiInclusiveGamingCommunityVision #XiangqiAccessibleGamingCommunityGrowth #XiangqiAccessibleGamingCommunityInnovation #XiangqiAccessibleGamingCommunityEmpowerment #XiangqiAccessibleGamingCommunityPassion
45 #DesignWithInclusiveDesignPhilosophyScaling: Embracing diversity and equality in design practices.
#DesignWithUserCenteredDesignPhilosophyScaling: Putting users first in design decisions and experiences.
#DesignWithInclusiveTechnologyPhilosophyScaling: Making technology accessible to all users, regardless of abilities.
#DesignWithDigitalInclusionPhilosophyScaling: Ensuring everyone has equal access to digital resources and services.
(omitted 176 lines for clarity).
#DesignWithUserTestingTrainingScaling: Scaling user testing training opportunities.
#DesignWithAssistiveTechnologyTrainingScaling: Expanding assistive technology training opportunities.
#DesignWithInclusive
49 #DesignWithGlobalAccessibilityInitiativesScaling: Expanding global accessibility initiatives and collaborations.
(omitted 7 lines for clarity)
#DesignWithInclusiveDesignTrendsScaling: Growing trends and innovations in inclusive design and accessibility.
#DesignWithInclusiveDesignResourcesScaling: Expanding resources for inclusive design and accessibility knowledge and tools.
Refer to caption
Figure 12: Empirical validation of attractors position and strength estimation. To empirical verify that the method introduced in Section 3.4 makes accurate prediction, we used the first 10 generations of each chain to fit the linear regression between initial and final property values (a). We then used our method to estimate attractors’ strength and position (b). We then compared those predictions with the actual shifts in distribution observed after 50 generations (c). The grey area represents the initial distribution of the corresponding property, and colored line show the distribution after 50 generations for each model. Crosses indicate the estimated position of theoretical attractors, and their size represent its strength. For the fourth row, second column, one attractor was outside the range of represented values and is thus represented with "-> X".

Validation of attractors position and strength estimation

The method introduced in Section 3.4 gives the position and strength of a theoretical attractor (or theoretical fixed point). In order to validate our method, we verified that this theoretical prediction matches the actual data. To do so, we used the first 10 generations of each simulated chain to predict the strength and position of attractors for each task, model and property. We then compared this prediction with the actual properties of texts obtained after 50 generations. As shown in Figure 12, transmission chains shifts the initial distribution of values in the direction of the predicted attractor. Moreover, the variance of the final distribution appears to reflect the predicted strength of the attractor. These results confirmed that the method we introduce is indeed suited for estimating the strength of position of attractors.

Evolution of text properties for all initial stories

We here provide the figures representing the evolution of each of the four metric for each model, for each of the 20 initial stories. Lines represented the average over 5 seeds, and shaded areas represent the standard errors.

Refer to caption
Figure 13: Evolution of text properties starting with Initial Text 1
Refer to caption
Figure 14: Evolution of text properties starting with Initial Text 2
Refer to caption
Figure 15: Evolution of text properties starting with Initial Text 3
Refer to caption
Figure 16: Evolution of text properties starting with Initial Text 4
Refer to caption
Figure 17: Evolution of text properties starting with Initial Text 5
Refer to caption
Figure 18: Evolution of text properties starting with Initial Text 6
Refer to caption
Figure 19: Evolution of text properties starting with Initial Text 7
Refer to caption
Figure 20: Evolution of text properties starting with Initial Text 8
Refer to caption
Figure 21: Evolution of text properties starting with Initial Text 9
Refer to caption
Figure 22: Evolution of text properties starting with Initial Text 10
Refer to caption
Figure 23: Evolution of text properties starting with Initial Text 11
Refer to caption
Figure 24: Evolution of text properties starting with Initial Text 12
Refer to caption
Figure 25: Evolution of text properties starting with Initial Text 13
Refer to caption
Figure 26: Evolution of text properties starting with Initial Text 14
Refer to caption
Figure 27: Evolution of text properties starting with Initial Text 15
Refer to caption
Figure 28: Evolution of text properties starting with Initial Text 16
Refer to caption
Figure 29: Evolution of text properties starting with Initial Text 17
Refer to caption
Figure 30: Evolution of text properties starting with Initial Text 18
Refer to caption
Figure 31: Evolution of text properties starting with Initial Text 19
Refer to caption
Figure 32: Evolution of text properties starting with Initial Text 20