With nearly three billion players, video games are more popular than ever. Casual puzzle games are among the most played categories. These games capitalize on the players’ analytical and problem-solving skills. Can we leverage these abilities to teach ourselves how to solve complex combinatorial problems? In this study, we harness the collective wisdom of millions of players to tackle the classical NP-hard problem of multiple sequence alignment, relevant to many areas of biology and medicine. We show that Borderlands Science players propose solutions to multiple sequence alignment tasks that perform as well or better than standard approaches, while exploring a much larger area of the Pareto-optimal solution space. We also show the strategies of the players, although highly heterogeneous, follow a collective logic that can be mimicked with Behavioral Cloning with minimal performance loss, allowing the players’ collective wisdom to be leveraged for alignment of any sequences.

1 Introduction

In 2019, Americans spent an average of over 37 minutes a day playing video games and has only increased since then [Eddy 2022]. The time invested in gaming is usually seen as pure entertainment, like watching a movie, but what if the efforts of the players could be harnessed to accomplish useful tasks? This is the objective of "games-with-a-purpose", which fill this gap and pair entertainment with a computational task such as image labelling [Von Ahn 2006; von Ahn and Dabbish 2004].

Nearly every game can be summarized as a series of tasks that the user has to solve. In many situations, players display inventiveness and creativity in solving these tasks[Eiben et al. 2012; Moffat et al. 2017; Osvath and Newport 2020]. Typically, games are designed to optimize the enjoyment of the user [Cristofori et al. 2018; Koepp et al. 1998]. However, when these tasks can be adapted to solve real-world problems [Von Ahn and Dabbish 2004], games become powerful vehicles to access the millions of hours of active problem-solving efforts conducted every week by the players [Waldispühl et al. 2020].

Citizen science efforts, which aim to involve non-scientists in the solving of scientific problems, often struggle with finding enough participants to address large-scale problems. The potential contribution from the gaming community could bring a major paradigm change in the type and scale of problems that can be solved through this type of approach.

Enter Science Discovery Games (SDGs). These games are specifically designed to help with analysis of scientific data and assist research projects. They rely on the intuition of humans (i.e., the players) to solve computational problems that are challenging for computers because of their complexity [Cooper et al. 2010; Kawrykow et al. 2012] or because the solution is based on human perception and agreement [Butyaev et al. 2022; Pandey et al. 2017; 2021].

Among the types of important problems that classical algorithms struggle to solve are NP-hard problems, for which no polynomial time algorithms are known to exist. Many classical bioinformatics problems are NP-hard [Wang and Jiang 1994], including the multiple sequence alignment problem. This problem, which has applications in fields ranging from biology to linguistics, is particularly difficult due to the lack of a well-defined ground truth, leading to many approaches being parameter-sensitive and difficult to generalize, or too inefficient for use on real-world problems. Many approaches have been proposed to tackle this problem [Chatzou et al. 2016], such as phylogeny-aware methods [Löytynoja and Goldman 2008] or non-coding multiple sequence alignment [Salama and Stekel 2013], and new methods are regularly published, because there is still significant room for improvement. In particular, there is room for human contribution, because the lack of a well-established optimal target state is less of a challenge for citizen scientists than it would be for deterministic algorithms.

The first citizen science game to tap into this potential was Phylo (2010) [Kawrykow et al. 2012], a sequence alignment puzzle game in which players could improve computer-generated alignments in a lightly gamified environment. This project showed that humans could significantly improve alignments, but the limited gamification led to the game being heavily expert-dominated, which reduced the opportunity to learn how the average player plays an alignment game.

In order to bridge that gap by targeting a much wider group of potential players, Borderlands Science (see Figure 1) was released as a much more gamified mini-game inside Borderlands 3, a shooter-looter game played by millions of players [Szantner 2016]. To attract a more general audience more used to fast-paced games and less familiar with puzzles, the methods from Phylo were adapted to simplify and speed up puzzle solving. With over 3 million participants and 100 million games played, and a relatively even contribution across individual players, we now have enough data to thoroughly investigate how human players solve these puzzles, and how this can be leveraged for the broader problem of sequence alignment.

1.1 Game design

In Borderlands Science (Figure 2), the player is shown 7 to 12 columns of bricks. Each column represents a homologous DNA sequence fragment (note that this representation is the transposed version of the typical multiple alignment representation where sequences constitute alignment rows). The number and length of sequences increases with the difficulty level. Each individual brick represents one of the four types of nucleotide bases in DNA that are encoded with a specific tile and color. Each puzzle in Borderlands Science is made of fragments of microbial 16S ribosomal RNA gene sequences provided by the American Gut Project [McDonald et al. 2018]. In the initial configuration of the puzzle, the bricks are piled up at the bottom of the screen (see Figure 2), as though they were under the effect of gravity.

The player is provided with a limited number of gap tokens, which they can insert between bricks to maximize the alignment of bricks against the guides on the left. These guides provide insight about the overall context of the alignment region the sequence fragments come from [Waldispühl et al. 2020]. The cost of adding a gap is conveyed to the player as the resistance to the gravity effect. The main challenge of the game is to reach a target score which is the integer that is above the nominal score set by the naive greedy player in the face of limited number of gap tokens and move on to the next puzzle. We also display the highest score previously submitted to push the participants to optimize that score. (See official trailer at https://youtu.be/L_mH6Ak_Ny0. )

Figure 1:

Figure 2:

1.2 Hypotheses

The strategies employed by the players to achieve this goal appears to be heterogeneous without obvious pattern. Are these strategies efficient (i.e., the solution simultaneously maximize the score and minimize the number of gaps used)? Can we learn from the the collected solutions new heuristics to solve the puzzles?

This article explores the potential of crowdsourcing human intuition to tackle NP-hard problems, focusing on the typical problem of multiple sequence alignment. We attempt to formalize, extract, and reproduce the players’ puzzle-solving strategies to show they match an effective and reproducible strategy. By establishing the meaningfulness of the solutions submitted by the players, we open an avenue for further work in human-inspired multiple sequence alignment algorithms, especially as new developments in reinforcement learning and Transformers-based methods unlock new possibilities for exploiting crowdsourcing results. We hypothesized that the wisdom of crowds could be an effective solution for the multiple sequence alignment task in particular, and for other types of NP-hard problems more generally. Specifically:

H1: Human solutions use complex strategies that are not replicating basic heuristics. H1 aims to tests whether the strategies used by the participants follow simple and intuitive rules. To address this question, we device several bots simulating players that are using simple explicit rules to solve the puzzles. We compare the solutions returned by the human participants to those from the bots to identify similarities and differences.

H2: Human solutions perform at least as well as standard algorithms for the MSA problem. This hypothesis compares several well-established and non-trivial algorithms designed to solve the multiple sequence alignment problem embedded in Borderlands Science. Again, we compare the performance and similarities between solutions obtained from these algorithms and humans.

H3: We can learn and reproduce player strategies using behavioural cloning. This part of the work considers whether the decisions of the players are reproducible using behaviour cloning. We hypothesize that being able to reproduce the puzzle solutions obtained by the players is an indirect confirmation that the players’ strategies have features in common.

We test these hypotheses and aim to establish the relevance of small, fast-paced puzzles, when previous work in this field typically relied on very complex puzzles, such as those in Phylo [Kawrykow et al. 2012].

2 Related Work

Understanding human abilities related to combinatorics and optimization problems has been of interest for the past few decades. In 1990, MacGregor and Ormerod found that human solutions to the Travelling Salesman Problem significantly outperformed solutions obtained using Nearest Neighbor, Largest Interior Angle or Convex Hull algorithms [MacGregor and Ormerod 1996]. They noticed that human-based decisions were not related to how state-of-the-art algorithms solve these problems, but were based on human perception. Acuña and Parada similarly observed that human performance outperformed not only a random approach, but also many common heuristics, and that humans can improve on the best existing solutions [Acuña and Parada 2010]. Hidalgo-Herrero et al. provides an analysis of humans solving the Knapsack and Vertex Cover problems [Hidalgo-Herrero et al. 2013], concluding that human solutions to these problems outperform genetic algorithms. They also observed that human performance decreases with increasing task complexity, and that children develop more diverse and interesting strategies than adults.

Problem-solving tasks in games are good environments to mobilize the human knack for tackling complex problems. Gamification of scientific tasks has gained traction since the release of Foldit in 2008 [Cooper et al. 2010], with many science discovery games reaching hundreds of thousands of players, such as EteRNA [Lee et al. 2014], Galaxy Zoo, Eyewire, Project Discovery and, in particular, Phylo (2010) [Kawrykow et al. 2012], the first multiple sequence alignment SDG, which showed humans could improve existing alignments through participating in a game.

James Surowiecki’s wisdom of crowds theory [Surowiecki 2005] suggests that a combination of solutions from different sources may be better than an individual solution. This theory has been described and applied in projects in diverse fields including business, economics and sociology [Surowiecki 2005], but has only recently been applied to computer science. In 2010, Yi et al. defined classic combinatorics problems such as the Minimum Spanning Tree and Traveling Salesman as puzzles divided into fragments solved by human participants. These fragments were then assembled into a global solution that outperformed the results of standard algorithms.

The main limitation of this class of crowdsourcing initiatives is the reliance on a large number of participants to solve a single problem. There is thus a natural synergy with artificial intelligence methods that augment crowdsourced data or apply the knowledge gained from humans to other problems. This led to the idea of the wisdom of artificial crowds (WoAC), a strategy successfully applied to general computational problems such as the Travelling Salesman Problem, as well as to real-world games such as Sudoku [Redding et al. 2015; Yampolskiy and El-Barkouky 2011].

The multiple sequence alignment problem, due to its importance in biology, has been studied for over 50 years, and hundreds of algorithms have been proposed to tackle it [Blanchette et al. 2004; Do et al. 2005; Morgenstern et al. 2005; Paten et al. 2008; Thompson et al. 1994]. While these methods are still very popular, an increasing number of new approaches involving machine learning have been published recently [Andreatta and Nielsen 2016; Kuang et al. 2020], with a small minority involving Reinforcement Learning (RL) [Ramakrishnan et al. 2018].

In Reinforcement Learning, an agent interacts with an environment by following a policy in order to maximize its reward function [Sutton and Barto 2018]. An example of this class of method is the application of Q-learning to the MSA problem [Jafari et al. 2019]. Their results were later significantly improved with Deep Q-networks (DQN) [Jafari et al. 2020]. Another application of reinforcement learning to MSA is RLALIGN (2018), which is based on Asynchronous Advantage Actor Critic (A3C) [Ramakrishnan et al. 2018].

The main disadvantage of the RL approach is the explicit definition of the Reward Function, which can be quite complex [Song and Geyer 2015]. Alternatively, Behavioural Cloning (BC) can be used as an approach focused on the capture and reproduction of human abilities and sequences of actions [Sammut 2010]. This approach has been successfully applied to various video game tasks [Guss et al. 2019; Hester et al. 2018; Zhang et al. 2020]. A recent benchmarking paper reviewing 10 modern games showed that, despite low productivity compared to humans, agents can learn the basic strategy and rules and emphasize that data quality rather than quantity matters for RL [Kanervisto et al. 2020].

Game-related tasks are well suited for Reinforcement Learning because they typically involve a well-defined action space and reward system: these are key features of game design. Reinforcement Learning has important limitations: it requires a designed environment in which an agent can interact with a game, which can be costly to implement and maintain, and once the environment is in place, agent training can take thousands of years of game time.

However, we know that humans can learn to perform tasks through imitation, and can leverage this approach for automated learning. Applying this learning method to an autonomous agent is referred to as learning from demonstration (LfD) or Imitation Learning (IL) [Hussein et al. 2017]. The commonly used approaches within this paradigm can be divided into two broad categories: Behaviour cloning (BC) and Inverse Reinforcement Learning (IRL). AI Players based on the players behavior can solve a large number of different tasks [Hussein et al. 2017]. For example, using player experience as observational data, we can train a generative model to play Atari games, which is a benchmark for many RL approaches [Chen et al. 2021]. Pfau et al. used this approach to ensure the balancing of in-game parameters and classes to ensure the success of the Aion game. These methods can also be successfully applied in the real world with systems in autonomous vehicles [Le Mero et al. 2022].

In this article, we focus on a Behaviour Cloning approach, because it tends to be both simple and effective in solving policy search problems [Codevilla et al. 2019; Farag 2019], and our main goal is to assess whether the player solutions can be mimicked, a task that does not require a complex model.

Behavioural cloning [Bain and Sammut 1995; Daftry et al. 2016; Pomerleau 1989; Ross et al. 2011] is one of the main approaches for imitation learning. Rather than learning an optimal policy that maximizes the long-term cumulative rewards like in traditional reinforcement learning, a set of demonstrations from an expert is used as a base for learning.

The notation is similar to that used in RL. Expert demonstrations are divided into state-action pairs, and supervised learning is applied as a classification or regression model. The loss function depends entirely on the application and the nature of the data. More formally, it can be written as:

•

Collect demonstrations (τ^* trajectories) from expert

•

Treat the demonstrations as i.i.d. state-action pairs:

\(\left(s_{0}^{*}, a_{0}^{*}\right),\left(s_{1}^{*}, a_{1}^{*}\right), \ldots\)

•

Learn π_θ policy using supervised learning by minimizing the loss function L(a^*, π_θ(s))

This approach has many attractive properties. Specialized game modifications and environment creation are not needed, lengthy training is avoided, and the same method can be adapted and reused in different games. Until now, this class of approaches struggles to outperform human experts, and requires a significant amount of expert examples in order to achieve similar proficiency, despite recent advancements combining simulation and optimization yielding promising results [Chen et al. 2020]. However, when expert imitation is achieved, the positive outcomes can be significant because expert-level strategies can be computed much faster than human solutions, and scales by adding computational resources rather than through the lengthy process of training new human experts.

The current state of the art in both human solving of hard computational problems and reinforcement learning methods creates a remarkably favourable context to explore the synergy between these two fields. With Borderlands Science, we are given a unique opportunity to explore this synergy and fully unlock the potential of human contributions.

3 Methods

In this section, we describe the multiple sequence alignment problem, and the machine learning techniques we used to capture strategies from players.

3.1 Data filtering

The Borderlands Science game data consists of human gut microbe genome fragments, sequenced by the Microsetta Initiative, which were pre-aligned with PASTA [Mirarab et al. 2015]. The puzzles submitted to players are carved out of this pre-alignment. In the game, player optimize a bi-objective function: reaching the maximal number of matches (the score), with a limited number of gaps they can insert. This is the same bi-objective function traditional sequence alignment algorithm optimize, but powered by human intuition rather than a scoring matrix.

For this paper, we focus on a sample of 1,000,000 randomly selected puzzle solutions, played between April 2020 and June 2021.

To determine the value of a submitted solution, approximate the Pareto front for each puzzle from all solutions collected for that puzzle (on average, we collected 45 solutions per puzzle). The Pareto front is the set of solutions to a puzzle that optimize our bi-objective function. In other words, a solution is Pareto-optimal if there does not exist another solution that simultaneously matches/increases the score and matches/reduces the number of gaps. We estimate the quality of a solution from its distance from the Pareto front on the x axis(i.e., the score difference between the two).

We aggregated puzzle solutions and extracted those close to the Pareto front, in order to only include the best human solutions. We define a proportional horizontal Pareto distance of a single solution to a puzzle as the score improvement of that solution over the worst human solution, divided by the improvement of the best score for this number of gaps (pareto-optimal) over the worst human solution. Solutions associated with a distance over 0.7 were excluded from the data. This threshold was established through visual assessment of clustering outcomes by human experts. This resulted in 53.4 percent of solutions were filtered out.

3.2 Multiple sequence alignment

3.2.1 Definition.

A multiple sequence alignment can be described as a set of n sequences S₁, S₂, …S_n, represented as vectors of the matrix A where each element a_{i, j} comes from the set of nucleotides or gap (A, C, G, T, −). The input to the multiple sequence alignment problem is typically a set of ungapped sequences, and the output sequences contain gaps. This formulation of the problem allows us to consider several sequences as a matrix. This input matrix A is dense, and the output matrix A′ contains gaps −, inserted in a specific arrangement to maximize some score, such as a phylogenetically-aware scoring scheme [Blanchette et al. 2004] or a simple sum-of-pairs scoring scheme [Sankoff et al. 1973]. The mathematical form of the MSA problem is NP-hard for both classes of scoring scheme [Wang and Jiang 1994].

The many methods presented to tackle this problem, [Blanchette et al. 2004; Do et al. 2005; Morgenstern et al. 2005; Paten et al. 2008; Thompson et al. 1994] tend to have different performance characteristics that depend on the type, length and number of input sequences. In particular, PASTA [Mirarab et al. 2015] is specifically designed to handle large alignments and was used as the scaffold for the Borderlands Science puzzles. It should be noted all these algorithms are heuristics with no guarantees of optimality, and that multiple sequence alignment remains an open problem.

3.2.2 Algorithms used for comparison.

The methods described above for solving the problem have different focuses and approaches. We chose several algorithms for comparison to our methods in two categories corresponding to our hypotheses. The first category, which corresponds to hypothesis 2, contains traditional computational methods developed to tackle the MSA problem, included for the purpose of benchmarking the performance of our methods. It includes: Dynamic Programming (Needleman-Wunsch) [Needleman 1970], PASTA [Nguyen et al. 2015], profile alignment using HMMER [Eddy 1992], and Greedy algorithm. The second category, corresponding to hypothesis 1, aims at providing context to interpret the particularity of player solutions. It includes methods built from basic heuristics that can be applied by a human to solve the game. The Greedy algorithm, a Progressive Profile Strategy (PPS) algorithm, and a Random algorithm are chosen for this category. We designed the greedy and PPS algorithms ourselves. A summary of our methods can be seen in Table 1.

Table 1:

Name	Short Description	Reference	Hypothesis
Random	Randomly places gaps	-	H1
Progressive Profile Strategy (PPS)	Progressive profile alignment	-	H1
Greedy	Chooses best place to insert a gap iteratively	-	H1 & H2
Needleman-Wunsch (NW)	Dynamic programming	[Needleman 1970]	H2
PASTA	Uses alignment and tree estimation, and HMMs	[Nguyen et al. 2015]	H2
HMMER	Uses HMMs	[Eddy 1992]	H2

Table 1: Summary of reference algorithms used for comparison.

Our dynamic programming algorithm is inspired by the Needleman–Wunsch algorithm, a classic sequence alignment algorithm introduced in 1970 [Needleman 1970]. Unfortunately, its complexity is O(n^k), where n is the number of nucleotides in the sequences, and k is the number of sequences. Therefore, we use an adapted method that aligns each sequence to the guide while applying a maximum total width to the alignment. This type of algorithm has many equally optimal solutions (per each sequence). In order to satisfy the gap limit constraint of the Borderlands Science game, we take a 2-step approach: first, we invert the sequences and the guide and remove the gap insertion penalty at the start (that corresponds to the top of the puzzle now), effectively accounting for the "gravity" effect of the puzzle. This is required because NW always fills the entire grid. After obtaining optimal solutions, we try different combinations of these optimal solutions and count the number of gaps until we find one that respects the gap limit. If none exist, we incrementally increase the gap limit until such a solution exists. However, due to running time constraints of the dynamic programming algorithm for complex puzzle, a proportion of solutions are skipped.

PASTA [Nguyen et al. 2015] (Practical Alignments Using SAT’e and TrAnsitivity) allows for the computation of MSA alignments for very large nucleotide datasets. Prior to aligning the sequences, PASTA estimates an alignment and a guide tree from a subset of the sequences using a very simple profile Hidden Markov Model (HMM)-based method. A set of HMMs is created from the alignment and the tree, and the rest of the sequences are aligned to each HMM and the best one is used to update the alignment with the sequence. The PASTA alignment is taken directly from the uncollapsed alignment used to generate the puzzles, so does not respect the gap limit given to the player solutions and other algorithms.

In order to use HMMer to solve our puzzles, we used pyhmmer package. We first build a Hidden Markov Model (HMM) profile from our guides with DNA Alphabet, then use the HMMer Trace Aligner method. The main challenge was to fit the solution of the algorithm into the same grid that a player sees in the BLS and use fewer gaps than the maximum that players are allowed. To do this, we varied the consensus impact scale, between 0.8 and 0.95 to yield solutions that use fewer gaps, however, many solutions still do not respect the gap limit.

Greedy algorithms, which optimize for immediate reward, can be effective approaches for solving the MSA problem [Zhang et al. 2000] despite their simplicity. Borderlands Science uses a naive greedy solver to set score targets for the players. Because all solutions submitted by players must outperform the greedy solution, that algorithm is intentionally simplistic. For the purpose of this study, we added stochasticity to explore the limits of its reach in the solution space, and simulate players using a strategy aiming for maximum reward after each move.

The Progressive Profile Strategy (PPS) algorithm was inspired by Progressive MSA, combined with an effort to reproduce how players are usually observed to solve puzzles (see Supp. Mat.). In this method, first, a profile is created from the guide. We then start from the leftmost sequence and align it to the profile by trying all possible combinations of gaps and choosing the best one. Then we add the solution found for the first sequence to the profile, and move on to aligning the second sequence with it. We continue this until all sequences are aligned with the profile.

We also compared the trained models to a random player who randomly selects a gap location based solely on the size of the puzzle and the number of gaps added by the human players to solve the same problem. We are using this algorithm to simulate a player without strategy.

We tested whether the strategy used by the players is distinct from these methods, and outperforms them on any relevant criteria.

3.3 Imitation learning

3.3.1 Data representation.

Figure 3:

We consider two ways of representing our data (See Figure 3). First, we can represent the puzzles as a matrix, where each of the four nucleotides is associated with a channel (e.g., red, green and blue are three distinct channels for traditional computer representation of images). The input is thus a 4 × L × N matrix, where, L is the number of rows, and N the number of columns. And the output will be a two-dimensional L × N matrix storing gap positions. We call it a image2image task.

The second option consists in representing each individual sequence as a string. In this case, the output will be the same string in which we eventually insert gaps gaps. We call it a seq2seq task.

3.3.2 Predictive models.

•

Fully Convolutional Network. FCNs were originally presented as a solution to the segmentation problem [Long et al. 2015]. Unlike classical CNNs, which rely on a fully connected layer to obtain a fixed-length feature vector for classification after the convolutional layer, FCNs can accept input images of any size because of the set of deconvolutional layers. This architecture has been shown to outperform the state-of-the-art without further machinery in several settings [Li et al. 2016; Lu et al. 2019; Papandreou et al. 2015].

•

Transformers. Transformers are a state-of-the-art model with self-attention [Vaswani et al. 2017]. Recent research has shown that self-attention is an effective way to model text sequences [Devlin et al. 2018; Vaswani et al. 2017]. It consists of an encoder and a decoder. An encoder converts the input information into one or more vectors, and a decoder generates output information from these vectors. The input data passes through the layers of the encoder: some of them are standard fully-connected layers, and others are residual connections similar to ResNet [He et al. 2016]. The most novel component of the encoder section is the Multi-head attention layer, a special layer that allows each input vector to interact with other words through the attention mechanism, instead of passing through a hidden state like in an RNN [Sherstinsky 2020] or CNN [LeCun et al. 1989]. Its inputs are Query vectors and several Key-Value pairs that are responsible for positional information. This embedding will then be used by the decoder. Internal attention layers in a decoder work a little differently than layers in an encoder. In the decoder, the internal attention layer can only focus on previous positions in the output sentence.

3.3.3 Architecture.

We implemented our Transformer model with 8 layers, 64 embedding sizes, and 8 attention heads using a batch size of 100 puzzles, and them trained for 100 epochs. To optimize the transformers-based model, we used Adam optimization with a learning rate of 0.004 and a weight decay of 0.004. We also set attention dropout to 0.1. In order to get the most out of the self-attention mechanism, we selected "axial attention" architecture proposed in [Ho et al. 2019]. This allows embedding of a 2D tensor with an autoregressive model, while remaining economical in computation.

By contrast, our FCN model consists of stacked blocks of 2D convolution layers, Dropout and BatchNormalization regularization and a RELU activation function. We have three layers in the encoder and three layers in the decoder with the following parameters: filters = 4, kernel size=1, strides=1. Each position in output data can be associated with one of two classes, indicating whether this position corresponds to a gap. In experimenting with this model, we focused on the more suitable image2image approach, leveraging the encoder-decoder approach as presented by Noh et al. The input is a puzzle of variable size, and the output is a label assigned to each pixel from one of two classes, determining whether the pixel moved or stayed in place. To optimize the FCN model, we kept the same learning rate of 0.0001, 100 epochs. Optimization was performed with cross-entropy loss minimization as the metric and Adam optimization. The cross-entropy loss is computed against the flattened real player solution image.

3.4 Evaluation

The choice of a metric that measure the similarity between outputs is challenging, because puzzles have different sizes and numbers of gaps. To obtain a complete picture of the similarity between solutions, we selected three metrics: Cosine Similarity, Hamming distance, and Jensen-Shannon distance.

The first metric, Cosine Similarity [Nguyen and Bai 2010; Xia et al. 2015], measures the angle between two vectors in multidimensional space, and is useful for evaluating sparse data such as ours because cosine it ignores 0-0 matches. Cosine similarity is a widely used quality metric that is used in a wide variety of ML areas, such as text classification [Li and Han 2013] or images analysis [Nguyen and Bai 2010], also finds high use in RL tasks [Rao et al. 2017]. We also use cosine similarity to obtain the distance between each solution from algorithms and humans to the consensus of gaps between humans.

We also present the Hamming distance [Norouzi et al. 2012] between predicted sequences and player solutions. The metric shows the number of positions in which the characters corresponding to them are different. The Hamming distance is normalized to the string length to generalize to any puzzle size. This metric is extremely useful as it is well interpreted by humans - visually it is very easy to identify mismatched elements in two lines of the test. Hamming distance is often used for language models [Li et al. 2021; Rao et al. 2020]. These approaches can also be used in work and genetic chains, presenting language model information [Lupo et al. 2022; Pinheiro et al. 2012].

We report the Jensen-Shannon distance [Fuglede and Topsoe 2004], which measures the differences between probability distributions. We compare the distribution of gaps between predictions and real solutions. If two distributions are similar, the Jensen-Shannon distribution between them is 0. This metric is more complex as it considers the real and predicted distributions. Jensen-Shannon divergence has also been used by many authors to analyze their research. For example, [Wang and Dunbrack Jr 2004] use this metric for a scoring scheme for Sequence alignment profiles task, also the metric is used for DNA analysis [Grosse et al. 2002].

We also report a similarity to the centroid of player solutions. The centroid of player solutions is a representative solution that is calculated by identifying the player solution that has the highest cosine similarity to the consensus of gaps between all player solutions for a specific puzzle. In the context of our analysis, we use the Levenshtein distance to compare the predicted solutions to the centroid of all player solutions for that puzzle. The Levenshtein distance is a measure of the similarity between two sequences by measuring the amount of insertions, deletions and substitutions needed to transform one sequence to another, and is often used in natural language processing tasks to compare the similarity between two pieces of text. By using the Levenshtein distance on the centroid, we can determine the similarity of the predictions to the most popular player solution, without being heavily influenced by outliers. As this information is not at all used in the training of RL agents, it also serves as a sanity check.

For the methods that do not adhere to the gap limit provided to the player such as HMMer and PASTA, we can compare their performances by looking at how many incompatibilities there are with the player solutions. Incompatibilities are described as the number of gap deletions required to transform a player’s solution into the algorithm’s solution using only gap insertions. This allows us to see how similar the player solutions are to a particular algorithm’s strategy, regardless of the number of gaps the solutions use. The assumption is that if there are few incompatibilities, it means that the player solutions are likely incorporating part of the strategy, but may not have enough gap insertions to fully solve the puzzle. Figure 9 shows the mean of these solutions across all puzzle difficulties and highlights the similarities between the various algorithms.

We also measure the mean number of gaps per puzzle and the distribution of gaps across the columns of each puzzle. These values are split into three groups of puzzle difficulties (which correspond to the size and complexity of the puzzle) to allow us to compare the performance of the different methods across different levels of difficulty. Analyzing the distribution of gaps can help us understand the strategies that lead to the results we observe.

All metrics are measured between the flattened forms of the algorithm or agent-produced and player-solved puzzle matrices.

4 Results

4.1 Data from Borderlands Science

Table 2:

Difficulty	Mean used gaps	Number of puzzles
1	5.68	5443
2	5.62	3627
3	5.03	2108
4	6.60	2001
5	7.95	2292
6	9.76	3753
7	10.34	2833
8	11.37	2851

Table 2: Level-by-level description of the Borderlands Science data the agents were trained on, with the average number of gaps inserted per puzzle for each level

Table 3:

	CosSimilarity	HammingDist	JensenShannonDist

Transformer	0.79	0.07	0.40
FCN	0.77	0.08	0.41
NW	0.77	0.04	0.37
Greedy	0.64	0.04	0.30
PPS	0.58	0.11	0.44
Random	0.55	0.07	0.44

Table 3: The three similarity metrics computed on the final state were obtained with seven reference algorithms. We report the similarity/distance between an algorithm’s solution to a puzzle and the Pareto-optimality-filtered player solutions for this puzzle. The optimal result is 1 for cosine similarity, and 0 for the two distance metrics.

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

The Borderlands Science data was processed as described in section 3.1. Table 2 presents a detailed description of this dataset. It contains a total of 25,000 puzzles and 1,145,001 solutions.

4.2 Testing Hypothesis 1

To test whether the player strategies were significantly different from simple heuristics, we solved the puzzles with a greedy player, a Progressive Profile Strategy (PPS) algorithm, and a random player. We compared these algorithmic solutions to human solutions with the methods listed in section 3.4.

First, when comparing solutions, we observe that the PPS, greedy and random players achieve lower game scores than humans (Figure 4). They also generate final states that are less similar to the averaged human solutions (Figure 5) and further from the human solution centroid (Figure 6) than the Needleman-Wunsch algorithm. Overall, this appears to indicate the similarity between human solutions and PPS, greedy and random solutions is low.

Second, we compare strategies. We observe that PPS places a similar number of gaps per puzzle as the human players (near the maximum allowed) for all difficulty levels (Figure 8). However, the column placement patterns differ: humans build more columns with one gap than columns with zero, whereas PPS builds more columns with zero gaps than columns with one (Figure 9). The Greedy gap placement behaviour also significantly differs from humans. An example result of the difference in gap placement patterns can be seen in Figure 7.

Conclusion: Our results indicate that both in terms of strategy and performance, there is a low similarity between human players and basic heuristics.

4.3 Testing Hypothesis 2

To test whether the player strategies provide a satisfying solution to the multiple sequence alignment problem, we also included standard algorithms to our comparison set, such as a modified Needleman-Wunsch and a greedy algorithm.

Since these two methods could be modified to accommodate the gap limit in the game (see methods), we can compare them to the human performance in terms of game score, a score obtained primarily from correctly matching nucleotides to guides. We observe that Needleman-Wunsch outperforms the average player for the first three difficulty levels (smallest puzzles), but that as the problem gets more complex it falls behind human solutions (Figure 4). This phenomenon suggests that the players use a strategy that allows them to quickly identify a near-optimal solution. When the combinatorial complexity of the puzzle is growing, an exhaustive search is no longer feasible and players clearly outperform this optimization method. The greedy algorithm is overall outperformed by the average human.

Because some other standard algorithms such as HMMer and PASTA cannot accommodate the gap limit imposed on the players and cannot easily be modified to accommodate it (as we did for Needleman-Wunsch), we also performed a separate investigation on the compatibility of human solutions with the outputs of advanced and widely-used algorithms HMMer [Eddy 1992] and PASTA [Nguyen et al. 2015] (see section 3.5).

In this investigation (Figure 9), we estimated the average number of gaps placed by players that are not compatible with the software solution. We observed that the behavior of HMMer is highly compatible with that of players. In other words, the gaps inserted by players are overwhelmingly gaps inserted by HMMer. By contrast, PASTA showed significantly more disagreement with player solutions, in large part because it is very gap-adverse (i.e., PASTA tends to avoid opening new gaps in sequences that has none). This is consistent with the role of PASTA in the Borderlands Science pipeline; it builds a very tight scaffold and then players add missing gaps to improve the alignment. This however comes with PASTA obtaining a much poorer score.

Conclusion: In the context of the game, humans appear to outperform traditional algorithms for puzzles above the easiest difficulty levels, and the moves made by human players are compatible with gap-friendly methods such as HMMer and complementary to gap-adverse methods such as PASTA. Interestingly, solutions of the Needleman-Wunsch optimization algorithm bear similarities with those of humans. Yet, the player strategy quickly becomes more efficient at identifying a good solution when the difficulty increases.

4.4 Testing Hypothesis 3

To test whether it was possible to accurately learn from the human solutions, we leveraged two well-researched behaviour cloning approaches.

We observed that Transformers and FCN achieve high cosine similarity to human consensus (Figure 5) and low Levenshtein distance against human centroid (Figure 6). RL methods obtain game scores (Figure 4) comparable to humans, but slightly worse, notably because they place fewer gaps (Figure 8), limiting itself to high-consensus gaps. In other words, after having added the gaps that most humans agree with, adding more gaps increases the distance against the majority of humans. This can be a strength in the context of sequence alignments where gap insertion and extensions are usually limited by some form of penalty to avoid having too many gaps. The very small difference in score, in light of the significant difference in gap insertions, indicates that the additional gaps inserted by humans provide a diminishing returns, suggesting the FCN and Transformer’s gap-efficiency is a strength.

It should also be noted that this high game score was obtained by the agents despite the fact that it does not perfectly copy player behavior. This indicates that even when the agent fails to correctly imitate the players, it still provides a solution that obtains a decent score. suggesting it presents a valid solution to the general problem of multiple sequence alignment.

In terms of strategy, RL methods produce nearly exactly as many columns with one gap as humans over the three difficulty categories, and its gap placement patterns become more similar to humans as the complexity of the problem increases (and as human solutions progressively outperform Needleman-Wunsch).

Conclusion: We can successfully capture strategies employed by humans using behaviour cloning techniques. Both behaviour cloning methods we present accurately mimic the strategies of human players, with Transformers slightly out-performing FCN.

5 Discussion

We have shown that all three hypotheses presented in this paper are corroborated by experimental results: the player solutions are original, high-quality and mimickable.

By simultaneously demonstrating that the player strategies are original (H1) and that they produce outputs comparable in quality to state of the art algorithms (H2), we established the relevance of training an agent to learn from these players on Borderlands Science. This confirms our hypothesis based on previous successes in the solving of computational problems by humans [Acuña and Parada 2010; MacGregor and Ormerod 1996; Redding et al. 2015; Yampolskiy and El-Barkouky 2011]. Notably, humans out-performed greedy algorithms on Borderlands Science as they did in [Acuña and Parada 2010], and we confirmed the presence of a perceptible regular pattern in how humans solve these tasks, as in [MacGregor and Ormerod 1996].

We also demonstrated that accurately learning from players was achievable (H3). Our results show that behavioral cloning can be an effective approach for a puzzle game such as Borderlands Science, confirming our hypothesis based on its performance on Atari [Zhang et al. 2020]. In particular, we discovered that learning from the consensus of players leads to a gap-efficient agent able to achieve scores comparable to humans while excluding gaps that provide diminishing returns, a particularity that is highly relevant for tasks such as multiple sequence alignment.

A preliminary analysis of the moves played by our model reveals that a strategy captured from the players aims at finding a specific trade-off between the number and length of gaps (i.e., the length of gaps is defined as the number of adjacent gap tiles). It turns out that the determination of a gap penalty scheme is one of the most sensitive parameters in biological sequence alignment [Vingron and Waterman 1994]. Hence, this work offers a piece of promising information to design new sequence alignment algorithms.

There are two main application paths for the methods we have presented. The most obvious one is to leverage the solutions from the players by applying the strategy they taught us to new sequences, effectively generalizing the results of the Borderlands Science initiative and taking full advantage of the data, limiting the impact of the significant overhead cost of the project. Another promising avenue is to use the reinforcement learning agents as a partner for the players, for example by providing hints, thus creating a fully-fledged Human-in-the-loop (HITL) system to leverage both human and machine intelligence for solving combinatorial problems.

From a technical perspective, both machine learning frameworks yielded comparable and positive results. Yet, it appears that transformers trained with the seq2seq framing outperform FCNs trained on the full image on two of our three metrics. This could be significant in terms of indicating a better framing of the problem, but it could also be due to the high performance of state-of-the-art transformer-based methods. We leave a full investigation of this question to future work.

5.1 Limitations

First, the algorithmic methods presented here as benchmarks are limited. Typical multiple sequence alignment methods do not usually have a hard gap limit and integrating that constraint can be difficult: our modified Needleman-Wunsch was only able to terminate for 40% of the puzzles. We only reported results for the problems it was able to terminate on. Greedy algorithms work well with gap limits since they optimize step by step but they are rarely optimal for MSAs. We were unable to integrate a gap limit to PASTA and HMMer and had to limit ourselves to assessing their compatibility to human solutions. The bottom line here is that humans and algorithms solve the MSA problem with different approaches and a perfect benchmarking of one against the other is not possible.

Second, while the purpose of the approach we present is to be able to apply the strategies learned from human players to any dataset, its performance on a dataset of non-microbial sequences has not yet been tested. However, we have evidence that at least in the context this approach has been tested, it presents a reasonable solution to the MSA problem, which seems to indicate the agents learn general strategies that should apply to any nucleic acid sequence.

Third, a limitation of the Borderlands Science game is that the players are not explicitly told to optimize a bi-objective function. They have a limited number of gaps but they have no incentive not to use all the gaps they are given. We alleviate this limitation by showing players different numbers of gap tokens for the same puzzle, and by filtering player output for solutions near the pareto front, which results in a large enough set of nearly optimal solutions.

Also, pareto-optimality, which we use here as a proxy to identify good solutions to the MSA problem, is not guaranteed to provide the best solution to this problem. It is not impossible that there exists a better way to filter this data and extract solutions that constitute the best strategies to tackle the MSA problem. However, this limitation is damped by the logical relevance of pareto-optimality to the problem at hand; while we cannot guarantee the player solutions are globally optimal, we have shown strong evidence that they generally outperform computational methods on at least one important evaluation criteria, which is sufficient to support the claim that these solutions provide valuable information that is worth learning to reproduce (see H1).

Another typical limitation of a large-scale participative science approach is the potential presence of disruptive behavior and its handling [Pfau and Malaka 2019]. In Borderlands Science, several steps were taken to limit the impact of such behavior: the requirement of reaching the par score forces players to at least attempt to solve the puzzle, and, solutions that are too distant from the consensus are removed. We are inclined to trust this consensus between players because the main feedback we received from our interaction with the player-base was excitement about contributing to science. Finally, the main rewards offered to player, in-game cosmetics, can only be received once per level, which reduces the long-term value of playing the game and thus de-incentivizes botting.

Additionally, the reinforcement learning methods presented in this article do not cover the entirety of the state of the art for imitation learning. We intentionally selected simpler models that are known to work well for this type of problem rather than more complex models such as DQN to assess whether player solutions could be mimicked with a simple model. It is possible alternate methods outperform the ones presented here. Nevertheless, we reserve the investigation of alternative strategies for future work, as recent work has confirmed it is possible to train an agent to play at an experienced level in a game with behavioural cloning [Vinyals et al. 2019], and this is the extent of the evidence we needed to accumulate for the claims presented in this work.

Finally, the data presented in this study is difficult to acquire. Assembling a million solutions from players requires selecting the right problem, adding an adequate level of gamification, and finding a way to massively distribute the game. The overhead cost is significant. However, the methods presented in this paper represent a promising avenue to make full use of the output by widening its range of applications, further justifying that overhead cost.

6 Conclusion

In this work, we showed that Borderlands Science players leveraged heterogeneous and efficient strategies and that these strategies could be mimicked with machine learning. This contribution further supports the relevance of scientific discovery games to tackle scientific data analysis challenges by offering an avenue to apply these results to new data. Yet, the main novelty of this contribution is the successful application of this strategy to an NP-hard computational problem, charting a promising course for solving complex problems with the wisdom of the crowd through massive-scale citizen science games.

Acknowledgments

The authors want to recognize the contribution of the entire Borderlands Science (BLS) consortium^* to this project, which includes (but is not limited to) the design, implementation, production, promotion of the BLS game and preparation of the data sets. The team is composed of members from Gearbox Studio Québec and Gearbox Software (Gearbox), Massively Multiplayer Online Science (MMOS), The Microsetta Initiative at the University of California at San Diego (UCSD), and the School of Computer Science at McGill University (McGill).

The BLS consortium is made of (by alphabetical order): David Bélanger (Gearbox), Alexis Belley Duffault (Gearbox), Mathieu Blanchette (McGill), Michael Bouffard (Gearbox), Amélie Brouillette (Gearbox), Alexander Butyaev (McGill), Eddie Cai (McGill), Sébastien Caisse (Gearbox), Joshua Davidson (Gearbox), Chrisostomos Drogaris (McGill), Kornel Erhart (MMOS), Mathieu Falaise (Gearbox), Fanny Pelletier (Gearbox), Vincent Fiset (Gearbox), Olivier Gagné Houle (Gearbox), Parham Ghasemloo (McGill), Steven Hebert (Gearbox), Dan Hewitt (Gearbox), Jonathan Huot (Gearbox), Raphaël Jean (Gearbox), Timothy Keding (McGill), Seung Kim (Gearbox), Rob Knight (UCSD), Daniel McDonald (UCSD), Jonathan Moreau-Genest (Gearbox), Julien Mounthanyvong (McGill), Renata Mutalova (McGill), David Najjab (Gearbox), Elena Nazarova (McGill), Randy Pitchford (Gearbox), Steve Prince (Gearbox), Gabriel Richard (Gearbox), Ludger Saintélien (Gearbox), Roman Sarrazin-Gendron (McGill), Hugues Thibodeau (Gearbox), Attila Szantner (MMOS), Jérôme Waldispühl (McGill), Jiayue Zheng (McGill), and Yuxue Zhu (McGill).

Additionally, this project would not have been possible without the participation of the millions of Borderlands Science players; the results presented in this paper are the direct product of their contributions.

Finally, the authors want to thank Genome Canada and Génome Québec for their financial support to this project through the Genomic Applications Partnership Program (GAPP).

Supplementary Material

MP4 File (3544548.3581375-video-preview.mp4)

Video Preview

Download
54.27 MB

References

[1]

Daniel E Acuña and Víctor Parada. 2010. People efficiently explore the solution space of the computationally intractable traveling salesman problem to find near-optimal tours. PloS one 5, 7 (2010), e11685.

Abstract

1 Introduction

1.1 Game design

1.2 Hypotheses

2 Related Work

3 Methods

3.1 Data filtering

3.2 Multiple sequence alignment

3.2.1 Definition.

3.2.2 Algorithms used for comparison.

3.3 Imitation learning

3.3.1 Data representation.

3.3.2 Predictive models.

3.3.3 Architecture.

3.4 Evaluation

4 Results

4.1 Data from Borderlands Science

4.2 Testing Hypothesis 1

4.3 Testing Hypothesis 2

4.4 Testing Hypothesis 3

5 Discussion

5.1 Limitations

6 Conclusion

Acknowledgments

Supplementary Material

References

Cited By

Index Terms

Recommendations

Classic Nintendo games are (computationally) hard

Testing men's hormone responses to playing League of Legends

The effect of leaderboard ranking on players' perception of gaming fun

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations