\mdfdefinestyle

MyFramelinecolor=black, outerlinewidth=2pt, roundcorner=20pt, innertopmargin=4pt, innerbottommargin=4pt, innerrightmargin=4pt, innerleftmargin=4pt, leftmargin = 4pt, rightmargin = 4pt backgroundcolor=gray

Combinatorial Reasoning: Selecting Reasons
in Generative AI Pipelines via Combinatorial Optimization

Mert Esencan1    Tarun Advaith Kumar1    Ata Akbari Asanjan2,3,4   
P. Aaron Lott2,4
   Masoud Mohseni5    Can Unlu1    Davide Venturelli2,4    Alan Ho1,6
\affiliations1Icosa Computing Inc., New York, USA
2NASA ARC - Quantum Artificial Intelligence Laboratory (QuAIL), Moffett Field, California, USA
3NASA ARC - Data Sciences Group, Moffett Field, California, USA
4USRA Research Institute for Advanced Computer Science (RIACS), Moffett Field, California, USA
5LSIP, Hewlett Packard Labs, Milpitas, CA, USA
6DataStax, Santa Clara, California, USA
\emails{mert, tarun, can}@icosacomputing.com
{aakbariasanjan, plott, dventurelli}@usra.edu
alan.h@datastax.com
Abstract

Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.

1 Introduction

The advent of auto-regressive architectures, notably modern LLMs (?), mark a significant development in the pursuit of human-like AI. These models exhibit a profound capacity for generating human-like responses, positioning them as impressive tools for information processing. However, despite their extensive training and large parameter counts, LLMs inherently lack robust mechanisms for deep reasoning and strategic planning (??), which are necessary for applications demanding high-level cognitive functions. Moreover, the same architecture makes LLMs prone to hallucination—defined as generation that is nonsensical or unfaithful to source material (?).

One approach to handling these limitations is to provide additional retrieved context to reduce incorrect generations. Techniques such as retrieval augmented generation (RAG) query a vector database to retrieve source material before generating the LLM response (?). This approach is particularly suitable for knowledge intensive tasks but does not generalize well to reasoning-intensive tasks.

A parallel area of research is improvements to prompt engineering and response decoding. A new method dubbed Chain of Thought (CoT) (?) concatenates hand-annotated example responses with reasoning to the query to create the prompt. The responses from the LLM mimic the examples and contain a ‘reasoning path’ followed by the answer. Another method developed as a complementary decoding approach is Self-Consistency, with the idea that marginalizing over several reasoning paths provides the best possible response (?). However, these approaches rely heavily on human annotations and the same static examples may not be relevant to different queries.

To address these limitations and inspired by the conjecture that the human brain performs gradient-free optimization for reasoning and decision-making (?), we propose that integrating combinatorial optimization strategies within LLM frameworks could advance their reasoning capabilities and make them more adept at handling tasks requiring strategic thought.

Our research proposes the integration of an external reasoning engine that interfaces with existing LLM pipelines to fully automate the creation of CoT style prompts. As the reasoning engine sits outside the LLM black box, our work is not an attempt to change the foundational auto-regressive architecture of LLMs but a proposed tool to analyze and possibly augment their reasoning faculties through automated prompt engineering. By employing combinatorial optimization, the engine generates structured prompts that guide the LLM towards the correct answer. Our work intersects two distinct fields - generative AI and probabilistic combinatorial optimization - to tackle human level reasoning tasks. We construct a first of a kind LLM pipeline with physics-inspired solvers and benchmark the pipeline across a variety of well known Natural Language Processing (NLP) reasoning benchmarks.

Refer to caption
Figure 1: Workflow for Combinatorial Reasoning. The initial prompt is processed by the LLM N𝑁Nitalic_N times and the answers are filtered through a semantic matching procedure to produce answers with distinct reasons. The ensemble is mapped into a QUBO problem solved by an Ising machine. The final solution determines a set of reasons to be added to the prompt for a final LLM call that determines the final answer.

In the following sections, we review the state of art and then present our Combinatorial Reasoning (CR) framework as a versatile technique that leverages a probabilistic combinatorial optimizer to construct a Chain-of-Though style prompt with no human intervention. Our proof-of-concept results demonstrate that in some cases CR achieves improvements over other zero-shot prompting strategies on a few reasoning tasks from BigBench-Hard, and human-level reasoning performance on several reasoning tasks.

In the conclusions, we will discuss the lessons learned, how the CR framework can be further optimized beyond our preliminary baseline experiments and outline multiple promising research avenues.

2 Preliminaries and Prior Work

2.1 Large Language Models

Large Language Models (LLMs) are machine learning models that are trained to receive and recognize text as input and produce text as output. Distinguished from simpler language models by their immense parameter count, these models are capable of general purpose language processing tasks. GPT 3.5-turbo, the LLM used in our experiments, is part of a series of models developed by OpenAI for generating human-like natural language text (?). Many LLMs including GPT 3.5-turbo are capable of receiving and following a set of system instructions while responding to a query. System instructions differ from the user content that queries an LLM for a response, and we refer to the latter as a ‘prompt’ throughout this work. System instructions define the characteristics of the language model’s output, and constrain it to fit requested behavior (?). For our experiments, we use temperature sampling - a popular method for decoding LLMs in text generation tasks. Adjusting the temperature causes the probability distribution of each subsequent token to be modified, affecting the variability of the responses (?). High sampling temperatures result in diversity while lower temperatures provide more reproducible generations.

2.2 Reasoning in Large Language Models

There have been many papers that suggest that LLMs can indeed reason (??). For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results. However, these results are not a good indicator for the autonomous reasoning capabilities of the model. In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.

The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts. Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.

Gemini Ultra GPT-4 LLama3 70B
MMLU 90.04% CoT@32 86.4% 5-shot 79.5% 5-shot
GSM8K 94.4% Maj1@32 92% 5-Shot CoT 93.0 8-shot
MATH 53.2% 4-shot 50.4 4-shot
BIG-Bench-Hard 83.6% 3-shot 81.3 3-shot, CoT
DROP 82.4% Variable shot 80.9 3-shot 79.7 3-shot,F1
HellaSwag 87.8% 10-shot 95.3% 10-shot
WinoGrande 87.5% 5-shot 81.3% 5-shot
AI2 Reasoning 96.3% 25-shot 93% 25-shot
Table 1: Summary of recent reasoning benchmarks on LLMs (?). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT)

The main inspiration for our work comes from Yan LeCun’s review (?) which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.

Chain of Thought

As a natural extension of the in-context few shot learning to facilitate reasoning, the Google Brain team developed few-shot CoT (?). The aim is to augment language models with the ability to ‘reason’ in intermediate steps before providing an answer. This approach requires manual demos (or exemplars): labeled question answer pairs containing intermediate reasoning steps that lead to the final correct answer. Inspired by this, (?) developed “zero-shot CoT” as an approach to induce LLMs to answer with intermediate reasoning step without manual demos. Simply put, appending the phrase “Let’s think step by step” to a particular query improved performance.

Self-Consistency

Self-Consistency was introduced by Google’s Deepmind Team as an improved decoding approach for CoT style prompts (?). Instead of greedy decoding, the authors suggest collecting samples at non-zero temperatures and selecting the most occurring answer. This approach lends itself to an intuitive interpretation - reasoning problems admit multiple correct reasoning paths that lead to the unique right answer but incorrect reasoning paths lead to different incorrect answers. Self-consistency can also be viewed as marginalization over latent tokens to produce a better answer.

Universal Self-Adaptive Prompting (USP)

The Universal Self-Adaptive Prompting approach introduces a novel way of generating automatically designed prompts to improve decoding efficiency (?). This approach involves the use of unlabeled questions to generate a set of prompt and response pairs. These prompt and response pairs are concatenated to form a collection of pseudo-demos similar in style to the manual demos used in Chain of Thought. For a given question, a selection algorithm picks a subset of this collection - based on computed metrics such as diversity and confidence. These selected demos are then prepended to the beginning of the question in standard CoT style to form the final prompt.

2.3 Combinatorial Optimization and Ising Machines

It is well known that challenging combinatorial optimization problems arise in multiple industrial domains, such as finance, logistics, manufacturing, drug design  (????). State-of-art solution methods often consist of a patchwork of heuristic techniques tuned to the problem class of interest. Interestingly, for many of these problems efficient mappings of the cost-function H𝐻Hitalic_H to a binary, quadratic, unconstrained formulation exist (QUBOs (??)). Equivalently, the problems can be framed in “physics terms” as approximating as much as possible the ground state of an interacting, disordered classical Ising spin energy function:

H=i,jQijxixjihisi+i,jJijsisj,𝐻subscript𝑖𝑗subscript𝑄𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗subscript𝑖subscript𝑖subscript𝑠𝑖subscript𝑖𝑗subscript𝐽𝑖𝑗subscript𝑠𝑖subscript𝑠𝑗H=\sum_{\begin{subarray}{c}i,j\end{subarray}}Q_{ij}x_{i}x_{j}\equiv\sum_{i}h_{% i}s_{i}+\sum_{\begin{subarray}{c}i,j\end{subarray}}J_{ij}s_{i}s_{j},italic_H = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i , italic_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i , italic_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (1)

where xi{0,1}subscript𝑥𝑖01x_{i}\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 }; si{1,+1}subscript𝑠𝑖11s_{i}\in\{-1,+1\}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , + 1 } and Jijsubscript𝐽𝑖𝑗J_{ij}italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Qijsubscript𝑄𝑖𝑗Q_{ij}italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are real-valued coefficients that specify the problem instance. While the QUBO and Ising forms are equivalent, in this paper we will formulate everything in terms of the QUBO form. Clearly, the search space for the minimum of Eq. 1 scales as 2Nsuperscript2𝑁2^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT as the number of variables N𝑁Nitalic_N increases.

The existence of these mappings has ignited a lively research community that, in the last ten years, has devised hardware implementation of the Ising model as well as physics-inspired algorithmic strategies meant to cool down interacting spins close to their least energetic configuration. Collectively, these methods are often referred to as “Ising Machines”  (???) or sometimes “quantum-inspired” solvers - considering the fact that the most popular and visible methods have a connection to quantum mechanics (????).

Digital implementation of Ising machines are currently the most scalable approach to tackle large problems. This includes GPU/FPGA emulations of solution principles inspired by oscillator synchronization (such as Kuramoto models (?), coherent optics with or without dissipation (e.g Coherent Ising Machines (?) and Bifurcation Machines (?)) and thermal relaxation (e.g. probabilistic bits (“p-bits”) variations (??)). Benchmarks of these systems have consistently performed well in paradigmatic benchmarks of combinatorial optimization, especially in absence of hard constraints. Indeed, there is accumulated empirical evidence that on NP-Hard problems such as fully-connected Spin Glasses the time-to-solution scales as a stretched exponential with the increase of the number of binary variables, on typical instances (i.e. O(N)exp(N)proportional-to𝑂𝑁𝑁O(N)\propto\exp(\sqrt{N})italic_O ( italic_N ) ∝ roman_exp ( square-root start_ARG italic_N end_ARG )), while other methods seem to struggle (??).

Simulated Annealing and Parallel Tempering

Simulated Annealing (SA) (?) is an optimization technique built on searching for low energy solutions using a Markov chain parameterized by a temperature such that high temperature samples correspond to random samples and low temperature samples reflect low energy locally-optimal configurations of the target system. A temperature schedule is formulated to effectively explore or exploit regions of the search space for low energy configurations. Parallel Tempering (PT) (?) is a similar optimization technique built on multiple Markov chains running at different temperatures that swap configurations between the chains in order to explore or exploit the temperatures in finding low energy configurations while avoiding stagnation in local minima. The hyperparameters for the parallel tempering scheme including the temperature schedule and acceptance rate could also be determined adaptively in a fashion to enable a constant swap rate between chains, a scheme dubbed Adaptive Parallel Tempering (APT) (??). The adaptive process is based on a physics-informed procedure to efficiently explore complex energy landscapes.

3 Combinatorial Reasoning

While LLMs cannot reliably reason on their own, with the assistance of an auxiliary system - namely a discrete probabilistic optimizer - we could conceivably select reasons that could create a useful CoT passed into the LLM. The main conceptual challenge is whether one can design a reason-to-variable mapping and a related cost function with the following properties:

  • universality: works across a large variety of reasoning tasks

  • accuracy: its optimized solutions correspond to selecting good reasons when a variety of reasons exist for a given answer

  • practicality: its complexity is such that it returns useful reasons within the time allowed for the optimizer to do the minimization

With reference to Fig. 2, we investigate these challenges by drafting a QUBO cost-function inspired by the problem of portfolio optimization, and designing a sequential procedure of interaction between LLMs and an Ising machine. We call this generic framework Combinatorial Reasoning (CR). It consists of four stages which we now describe in detail.

3.1 Sampling of Reasons

Given a question from the dataset, we prepare N𝑁Nitalic_N identical input prompts (see appendix B) and query an LLM at a fixed temperature. Following the system instructions, each of the N𝑁Nitalic_N outputs will contain a set of reasons . Among these, there are duplicate reasons that are semantically equivalent. We use a Sentence Transformer from HuggingFace (all-mpnet-base-v2) to embed each reason into a normalized 768 dimensional vector. Defining a similarity metric between two reasons as the dot product of the corresponding embedded vectors, we count two reasons as the same if this metric is greater than ζ𝜁\bf{\zeta}italic_ζ. Using this procedure as our method for counting, we can reduce the set of all sampled reasons into a smaller set of distinct reasons and a collection, {γ𝐢}subscript𝛾𝐢\{\bf{\gamma}_{i}\}{ italic_γ start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT }, of embedded vectors. We define:

  • {s\{s{ italic_s} : Set of samples each with an answer and set of reasons

  • {rtotal\{r_{total}{ italic_r start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT} : Set of all reasons sampled from the LLM

  • {rdistinct\{r_{distinct}{ italic_r start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_n italic_c italic_t end_POSTSUBSCRIPT}: Set of independent reasons selected by Sentence Transformer

  • nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: The number of times each independent reason, indexed by i𝑖iitalic_i, appears in our N𝑁Nitalic_N samples

  • nijsubscript𝑛𝑖𝑗n_{ij}italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT: The number of times a pair of independent reasons, indexed by i𝑖iitalic_i and j𝑗jitalic_j, appear together within any one of our N𝑁Nitalic_N samples

These counts are the basis of combinatorial reasoning, and we use these to compute quantities essential in the QUBO mapping. From here on, we refer to independent reasons as reasons for the sake of brevity. Using these counts as well as the acquired embeddings, we denote misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the average similarity that each reason shares with every reason, i.e.

mi=1kj=1kγ𝐢γ𝐣subscript𝑚𝑖1𝑘superscriptsubscript𝑗1𝑘subscript𝛾𝐢subscript𝛾𝐣m_{i}=\frac{1}{k}\sum_{j=1}^{k}\bf{\gamma_{i}}\cdot\bf{\gamma_{j}}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ⋅ italic_γ start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT (2)

Finally, to clarify our notation, for a given collection {ξμ}subscript𝜉𝜇\{\xi_{\mu}\}{ italic_ξ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT }, we use ξ¯¯𝜉\bar{\xi}over¯ start_ARG italic_ξ end_ARG, ξ¯¯¯¯𝜉\bar{\bar{\xi}}over¯ start_ARG over¯ start_ARG italic_ξ end_ARG end_ARG, and δξ𝛿𝜉\delta\xiitalic_δ italic_ξ to denote the mean, median, and standard deviation.

3.2 QUBO Mapping

This stage processes deterministically the collection of answers and their distinct reasons to formulate a quadratic unconstrained integer optimization problem. The procedure that we decide to investigate is inspired by the QUBO mappings to Markowitz portfolio optimization (?) where the goal is to select the optimal assets (i.e. reasons, in our case) out of a finite universe (all distinct reasons) maximizing some metric. Importantly, this is just one of the many possible cost-functions that could be designed to try to capture the correlations between good and consistent reasons outputted by an LLM after an ensemble of queries, as it will be discussed in Section 5. Each distinct reason is associated to an integer variable zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The integer bound for the variables is a parameter to be chosen as the maximum power of two in order to leverage the binary encoding zi=w=0W12wxiwsubscript𝑧𝑖superscriptsubscript𝑤0𝑊1superscript2𝑤subscript𝑥𝑖𝑤z_{i}=\sum_{w=0}^{W-1}2^{w}x_{iw}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_w = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT where xiwsubscript𝑥𝑖𝑤x_{iw}italic_x start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT are binary variables. We now construct two functions that will compose a total objective H=(L+Q)𝐻𝐿𝑄H=-(L+Q)italic_H = - ( italic_L + italic_Q ). The first term is meant to select reasons based on their frequency of appearance ni/Nsubscript𝑛𝑖𝑁n_{i}/Nitalic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N:

L=ili(μ,α)zi=i[μpiαri]zi𝐿subscript𝑖subscript𝑙𝑖𝜇𝛼subscript𝑧𝑖subscript𝑖delimited-[]𝜇subscriptp𝑖𝛼subscriptr𝑖subscript𝑧𝑖L=\sum_{i}l_{i}(\mu,\alpha)z_{i}=\sum_{i}\left[\mu\,\mathrm{p}_{i}-\alpha% \mathrm{r}_{i}\right]z_{i}italic_L = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ , italic_α ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_μ roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (3)

where pisubscriptp𝑖\mathrm{p}_{i}roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a measure of “popularity”, i.e. of the relative deviation with respect to the mean frequency of appearance of a reason. risubscriptr𝑖\mathrm{r}_{i}roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a measure of the standard deviation module around the frequency (in analogy to the concept of risk in portfolio optimization):

pi=nin¯¯Nsubscriptp𝑖subscript𝑛𝑖¯¯𝑛𝑁\displaystyle\mathrm{p}_{i}=\frac{n_{i}-\bar{\bar{n}}}{N}roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG over¯ start_ARG italic_n end_ARG end_ARG end_ARG start_ARG italic_N end_ARG ri2=niN(1niN).superscriptsubscriptr𝑖2subscript𝑛𝑖𝑁1subscript𝑛𝑖𝑁\displaystyle\mathrm{r}_{i}^{2}=\frac{n_{i}}{N}\left(1-\frac{n_{i}}{N}\right).roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ( 1 - divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ) . (4)

The real parameters μ𝜇\muitalic_μ and α𝛼\alphaitalic_α needs to be chosen empirically. If this was the only objective function (i.e. H=L𝐻𝐿H=-Litalic_H = - italic_L) then zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will either be 00 or maxed out depending on whether li(μ,α)subscript𝑙𝑖𝜇𝛼l_{i}(\mu,\alpha)italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ , italic_α ) is positive or negative.

We measure the potential relationship between two reasons by evaluating cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as the connected correlation between two reasons defined as:

cij=nijNninjN2subscript𝑐𝑖𝑗subscript𝑛𝑖𝑗𝑁subscript𝑛𝑖subscript𝑛𝑗superscript𝑁2c_{ij}=\frac{n_{ij}}{N}-\frac{n_{i}n_{j}}{N^{2}}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5)

This function assumes values between 1+N11superscript𝑁1-1+N^{-1}- 1 + italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 1/4141/41 / 4. It is zero if one of the two reasons appears all the time (meaning it is trivially right - its selection should be handled by linear terms). It is negative when reasons i𝑖iitalic_i, j𝑗jitalic_j are frequent but rarely appearing in the same answer s𝑠sitalic_s. It hits the maximum positive value when reasons i𝑖iitalic_i and j𝑗jitalic_j appears all the time together but are not necessarily obvious (ni=nj=N/2subscript𝑛𝑖subscript𝑛𝑗𝑁2n_{i}=n_{j}=N/2italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_N / 2).

Using the connected correlation functions, we construct the second term in H𝐻Hitalic_H to be sensitive to the correlation between reasons that appear jointly more or less frequently than the average:

Q=ijqij(β)zizj=ij[cijc¯βδc]zizj𝑄subscript𝑖𝑗subscript𝑞𝑖𝑗𝛽subscript𝑧𝑖subscript𝑧𝑗subscript𝑖𝑗delimited-[]subscript𝑐𝑖𝑗¯𝑐𝛽𝛿𝑐subscript𝑧𝑖subscript𝑧𝑗Q=\sum_{i\neq j}q_{ij}(\beta)z_{i}z_{j}=\sum_{i\neq j}\left[c_{ij}-\bar{c}-% \beta\,\delta c\right]z_{i}z_{j}italic_Q = ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_β ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_c end_ARG - italic_β italic_δ italic_c ] italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (6)

The parameter β𝛽\betaitalic_β is a real valued hyperparameter of the mapping, appearing as a prefactor to the standard deviation of cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. For β=0𝛽0\beta=0italic_β = 0, if cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is positive then qi,jsubscript𝑞𝑖𝑗q_{i,j}italic_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT will be positive and increasing with the number of times we observe reasons i𝑖iitalic_i and j𝑗jitalic_j occur together when this number is greater than the observed average. In a symmetric way, it will be negative and decreasing if cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is negative and the joint appearance is less frequent than the average. The L+U𝐿𝑈L+Uitalic_L + italic_U function is straightforwardly converted to QUBO form Eq. 1 by means of the binary encoding formula:

H𝐻\displaystyle Hitalic_H =\displaystyle== ili(μ,α)w12w1xiw1subscript𝑖subscript𝑙𝑖𝜇𝛼subscriptsubscript𝑤1superscript2subscript𝑤1subscript𝑥𝑖subscript𝑤1\displaystyle-\sum_{i}l_{i}(\mu,\alpha)\sum_{w_{1}}2^{w_{1}}x_{iw_{1}}- ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ , italic_α ) ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (7)
\displaystyle-- ijqij(β)w1w22w12w2xiw1xjw2.subscript𝑖𝑗subscript𝑞𝑖𝑗𝛽subscriptsubscript𝑤1subscriptsubscript𝑤2superscript2subscript𝑤1superscript2subscript𝑤2subscript𝑥𝑖subscript𝑤1subscript𝑥𝑗subscript𝑤2\displaystyle\sum_{i\neq j}q_{ij}(\beta)\sum_{w_{1}}\sum_{w_{2}}2^{w_{1}}2^{w_% {2}}x_{iw_{1}}x_{jw_{2}}.∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_β ) ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Despite the simplicity of L𝐿Litalic_L and Q𝑄Qitalic_Q, it was empirically determined in our study that it is beneficial to use a slightly modified version of the function L~~𝐿\tilde{L}over~ start_ARG italic_L end_ARG introducing flexibility on the mapping of the single non-interacting reasons. Per result of trial and error in our early investigations, the popularity portion was modulated favoring “crucial” reasons with low semantic similarity, and the risk portion of the mapping was modified with an additional tuning parameter κ𝜅\kappaitalic_κ, as well as other quadratic terms in xiwsubscript𝑥𝑖𝑤x_{iw}italic_x start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT to modify the integer encodings:

L~~𝐿\displaystyle\tilde{L}over~ start_ARG italic_L end_ARG =\displaystyle== il~i(μ,mi,α,κ,xi)subscript𝑖subscript~𝑙𝑖𝜇subscript𝑚𝑖𝛼𝜅subscript𝑥𝑖\displaystyle{\sum_{i}\tilde{l}_{i}(\mu,m_{i},\alpha,\kappa,x_{i})}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α , italic_κ , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=\displaystyle== i[μsgn(pi)|pi|miw1W12w1xiw1\displaystyle\sum_{i}\left[\mu\,\mathrm{sgn}(\mathrm{p}_{i})|\mathrm{p}_{i}|^{% m_{i}}\sum_{w_{1}}^{W-1}2^{w_{1}}x_{iw_{1}}\right.∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_μ roman_sgn ( roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
\displaystyle-- αriw1W1(2κw1xiw1+w2=w1+1W12w12w2xiw1xiw2)].\displaystyle\left.\alpha\mathrm{r}_{i}\sum_{w_{1}}^{W-1}\left(2^{\kappa w_{1}% }x_{iw_{1}}+\sum_{w_{2}=w_{1}+1}^{W-1}2^{w_{1}}2^{w_{2}}x_{iw_{1}}x_{iw_{2}}% \right)\right].italic_α roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_κ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] .

The final QUBO that we send to the solver is then tackling the objective function H~=(L~+Q)~𝐻~𝐿𝑄\tilde{H}=-(\tilde{L}+Q)over~ start_ARG italic_H end_ARG = - ( over~ start_ARG italic_L end_ARG + italic_Q ).

3.3 Combinatorial Optimization Solver

The QUBO instance is processed by an Ising machine configured with a pre-defined parameter setting strategy (?) aimed to find the most appropriate solution to the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT variables. Ideally, the solver identifies the global optimum of H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG, i.e.

x=argmin{xi}[H~],superscript𝑥subscriptargminsubscript𝑥𝑖~𝐻x^{\star}=\operatorname*{arg\,min}_{\{x_{i}\}}\left[\tilde{H}\right],italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT [ over~ start_ARG italic_H end_ARG ] , (9)

however, an approximate solution might be sufficient as discussed in Section 6.1. We denote as {rselected\{r_{selected}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT} the set of reasons selected by QUBO solver. From the binary encoding formula we obtain the zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT variables. We then compose a list including all reasons such that zi>0subscript𝑧𝑖0z_{i}>0italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 in the returned solution, and we associate to each of them a real weight w𝑤witalic_w-value=zi/Z𝑣𝑎𝑙𝑢𝑒subscript𝑧𝑖𝑍value=z_{i}/Zitalic_v italic_a italic_l italic_u italic_e = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_Z where Z=izi𝑍subscript𝑖subscript𝑧𝑖Z=\sum_{i}z_{i}italic_Z = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.4 Final Prompt Creation

Once the best candidate solution to the QUBO problem has been found, it is mapped back to set of reasons {rselected\{r_{selected}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT} each prepended with their w𝑤witalic_w-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e. The LLM is instructed to treat the w𝑤witalic_w-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e as a level of relative importance for each reason. These reasons are sorted according to their w𝑤witalic_w-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_es (highest first) and alphabetically. The concatenated string is used to form a prompt (see appendix B). This final prompt inherit the benefits of CoT thanks to these additional reasons, and is used to query the LLM in a zero-shot fashion at temperature = 0 (greedy decoding).

4 Experimental Results

Refer to caption
Figure 2: The performance of combinatorial reasoning (CR) against other methods. Human and USP results are reported from the publications for BBH and USP respectively (?) (?). USP is evaluated on a different, but comparable, LLM PaLM 2-M. Table 3 presents the cumulative results across BBH for these various tasks. Tasks marked with ΛΛ\Lambdaroman_Λ are algorithmic tasks while the others are NLP tasks.

We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens. This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.

We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?). To save on inference time and cost, we sample 50 questions from each of the subtasks111Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness. On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting. Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).

For the Sampling of Reasons step, we sampled the LLM N=210𝑁210N=210italic_N = 210 times at T=1𝑇1T=1italic_T = 1 to collect sufficient distinct reasons, and calculate their distribution and correlations matrices. N𝑁Nitalic_N was determined empirically on test questions. To map to distinct reason, the similarity threshold is held to ζ𝜁\bf{\zeta}italic_ζ=0.90, again determined empirically. Prior to running the QUBO mapper, we tune the mapping parameters μ𝜇\muitalic_μ, α𝛼\alphaitalic_α, β𝛽\betaitalic_β, W𝑊Witalic_W and (κ𝜅\kappaitalic_κ is fixed) using 5 questions from across all of BBH to form a 135 question tuning set. On this, we set the ranges for the tuning (see Table 2) and use Optuna - a gradient free hyperparameter optimization framework (?) - to select the optimal values for the other four parameters. We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.

Parameter μ𝜇\muitalic_μ α𝛼\alphaitalic_α β𝛽\betaitalic_β W𝑊Witalic_W κ𝜅\kappaitalic_κ
Tuning Range [1E-3, 70] [1E-4, 20] [-2, 10] [1, 4] 2
Table 2: Parameter Tuning Ranges

For the Ising solver, we utilized an open-source implementation of simulated annealing (?) featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.

Setting Zero-Shot Few-Shot
Method 0-Shot USP CR 3-Shot
(Ours) CoT
Average (%) \uparrow 47.68 55.89 59.88 74.20
Gain over zero-shot 0 +8.21 +12.20 +26.52
Average rank \downarrow 3.22 2.78 2.57 1.35
Table 3: Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.

Figure 2 and Table 3 displays our results for BBH tasks. We manually evaluated the results for CR and zero-shot. The USP results are taken from (?). While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (??).

We performed a human evaluation at each stage of the CR pipeline. In Table 4 we report the number of sampled reasons before and after the stages depicted in Fig. 2. It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons. More results of the human evaluation can be found in the Appendix.

All Reasons % of {rdistinct}subscript𝑟𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡\{r_{distinct}\}{ italic_r start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_n italic_c italic_t end_POSTSUBSCRIPT }
Dataset {rtotal}subscript𝑟𝑡𝑜𝑡𝑎𝑙\{r_{total}\}{ italic_r start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT } {rdistinct}subscript𝑟𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡\{r_{distinct}\}{ italic_r start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_n italic_c italic_t end_POSTSUBSCRIPT } {rselected}subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑\{r_{selected}\}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT }
Causal Judgement 709 204 87.2
Reasoning About Colored Objects 525 100 82.0
Navigate 1100 572 100.0
Penguins In A Table 589 123 77.2
Geometric Shapes 630 331 100.0
Disambiguation QA 373 45 68.9
Tracking Shuffled Objects Five Objects 1020 298 95.0
Word Sorting 385 107 99.1
Tracking Shuffled Objects Three Objects 743 147 64.6
Tracking Shuffled Objects Seven Objects 1164 400 98.5
Multistep Arithmetic Two 621 253 99.6
Web Of Lies 885 113 84.1
Logical Deduction Three Objects 540 100 72.0
Sports Understanding 449 160 96.3
Snarks 396 109 91.7
Logical Deduction Five Objects 680 199 92.0
Salient Translation Error Detection 389 90 98.9
Hyperbaton 432 57 65.0
Movie Recommendation 730 457 100.0
Object Counting 397 48 62.5
Logical Deduction Seven Objects 730 309 100.0
Temporal Sequences 533 76 97.3
Formal Fallacies 579 251 100.0
Dyck Languages 1112 558 100.0
Date Understanding 587 162 98.1
Boolean Expressions 493 160 93.8
Ruin Names 622 421 100.0
Table 4: Reason filtering and selection percentages

5 Conclusion

We propose CR as a zero-shot automatic prompting pipeline applicable to reasoning tasks. We believe CR could be advantageous in the scenario that one needs multiple reasons to elicit the correct answer, and the reasons cannot be obtained via a single-shot from the LLM.

6 Future Work

In this section, we point out some details on the upcoming research directions to improve CR.

6.1 Improving time and accuracy

We list the following straightforward improvement ideas to the baseline framework:

Semantic matching

Human evaluation on a few samples (N=10𝑁10N=10italic_N = 10) on the causal judgement, movie recommendation, and sports understanding datasets reveals that a decent fraction of the reasons that are identified as distinct via our automated procedure are actually semantically the same in the eyes of a human. This speaks to the relative simpleness of BBH for LLM reasoning and clearly negatively affects the effectiveness to the QUBO mapping and of the entire CR pipeline. It is therefore a priority to improve the semantic matching procedure via threashold adjustment or other more sophisticated filtering.

QUBO Mapping

First and foremost, we re-emphasize the fact that the QUBO construction is just a first attempt to identifying a good objective function. The H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG construction can be refined and studied carefully to maximize the correlation between the quality of the xsuperscript𝑥x^{\star}italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and the accuracy of the final answer at the end of the CR procedure. Approximate solutions of these hard problems might end up being good enough and optimal with respect to the end-to-end result. In Appendix C we study some properties of the current cost-function choice, learning valuable lessons for the design of future improved CR pipelines. It will be also clearly beneficial to study the property of the graphs (size, weight distribution) and correlate characteristics from the physics of spin-glasses (such as the presence of a phase transition (??)) with the final answer.

It is immediately notable that for a few task categories (Table 4) our the QUBO mapping does not result in a sub-selection of reasons because all distinct reasons are selected to be part of the final prompt. This indicates that either the problem does not need combinatorial optimization, or that the QUBO mapping needs to be improved. Moreover, the current construction was inspired by a basic portfolio optimization formulation and neglects possibility of negative zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and , budget constraints (which would fix the size of {rselected}subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑\{r_{selected}\}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT }), and higher-than-quadratic correlations between the reasons, which can be inserted and “gadgetized” into quadratic terms (?) (adding ancillary variables sparsely connected) while still allowing the use of Ising machines. The advantage of using higher locality solvers than quadratic ones for certain optimization problems has recently been demonstrated (?).

Combinatorial Optimization Solvers Selection

With QUBO instances being NP-Hard in general, it is expected that the combinatorial optimization solver might take more time to find a quality solution than an acceptable user experience might mandate. The solver used for our baseline results is far from being the optimal choice or from being used in the optimal way, hence we can expect great room for improvement in speed and accuracy of the results in Fig. 2.

To show that enhancements are possible, we proceeded performing a few numerical experiments on a subset of instances of BBH for which we substituted the simulated annealing baseline solver either with a hardware-efficient digital implementation (the Fujitsu Digital Annealer (?)), and with an Adaptive Parallel Tempering (APT) solver from USRA (?).

  • Speedup potential For the digital annealer experiments, we selected the Logical Deduction - 7 objects from BBH as the dataset of interest and sampled 20 questions from it. We consider these instances to be typical, resulting in QUBO problems with an average of 900 variables. We observed consistently a difference of at least an order of magnitude (and often two orders of magnitude) in speed to obtain the same highest quality solution between the baseline method and the specialized hardware. In particular, the time-to-solution given by the Fujitsu Digital Annealer, not considering the latency due to cloud access, were all under 2 seconds, which would enable a turnaround time of the entire CR pipeline under 5 seconds as demonstrated in the relevant technical report  (?).

  • Accuracy potential As clear from Fig. 2, the Formal Fallacies category of the BBH dataset has proven to be particularly difficult for our CR method. Investigations on these QUBO instances indicated that they were also difficult for the baseline simulated annealing solver, which did not converge to a stable minimum by the allotted time. We then opted to employ the APT solver (which has shown to outperform simulated annealing on spin-glass problems) run on an HPC cluster extensively until convergence to access higher quality solutions.

    Out of the 19 questions we considered for analysis, for 14 questions the answers matched. In 4 questions, the solutions obtained by APT resulted in a correct answer, while the solution obtained by the baseline resulted in an incorrect answer. The opposite is true only for 1 question. Solutions obtained by APT tended to contain a larger number of reasons, with APT picking approximately 49.7 reasons on average while simulated annealing picked only approximately 12.7 reasons. Approximately 86.7%percent\%% of reasons picked as part of APT’s solution was not part of the reasons picked as part of the simulated annealing solution. This seems to indicate that having a high-quality solution to the QUBO corresponds to final prompts contain a CoT with a substantial number of reasons.

It should be noted that simulated annealing, (adaptive) parallel tempering and the digital annealers used in our studies are all driving the optimization through a Monte-Carlo procedure employing thermal relaxation dynamics as the engine. Besides further sophistication of the algorithms (?) we could also consider other types of Ising machines, including quantum solvers that might access unmatchable sources of speedup, in principle. Based on recent results, particularly promising options in the near-term for time-sensitive applications are superconducting devices implementing quantum annealing (?) as well as gate-model algorithms that exploit the noise as a drive (?).

Post-processing of the reasons for the final prompt

When reviewing the reasons sampled from the LLM, there were still quite a few reasons that were semantically the same, but identified as different after the semantic matching. This suggests that we could further tune the semantic similarity threshold or potentially leverage an LLM call to help disambiguate reasons. After selecting the “best” reasons from the combinatorial optimizer, we often find that the prompt does not flow in a logical manner. This is because reasons are often “consequences” of other reasons and are directional. When the consequences are beyond 2 levels (Reason A causes Reason B which causes Reason C), the combinatorial optimizer doesn’t work so well, possibly because it naturally only looks at the correlation of 2 reasons. Futhermore, the output of the combinatorial optimizer is a flat list of reasons with no specific ordering or relationship between the reasons within the prompt. An intermediate step that could be experimented with is to create a CoT or a “Tree-of-Thought” using an ordered set of reasons. There are recent advances in prompt engineering methods employing similar sophisticated strategies (??).

6.2 Generalizations of the framework

Understanding performance characteristics of sampling reasons from an LLM

CR is based on the assumption that sampling the LLM for reasons will create a distribution of reasons that can be be mapped to an optimization procedure. Exactly how the distribution is created and the accuracy of such distribution needs to be quantified. Factors such as the temperature of the LLM and what type of LLM is being used can impact the distribution of the reasons sampled. The distribution of the reasons are likely to also indicate if the CR procedure is even needed, which could lead to further optimizations.

Integration with Theorem Provers

Manual inspection of the reasons that are selected by the combinatorial optimizer reveals that we sometimes find reasons that conflict with each other. We conjecture for harder problems, the number of conflicting reasons will increase and can be removed by leveraging a theorem prover such as (?) to further improve the performance of combinatorial reasoning. One of the challenges of using a theorem prover is that they do not scale to thousands of variables. However, using the theorem prover as a post-processing step after the combinatorial optimizer narrows down the reasons to several dozen reasons is practical. The combination of a probabilistic solver combined with a deterministic solver allows for reasoning on open domain problems.

Integration with Retrieval Augmented Generation

Retrieval augmented generation (RAG) is used for knowledge intensive tasks. In the real world human problem solving, a combination of knowledge retrieval and reasons is used. A potential avenue for integrating RAG could be:

  • Use input query to perform a semantic search on a knowledge base to create a knowledge context

  • Include the knowledge context the prompt when performing sampling of reasons

With context windows of 1M tokens (?), the “reasons” sampled from the LLM could be derived from very long form documents.

Acknowledgments

The work was funded by Icosa Computing Inc. A.A.A., P.A.L. and D.V. were funded by NSF CCF (grant #1918549) and NASA Academic Mission Services contract NNA16BD14C – funded under SAA2-403506. We thank David Fellah for helpful discussions. We acknowledge the generous access provided by Fujitsu Limited to their Digital Annealer. We thank Michiyuki Tanaka, Yasuyuki Tanaka, Hirofumi Ukita, and Atsushi Kasugai from the Fujitsu team for their support. We also thank Google Deepmind and Google Cloud teams for their comments.

Appendix A Data Availability

The LLM samples, generated prompts, human analysis and relevant code are available at https://github.com/Icosa-Computing/cr-paper.

Appendix B End-to-End Example

Here we provide a full walkthrough of combinatorial reasoning for a particular question from BBH task snarks. The original question is presented below:

{mdframed}

[style=MyFrame,nobreak=true] Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in

System Instruction

{mdframed}

[style=MyFrame,nobreak=true] Let’s think step by step. After each step, condense the reasoning in the step into a sentence and put it in curly braces.

Initial Prompt

{mdframed}

[style=MyFrame,nobreak=true] Output template:
Step 1: reasoning
{condensed reason}
Step 2: reasoning
{condensed reason} ….
Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in

Reason Extraction

Each of the N𝑁Nitalic_N responses will contain a reasoning path, with condensed reasons placed inside curly braces. We extract these condensed reasons and embed them to perform the QUBO mapping. Here, we display two of the LLM responses from the initial sampling phase and demonstrate an example of reason embedding.

Sampled Response #1

{mdframed}

[style=MyFrame,nobreak=true] Step 1: Plugging in a keyboard and mouse to a computer is a typical and expected action.
{Plugging in a keyboard and mouse to a computer is a normal occurrence.}
Step 2: Plugging in a keyboard and mouse to a bus is an illogical and sarcastic statement.
{Plugging in a keyboard and mouse to a bus is absurd and sarcastic.}
Conclusion: Option (B) I’ll just bring my keyboard and mouse to the bus and plug it all in is sarcastic.
{Option (B) is sarcastic.}

Refer to caption
Figure 3: Baseline analysis for Quadratic CR (same as main text) with Linear CR and Random Reasons. Overall performance across the ten datasets were Quadratic CR: 65.2%percent65.265.2\%65.2 %, Linear CR: 68.2%percent68.268.2\%68.2 %, Random: 57.4%percent57.457.4\%57.4 %. 0-shot and 0-shot CoT results are included for reference. The individual tasks are ordered according to the performance of 0-shot CoT.

Sampled Response #2

{mdframed}

[style=MyFrame,nobreak=true] Step 1: Plugging a keyboard and mouse into a computer is a common and expected action, so the statement using this scenario is not sarcastic.
{Plugging a keyboard and mouse into a computer is a common and expected action.}
Step 2: However, plugging a keyboard and mouse into a bus doesn’t make sense as buses do not have such connections, making this statement sarcastic. {Plugging a keyboard and mouse into a bus is not a logical or realistic scenario, indicating sarcasm.}

The sarcastic statement is: (B) I’ll just bring my keyboard and mouse to the bus and plug it all in. We focus on the following two reasons appearing in the two different LLM outputs (only curly brackets are extracted).

  • Plugging a keyboard and mouse into a computer is a common and expected action.

  • Plugging a keyboard and mouse into a computer is a normal occurence.

The Sentence Transformer embeds them into vectors γ𝟏subscript𝛾1\bf{\gamma}_{1}italic_γ start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT and γ𝟐subscript𝛾2\bf{\gamma}_{2}italic_γ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT. They are semantically identical, as determined by their inner product γ𝟏γ𝟐=0.95subscript𝛾1subscript𝛾20.95\bf{\gamma}_{1}\cdot\bf{\gamma}_{2}=0.95italic_γ start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ⋅ italic_γ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT = bold_0.95, which is well above the threshold of 0.90. These reasons will be collapsed into one single distinct reason in the CR pipeline.

Ising Mapping and Solving

The extracted reasons will be mapped to a quadratic optimization problem according to the procedure outlined in 3.2. For this example question, the QUBO instance contains 174174174174 variables. The best solution found by simulated annealing has an energy of 806.3806.3-806.3- 806.3. We map the best solution back to a set of 45 selected reasons and corresponding w𝑤witalic_w-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_es (which end up being only two different values).

Final Prompt

Using the selected reasons and their w-values, we construct the final prompt. The LLM is queried with the final prompt and system instructions at temperature T=0𝑇0T=0italic_T = 0. We have omitted the mid-section of the prompt due to space constraints. The full prompt for the example and all other prompts can be found in the Supplementary Material.

System Instruction

{mdframed}

[style=MyFrame,nobreak=true] Each W-Statement starts with the substring (w), where (w) is a number called the W-Value. Identify and state each W-Value. W-Statements with higher W-Values have more reliable information. You may not provide multiple possible answers, you must narrow your final solution down to a single answer. Refer to each W-Statement and their W-Values in your reasoning. Your final answer must be of the form SOLUTION: (option).

Prompt

{mdframed}

[style=MyFrame,nobreak=False] Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in
W-Statements:
(0.028) Using a keyboard and mouse on a bus is an absurd idea.
(0.028) The statement ”I’ll just bring my keyboard and mouse to the bus and plug it in” is sarcastic because it suggests an action that is illogical.

(0.009) Option (B) is sarcastic as it involves an absurd suggestion.
(0.009) Option (B) is sarcastic as it contains an unrealistic scenario, making it an unlikely action.
(0.009) Option (B) ”I’ll just bring my keyboard and mouse to the bus and plug it all in” is sarcastic.
(0.009) B is the sarcastic statement

LLM Response

{mdframed}

[style=MyFrame,nobreak=False] The W-Value for all statements is 0.028 except for the last two, which have a W-Value of 0.009. The majority of the statements with a W-Value of 0.028 clearly identify option (B) as the sarcastic statement, while the statements with a W-Value of 0.009 do not provide a definitive answer. Therefore, the final solution is:

SOLUTION: (B) I’ll just bring my keyboard and mouse to the bus and plug it all in

Appendix C Evaluation of QUBO mapping and solving

In Figure 3, we investigated the performance differences between CR as per main text (“Quadratic CR”) versus selecting a set {rselected}subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑\{r_{selected}\}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT } of reasons either randomly (“Random Reasons”) among all the {rdistinct}subscript𝑟𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡\{r_{distinct}\}{ italic_r start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_n italic_c italic_t end_POSTSUBSCRIPT } possible or the set that optimize the cost function i{rselected}l~i(μ,mi,α,κ,xi)superscriptsubscript𝑖subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑subscript~𝑙𝑖𝜇subscript𝑚𝑖𝛼𝜅subscript𝑥𝑖\sum_{i}^{\{r_{selected}\}}\tilde{l}_{i}(\mu,m_{i},\alpha,\kappa,x_{i})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT } end_POSTSUPERSCRIPT over~ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α , italic_κ , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from Eq. 3.2 (“Linear CR”). We restrict the analysis to the top 10 categories with the minimum value of the ratio between the cardinality of {rselected}subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑\{r_{selected}\}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT } and {rdistinct}subscript𝑟𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡\{r_{distinct}\}{ italic_r start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_n italic_c italic_t end_POSTSUBSCRIPT }, which we take as a proxy for “reasoning difficulty” (see Table 4). Results indicate that random selection is generally performing worse and that quadratic terms seem to be useful only for 2/10 categories, which happen to be also among the most difficult (in terms of accuracy) for zero shot methods. We believe this correlation between difficulty and importance of quadratic terms in the QUBO deserves more investigation.

We performed a human evaluation in order to understand the QUBO model effect on correlated reasons to verify that cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT could be seen as measures of strength of relationship between two sampled reasons. Restricting our analysis to the causal judgement, movie recommendation, and sports understanding datasets, we assigned a value (-1,0,+1) to each pair of reasons appearing in 10 samples depending on whether the pair appears respectively inconsistent, consistent or independent in the eyes of a human. A first observation is that, within each sample, reasons that are inconsistent are very rare (they appeared only in one sample), which reassure that the LLM does not output contradictory answers for the dataset examined. Across samples, in causal judgement, we observed that the 50 pairs with higher values of cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT did contain a large fraction of correlated and consistent pairs. However, on other datasets such as movie recommendation or sports judgement, there were a significant fraction of correlated and consistent reasons even in the lowest 50 pairs. Moreover, many of the reason pairs with the lowest 50 values seem semantically identical - indicating a failure of the similarity measure described in 3.1. Based on this and other anecdotal evidence, we hypothesize that for easy questions, the LLM might be producing correlated reasons consistent with the correct answer nearly all the time. Results call to look at the pairs resulting in negative cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT much more closely in future studies.

We performed human evaluation of the consistency of the final reasons selected after the QUBO Solver {rselected}subscript𝑟𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑\{r_{selected}\}{ italic_r start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT } for the same restricted dataset, observing that inconsistency is very rare (for casual judgment) or difficult to ascertain for the movie recommendation, and sports understanding datasets. A preliminary observation indicates for NLP tasks we match or slightly outperform the simple zero-shot CoT method (without human-in-the-loop), while for algorithmic tasks CR usually has slightly lower accuracy.

References

  • Aadit, Lott, and Mohseni 2023 Aadit, N.; Lott, P. A.; and Mohseni, M. 2023. APT: Adaptive Parallel Tempering.
  • Acebrón et al. 2005 Acebrón, J. A.; Bonilla, L. L.; Vicente, C. J. P.; Ritort, F.; and Spigler, R. 2005. The kuramoto model: A simple paradigm for synchronization phenomena. Reviews of modern physics 77(1):137.
  • Akiba et al. 2019 Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
  • Angelini and Ricci-Tersenghi 2023 Angelini, M. C., and Ricci-Tersenghi, F. 2023. Limits and performances of algorithms based on simulated annealing in solving sparse hard inference problems. Physical Review X 13(2):021011.
  • Babbush, O’Gorman, and Aspuru-Guzik 2013 Babbush, R.; O’Gorman, B.; and Aspuru-Guzik, A. 2013. Resource efficient gadgets for compiling adiabatic quantum optimization problems. Annalen der Physik 525(10-11):877–888.
  • Besta et al. 2024 Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; and Hoefler, T. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence 38(16):17682–17690. arXiv:2308.09687 [cs].
  • Camsari, Salahuddin, and Datta 2017 Camsari, K. Y.; Salahuddin, S.; and Datta, S. 2017. Implementing p-bits with embedded mtj. IEEE Electron Device Letters 38(12):1767–1770.
  • Chermoshentsev et al. 2022 Chermoshentsev, D. A.; Malyshev, A. O.; Esencan, M.; Tiunov, E. S.; Mendoza, D.; Aspuru-Guzik, A.; Fedorov, A. K.; and Lvovsky, A. I. 2022. Polynomial unconstrained binary optimisation inspired by optical simulation. arXiv:2106.13167 [nlin, physics:quant-ph].
  • Chowdhury et al. 2023 Chowdhury, S.; Grimaldi, A.; Aadit, N. A.; Niazi, S.; Mohseni, M.; Kanai, S.; Ohno, H.; Fukami, S.; Theogarajan, L.; Finocchio, G.; et al. 2023. A full-stack view of probabilistic computing with p-bits: devices, architectures and algorithms. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits.
  • Esencan et al. 2024 Esencan, M.; Kumar, T.; Unlu, C.; and Ho, A. 2024. Improving Large Language Models with Combinatorial Optimization. Technical report, Icosa Computing Inc.
  • Fan, Lewis, and Dauphin 2018 Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical neural story generation. In Gurevych, I., and Miyao, Y., eds., Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 889–898. Melbourne, Australia: Association for Computational Linguistics.
  • Gannouni et al. 2020 Gannouni, A.; Samsonov, V.; Behery, M.; Meisen, T.; and Lakemeyer, G. 2020. Neural combinatorial optimization for production scheduling with sequence-dependent setup waste. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2640–2647.
  • Gemini Team 2024 Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs].
  • Google Gemini Team 2023 Google Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models.
  • Google 2023 Google. 2023. Palm 2 technical report.
  • Grant, Humble, and Stump 2021 Grant, E.; Humble, T. S.; and Stump, B. 2021. Benchmarking quantum annealing controls with portfolio optimization. Phys. Rev. Appl. 15:014012.
  • Hidaka et al. 2023 Hidaka, R.; Hamakawa, Y.; Nakayama, J.; and Tatsumura, K. 2023. Correlation-diversified portfolio construction by finding maximum independent set in large-scale market graph. IEEE Access 11:142979–142991.
  • Inagaki et al. 2016 Inagaki, T.; Haribara, Y.; Igarashi, K.; Sonobe, T.; Tamate, S.; Honjo, T.; Marandi, A.; McMahon, P. L.; Umeki, T.; Enbutsu, K.; et al. 2016. A coherent ising machine for 2000-node optimization problems. Science 354(6312):603–606.
  • Irbäck et al. 2022 Irbäck, A.; Knuthson, L.; Mohanty, S.; and Peterson, C. 2022. Folding lattice proteins with quantum annealing. Phys. Rev. Res. 4:043013.
  • Kiciman et al. 2023 Kiciman, E.; Ness, R. O.; Sharma, A.; and Tan, C. 2023. Causal reasoning and large language models: Opening a new frontier for causality.
  • Kirkpatrick, Gelatt, and Vecchi 1983 Kirkpatrick, S.; Gelatt, C.; and Vecchi, M. 1983. Optimization by simulated annealing. Science (New York, N.Y.) 220:671–80.
  • Kochenberger et al. 2014 Kochenberger, G.; Hao, J.-K.; Glover, F.; Lewis, M.; Lü, Z.; Wang, H.; and Wang, Y. 2014. The unconstrained binary quadratic programming problem: A survey. Journal of Combinatorial Optimization 28.
  • Kojima et al. 2022 Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large Language Models are Zero-Shot Reasoners.
  • Komatsu et al. 2018 Komatsu, K.; Momose, S.; Isobe, Y.; Watanabe, O.; Musa, A.; Yokokawa, M.; Aoyama, T.; Sato, M.; and Kobayashi, H. 2018. Performance evaluation of a vector supercomputer sx-aurora tsubasa. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 685–696.
  • LeCun 2022 LeCun, Y. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62(1).
  • Lewis et al. 2020 Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33:9459–9474.
  • Lucas 2014 Lucas, A. 2014. Ising formulations of many np problems. Frontiers in physics 2:74887.
  • Maciejewski et al. 2024 Maciejewski, F. B.; Biamonte, J.; Hadfield, S.; and Venturelli, D. 2024. Improving quantum approximate optimization by noise-directed adaptive remapping. arXiv preprint arXiv:2404.01412.
  • Mandra et al. 2023 Mandra, S.; Akbari Asanjan, A.; Brady, L.; Lott, A.; and Bernal Neira, D. E. 2023. PySA: Fast Simulated Annealing in Native Python.
  • Mohseni et al. 2021 Mohseni, M.; Eppens, D.; Strumpfer, J.; Marino, R.; Denchev, V.; Ho, A. K.; Isakov, S. V.; Boixo, S.; Ricci-Tersenghi, F.; and Neven, H. 2021. Nonequilibrium monte carlo for unfreezing variables in hard combinatorial optimization.
  • Mohseni, McMahon, and Byrnes 2022 Mohseni, N.; McMahon, P. L.; and Byrnes, T. 2022. Ising machines as hardware solvers of combinatorial optimization problems. Nature Reviews Physics 4(6):363–379.
  • Neal 2021 Neal, D.-W. S. I. 2021. dwave-neal. https://github.com/dwavesystems/dwave-neal.
  • Neira et al. 2024 Neira, D. E. B.; Brown, R.; Sathe, P.; Wudarski, F.; Pavone, M.; Rieffel, E. G.; and Venturelli, D. 2024. Benchmarking the operation of quantum heuristics and ising machines: Scoring parameter setting strategies on optimization applications. arXiv preprint arXiv:2402.10255.
  • Nye et al. 2021 Nye, M.; Andreassen, A. J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  • OpenAI 2023 OpenAI. 2023. Gpt-4 technical report.
  • Rieffel et al. 2014 Rieffel, E.; Venturelli, D.; Do, M.; Hen, I.; and Frank, J. 2014. Parametrized families of hard planning problems from phase transitions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28.
  • Sankar et al. 2021 Sankar, K.; Scherer, A.; Kako, S.; Reifenstein, S.; Ghadermarzy, N.; Krayenhoff, W. B.; Inui, Y.; Ng, E.; Onodera, T.; Ronagh, P.; et al. 2021. Benchmark study of quantum algorithms for combinatorial optimization: Unitary versus dissipative. arXiv preprint arXiv:2105.03528.
  • Sao et al. 2019 Sao, M.; Watanabe, H.; Musha, Y.; and Utsunomiya, A. 2019. Application of digital annealer for faster combinatorial optimization. Fujitsu Scientific and Technical Journal 55(2):45–51.
  • Suzgun et al. 2022 Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H. W.; Chowdhery, A.; Le, Q. V.; Chi, E. H.; Zhou, D.; ; and Wei, J. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  • Takemoto et al. 2020 Takemoto, T.; Hayashi, M.; Yoshimura, C.; and Yamaoka, M. 2020. A 2 ×\times× 30k-spin multi-chip scalable cmos annealing processor based on a processing-in-memory approach for solving large-scale combinatorial optimization problems. IEEE Journal of Solid-State Circuits 55(1):145–156.
  • Tanahashi et al. 2019 Tanahashi, K.; Takayanagi, S.; Motohashi, T.; and Tanaka, S. 2019. Application of ising machines and a software development for ising machines. Journal of the Physical Society of Japan 88(6):061010.
  • Tasseff et al. 2022 Tasseff, B.; Albash, T.; Morrell, Z.; Vuffray, M.; Lokhov, A. Y.; Misra, S.; and Coffrin, C. 2022. On the emerging potential of quantum annealing hardware for combinatorial optimization. arXiv preprint arXiv:2210.04291.
  • Tatsumura, Dixon, and Goto 2019 Tatsumura, K.; Dixon, A. R.; and Goto, H. 2019. Fpga-based simulated bifurcation machine. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 59–66. IEEE.
  • Tiunov, Ulanov, and Lvovsky 2019 Tiunov, E. S.; Ulanov, A. E.; and Lvovsky, A. I. 2019. Annealing by simulating the coherent Ising machine. Optics Express 27(7):10288.
  • Toshiba 2024 Toshiba, S. 2024. About SQBM+ | Quantum-Inspired Optimization Solutions SQBM+ | TOSHIBA DIGITAL SOLUTIONS CORPORATION.
  • Valmeekam et al. 2023 Valmeekam, K.; Marquez, M.; Sreedharan, S.; and Kambhampati, S. 2023. On the Planning Abilities of Large Language Models - A Critical Investigation. Advances in Neural Information Processing Systems 36:75993–76005.
  • Vaswani et al. 2017 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Venturelli and Kondratyev 2019 Venturelli, D., and Kondratyev, A. 2019. Reverse quantum annealing approach to portfolio optimization problems. Quantum Machine Intelligence 1(1):17–30.
  • Wan et al. 2023 Wan, X.; Sun, R.; Nakhost, H.; Dai, H.; Eisenschlos, J. M.; Arik, S. O.; and Pfister, T. 2023. Universal Self-Adaptive Prompting.
  • Wang et al. 2022 Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models.
  • Webb, Holyoak, and Lu 2022 Webb, T.; Holyoak, K. J.; and Lu, H. 2022. Emergent analogical reasoning in large language models.
  • Wei et al. 2022 Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q. V.; and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35:24824–24837.
  • Xu, Jain, and Kankanhalli 2024 Xu, Z.; Jain, S.; and Kankanhalli, M. 2024. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
  • Yao et al. 2023 Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems 36:11809–11822.
  • Ye et al. 2023 Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; Zhou, J.; Chen, S.; Gui, T.; Zhang, Q.; and Huang, X. 2023. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models.
  • Z3 2024 2024. Z3 Theorem Prover. Page Version ID: 1219986304.