\mdfdefinestyle

MyFramelinecolor=black, outerlinewidth=2pt, roundcorner=20pt, innertopmargin=4pt, innerbottommargin=4pt, innerrightmargin=4pt, innerleftmargin=4pt, leftmargin = 4pt, rightmargin = 4pt backgroundcolor=gray

Combinatorial Reasoning: Selecting Reasons
in Generative AI Pipelines via Combinatorial Optimization

Mert Esencan¹ Tarun Advaith Kumar¹ Ata Akbari Asanjan^2,3,4
P. Aaron Lott^2,4 Masoud Mohseni⁵ Can Unlu¹ Davide Venturelli^2,4 Alan Ho^1,6
\affiliations¹Icosa Computing Inc., New York, USA
²NASA ARC - Quantum Artificial Intelligence Laboratory (QuAIL), Moffett Field, California, USA
³NASA ARC - Data Sciences Group, Moffett Field, California, USA
⁴USRA Research Institute for Advanced Computer Science (RIACS), Moffett Field, California, USA
⁵LSIP, Hewlett Packard Labs, Milpitas, CA, USA
⁶DataStax, Santa Clara, California, USA
\emails{mert, tarun, can}@icosacomputing.com
{aakbariasanjan, plott, dventurelli}@usra.edu
alan.h@datastax.com

Abstract

Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.

1 Introduction

The advent of auto-regressive architectures, notably modern LLMs (?), mark a significant development in the pursuit of human-like AI. These models exhibit a profound capacity for generating human-like responses, positioning them as impressive tools for information processing. However, despite their extensive training and large parameter counts, LLMs inherently lack robust mechanisms for deep reasoning and strategic planning (?; ?), which are necessary for applications demanding high-level cognitive functions. Moreover, the same architecture makes LLMs prone to hallucination—defined as generation that is nonsensical or unfaithful to source material (?).

One approach to handling these limitations is to provide additional retrieved context to reduce incorrect generations. Techniques such as retrieval augmented generation (RAG) query a vector database to retrieve source material before generating the LLM response (?). This approach is particularly suitable for knowledge intensive tasks but does not generalize well to reasoning-intensive tasks.

A parallel area of research is improvements to prompt engineering and response decoding. A new method dubbed Chain of Thought (CoT) (?) concatenates hand-annotated example responses with reasoning to the query to create the prompt. The responses from the LLM mimic the examples and contain a ‘reasoning path’ followed by the answer. Another method developed as a complementary decoding approach is Self-Consistency, with the idea that marginalizing over several reasoning paths provides the best possible response (?). However, these approaches rely heavily on human annotations and the same static examples may not be relevant to different queries.

To address these limitations and inspired by the conjecture that the human brain performs gradient-free optimization for reasoning and decision-making (?), we propose that integrating combinatorial optimization strategies within LLM frameworks could advance their reasoning capabilities and make them more adept at handling tasks requiring strategic thought.

Our research proposes the integration of an external reasoning engine that interfaces with existing LLM pipelines to fully automate the creation of CoT style prompts. As the reasoning engine sits outside the LLM black box, our work is not an attempt to change the foundational auto-regressive architecture of LLMs but a proposed tool to analyze and possibly augment their reasoning faculties through automated prompt engineering. By employing combinatorial optimization, the engine generates structured prompts that guide the LLM towards the correct answer. Our work intersects two distinct fields - generative AI and probabilistic combinatorial optimization - to tackle human level reasoning tasks. We construct a first of a kind LLM pipeline with physics-inspired solvers and benchmark the pipeline across a variety of well known Natural Language Processing (NLP) reasoning benchmarks.

Refer to caption — Figure 1: Workflow for Combinatorial Reasoning. The initial prompt is processed by the LLM $N$ times and the answers are filtered through a semantic matching procedure to produce answers with distinct reasons. The ensemble is mapped into a QUBO problem solved by an Ising machine. The final solution determines a set of reasons to be added to the prompt for a final LLM call that determines the final answer.

In the following sections, we review the state of art and then present our Combinatorial Reasoning (CR) framework as a versatile technique that leverages a probabilistic combinatorial optimizer to construct a Chain-of-Though style prompt with no human intervention. Our proof-of-concept results demonstrate that in some cases CR achieves improvements over other zero-shot prompting strategies on a few reasoning tasks from BigBench-Hard, and human-level reasoning performance on several reasoning tasks.

In the conclusions, we will discuss the lessons learned, how the CR framework can be further optimized beyond our preliminary baseline experiments and outline multiple promising research avenues.

2 Preliminaries and Prior Work

2.1 Large Language Models

Large Language Models (LLMs) are machine learning models that are trained to receive and recognize text as input and produce text as output. Distinguished from simpler language models by their immense parameter count, these models are capable of general purpose language processing tasks. GPT 3.5-turbo, the LLM used in our experiments, is part of a series of models developed by OpenAI for generating human-like natural language text (?). Many LLMs including GPT 3.5-turbo are capable of receiving and following a set of system instructions while responding to a query. System instructions differ from the user content that queries an LLM for a response, and we refer to the latter as a ‘prompt’ throughout this work. System instructions define the characteristics of the language model’s output, and constrain it to fit requested behavior (?). For our experiments, we use temperature sampling - a popular method for decoding LLMs in text generation tasks. Adjusting the temperature causes the probability distribution of each subsequent token to be modified, affecting the variability of the responses (?). High sampling temperatures result in diversity while lower temperatures provide more reproducible generations.

2.2 Reasoning in Large Language Models

There have been many papers that suggest that LLMs can indeed reason (?; ?). For each subsequent revision of LLMs - GPT4 / Gemini / and Llama3, reasoning benchmarks such as BIG-Bench-Hard, HellaSwag, and MMLU show ever improving results. However, these results are not a good indicator for the autonomous reasoning capabilities of the model. In each case, the benchmarks are performed using in-context learning, with few-shot (specific examplars) or Chain of Thought (CoT), for which humans manually develop exemplars using labeled datasets to improve performance.

The latest language models do not report the zero-shot performance on these benchmark as in seen Table 1 since the performance is likely poorer than those with manual prompts. Thus we believe the next milestone for LLMs is automatic prompt generation with correct reasoning.

	Gemini Ultra	GPT-4	LLama3 70B
MMLU	90.04% CoT@32	86.4% 5-shot	79.5% 5-shot
GSM8K	94.4% Maj1@32	92% 5-Shot CoT	93.0 8-shot
MATH	53.2% 4-shot		50.4 4-shot
BIG-Bench-Hard	83.6% 3-shot		81.3 3-shot, CoT
DROP	82.4% Variable shot	80.9 3-shot	79.7 3-shot,F1
HellaSwag	87.8% 10-shot	95.3% 10-shot
WinoGrande		87.5% 5-shot	81.3% 5-shot
AI2 Reasoning		96.3% 25-shot	93% 25-shot

Table 1: Summary of recent reasoning benchmarks on LLMs (?). Note that reported results are all dependent on some form of In Context Learning or Chain of Thought (CoT)

The main inspiration for our work comes from Yan LeCun’s review (?) which suggests multiple models need to work together to emulate general intelligence and that human brain possibly calculates a “cost function” for reasoning in a gradient-free manner - similar to combinatorial optimization.

Chain of Thought

As a natural extension of the in-context few shot learning to facilitate reasoning, the Google Brain team developed few-shot CoT (?). The aim is to augment language models with the ability to ‘reason’ in intermediate steps before providing an answer. This approach requires manual demos (or exemplars): labeled question answer pairs containing intermediate reasoning steps that lead to the final correct answer. Inspired by this, (?) developed “zero-shot CoT” as an approach to induce LLMs to answer with intermediate reasoning step without manual demos. Simply put, appending the phrase “Let’s think step by step” to a particular query improved performance.

Self-Consistency

Self-Consistency was introduced by Google’s Deepmind Team as an improved decoding approach for CoT style prompts (?). Instead of greedy decoding, the authors suggest collecting samples at non-zero temperatures and selecting the most occurring answer. This approach lends itself to an intuitive interpretation - reasoning problems admit multiple correct reasoning paths that lead to the unique right answer but incorrect reasoning paths lead to different incorrect answers. Self-consistency can also be viewed as marginalization over latent tokens to produce a better answer.

Universal Self-Adaptive Prompting (USP)

The Universal Self-Adaptive Prompting approach introduces a novel way of generating automatically designed prompts to improve decoding efficiency (?). This approach involves the use of unlabeled questions to generate a set of prompt and response pairs. These prompt and response pairs are concatenated to form a collection of pseudo-demos similar in style to the manual demos used in Chain of Thought. For a given question, a selection algorithm picks a subset of this collection - based on computed metrics such as diversity and confidence. These selected demos are then prepended to the beginning of the question in standard CoT style to form the final prompt.

2.3 Combinatorial Optimization and Ising Machines

It is well known that challenging combinatorial optimization problems arise in multiple industrial domains, such as finance, logistics, manufacturing, drug design (?; ?; ?; ?). State-of-art solution methods often consist of a patchwork of heuristic techniques tuned to the problem class of interest. Interestingly, for many of these problems efficient mappings of the cost-function $H$ to a binary, quadratic, unconstrained formulation exist (QUBOs (?; ?)). Equivalently, the problems can be framed in “physics terms” as approximating as much as possible the ground state of an interacting, disordered classical Ising spin energy function:

H=\sum_{\begin{subarray}{c}i,j\end{subarray}}Q_{ij}x_{i}x_{j}\equiv\sum_{i}h_{% i}s_{i}+\sum_{\begin{subarray}{c}i,j\end{subarray}}J_{ij}s_{i}s_{j},

(1)

where $x_{i}\in\{0,1\}$ ; $s_{i}\in\{-1,+1\}$ and $J_{ij}$ , $h_{i}$ , $Q_{ij}$ are real-valued coefficients that specify the problem instance. While the QUBO and Ising forms are equivalent, in this paper we will formulate everything in terms of the QUBO form. Clearly, the search space for the minimum of Eq. 1 scales as $2^{N}$ as the number of variables $N$ increases.

The existence of these mappings has ignited a lively research community that, in the last ten years, has devised hardware implementation of the Ising model as well as physics-inspired algorithmic strategies meant to cool down interacting spins close to their least energetic configuration. Collectively, these methods are often referred to as “Ising Machines” (?; ?; ?) or sometimes “quantum-inspired” solvers - considering the fact that the most popular and visible methods have a connection to quantum mechanics (?; ?; ?; ?).

Digital implementation of Ising machines are currently the most scalable approach to tackle large problems. This includes GPU/FPGA emulations of solution principles inspired by oscillator synchronization (such as Kuramoto models (?), coherent optics with or without dissipation (e.g Coherent Ising Machines (?) and Bifurcation Machines (?)) and thermal relaxation (e.g. probabilistic bits (“p-bits”) variations (?; ?)). Benchmarks of these systems have consistently performed well in paradigmatic benchmarks of combinatorial optimization, especially in absence of hard constraints. Indeed, there is accumulated empirical evidence that on NP-Hard problems such as fully-connected Spin Glasses the time-to-solution scales as a stretched exponential with the increase of the number of binary variables, on typical instances (i.e. $O(N)\propto\exp(\sqrt{N})$ ), while other methods seem to struggle (?; ?).

Simulated Annealing and Parallel Tempering

Simulated Annealing (SA) (?) is an optimization technique built on searching for low energy solutions using a Markov chain parameterized by a temperature such that high temperature samples correspond to random samples and low temperature samples reflect low energy locally-optimal configurations of the target system. A temperature schedule is formulated to effectively explore or exploit regions of the search space for low energy configurations. Parallel Tempering (PT) (?) is a similar optimization technique built on multiple Markov chains running at different temperatures that swap configurations between the chains in order to explore or exploit the temperatures in finding low energy configurations while avoiding stagnation in local minima. The hyperparameters for the parallel tempering scheme including the temperature schedule and acceptance rate could also be determined adaptively in a fashion to enable a constant swap rate between chains, a scheme dubbed Adaptive Parallel Tempering (APT) (?; ?). The adaptive process is based on a physics-informed procedure to efficiently explore complex energy landscapes.

3 Combinatorial Reasoning

While LLMs cannot reliably reason on their own, with the assistance of an auxiliary system - namely a discrete probabilistic optimizer - we could conceivably select reasons that could create a useful CoT passed into the LLM. The main conceptual challenge is whether one can design a reason-to-variable mapping and a related cost function with the following properties:

•

universality: works across a large variety of reasoning tasks
•

accuracy: its optimized solutions correspond to selecting good reasons when a variety of reasons exist for a given answer
•

practicality: its complexity is such that it returns useful reasons within the time allowed for the optimizer to do the minimization

With reference to Fig. 2, we investigate these challenges by drafting a QUBO cost-function inspired by the problem of portfolio optimization, and designing a sequential procedure of interaction between LLMs and an Ising machine. We call this generic framework Combinatorial Reasoning (CR). It consists of four stages which we now describe in detail.

3.1 Sampling of Reasons

Given a question from the dataset, we prepare $N$ identical input prompts (see appendix B) and query an LLM at a fixed temperature. Following the system instructions, each of the $N$ outputs will contain a set of reasons . Among these, there are duplicate reasons that are semantically equivalent. We use a Sentence Transformer from HuggingFace (all-mpnet-base-v2) to embed each reason into a normalized 768 dimensional vector. Defining a similarity metric between two reasons as the dot product of the corresponding embedded vectors, we count two reasons as the same if this metric is greater than $\bf{\zeta}$ . Using this procedure as our method for counting, we can reduce the set of all sampled reasons into a smaller set of distinct reasons and a collection, $\{\bf{\gamma}_{i}\}$ , of embedded vectors. We define:

•

$\{s$ } : Set of samples each with an answer and set of reasons
•

$\{r_{total}$ } : Set of all reasons sampled from the LLM
•

$\{r_{distinct}$ }: Set of independent reasons selected by Sentence Transformer
•

$n_{i}$ : The number of times each independent reason, indexed by $i$ , appears in our $N$ samples
•

$n_{ij}$ : The number of times a pair of independent reasons, indexed by $i$ and $j$ , appear together within any one of our $N$ samples

These counts are the basis of combinatorial reasoning, and we use these to compute quantities essential in the QUBO mapping. From here on, we refer to independent reasons as reasons for the sake of brevity. Using these counts as well as the acquired embeddings, we denote $m_{i}$ as the average similarity that each reason shares with every reason, i.e.

m_{i}=\frac{1}{k}\sum_{j=1}^{k}\bf{\gamma_{i}}\cdot\bf{\gamma_{j}}

(2)

Finally, to clarify our notation, for a given collection $\{\xi_{\mu}\}$ , we use $\bar{\xi}$ , $\bar{\bar{\xi}}$ , and $\delta\xi$ to denote the mean, median, and standard deviation.

3.2 QUBO Mapping

This stage processes deterministically the collection of answers and their distinct reasons to formulate a quadratic unconstrained integer optimization problem. The procedure that we decide to investigate is inspired by the QUBO mappings to Markowitz portfolio optimization (?) where the goal is to select the optimal assets (i.e. reasons, in our case) out of a finite universe (all distinct reasons) maximizing some metric. Importantly, this is just one of the many possible cost-functions that could be designed to try to capture the correlations between good and consistent reasons outputted by an LLM after an ensemble of queries, as it will be discussed in Section 5. Each distinct reason is associated to an integer variable $z_{i}$ . The integer bound for the variables is a parameter to be chosen as the maximum power of two in order to leverage the binary encoding $z_{i}=\sum_{w=0}^{W-1}2^{w}x_{iw}$ where $x_{iw}$ are binary variables. We now construct two functions that will compose a total objective $H=-(L+Q)$ . The first term is meant to select reasons based on their frequency of appearance $n_{i}/N$ :

L=\sum_{i}l_{i}(\mu,\alpha)z_{i}=\sum_{i}\left[\mu\,\mathrm{p}_{i}-\alpha% \mathrm{r}_{i}\right]z_{i}

(3)

where $\mathrm{p}_{i}$ is a measure of “popularity”, i.e. of the relative deviation with respect to the mean frequency of appearance of a reason. $\mathrm{r}_{i}$ is a measure of the standard deviation module around the frequency (in analogy to the concept of risk in portfolio optimization):

\displaystyle\mathrm{p}_{i}=\frac{n_{i}-\bar{\bar{n}}}{N}

\displaystyle\mathrm{r}_{i}^{2}=\frac{n_{i}}{N}\left(1-\frac{n_{i}}{N}\right).

(4)

The real parameters $\mu$ and $\alpha$ needs to be chosen empirically. If this was the only objective function (i.e. $H=-L$ ) then $z_{i}$ will either be $0$ or maxed out depending on whether $l_{i}(\mu,\alpha)$ is positive or negative.

We measure the potential relationship between two reasons by evaluating $c_{ij}$ as the connected correlation between two reasons defined as:

c_{ij}=\frac{n_{ij}}{N}-\frac{n_{i}n_{j}}{N^{2}}

(5)

This function assumes values between $-1+N^{-1}$ and $1/4$ . It is zero if one of the two reasons appears all the time (meaning it is trivially right - its selection should be handled by linear terms). It is negative when reasons $i$ , $j$ are frequent but rarely appearing in the same answer $s$ . It hits the maximum positive value when reasons $i$ and $j$ appears all the time together but are not necessarily obvious ( $n_{i}=n_{j}=N/2$ ).

Using the connected correlation functions, we construct the second term in $H$ to be sensitive to the correlation between reasons that appear jointly more or less frequently than the average:

Q=\sum_{i\neq j}q_{ij}(\beta)z_{i}z_{j}=\sum_{i\neq j}\left[c_{ij}-\bar{c}-% \beta\,\delta c\right]z_{i}z_{j}

(6)

The parameter $\beta$ is a real valued hyperparameter of the mapping, appearing as a prefactor to the standard deviation of $c_{ij}$ . For $\beta=0$ , if $c_{ij}$ is positive then $q_{i,j}$ will be positive and increasing with the number of times we observe reasons $i$ and $j$ occur together when this number is greater than the observed average. In a symmetric way, it will be negative and decreasing if $c_{ij}$ is negative and the joint appearance is less frequent than the average. The $L+U$ function is straightforwardly converted to QUBO form Eq. 1 by means of the binary encoding formula:

	$\displaystyle H$	$\displaystyle=$	$\displaystyle-\sum_{i}l_{i}(\mu,\alpha)\sum_{w_{1}}2^{w_{1}}x_{iw_{1}}$		(7)
		$\displaystyle-$	$\displaystyle\sum_{i\neq j}q_{ij}(\beta)\sum_{w_{1}}\sum_{w_{2}}2^{w_{1}}2^{w_% {2}}x_{iw_{1}}x_{jw_{2}}.$		(7)

Despite the simplicity of $L$ and $Q$ , it was empirically determined in our study that it is beneficial to use a slightly modified version of the function $\tilde{L}$ introducing flexibility on the mapping of the single non-interacting reasons. Per result of trial and error in our early investigations, the popularity portion was modulated favoring “crucial” reasons with low semantic similarity, and the risk portion of the mapping was modified with an additional tuning parameter $\kappa$ , as well as other quadratic terms in $x_{iw}$ to modify the integer encodings:

$\displaystyle\tilde{L}$	$\displaystyle=$	$\displaystyle{\sum_{i}\tilde{l}_{i}(\mu,m_{i},\alpha,\kappa,x_{i})}$
	$\displaystyle=$	$\displaystyle\sum_{i}\left[\mu\,\mathrm{sgn}(\mathrm{p}_{i})\|\mathrm{p}_{i}\|^{% m_{i}}\sum_{w_{1}}^{W-1}2^{w_{1}}x_{iw_{1}}\right.$
	$\displaystyle-$	$\displaystyle\left.\alpha\mathrm{r}_{i}\sum_{w_{1}}^{W-1}\left(2^{\kappa w_{1}% }x_{iw_{1}}+\sum_{w_{2}=w_{1}+1}^{W-1}2^{w_{1}}2^{w_{2}}x_{iw_{1}}x_{iw_{2}}% \right)\right].$

The final QUBO that we send to the solver is then tackling the objective function $\tilde{H}=-(\tilde{L}+Q)$ .

3.3 Combinatorial Optimization Solver

The QUBO instance is processed by an Ising machine configured with a pre-defined parameter setting strategy (?) aimed to find the most appropriate solution to the $x_{i}$ variables. Ideally, the solver identifies the global optimum of $\tilde{H}$ , i.e.

x^{\star}=\operatorname*{arg\,min}_{\{x_{i}\}}\left[\tilde{H}\right],

(9)

however, an approximate solution might be sufficient as discussed in Section 6.1. We denote as $\{r_{selected}$ } the set of reasons selected by QUBO solver. From the binary encoding formula we obtain the $z_{i}$ variables. We then compose a list including all reasons such that $z_{i}>0$ in the returned solution, and we associate to each of them a real weight $w$ - $value=z_{i}/Z$ where $Z=\sum_{i}z_{i}$ .

3.4 Final Prompt Creation

Once the best candidate solution to the QUBO problem has been found, it is mapped back to set of reasons $\{r_{selected}$ } each prepended with their $w$ - $value$ . The LLM is instructed to treat the $w$ - $value$ as a level of relative importance for each reason. These reasons are sorted according to their $w$ - $value$ s (highest first) and alphabetically. The concatenated string is used to form a prompt (see appendix B). This final prompt inherit the benefits of CoT thanks to these additional reasons, and is used to query the LLM in a zero-shot fashion at temperature = 0 (greedy decoding).

4 Experimental Results

We conduct all of our experiments using the gpt-3.5-turbo-0125 LLM which has a context window of 16,385 tokens and returns a maximum of 4,096 tokens. This language model is a variant of GPT-3.5-Turbo3 produced by OpenAI, and was trained with data available until September 2021.

We selected the suite of BIG-bench Hard (BBH) tasks - a datasets consisting of reasoning oriented questions that have proven challenging for LLMs in the past (?). To save on inference time and cost, we sample 50 questions from each of the subtasks¹¹1Subtasks Logical Deduction and Tracking Shuffled Objects are split up into three further subtasks, we sample 50 questions from each of these., combining them into a 1350 question evaluation set without the subset labels to ensure robustness. On this set, we compare CR against (i) a modified version of zero-shot prompting, (ii) Universal Self-Adaptive Prompting (USP), and (iii) standard three-shot CoT prompting. Our modification to zero-shot consists of an added system-instruction very similar to the one used for CR (see Appendix B for the exact format).

For the Sampling of Reasons step, we sampled the LLM $N=210$ times at $T=1$ to collect sufficient distinct reasons, and calculate their distribution and correlations matrices. $N$ was determined empirically on test questions. To map to distinct reason, the similarity threshold is held to $\bf{\zeta}$ =0.90, again determined empirically. Prior to running the QUBO mapper, we tune the mapping parameters $\mu$ , $\alpha$ , $\beta$ , $W$ and ( $\kappa$ is fixed) using 5 questions from across all of BBH to form a 135 question tuning set. On this, we set the ranges for the tuning (see Table 2) and use Optuna - a gradient free hyperparameter optimization framework (?) - to select the optimal values for the other four parameters. We note that none of the 135 questions in the tuning set appear in the 1350 question evaluation set.

Parameter	$\mu$	$\alpha$	$\beta$	$W$	$\kappa$
Tuning Range	[1E-3, 70]	[1E-4, 20]	[-2, 10]	[1, 4]	2

Table 2: Parameter Tuning Ranges

For the Ising solver, we utilized an open-source implementation of simulated annealing (?) featuring default settings on temperature, linear annealing schedule, and a fixed parameter setting strategy employing 1000 sweeps, run identically 100 times.

Setting		Zero-Shot		Few-Shot
Method	0-Shot	USP	CR	3-Shot
			(Ours)	CoT
Average (%) $\uparrow$	47.68	55.89	59.88	74.20
Gain over zero-shot	0	+8.21	+12.20	+26.52
Average rank $\downarrow$	3.22	2.78	2.57	1.35

Table 3: Cumulative statistics on the performance of different methods across BBH. CR outperforms the other zero-shot methods both by rank and on average.

Figure 2 and Table 3 displays our results for BBH tasks. We manually evaluated the results for CR and zero-shot. The USP results are taken from (?). While USP was evaluated on PaLM 2-M, we report it here anyway due to its recreation complexity and the superior performance of PaLM 2-M to GPT 3.5 Turbo (?; ?).

We performed a human evaluation at each stage of the CR pipeline. In Table 4 we report the number of sampled reasons before and after the stages depicted in Fig. 2. It should be noted that the effect of optimization is visible as the mechanism that reduces the number of distinct reasons to a subset of reasons. More results of the human evaluation can be found in the Appendix.

	All Reasons		% of $\{r_{distinct}\}$
Dataset	$\{r_{total}\}$	$\{r_{distinct}\}$	$\{r_{selected}\}$
Causal Judgement	709	204	87.2
Reasoning About Colored Objects	525	100	82.0
Navigate	1100	572	100.0
Penguins In A Table	589	123	77.2
Geometric Shapes	630	331	100.0
Disambiguation QA	373	45	68.9
Tracking Shuffled Objects Five Objects	1020	298	95.0
Word Sorting	385	107	99.1
Tracking Shuffled Objects Three Objects	743	147	64.6
Tracking Shuffled Objects Seven Objects	1164	400	98.5
Multistep Arithmetic Two	621	253	99.6
Web Of Lies	885	113	84.1
Logical Deduction Three Objects	540	100	72.0
Sports Understanding	449	160	96.3
Snarks	396	109	91.7
Logical Deduction Five Objects	680	199	92.0
Salient Translation Error Detection	389	90	98.9
Hyperbaton	432	57	65.0
Movie Recommendation	730	457	100.0
Object Counting	397	48	62.5
Logical Deduction Seven Objects	730	309	100.0
Temporal Sequences	533	76	97.3
Formal Fallacies	579	251	100.0
Dyck Languages	1112	558	100.0
Date Understanding	587	162	98.1
Boolean Expressions	493	160	93.8
Ruin Names	622	421	100.0

Table 4: Reason filtering and selection percentages

5 Conclusion

We propose CR as a zero-shot automatic prompting pipeline applicable to reasoning tasks. We believe CR could be advantageous in the scenario that one needs multiple reasons to elicit the correct answer, and the reasons cannot be obtained via a single-shot from the LLM.

6 Future Work

In this section, we point out some details on the upcoming research directions to improve CR.

6.1 Improving time and accuracy

We list the following straightforward improvement ideas to the baseline framework:

Semantic matching

Human evaluation on a few samples ( $N=10$ ) on the causal judgement, movie recommendation, and sports understanding datasets reveals that a decent fraction of the reasons that are identified as distinct via our automated procedure are actually semantically the same in the eyes of a human. This speaks to the relative simpleness of BBH for LLM reasoning and clearly negatively affects the effectiveness to the QUBO mapping and of the entire CR pipeline. It is therefore a priority to improve the semantic matching procedure via threashold adjustment or other more sophisticated filtering.

QUBO Mapping

First and foremost, we re-emphasize the fact that the QUBO construction is just a first attempt to identifying a good objective function. The $\tilde{H}$ construction can be refined and studied carefully to maximize the correlation between the quality of the $x^{\star}$ and the accuracy of the final answer at the end of the CR procedure. Approximate solutions of these hard problems might end up being good enough and optimal with respect to the end-to-end result. In Appendix C we study some properties of the current cost-function choice, learning valuable lessons for the design of future improved CR pipelines. It will be also clearly beneficial to study the property of the graphs (size, weight distribution) and correlate characteristics from the physics of spin-glasses (such as the presence of a phase transition (?; ?)) with the final answer.

It is immediately notable that for a few task categories (Table 4) our the QUBO mapping does not result in a sub-selection of reasons because all distinct reasons are selected to be part of the final prompt. This indicates that either the problem does not need combinatorial optimization, or that the QUBO mapping needs to be improved. Moreover, the current construction was inspired by a basic portfolio optimization formulation and neglects possibility of negative $z_{i}$ and , budget constraints (which would fix the size of $\{r_{selected}\}$ ), and higher-than-quadratic correlations between the reasons, which can be inserted and “gadgetized” into quadratic terms (?) (adding ancillary variables sparsely connected) while still allowing the use of Ising machines. The advantage of using higher locality solvers than quadratic ones for certain optimization problems has recently been demonstrated (?).

Combinatorial Optimization Solvers Selection

With QUBO instances being NP-Hard in general, it is expected that the combinatorial optimization solver might take more time to find a quality solution than an acceptable user experience might mandate. The solver used for our baseline results is far from being the optimal choice or from being used in the optimal way, hence we can expect great room for improvement in speed and accuracy of the results in Fig. 2.

To show that enhancements are possible, we proceeded performing a few numerical experiments on a subset of instances of BBH for which we substituted the simulated annealing baseline solver either with a hardware-efficient digital implementation (the Fujitsu Digital Annealer (?)), and with an Adaptive Parallel Tempering (APT) solver from USRA (?).

•

Speedup potential For the digital annealer experiments, we selected the Logical Deduction - 7 objects from BBH as the dataset of interest and sampled 20 questions from it. We consider these instances to be typical, resulting in QUBO problems with an average of 900 variables. We observed consistently a difference of at least an order of magnitude (and often two orders of magnitude) in speed to obtain the same highest quality solution between the baseline method and the specialized hardware. In particular, the time-to-solution given by the Fujitsu Digital Annealer, not considering the latency due to cloud access, were all under 2 seconds, which would enable a turnaround time of the entire CR pipeline under 5 seconds as demonstrated in the relevant technical report (?).
•

Accuracy potential As clear from Fig. 2, the Formal Fallacies category of the BBH dataset has proven to be particularly difficult for our CR method. Investigations on these QUBO instances indicated that they were also difficult for the baseline simulated annealing solver, which did not converge to a stable minimum by the allotted time. We then opted to employ the APT solver (which has shown to outperform simulated annealing on spin-glass problems) run on an HPC cluster extensively until convergence to access higher quality solutions.

Out of the 19 questions we considered for analysis, for 14 questions the answers matched. In 4 questions, the solutions obtained by APT resulted in a correct answer, while the solution obtained by the baseline resulted in an incorrect answer. The opposite is true only for 1 question. Solutions obtained by APT tended to contain a larger number of reasons, with APT picking approximately 49.7 reasons on average while simulated annealing picked only approximately 12.7 reasons. Approximately 86.7 $\%$ of reasons picked as part of APT’s solution was not part of the reasons picked as part of the simulated annealing solution. This seems to indicate that having a high-quality solution to the QUBO corresponds to final prompts contain a CoT with a substantial number of reasons.

It should be noted that simulated annealing, (adaptive) parallel tempering and the digital annealers used in our studies are all driving the optimization through a Monte-Carlo procedure employing thermal relaxation dynamics as the engine. Besides further sophistication of the algorithms (?) we could also consider other types of Ising machines, including quantum solvers that might access unmatchable sources of speedup, in principle. Based on recent results, particularly promising options in the near-term for time-sensitive applications are superconducting devices implementing quantum annealing (?) as well as gate-model algorithms that exploit the noise as a drive (?).

Post-processing of the reasons for the final prompt

When reviewing the reasons sampled from the LLM, there were still quite a few reasons that were semantically the same, but identified as different after the semantic matching. This suggests that we could further tune the semantic similarity threshold or potentially leverage an LLM call to help disambiguate reasons. After selecting the “best” reasons from the combinatorial optimizer, we often find that the prompt does not flow in a logical manner. This is because reasons are often “consequences” of other reasons and are directional. When the consequences are beyond 2 levels (Reason A causes Reason B which causes Reason C), the combinatorial optimizer doesn’t work so well, possibly because it naturally only looks at the correlation of 2 reasons. Futhermore, the output of the combinatorial optimizer is a flat list of reasons with no specific ordering or relationship between the reasons within the prompt. An intermediate step that could be experimented with is to create a CoT or a “Tree-of-Thought” using an ordered set of reasons. There are recent advances in prompt engineering methods employing similar sophisticated strategies (?; ?).

6.2 Generalizations of the framework

Understanding performance characteristics of sampling reasons from an LLM

CR is based on the assumption that sampling the LLM for reasons will create a distribution of reasons that can be be mapped to an optimization procedure. Exactly how the distribution is created and the accuracy of such distribution needs to be quantified. Factors such as the temperature of the LLM and what type of LLM is being used can impact the distribution of the reasons sampled. The distribution of the reasons are likely to also indicate if the CR procedure is even needed, which could lead to further optimizations.

Integration with Theorem Provers

Manual inspection of the reasons that are selected by the combinatorial optimizer reveals that we sometimes find reasons that conflict with each other. We conjecture for harder problems, the number of conflicting reasons will increase and can be removed by leveraging a theorem prover such as (?) to further improve the performance of combinatorial reasoning. One of the challenges of using a theorem prover is that they do not scale to thousands of variables. However, using the theorem prover as a post-processing step after the combinatorial optimizer narrows down the reasons to several dozen reasons is practical. The combination of a probabilistic solver combined with a deterministic solver allows for reasoning on open domain problems.

Integration with Retrieval Augmented Generation

Retrieval augmented generation (RAG) is used for knowledge intensive tasks. In the real world human problem solving, a combination of knowledge retrieval and reasons is used. A potential avenue for integrating RAG could be:

•

Use input query to perform a semantic search on a knowledge base to create a knowledge context
•

Include the knowledge context the prompt when performing sampling of reasons

With context windows of 1M tokens (?), the “reasons” sampled from the LLM could be derived from very long form documents.

Acknowledgments

The work was funded by Icosa Computing Inc. A.A.A., P.A.L. and D.V. were funded by NSF CCF (grant #1918549) and NASA Academic Mission Services contract NNA16BD14C – funded under SAA2-403506. We thank David Fellah for helpful discussions. We acknowledge the generous access provided by Fujitsu Limited to their Digital Annealer. We thank Michiyuki Tanaka, Yasuyuki Tanaka, Hirofumi Ukita, and Atsushi Kasugai from the Fujitsu team for their support. We also thank Google Deepmind and Google Cloud teams for their comments.

Appendix A Data Availability

The LLM samples, generated prompts, human analysis and relevant code are available at https://github.com/Icosa-Computing/cr-paper.

Appendix B End-to-End Example

Here we provide a full walkthrough of combinatorial reasoning for a particular question from BBH task snarks. The original question is presented below:

{mdframed}

[style=MyFrame,nobreak=true] Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in

System Instruction

{mdframed}

[style=MyFrame,nobreak=true] Let’s think step by step. After each step, condense the reasoning in the step into a sentence and put it in curly braces.

Initial Prompt

{mdframed}

[style=MyFrame,nobreak=true] Output template:
Step 1: reasoning
{condensed reason}
Step 2: reasoning
{condensed reason} ….
Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in

Reason Extraction

Each of the $N$ responses will contain a reasoning path, with condensed reasons placed inside curly braces. We extract these condensed reasons and embed them to perform the QUBO mapping. Here, we display two of the LLM responses from the initial sampling phase and demonstrate an example of reason embedding.

Sampled Response #1

{mdframed}

[style=MyFrame,nobreak=true] Step 1: Plugging in a keyboard and mouse to a computer is a typical and expected action.
{Plugging in a keyboard and mouse to a computer is a normal occurrence.}
Step 2: Plugging in a keyboard and mouse to a bus is an illogical and sarcastic statement.
{Plugging in a keyboard and mouse to a bus is absurd and sarcastic.}
Conclusion: Option (B) I’ll just bring my keyboard and mouse to the bus and plug it all in is sarcastic.
{Option (B) is sarcastic.}

Sampled Response #2

{mdframed}

[style=MyFrame,nobreak=true] Step 1: Plugging a keyboard and mouse into a computer is a common and expected action, so the statement using this scenario is not sarcastic.
{Plugging a keyboard and mouse into a computer is a common and expected action.}
Step 2: However, plugging a keyboard and mouse into a bus doesn’t make sense as buses do not have such connections, making this statement sarcastic. {Plugging a keyboard and mouse into a bus is not a logical or realistic scenario, indicating sarcasm.}

The sarcastic statement is: (B) I’ll just bring my keyboard and mouse to the bus and plug it all in. We focus on the following two reasons appearing in the two different LLM outputs (only curly brackets are extracted).

•

Plugging a keyboard and mouse into a computer is a common and expected action.
•

Plugging a keyboard and mouse into a computer is a normal occurence.

The Sentence Transformer embeds them into vectors $\bf{\gamma}_{1}$ and $\bf{\gamma}_{2}$ . They are semantically identical, as determined by their inner product $\bf{\gamma}_{1}\cdot\bf{\gamma}_{2}=0.95$ , which is well above the threshold of 0.90. These reasons will be collapsed into one single distinct reason in the CR pipeline.

Ising Mapping and Solving

The extracted reasons will be mapped to a quadratic optimization problem according to the procedure outlined in 3.2. For this example question, the QUBO instance contains $174$ variables. The best solution found by simulated annealing has an energy of $-806.3$ . We map the best solution back to a set of 45 selected reasons and corresponding $w$ - $value$ s (which end up being only two different values).

Final Prompt

Using the selected reasons and their w-values, we construct the final prompt. The LLM is queried with the final prompt and system instructions at temperature $T=0$ . We have omitted the mid-section of the prompt due to space constraints. The full prompt for the example and all other prompts can be found in the Supplementary Material.

System Instruction

{mdframed}

[style=MyFrame,nobreak=true] Each W-Statement starts with the substring (w), where (w) is a number called the W-Value. Identify and state each W-Value. W-Statements with higher W-Values have more reliable information. You may not provide multiple possible answers, you must narrow your final solution down to a single answer. Refer to each W-Statement and their W-Values in your reasoning. Your final answer must be of the form SOLUTION: (option).

Prompt

{mdframed}

[style=MyFrame,nobreak=False] Q: Which statement is sarcastic?
Options:
(A) I’ll just bring my keyboard and mouse to the computer and plug it all in
(B) I’ll just bring my keyboard and mouse to the bus and plug it all in
W-Statements:
(0.028) Using a keyboard and mouse on a bus is an absurd idea.
(0.028) The statement ”I’ll just bring my keyboard and mouse to the bus and plug it in” is sarcastic because it suggests an action that is illogical.
…
(0.009) Option (B) is sarcastic as it involves an absurd suggestion.
(0.009) Option (B) is sarcastic as it contains an unrealistic scenario, making it an unlikely action.
(0.009) Option (B) ”I’ll just bring my keyboard and mouse to the bus and plug it all in” is sarcastic.
(0.009) B is the sarcastic statement

LLM Response

{mdframed}

[style=MyFrame,nobreak=False] The W-Value for all statements is 0.028 except for the last two, which have a W-Value of 0.009. The majority of the statements with a W-Value of 0.028 clearly identify option (B) as the sarcastic statement, while the statements with a W-Value of 0.009 do not provide a definitive answer. Therefore, the final solution is:

SOLUTION: (B) I’ll just bring my keyboard and mouse to the bus and plug it all in

Appendix C Evaluation of QUBO mapping and solving

In Figure 3, we investigated the performance differences between CR as per main text (“Quadratic CR”) versus selecting a set $\{r_{selected}\}$ of reasons either randomly (“Random Reasons”) among all the $\{r_{distinct}\}$ possible or the set that optimize the cost function $\sum_{i}^{\{r_{selected}\}}\tilde{l}_{i}(\mu,m_{i},\alpha,\kappa,x_{i})$ from Eq. 3.2 (“Linear CR”). We restrict the analysis to the top 10 categories with the minimum value of the ratio between the cardinality of $\{r_{selected}\}$ and $\{r_{distinct}\}$ , which we take as a proxy for “reasoning difficulty” (see Table 4). Results indicate that random selection is generally performing worse and that quadratic terms seem to be useful only for 2/10 categories, which happen to be also among the most difficult (in terms of accuracy) for zero shot methods. We believe this correlation between difficulty and importance of quadratic terms in the QUBO deserves more investigation.

We performed a human evaluation in order to understand the QUBO model effect on correlated reasons to verify that $c_{ij}$ and $q_{ij}$ could be seen as measures of strength of relationship between two sampled reasons. Restricting our analysis to the causal judgement, movie recommendation, and sports understanding datasets, we assigned a value (-1,0,+1) to each pair of reasons appearing in 10 samples depending on whether the pair appears respectively inconsistent, consistent or independent in the eyes of a human. A first observation is that, within each sample, reasons that are inconsistent are very rare (they appeared only in one sample), which reassure that the LLM does not output contradictory answers for the dataset examined. Across samples, in causal judgement, we observed that the 50 pairs with higher values of $c_{ij}$ and $q_{ij}$ did contain a large fraction of correlated and consistent pairs. However, on other datasets such as movie recommendation or sports judgement, there were a significant fraction of correlated and consistent reasons even in the lowest 50 pairs. Moreover, many of the reason pairs with the lowest 50 values seem semantically identical - indicating a failure of the similarity measure described in 3.1. Based on this and other anecdotal evidence, we hypothesize that for easy questions, the LLM might be producing correlated reasons consistent with the correct answer nearly all the time. Results call to look at the pairs resulting in negative $c_{ij}$ much more closely in future studies.

We performed human evaluation of the consistency of the final reasons selected after the QUBO Solver $\{r_{selected}\}$ for the same restricted dataset, observing that inconsistency is very rare (for casual judgment) or difficult to ascertain for the movie recommendation, and sports understanding datasets. A preliminary observation indicates for NLP tasks we match or slightly outperform the simple zero-shot CoT method (without human-in-the-loop), while for algorithmic tasks CR usually has slightly lower accuracy.

References

Aadit, Lott, and Mohseni 2023 Aadit, N.; Lott, P. A.; and Mohseni, M. 2023. APT: Adaptive Parallel Tempering.
Acebrón et al. 2005 Acebrón, J. A.; Bonilla, L. L.; Vicente, C. J. P.; Ritort, F.; and Spigler, R. 2005. The kuramoto model: A simple paradigm for synchronization phenomena. Reviews of modern physics 77(1):137.
Akiba et al. 2019 Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
Angelini and Ricci-Tersenghi 2023 Angelini, M. C., and Ricci-Tersenghi, F. 2023. Limits and performances of algorithms based on simulated annealing in solving sparse hard inference problems. Physical Review X 13(2):021011.
Babbush, O’Gorman, and Aspuru-Guzik 2013 Babbush, R.; O’Gorman, B.; and Aspuru-Guzik, A. 2013. Resource efficient gadgets for compiling adiabatic quantum optimization problems. Annalen der Physik 525(10-11):877–888.
Besta et al. 2024 Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; and Hoefler, T. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence 38(16):17682–17690. arXiv:2308.09687 [cs].
Camsari, Salahuddin, and Datta 2017 Camsari, K. Y.; Salahuddin, S.; and Datta, S. 2017. Implementing p-bits with embedded mtj. IEEE Electron Device Letters 38(12):1767–1770.
Chermoshentsev et al. 2022 Chermoshentsev, D. A.; Malyshev, A. O.; Esencan, M.; Tiunov, E. S.; Mendoza, D.; Aspuru-Guzik, A.; Fedorov, A. K.; and Lvovsky, A. I. 2022. Polynomial unconstrained binary optimisation inspired by optical simulation. arXiv:2106.13167 [nlin, physics:quant-ph].
Chowdhury et al. 2023 Chowdhury, S.; Grimaldi, A.; Aadit, N. A.; Niazi, S.; Mohseni, M.; Kanai, S.; Ohno, H.; Fukami, S.; Theogarajan, L.; Finocchio, G.; et al. 2023. A full-stack view of probabilistic computing with p-bits: devices, architectures and algorithms. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits.
Esencan et al. 2024 Esencan, M.; Kumar, T.; Unlu, C.; and Ho, A. 2024. Improving Large Language Models with Combinatorial Optimization. Technical report, Icosa Computing Inc.
Fan, Lewis, and Dauphin 2018 Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical neural story generation. In Gurevych, I., and Miyao, Y., eds., Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 889–898. Melbourne, Australia: Association for Computational Linguistics.
Gannouni et al. 2020 Gannouni, A.; Samsonov, V.; Behery, M.; Meisen, T.; and Lakemeyer, G. 2020. Neural combinatorial optimization for production scheduling with sequence-dependent setup waste. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2640–2647.
Gemini Team 2024 Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs].
Google Gemini Team 2023 Google Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models.
Google 2023 Google. 2023. Palm 2 technical report.
Grant, Humble, and Stump 2021 Grant, E.; Humble, T. S.; and Stump, B. 2021. Benchmarking quantum annealing controls with portfolio optimization. Phys. Rev. Appl. 15:014012.
Hidaka et al. 2023 Hidaka, R.; Hamakawa, Y.; Nakayama, J.; and Tatsumura, K. 2023. Correlation-diversified portfolio construction by finding maximum independent set in large-scale market graph. IEEE Access 11:142979–142991.
Inagaki et al. 2016 Inagaki, T.; Haribara, Y.; Igarashi, K.; Sonobe, T.; Tamate, S.; Honjo, T.; Marandi, A.; McMahon, P. L.; Umeki, T.; Enbutsu, K.; et al. 2016. A coherent ising machine for 2000-node optimization problems. Science 354(6312):603–606.
Irbäck et al. 2022 Irbäck, A.; Knuthson, L.; Mohanty, S.; and Peterson, C. 2022. Folding lattice proteins with quantum annealing. Phys. Rev. Res. 4:043013.
Kiciman et al. 2023 Kiciman, E.; Ness, R. O.; Sharma, A.; and Tan, C. 2023. Causal reasoning and large language models: Opening a new frontier for causality.
Kirkpatrick, Gelatt, and Vecchi 1983 Kirkpatrick, S.; Gelatt, C.; and Vecchi, M. 1983. Optimization by simulated annealing. Science (New York, N.Y.) 220:671–80.
Kochenberger et al. 2014 Kochenberger, G.; Hao, J.-K.; Glover, F.; Lewis, M.; Lü, Z.; Wang, H.; and Wang, Y. 2014. The unconstrained binary quadratic programming problem: A survey. Journal of Combinatorial Optimization 28.
Kojima et al. 2022 Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large Language Models are Zero-Shot Reasoners.
Komatsu et al. 2018 Komatsu, K.; Momose, S.; Isobe, Y.; Watanabe, O.; Musa, A.; Yokokawa, M.; Aoyama, T.; Sato, M.; and Kobayashi, H. 2018. Performance evaluation of a vector supercomputer sx-aurora tsubasa. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 685–696.
LeCun 2022 LeCun, Y. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62(1).
Lewis et al. 2020 Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33:9459–9474.
Lucas 2014 Lucas, A. 2014. Ising formulations of many np problems. Frontiers in physics 2:74887.
Maciejewski et al. 2024 Maciejewski, F. B.; Biamonte, J.; Hadfield, S.; and Venturelli, D. 2024. Improving quantum approximate optimization by noise-directed adaptive remapping. arXiv preprint arXiv:2404.01412.
Mandra et al. 2023 Mandra, S.; Akbari Asanjan, A.; Brady, L.; Lott, A.; and Bernal Neira, D. E. 2023. PySA: Fast Simulated Annealing in Native Python.
Mohseni et al. 2021 Mohseni, M.; Eppens, D.; Strumpfer, J.; Marino, R.; Denchev, V.; Ho, A. K.; Isakov, S. V.; Boixo, S.; Ricci-Tersenghi, F.; and Neven, H. 2021. Nonequilibrium monte carlo for unfreezing variables in hard combinatorial optimization.
Mohseni, McMahon, and Byrnes 2022 Mohseni, N.; McMahon, P. L.; and Byrnes, T. 2022. Ising machines as hardware solvers of combinatorial optimization problems. Nature Reviews Physics 4(6):363–379.
Neal 2021 Neal, D.-W. S. I. 2021. dwave-neal. https://github.com/dwavesystems/dwave-neal.
Neira et al. 2024 Neira, D. E. B.; Brown, R.; Sathe, P.; Wudarski, F.; Pavone, M.; Rieffel, E. G.; and Venturelli, D. 2024. Benchmarking the operation of quantum heuristics and ising machines: Scoring parameter setting strategies on optimization applications. arXiv preprint arXiv:2402.10255.
Nye et al. 2021 Nye, M.; Andreassen, A. J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
OpenAI 2023 OpenAI. 2023. Gpt-4 technical report.
Rieffel et al. 2014 Rieffel, E.; Venturelli, D.; Do, M.; Hen, I.; and Frank, J. 2014. Parametrized families of hard planning problems from phase transitions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28.
Sankar et al. 2021 Sankar, K.; Scherer, A.; Kako, S.; Reifenstein, S.; Ghadermarzy, N.; Krayenhoff, W. B.; Inui, Y.; Ng, E.; Onodera, T.; Ronagh, P.; et al. 2021. Benchmark study of quantum algorithms for combinatorial optimization: Unitary versus dissipative. arXiv preprint arXiv:2105.03528.
Sao et al. 2019 Sao, M.; Watanabe, H.; Musha, Y.; and Utsunomiya, A. 2019. Application of digital annealer for faster combinatorial optimization. Fujitsu Scientific and Technical Journal 55(2):45–51.
Suzgun et al. 2022 Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H. W.; Chowdhery, A.; Le, Q. V.; Chi, E. H.; Zhou, D.; ; and Wei, J. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
Takemoto et al. 2020 Takemoto, T.; Hayashi, M.; Yoshimura, C.; and Yamaoka, M. 2020. A 2 $\times$ 30k-spin multi-chip scalable cmos annealing processor based on a processing-in-memory approach for solving large-scale combinatorial optimization problems. IEEE Journal of Solid-State Circuits 55(1):145–156.
Tanahashi et al. 2019 Tanahashi, K.; Takayanagi, S.; Motohashi, T.; and Tanaka, S. 2019. Application of ising machines and a software development for ising machines. Journal of the Physical Society of Japan 88(6):061010.
Tasseff et al. 2022 Tasseff, B.; Albash, T.; Morrell, Z.; Vuffray, M.; Lokhov, A. Y.; Misra, S.; and Coffrin, C. 2022. On the emerging potential of quantum annealing hardware for combinatorial optimization. arXiv preprint arXiv:2210.04291.
Tatsumura, Dixon, and Goto 2019 Tatsumura, K.; Dixon, A. R.; and Goto, H. 2019. Fpga-based simulated bifurcation machine. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 59–66. IEEE.
Tiunov, Ulanov, and Lvovsky 2019 Tiunov, E. S.; Ulanov, A. E.; and Lvovsky, A. I. 2019. Annealing by simulating the coherent Ising machine. Optics Express 27(7):10288.
Toshiba 2024 Toshiba, S. 2024. About SQBM+ | Quantum-Inspired Optimization Solutions SQBM+ | TOSHIBA DIGITAL SOLUTIONS CORPORATION.
Valmeekam et al. 2023 Valmeekam, K.; Marquez, M.; Sreedharan, S.; and Kambhampati, S. 2023. On the Planning Abilities of Large Language Models - A Critical Investigation. Advances in Neural Information Processing Systems 36:75993–76005.
Vaswani et al. 2017 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Venturelli and Kondratyev 2019 Venturelli, D., and Kondratyev, A. 2019. Reverse quantum annealing approach to portfolio optimization problems. Quantum Machine Intelligence 1(1):17–30.
Wan et al. 2023 Wan, X.; Sun, R.; Nakhost, H.; Dai, H.; Eisenschlos, J. M.; Arik, S. O.; and Pfister, T. 2023. Universal Self-Adaptive Prompting.
Wang et al. 2022 Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Webb, Holyoak, and Lu 2022 Webb, T.; Holyoak, K. J.; and Lu, H. 2022. Emergent analogical reasoning in large language models.
Wei et al. 2022 Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q. V.; and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35:24824–24837.
Xu, Jain, and Kankanhalli 2024 Xu, Z.; Jain, S.; and Kankanhalli, M. 2024. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
Yao et al. 2023 Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems 36:11809–11822.
Ye et al. 2023 Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; Zhou, J.; Chen, S.; Gui, T.; Zhang, Q.; and Huang, X. 2023. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models.
Z3 2024 2024. Z3 Theorem Prover. Page Version ID: 1219986304.

Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization

Abstract

1 Introduction

2 Preliminaries and Prior Work

2.1 Large Language Models

2.2 Reasoning in Large Language Models

Chain of Thought

Self-Consistency

Universal Self-Adaptive Prompting (USP)

2.3 Combinatorial Optimization and Ising Machines

Simulated Annealing and Parallel Tempering

3 Combinatorial Reasoning

3.1 Sampling of Reasons

3.2 QUBO Mapping

3.3 Combinatorial Optimization Solver

3.4 Final Prompt Creation

4 Experimental Results

5 Conclusion

6 Future Work

6.1 Improving time and accuracy

Semantic matching

QUBO Mapping

Combinatorial Optimization Solvers Selection

Post-processing of the reasons for the final prompt

6.2 Generalizations of the framework

Understanding performance characteristics of sampling reasons from an LLM

Integration with Theorem Provers

Integration with Retrieval Augmented Generation

Acknowledgments

Appendix A Data Availability

Appendix B End-to-End Example

System Instruction

Initial Prompt

Reason Extraction

Sampled Response #1

Sampled Response #2

Ising Mapping and Solving

Final Prompt

System Instruction

Prompt

LLM Response

Appendix C Evaluation of QUBO mapping and solving

References

Combinatorial Reasoning: Selecting Reasons
in Generative AI Pipelines via Combinatorial Optimization