CigaR: Cost-efficient Program Repair with LLMs

Dávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla, Martin Monperrus D. Hidvégi, K. Etemadi, S. Bobadilla, and M. Monperrus are with the KTH Royal Institute of Technology, Stockholm, Sweden
Email: {dhidvegi, khaes, sofbob, monperrus}@kth.se
Abstract

Large language models (LLM) have proven to be effective at automated program repair (APR). However, using LLMs can be costly, with companies invoicing users by the number of tokens. In this paper, we propose CigaR, the first LLM-based APR tool that focuses on minimizing the repair cost. CigaR works in two major steps: generating a first plausible patch and multiplying plausible patches. CigaR optimizes the prompts and the prompt setting to maximize the information given to LLMs using the smallest possible number of tokens. Our experiments on 429 bugs from the widely used Defects4J and HumanEval-Java datasets shows that CigaR reduces the token cost by 73%. On average, CigaR spends 127k tokens per bug while the baseline uses 467k tokens per bug. On the subset of bugs that are fixed by both, CigaR spends 20k per bug while the baseline uses 608k tokens, a cost saving of 96%. Our extensive experiments show that CigaR is a cost-effective LLM-based program repair tool that uses a low number of tokens to automatically generate patches.

Index Terms:
Program repair, cost minimization, large language model
**footnotetext: These authors contributed equally to this work.

I Introduction

Automated program repair (APR) [1] aims to reduce the huge cost of software maintenance [2] by automatically fixing bugs once they are detected. Automatizing software tasks is computationally costly, and it is no exception for APR research. Hence, it is important to make our APR systems as efficient as possible, and indeed it is a long standing problem in program repair to be able to fix bugs at a low cost [3, 4]. Recently, large language models (LLMs) have shown highly promising state-of-the-art results in APR [5]. However, both training LLMs and using them for inference is costly [6]. For example, ChatGPT, which is arguably the most powerful existing LLM, charge their users for every single token that is either sent to the model or generated from it [7]. In this paper, we study the problem of minimizing the computational cost of LLM-based program repair.

In the context of LLMs, a cost-effective LLM-based APR tool should steer the model towards the correct patch with the lowest possible cost, i.e. with the lowest possible number of tokens. For this purpose, two main components should be optimized: 1) the input to the LLM, i.e. the prompt, and 2) the configuration used for calling the model. First, a good prompt should maximize the trade-off between conciseness and information. While APR researchers have proposed prompt engineering techniques to better guide the model to find the correct patch [8], none of them have been optimized. Second, the settings under which the LLM works should lead to maximize the likelihood to output a correct patch, given a fixed token cost. Previous works have demonstrated that LLM settings, such as number of samples and temperature, are key to obtain acceptable results with a low cost [6].

In this paper, we propose CigaR, a novel LLM-based program repair system that concentrates on token cost minimization. CigaR achieves cost-effectiveness with the help of its three delicately designed prompts working in concert: an ‘initiation prompt’, an ‘improvement prompt’, and a ‘multiplication prompt’. The initiation prompt initializes the repair process. The improvement prompt improves partial patches until a plausible patch is generated, avoiding throwing away potentially useful patches, hence avoiding wasting tokens. Finally, the multiplication prompt builds upon the already generated plausible patches to synthesize more plausible patches with diversity maximization. All these prompts are designed to be concise while staying informative, minimizing the overall token cost. The prompts help the LLM avoid unnecessary token cost by building upon its previous responses. CigaR also uses a reboot strategy to allow the model to look into various parts of its search space, instead of spending tokens in dead-ends.

We evaluate CigaR on two widely used bug datasets: Defects4J [9] and HumanEval-Java. CigaR is highly effective, and possibly achieves the highest performance ever: CigaR is the first tool that fixes 171/429 (39.8%) of the bugs in the considered dataset. This shows that CigaR is properly taking advantage of its advanced prompt techniques for program repair. We demonstrate that CigaR reduces the token cost of the repair process by 73%. The significant cost reduction in CigaR proves the feasibility and effectiveness of the token cost minimization methods used by CigaR. Given that energy consumption by LLMs is significant [10], saving 73% of resources matters beyond program repair: it contributes to engineering frugal software tools in our era of climate crisis.

To sum up, CigaR is the first LLM-based APR tool that aims at minimizing the computational cost, as measured by the number of tokens employed. CigaR is a sophisticated system that enables the LLM to find diverse plausible patches by: 1) summarizing the feedback given to the LLM as a part of an iterative approach; 2) rebooting the repair process after a few failing LLM invocations to allow the LLM look at various parts of its search space; 3) employing the LLM to multiply the already generated patches in order to maximum diversity.

To summarize, we make the following contributions:

  • We introduce CigaR, a new LLM-based program repair approach. CigaR uses advanced prompting incl. iterative prompting, search rebooting and patch multiplication. CigaR has been designed and engineered to explore the patch search space effectively (more repaired bugs), while minimizing the repair token cost (lower invoice).

  • We evaluate CigaR on the widely used Defects4J dataset, as well on HumanEval-Java for sake of external validity. CigaR outperforms the state-of-the-art APR tools, incl. recent LLM-based APR tools, by fixing 171/429 (39.8%) of the considered bugs.

  • We study the amount of token cost reduction by CigaR. Compared to the state-of-the-art LLM-based APR, CigaR reduces the token cost by 73% on average. This shows the significance of token cost reduction by CigaR.

  • We make CigaR publicly available for future research at https://github.com/ASSERT-KTH/cigar.

The rest of this paper is organized as follows. In Section II, we review the core concepts related to our research. In Section III, we explain how CigaR works. Section IV and Section V present the methodology that we use for our experiments and their results. Section VI discusses the threats to the validity of our results. In Section VII, we review the related work. Finally, in Section VIII, we conclude this paper.

II Background

In this section, we review the fundamental considerations and techniques that CigaR is built upon.

II-A Large Language Models

Large language models (LLMs) are machine learning models with tens of billions of parameters trained on vast amounts of data. These models usually use the Transformer architecture [11] and consist of multiple fully connected layers of attention. LLMs can generate human-like text and code, they perform various natural language processing tasks such as translation, summarization, and programming code generation [12, 13].

There are three major types of Transformer models: encoder-only models, like BERT [14], decoder-only models, such as GPT-3 [15] [16], and finally encoder-decoder models like T5 [17]. Recent studies show that sizable decoder-only models outperform state-of-the-art in various software engineering tasks, including automatic program repair [4].

A decoder-only LLM’s procedure is as follows [18]. The user enters a textual prompt. Next, the prompt is tokenized and fed into the model. The model predicts the next token that should follow the tokenized input. This output token is then concatenated with the input tokens and sent again into the LLM in a loop until a stop condition is reached. The stop condition is a specific stop token or having exceeded the context length, which is the maximum number of tokens the model can be fed into. Tokens are the fundamental components based on which the model receives the input and generates the output. It is also the primary unit for counting usage and billing.

II-B LLM Token Cost

LLMs take an input prompt made of a certain number of tokens, and then infer what tokens should follow the given input. All token in the input as well as the tokens generated by inference induce computational and financial costs for users [19]. For example, the GPT-3.5 Turbo model, which is one of the most advanced LLMs, charges $0.0015 per 1k input tokens and $0.002 per 1k generated token [20] at the time of writing. Overall, the computational cost of using these models is a growing concern in society due to environmental concerns and in research due to the pressure on open, academic research [21].

This motivates the study of the token-efficiency of LLM-based techniques. The state-of-the-art metric for capturing token efficiency is the pass@t metric [22]. This metric shows the likelihood that a model produces a correct answer with a token cost of t tokens. Techniques are employed to reduce the token costs, such as using less expensive models [23], hyperparameter optimization [21], and in-context learning [19].

III CigaR Approach

III-A Overview

We design and implement CigaR, an LLM-based program repair tool that aims at minimizing the token cost. CigaR can be used with any LLM that has an API. CigaR employs three kinds of search: 1) search for a plausible patch from the buggy program, see Section III-B1; 2) search for a plausible patch from a partial patch, see Section III-B2; 3) search for more plausible patches from a single plausible patch, see Section III-C.

Figure 1 represents an overview of how CigaR works. CigaR takes as input a buggy program that fails on a test and generates a set of plausible patches that pass all tests as the output. CigaR works in two major steps, “First Plausible Patch Search” and “Plausible Patch Multiplication”. In the first plausible patch search step, CigaR tries to generate a plausible patch that passes all the tests. For this, CigaR uses so-called ‘initiation prompt’. If the LLM fails to suggest a plausible patch in its first invocation, CigaR proceeds by using improvement prompts. In the plausible patch multiplication step, after the first plausible patch is generated, CigaR asks the LLM to produce alternative plausible patches.

The token minimization approach of CigaR is based on three novel concepts: 1) the concept of patch multiplication, steering the LLM to generate maximally diverse plausible patches, 2) the design of three types of informative prompts which provide the LLM with a summarization of its responses in previous iterations, 3) the reboot of the repair process.

Refer to caption
Figure 1: Overview of CigaR: CigaR combines original prompting techniques, incl. partial patch improvement, search rebooting and patch multiplication to reach high effectiveness at a low token cost.

III-B Step 1: First Plausible Patch Search

III-B1 Plausible Patch Search Initiation

CigaR starts with an initiation prompt, whose goal is to identify a first plausible patch. To get the information for the initiation prompt, CigaR first runs the tests on the buggy program and saves the test results. Next, a prompt is generated that consists of three core sections.

  1. 1.

    A one-shot example of fixing a bug, to ensure that CigaR takes advantage of the in-context learning to understand the expected format.

  2. 2.

    The buggy code, which is the main part of the input, per a fault localization algorithm.

  3. 3.

    The test failure details, with valuable extra information about the bug under repair.

Refer to caption
Figure 2: P1: initiation prompt structure used for plausible patch search initiation.

Figure 2 illustrates the structure of the initiation prompt employed by CigaR. At the beginning, there is a system message (A) to put the LLM in the role of an APR tool. The second part of the prompt is a one-shot example (B), which includes a small bug and its fix in the history of the same software project. We select the smallest bug for the one-shot example in line with our token minimization strategy.

The third part of the input is the buggy code (C). The buggy part of the code is replaced with an “[INFILL]” token, and the original buggy part of the code is separately presented to the model. Note that CigaR fixes bugs that are localized inside a single function.

The fourth part of the prompt presents the test failure details (D). This includes the failing test method, the assertion that triggers the failure, and the runtime error message. This gives the model extra information regarding the execution of the buggy code that leads to the failure.

Finally, there is a call to action section (E) in the prompt that specifies what we exactly expect the LLM to generate.

The initiation prompt is sent to the model and the LLM is asked to generate 50 samples. This is because we want to generate as many samples as possible with a single input token upfront cost. Per our preliminary experiments, generating 50 samples per request is the sweet spot that gives us the most diverse set of patches without generating too many repetitive responses which cost output tokens for nothing.

CigaR extracts the generated code from the response and substitutes the original buggy code for it. The result is a set of ‘initial patches’. These initial patches are tested. A patch that passes all the tests is a plausible patch, and a patch that does not pass all the tests is a partial patch. If there is any plausible patch among the initial patches, it is directly sent to the plausible patch multiplication step, explained in Section III-C. Otherwise, the best partial patch together with its failure error message is sent to the partial patch improvement phase, explained in Section III-B2. The best partial patch is the first patch without compilation errors. If all generated patches have compilation errors, the first generated patch is considered as the best partial patch.

III-B2 Partial Patch Improvement

The goal of partial patch improvement is to get a plausible patch by improving partial patches found during the plausible patch search initiation phase. In this phase, CigaR takes an iterative approach for generating a plausible patch: until the model generates a plausible patch, CigaR invokes the LLM with improvement prompts. The improvement prompt contains a system message, the buggy code, test failure details, and a call to action similar to the initiation prompt as explained in Section III-B1.

In addition to the sections common with the initiation prompt, the improvement prompt also gives the LLM a concise feedback on its previous responses. This ensures that the model looks for the plausible patch at different parts of the search space with a low token cost. The information regarding the past generated partial patches comes right before the call to action. The partial patches generated in previous iterations are grouped by their failure message. Patches in each of these groups are added next to each other, and their common failure message comes after them. Using this method, we summarize previously generated patches without repeating test failure messages. CigaR considers as many previous patches as it is possible in a single prompt without going beyond the LLM’s prompt token limit.

Note that compared to the initiation prompt, the one-shot example is removed in the subsequent prompt. The reason is that the one-shot example is mainly used to help the LLM grasp the style of output it should generate. In the improvement prompt, the previous implausible patches play the same role and hint the model in what format it should generate the output.

The LLM is reinvoked with improvement prompts as long as no plausible patch is generated, and we have not exceeded max_invoke𝑚𝑎𝑥_𝑖𝑛𝑣𝑜𝑘𝑒max\_invokeitalic_m italic_a italic_x _ italic_i italic_n italic_v italic_o italic_k italic_e invocations. max_invoke𝑚𝑎𝑥_𝑖𝑛𝑣𝑜𝑘𝑒max\_invokeitalic_m italic_a italic_x _ italic_i italic_n italic_v italic_o italic_k italic_e is a configurable parameter in CigaR, set to ten by default per our pilot experiment. As soon as an improvement prompt generates a plausible patch, CigaR sends it to the third phase (Section III-C). Otherwise, CigaR has failed to fix the given buggy code.

III-B3 Rebooting the Plausible Patch Search

In case the LLM fails to synthesize a plausible patch after max_invoke𝑚𝑎𝑥_𝑖𝑛𝑣𝑜𝑘𝑒max\_invokeitalic_m italic_a italic_x _ italic_i italic_n italic_v italic_o italic_k italic_e invocations, CigaR reboots the whole repair process. This means ignoring all the patches generated before and starting from the initiation prompt. Rebooting the repair process enables CigaR to explore an even larger repair space by starting from a different initial seed, according to the model temperature. For this, CigaR uses a high temperature (see Section III-D), which maximize exploration. The rebooting strategy with high temperature leads to initiating the repair process from a radically different hidden state, meaning exploring an undiscovered part of the repair space. As a part of our token minimization approach, this helps CigaR avoid using too many tokens for exploring a limited and fruitless part of the search space.

Every time CigaR reboots the repair process, we say it has started a new round. Round_1 is when CigaR starts the repair and round_i is the repair process after the (i1)𝑖1(i-1)( italic_i - 1 )th reboot.

III-C Step 2: Plausible Patch Multiplication

Refer to caption
Figure 3: P2: Prompt used for plausible patch multiplication.

Plausible patch multiplication is the final step, its goal is to generate more distinct plausible patches. After the first plausible patch is generated with an initiation or an improvement prompt, CigaR tries to generate new plausible patches. The reason is that the program repair literature has shown that even a plausible patch can be semantically incorrect [24]. By generating multiple plausible patches, CigaR has a better chance of producing a correct patch. CigaR can employ any patch ranking technique [25] to select the most likely patch out of the set of plausible patches generated.

The patch multiplication process also works iteratively. In this step, the prompt contains the following parts: a system message and the buggy code similar to the initial prompt, as well as a summary of the generated plausible patches and a call to action to generate different patches. Figure 3 shows an example of these latter two parts. CigaR includes as many of the recent plausible patches as possible without exceeding the LLM’s prompt token limit. This makes the LLM amplify the set of existing patches with patches that are distinct from the previous ones, with a reduced token budget. CigaR invokes the LLM with multiplication prompts and collecting the responses five times. We aim for five multiplication tries to explore a significant part of the plausible patch search space without generating many repetitive patches.

Finally, the full list of all generated plausible patches is reported to the user.

III-D Implementation

CigaR is implemented in Python and uses OpenAI “gpt-3.5-turbo
-0301
” as the underlying LLM with sampling temperature of 1. This high temperature adds a notable level of randomness to LLM’s output. This is essential for exploring different parts of the repair space with CigaR’s reboots and patch multiplication techniques. We also note that CigaR implements a thorough caching system, which stores all the prompts sent to the LLM, all responses received from the LLM as well as the test execution results. This caching system is important for reproducing and investigating past repairs, as invoking the LLM can be highly costly. CigaR and all the data related to our experiments are made publicly available111https://github.com/ASSERT-KTH/cigar.

IV Experimental Methodology

IV-A Research Questions

In this paper, we study the following research questions.

  • RQ1 (effectiveness): How effective is CigaR in terms of generating plausible and correct patches? We count the number of bugs for which CigaR generates a plausible and correct patch, respectively. We compare CigaR against the state-of-the-art APR tools based on OpenAI’s LLMs.

  • RQ2 (efficiency): How efficient is CigaR in terms of token cost compared to the state-of-the-art? We run CigaR on 429 real-world bugs from the widely used datasets, Defects4J and HumanEval-Java. We compare CigaR’s token efficiency against ChatRepair [4].

  • RQ3 (patch exploration): To what extent does patch exploration advance over reboots and multiplications? We count the number of distinct patches that are the result of CigaR’s iterative approach. As repeating the LLM invocations is the major source of CigaR’s token cost, this studies whether our reboot and multiplication strategies are worthwhile.

IV-B Study Subjects

To evaluate CigaR, we need a benchmark. Therefore, we use Defects4J and HumanEval-Java [5] as the benchmarks of our study. Defects4J is the most commonly used benchmark for APR. It enables us to compare against performance metrics available in the published papers or their replication packages. We also consider HumanEval-Java [5] to address the potential data leakage concern. HumanEval-Java is a benchmark designed fore being more recent than the training data of the gpt-3.5-turbo-0301 model. Therefore, a good performance on HumanEval-Java indicates our results are not susceptible to data leakage.

Table I: Statistics for the considered bugs from the Defects4J dataset.
Identifier Project Name All Bugs Single-Function
Chart jfreechart 26 17
Closure closure-compiler 174 102
Lang commons-lang 64 42
Math commons-math 106 73
Mockito mockito 38 17
Time joda-time 26 16
HumanEval-Java 162 162
Total 596 429

As CigaR fixes single-function bugs, we consider only bugs of this type. Table I shows the characteristics of the projects in our dataset. The “All Bugs” column presents the total number of bugs from that project in the most recent version of Defects4J and HumanEval-Java. The “Single-Function” column shows the number of single-function bugs. Overall, our experiments consider 429 single-function bugs. Note that each project contains at least 16 single-function bugs, which makes a diverse and appropriate dataset for our study.

IV-C Protocol for RQ1 (Effectiveness)

To answer RQ1, we run CigaR on all 429 bugs in our dataset. After each round of executing CigaR ends, we identify the bugs for which a plausible patch is generated, and the bugs with no plausible patch. We continue rebooting CigaR until one plausible patch is found or the maximum number of round is reached.

We assess the effectiveness of CigaR by computing the number and ratio of bugs for which it generates a plausible or correct patch. We consider a patch to be plausible when it passes all the provided tests. We consider a patch to be correct if its abstract syntax tree (AST) exactly matches the AST of the ground-truth fix for the bug in our dataset. This allows for being oblivious to formatting changes. When a tool generates a correct patch for a bug, we say the tool fixes the bug. The more correct/plausible patches generated by CigaR, the more effective it is. Note that generating a maximum number of distinct plausible patches is good, because it maximizes the underlying of having the correct one. Techniques such as patch ranking [24] are then responsible for finding the correct one.

We also compare CigaR against LLM-based APR tools that employ OpenAI’s GPT models and were evaluated on Defects4J bugs. To obtain a full list of such tools, we search for “GPT” and related LLM names in the Google Scholar academic search engine. We then manually check all the papers and identify the ones that run their proposed tool on all applicable Defects4J bugs and report the number of correct patches. The resulting set includes nl2fix [26], STEAM [27] and ChatRepair. nl2fix augments the prompt with issue descriptions. STEAM takes an interactive approach with multiple steps to fix Defects4J bugs. ChatRepair [4] has an iterative approach and gives test results to the LLM as feedback. For the sake of comparison, we extract the numbers reported in the original paper of nl2fix and STEAM.

Per our literature review, ChatRepair is the most advanced APR tool that uses OpenAI’s models. Therefore, having the actual patches generated by ChatRepair and its prompts is essential for the study of CigaR’s effectiveness and efficiency against the state-of-the-art. As ChatRepair and its patches for Defects4J are not publicly available, by October 4, 2023, we carefully reimplement it according to the paper, incl. private communication with the authors. Our reimplementation of ChatRepair is publicly available in our repository for researchers and practitioners.

In our comparison between CigaR and the state-of-the-art APR tools, we measure performance in terms of the ratio of bugs with a correct and plausible generated patch. Specifically for CigaR, we also track how the number of correct/plausible patches increases as CigaR enters new rounds. A significant increase shows that the effectiveness of the rebooting strategy adopted by CigaR.

We also check the overlap between the bugs that each of CigaR and ChatRepair fix. This experiment shows us whether some tools explore a different part of the repair search space. We only consider CigaR and ChatRepair for this study, because the patches for nl2fix and STEAM are not publicly available.

IV-D Protocol for RQ2 (Efficiency)

In answer to RQ2, we compare CigaR against ChatRepair only, because it is the only tool for which we can precisely measure token usage. We compare CigaR’s total token cost against the total token cost of ChatRepair. Moreover, we study the token cost of CigaR and ChatRepair on bugs that are fixed vs not-fixed by each tools per our RQ1 experimental results. This sheds light on a different aspect of the efficiency of each tool, as it splits the token cost to fruitful ones leading to a correct patch and unfruitful ones, where the token price to pay is a pure loss. We want to see if these two tools spend a significant number of unnecessary tokens on bugs that are not fixed at the end.

IV-E Protocol for RQ3 (Exploration)

To answer RQ3, we measure how the number of distinct patches generated by CigaR increases, as it makes more LLM invocations. A patch p for a buggy program b is distinct if p is textually different from all patches generated for b in previous LLM invocations. Note that a distinct patch may be correct, plausible, partially plausible, or even uncompilable. Synthesizing a large number of distinct patches means CigaR is exploring different parts of the search space without repeating itself. Such patch synthesis is essential for making the exploration cost-effective and finding a correct patch with a low cost.

We count the total number of distinct generated patches as well as the number of distinct plausible patches for “plausible patch multiplication”, which is in line with the actual target of this step. A consistent increase in the number of distinct generated (plausible) patches shows the patch exploration capabilities in CigaR’s repair process.

We conduct this experiment on a set of selected bugs in our dataset. For studying the first plausible patch search step, we select the bugs that take the maximum number of LLM invocations to be plausibly repaired per each Defects4J project in the dataset, which means they are very hard to fix, and much exploration is needed.

For the plausible patch multiplication step, we make the set of selected bugs more diverse by also considering the bugs for which the first plausible patch is generated with the first LLM invocation. Note that at the plausible patch multiplication step, the number of LLM invocations is always set to five, which provides the same amount of data for all bugs.

By studying the number of distinct generated (plausible) patches relative to the number of LLM invocations, we better understand how CigaR explores the patches in the repair space for different types of bugs. In particular, we study whether CigaR cause unnecessary token cost by repeating patches that have been synthesized in previous LLM invocations.

V Experimental Results

V-A Results for RQ1 (Effectiveness)

Table II: Effectiveness of CigaR vs recent LLM-based program repair tools on Defects4J.
Defects4J HumanEval-Java ALL
Tool #Bugs #Plausible #Correct (EM) #Bugs #Plausible #Correct (EM) #Bugs #Plausible #Correct (EM)
nl2fix 283 53.3% (151/283) 11.3% (32/283)
STEAM 260 25.0% (65/260)
ChatRepair 267 59.9% (160/267) 19.8% (53/267) 162 93.2% (151/162) 52.4% (85/162) 429 72.4% (311/429) 32.1% (138/429)
CigaR _R1 267 56.5% (151/267) 23.9% (64/267) 162 83.9% (136/162) 61.1% (99/162) 429 66.8% (287/429) 33.7% (145/429)
CigaR _R3 267 61.4% (164/267) 25.4% (68/267) 162 90.1% (146/162) 62.9% (102/162) 429 72.2% (310/429) 39.6% (170/429)
CigaR _R12 267 69.2% (185/267) 25.8% (69/267) 162 93.8% (152/162) 62.9% (102/162) 429 78.5% (337/429) 39.8% (171/429)

Table II shows the results of running CigaR on our dataset and compares it with the state-of-the-art tools. The table presents the results for Defects4J and HumanEval-Java bugs both separately and combined. The “Tool” column shows the name of the APR tool. As explained in Section IV-C, we reboot CigaR several times until it generates a plausible patch for one of the remaining unfixed bugs. This strategy leads to running CigaR for 12 rounds. The results for three different rounds is presented in Table II. “#Bugs” represents the total number of bugs on which the APR tool is run.

Note that the “#Bugs” column that shows the number of considered Defects4J bugs slightly varies between the tools. This is due to the differences between the expectations of each tool. However, the overall number of considered bugs is similar between the considered tools, which enables us to have a meaningful comparison.

The “#Plausible” and “#Correct (EM)” columns show the number of bugs for which the APR has generated a plausible and a correct patch, respectively. For example, row CigaR _R3 shows that after running CigaR for three rounds, it generates a plausible patch for 61.4% of Defects4J bugs and 90.1% of HumanEval-Java bugs. In total, it generates a plausible patch for 72.2% or 310 of 429 considered bugs in our experiment. For 39.6% (170/429) of the bugs, one of the generated plausible patches is correct as it has an AST exactly matching the AST of the ground-truth fix. Note that the performance of nl2fix and STEAM are not known for HumanEval-Java and their tool is not publicly available.

CigaR is able to generate a plausible patch for 69.2% of Defects4J bugs and 93.8% of HumanEval-Java bugs, which is arguably high. In total, CigaR produces plausible patches for 78.5% (337/429) of the bugs in our dataset. This is higher than the best performance by state-of-the-art, ChatRepair, which generates plausible patches for 59.9% (160/267) of Defects4J bugs and 93.2% (151/162) of HumanEval-Java. In the first round R1, CigaR generates a plausible patch for 66.8% (287/429) of the bugs, and this number increases at each round until the twelfth round. This shows the effectiveness of the reboot strategy adopted by CigaR, which maximizes exploration using a new random seed thanks to the temperature mechanism.

Generating a correct patch is the ultimate goal of any APR tool. CigaR outperforms all other tools by generating a correct patch for 25.8% (69/267) of Defects4J bugs and 93.8% (152/162) of HumanEval-Java bugs. This proves the overall end-to-end effectiveness of CigaR. The number of bugs with a correct patch also increases over rounds, again demonstrating the importance of rebooting. CigaR surpasses all existing tools on both Defects4J and HumanEval-Java in terms of correct patches at round three (R3).

Refer to caption
Figure 4: The overlap between bugs fixed by CigaR and ChatRepair in our merge dataset (Defects4J and HumanEval).

Figure 4 illustrates the overlap between bugs that ChatRepair and CigaR fix (we cannot do it for the other tools due to the absence of detailed information in the respective papers). There are 125 bugs that both tools generate a correct patch for. Moreover, 26% (46/171) of the bugs fixed by CigaR are not fixed by ChatRepair, which shows CigaR’s patch exploration strategy looks into parts of LLM’s repair space that was not covered previously.

{mdframed}

Answer to RQ1: How effective is CigaR in terms of generating plausible and correct patches?
Our experiment on Defects4J shows that CigaR is effective at automatically fixing real-world bugs. CigaR generates plausible patches for 78.5% (337/429) bugs. It generates 39.8% (171/429) correct patches with the strictest definition of correctness – exact match with the developer ground truth patch. This outperforms the related APR tools built on OpenAI’s GPT models, thus showing the potency of CigaR’s prompting techniques. Notably, 46 bugs in the dataset are only fixed by CigaR. The validity of those results is fully confirmed by our run on a new benchmark HumanEval-Java, which is strong evidence of external validity. Overall, these experiments demonstrate that CigaR’s novel iterative prompting pipeline effectively explore the search space of LLMs to find correct patches.

V-B Results for RQ2 (Efficiency)

Table III: Token cost of CigaR and ChatRepair on various types of bugs. Lower cost indicates better token cost efficiency.
Bugs_Fixed_By CHC CIC CHC_AVG CIC_AVG Saving
ChatRepair (13) 8.6M 1.5M 661K 115K
CigaR (46) 24.4M 1.5M 530K 32K
Both (125) 76M 2.6M 608K 20K 96%
Neither (245) 95.3M 49.1M 388K 200K 48%
Total (429) 204.3M 54.9M 467K 127K 73%

Next, we focus on the token cost. We only compare with ChatRepair as the implementations of STEAM and nl2fix are not publicly available, preventing us for collecting their respective token cost.

We compute each tools’ cost on bugs of different types. Table III shows the result of this experiment. The “Bugs_Fixed_By” column indicates the type of bugs according to the tool(s) that fix them. The number in brackets in this column is the number of bugs of the respective type. “CHC” and “CIC” show the token cost of ChatRepair and CigaR on each type of bug, respectively. Also, “CHC_AVG” and “CIC_AVG” represent ChatRepair’s and CigaR’s average token cost per bug for each type. Finally, the “Saving” column indicates the percentage of saved token cost by CigaR compared to ChatRepair.

In total, ChatRepair spends 467K tokens on average, while CigaR spends 127K. This means CigaR improves the token cost by 73% (149.4M/204.3M), while, per RQ1, it outperforms ChatRepair in terms of the number of plausible/correct generated patches, demonstrating a win-win situation. This is explained by of our token minimization strategies: asking for as many samples as possible at each LLM invocation, timely reboots, and summarizing the previous patches in the prompt. To sum up, CigaR improves on both aspects: 1) it enables us to have more distinct patches and 2) for fewer tokens.

In Table III, we see that CigaR saves 96% (73.4M/76M) of token cost on bugs that are fixed by both. We also see that CigaR spends 32K tokens on average for the 46 bugs that are only fixed by CigaR, while ChatRepair spends 661K tokens on average for the 13 bugs only fixed by ChatRepair. All this data shows, when the tool is able to find a correct patch, CigaR works much more efficiently than ChatRepair. By manually checking our experimental data, we see the reason is that ChatRepair uses up to 199 LLM invocations after generating the first plausible patch to produce alternative plausible patches one by one. However, CigaR only uses five LLM invocations for plausible patch multiplication and generates up to 50 patches at each invocation. We see that CigaR’s summarization of previous patches in each invocation and using a high temperature is effective: it leads the LLM to synthesize novel patches without using too many invocations and causing unnecessary cost.

1 Type actual = ((ParameterizedType) generic).getActualTypeArguments()[0];
2 + if (actual instanceof Class) {
3 return (Class) actual;
4 + } else if (actual instanceof ParameterizedType) {
5 + return (Class) ((ParameterizedType) actual).getRawType();
6 + }
Listing 1: The correct patch for the bug “Mockito_12”, which is fixed by both ChatRepair and CigaR. ChatRepair spends 275K tokens on this bug, while CigaR spends 22K saving 92% of the token cost.

LABEL:lst:cost_ex shows the bug “Mockito_12” and its fix by both CigaR and ChatRepair. In the buggy version, the program always casts the variable actual to a Class and returns it. In the fixed version generated by CigaR and ChatRepair, it is first checked whether actual is an instance of Class; if not, the raw type of actual is returned. On this bug, CigaR saves 92% (253K/275K) of the token cost, compared with ChatRepair. Our manual analysis on this bug confirms our claim regarding cost saving on bugs fixed by both tools: ChatRepair generates alternative plausible patches one-by-one with up to 199 LLM invocations, while CigaR uses only 5 LLM invocations with a sample size of 50.

Finally, we consider the 245 bugs that are fixed by neither of the tools. ChatRepair’s token cost on these bugs is 95.3 million and CigaR’s token cost is 49.1 million. On average, ChatRepair spends 388K (95.3M/245) tokens on each of these bugs and CigaR spends 200K (49.1/245). This means CigaR saves 48% (188K/388K) of the token cost for unfixable bugs. The token cost difference on unfixable bugs is because CigaR summarizes previously generated partial patches, while ChatRepair includes the whole previously generated patches in its new invocations.

Moreover, CigaR spends 115K tokens on average for bugs that are only fixed by ChatRepair, while ChatRepair spends 530K tokens on average for bugs that are only fixed by CigaR. This shows the cost difference is smaller when the bug is not fixed, compared to the bugs that are fixed. The reason is that both tools make the maximum number of LLM invocations that they are allowed to as long as they have not generated a plausible patch.

{mdframed}

Answer to RQ2: How efficient is CigaR in terms of token cost compared to the state-of-the-art?
CigaR improves the token cost by 73% (149.4M/204.3M), compared to the baseline ChatRepair. This is because CigaR has been designed and engineered to explore the LLM repair space with as few LLM invocations as possible. Also, we show that CigaR is particularly cost-efficient when it is successful at generating a correct patch for a bug, with a token total price of only 4% on average compared to the baseline, representing a saving of 96%. This means that future companies which will use a program repair product such as CigaR would see their program repair operational costs dramatically reduced.

V-C Results for RQ3 (Exploration)

Table IV: Selected bugs for studying patch exploration.
BID #LLM_Invocations LEN FPPS PPM
Chart_5 1 1,452
Closure_58 91 2,568
Closure_128 1 263
Lang_19 119 1,652
Lang_53 101 5,218
Math_13 103 165
Math_23 1 5,380
Math_89 1 95
Mockito_33 1 410
Time_20 1 402
Time_22 85 108
Refer to caption
Figure 5: The progress during the plausible patch search step. The number of distinct patches increases with LLM invocations, with significant jumps at each reboot, validating the concept.

Table IV shows the bugs that are selected for investigating the patch exploration progress. The table shows the bug identifier (“BID”), the number of LLM invocations in the plausible search phase, and the length of the buggy function in characters (“LEN”). The “FPPS” and “PPM” columns show if the bug is considered for studying the first plausible patch phase and/or the plausible patch multiplication steps, respectively. For the first plausible patch search, we want to show how patch exploration discovers new distinct patches with reboots. Therefore, the selected bugs for studying this step are the ones with most number of reboots during the repair process. Note that no bug from “Mockito” and “Chart” is selected for this step, because all the bugs in these projects are either not fixed at all or fixed in a single round without reboots. To diversify the type of considered bugs, we also select bugs for which the first plausible patch is found with one LLM invocation. This leads to including “Chart_5”, “Closure_128”, “Math_23”, “Math_89”, “Mockito_33”, and “Time_20” in our study of the plausible patch multiplication step.

Figure 5 shows the number of total distinct patches generated up to a certain number of LLM invocations in the first plausible patch search phase. We observe two major points in these plots. First, the number of generated distinct patches jumps after every reboot, i.e. for new rounds which are made every ten requests. After the jump at the beginning of a round, the number of new distinct patches has a slow increase until the end of that round and then jumps back when the next round starts. This indicates that the reboots are more effective than continuing a single conversation for too long. The patch exploration power of reboots with new random seed is higher.

The second observation is that the overall progress of patch exploration seems to be linear. This means at each round CigaR identifies new distinct patches and this number does not decline rapidly. Again, this corroborates the effectiveness of reboots, as they effectively lead the model to explore a different part of patch search space. For the sake of experimental costs, we were not able to discover when a potential plateau effect starts.

Figure 6 presents the number of total distinct plausible patches at different LLM invocation steps during plausible patch multiplication. We notice that the number of new distinct patches gradually increases. Per our experiment, five is an appropriate number of LLM invocations for patch multiplication, with this setting, CigaR takes the most advantage of patch multiplication without having invocations that yield no new distinct patches while incurring a token cost. We also see that the decline in the number of new distinct patches is generally the same over bugs. For the two longest bugs, namely “Math_23” and “Lang_53”, the number of new distinct plausible patches is very low during patch multiplication. The reason is that CigaR generates very few patches for these two at each LLM invocation because of the token limit per LLM invocation: one cannot fit many large samples in this fixed window. The low number of distinct plausible patches for such long bugs also confirms our claim: having a limited number of invocations for plausible patch multiplication is a reasonable choice regarding the trade-off of token cost / new patches.

We also note that in both figures, the number of distinct patches is higher for bugs that are smaller, such as “Math_13” and “Time_22” with 165 and 85 characters, respectively. The reason is that because of the token limit per LLM invocation, CigaR can ask for more patches in each invocation given a fixed token budget.

{mdframed}

Answer to RQ3: To what extent does patch exploration advance over reboots and multiplications?
The reboot strategy in the first plausible patch search step of CigaR, enables it to efficiently explore different parts of the search space, avoiding wasting tokens in dead ends. This finding holds for various bug types: simple, complex, short, and long. In the plausible patch multiplication phase, our experiments show a real exploration of distinct plausible patches in the search space via prompting technique of patch multiplication. Both rebooting and patch multiplication are important, and put in conjunction, enable CigaR to efficiently explore the patch search space. This is significant because those original, largely unknown, prompting techniques could be used beyond program repair, in many of the software engineering tasks based on LLMs.

Refer to caption
Figure 6: Progress during the plausible patch multiplication in # distinct plausible patches. This validates the concept of plausible patch multiplication.

VI Threats to Validity

VI-A Construct Validity

We consider the number of tokens used in the prompts and LLM responses to be the cost of CigaR. This is the main threat to the construct validity of this paper, since the cost can also be defined in many different ways. For example, the cost can be attributed to the number of LLM invocations, GPU usage 222https://huggingface.co/pricing#spaces, or fine-tuning cost 333https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset. We believe the token count is an appropriate metric for measuring the cost for two reasons. First, as explained in Section II, analyzing and producing tokens are the building blocks of how LLMs work. Therefore, it is the most natural and meaningful way to measure the cost in terms of spent tokens. Second, OpenAI’s models, which are the most widely used API-based LLMs, charge their users per token usage. Based on these two reasons, researchers and practitioners can safely rely on the results of our analysis of LLM-based APR cost effectiveness.

Another threat to the construct validity of our work is data leakage. The gpt-3.5-0301 model used by CigaR is likely to have parts of the Defects4J dataset in its training data [28]. This can lead to wrong measurement of patch exploration capabilities by LLMs. To address this threat, we follow the method used in recent related work [5] and use another benchmark HumanEval-Java. As this dataset is more recent than the training data used in gpt-3.5-0301, it is not prone to the data leakage problem. Our results in Section V-A and Section V-B show that CigaR outperforms the state-of-the-art in terms of bug-fixing performance and cost efficiency on HumanEval-Java, providing evidence of external validity.

VI-B Internal Validity

In our RQ1 experiment, for nl2fix and STEAM, the set of considered bugs is not exactly the same as bugs considered because of the absence of publicly available reproduction packages. This poses a threat to the internal validity of our assessment of the effectiveness CigaR against state-of-the-art. However, we have reimplemented and rerun the ChatRepair on the exact same set of bugs considered in our RQ1 experiment. ChatRepair is a one of the most recent LLM-based APR tools, and it is the most relevant tool to our approach. Hence, CigaR’s significant improvement over ChatRepair strongly verifies CigaR’s effectiveness against the state-of-the-art.

VI-C External Validity

Our experiments are limited to Java projects in the Defects4J dataset. This is a threat to the external validity of our study, as our results may not generalize to other programming languages or projects. However, Defects4J is the most widely used and respected dataset in the area of automated program repair. Also, we have considered six different projects and various types of bugs with different lengths and complexities for our experiments. This strengthens the validity of our findings, though a more comprehensive study of the token cost of LLM-based APR in the future will certainly be valuable.

VII Related Work

We review the previous work in three areas of research: automatic program repair with LLMs, LLM cost management, and effective repair space exploration.

VII-A Automatic Program Repair with LLMs

Automatic program repair is an important software engineering use case for LLMs [13]. Xia and Zhang [29] introduce AlphaRepair based on the CodeBert LLM [30] and show that with an infill-style zero-shot prompt [31] they outperform all existing learning-based APR tools. Zhang et al. [32] also use infill-prompts for APR, but mask the tokens based on common template-based APR patterns. Zero-shot prompts have also proven to be effective at fixing security vulnerabilities [33].

Recent LLM-based APR approaches, have improved the LLM’s bug fixing effectiveness with prompt engineering [8] and fine-tuning [34]. Cao et al. [35] show that adding the code intention to the prompt and using a dialogue based approach improves the LLM’s performance. Xia et al. [36] add multiple short bug-fix examples to the prompt and conduct an experiment with nine different LLMs. They conclude that the Codex LLM [13] outperforms other LLMs. Nashit et al. [37] use an embedding-based technique, SRoBERTa[38], and a frequency-based technique, BM-25[39], to retrieve relevant examples for a given repair task. Similar to [36], they show how Codex with a few-shot prompt outperforms the state-of-the-art APR tools.

Ahmed and Devanbu propose using the self-consistency approach for program repair [40]. In this approach, the prompt contains the buggy code and the model is asked to generate multiple pairs of explanation-solutions. The prompt also contains a few examples of buggy code, explanation, and solutions. Then the solution that appears most frequently is taken as the final answer of the model. The authors show their approach outperforms the state-of-the-art on the MODIT dataset [41].

Iterative prompting strategies for APR are employed by other tools as well [42, 43, 44, 45, 27]. Most recently, Xia and Zhang introduce ChatRepair [4], an automated iterative APR approach. ChatRepair gives the test results as feedback to the LLM to improve its patches. Once a plausible patch is generated, ChatRepair asks the LLM to generate alternative plausible patches. In contrast with CigaR, ChatRepair does not reduce the LLM costs. Namely, CigaR uses patch summarization, high sample size, and multiplication rounds to minimize the cost, while keeping strong fixing capabilities.

Researchers have also used fine-tuning to enhance LLM-based APR effectiveness [46, 47, 48]. In one early work, Mashhadi and Hemmati [49] fine-tune CodeBERT to automatically generated fixes for ManySStuBs4J. Zirak and Hemmati show that fine-tuning a LLMs on a dataset related to the application domain is significantly more effective than fine-tuning on an irrelevant dataset [50].

Jiang et al. [5] conduct a large-scale study on using LLMs for APR. They show a fine-tuned version of InCoder [51] that fixes 164% more bugs compared to the state-of-the-art learning-based techniques. In another study, Huang et al. [52] observe that fine-tuning is effective for fixing different types of bugs (test failure, vulnerability, and errors) in various programming languages (Java, C/C++, and JavaScript).

In contrast with existing LLM-based APR approaches, CigaR is the first to focus and demonstrate cost minimization without losing effectiveness. For this purpose, it uses three novel prompt strategies that have been used in previous work in the way CigaR works: 1) plausible patch multiplication; 2) summarizing feedback on previous patches; and 3) using large sample size with a high temperature.

VII-B LLM Cost Management

Wang et al. propose EcoOptiGen [21] to optimize the utility-cost ratio of LLMs. EcoOptiGen searches for optimal hyperparameters to generate the best solutions given a limited inference dollar budget. The considered hyperparameters include number of responses, temperature, and max tokens. The authors later introduce EcoAssistant [53], which uses an iterative approach to gradually improve the LLMs response. EcoAssistant first attempts to solve the problem using a cheap and weak LLM, and moves to more expensive and strong LLMs only if it is required. The authors show that EcoAssistant surpasses GPT-4’s effectiveness, while reducing the cost by 50%.

Arefeen et al. introduce LeanContext [54]. This tool reduces the LLM cost by summarizing the prompt. For this, it only includes the k most relevant sentences from the context in the prompt. The number k is determined using a reinforcement learning technique. In the same line of research, Mu et al. [55] propose the gisting strategy to compress the prompt. This approach compresses the prompt into smaller ‘gist tokens’. Lin et al. also propose the BatchPrompt strategy [56], which reduces the token cost by putting more data points in a single prompt.

Chen et al. propose FrugalGPT [6], which uses the LLM cascade strategy for cost minimization. This strategy finds the most cost-effective LLM API that can be used to solve a given problem. The authors also review two other cost minimization strategies: prompt adaptation and LLM approximation. Prompt adaptation reduces the cost by using shorter prompts. LLM approximation creates simpler and cheaper LLMs to match more expensive LLMs on a specific task.

Contrarily to other tools focused on LLM cost management, CigaR is the first system frugal LLM technique for automated program repair. CigaR contains token minimization strategies designed for and particularly suitable to APR. For example, summarizing the test results for previous patches is a unique, domain specific technique invented by CigaR.

VIII Conclusion

In this paper, we have introduced CigaR, a novel LLM-based APR tool that focuses on minimizing the token cost. CigaR reduces the token cost by minimizing the repetitive data sent or received to LLMs and rapidly exploring various parts of the search space. More specifically, CigaR employs three strategies. 1) It summarizes previous LLM responses in new prompts and asks the LLM to generate different responses, 2) it uses reboots and patch multiplication with a high temperature setting to make sure the model looks into different parts of the search space, 3) it asks for the maximum number of responses per LLM invocation so that it explores a large part of the search space with every single prompt. Our experiments on 429 bugs from Defects4J and HumanEval-Java shows that CigaR outperforms the state-of-the-art by fixing 33 more bugs, while reducing the token cost by 73% (149.4M/204.3M). For future research, our paper proposes original strategies to reduce the cost of LLMs, which can be expanded to other software engineering tasks, such as code generation.

References

  • [1] M. Monperrus, “Automatic software repair: A bibliography,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–24, 2018.
  • [2] R. C. Seacord, D. Plakosh, and G. A. Lewis, Modernizing legacy systems: software technologies, engineering processes, and business practices.   Addison-Wesley Professional, 2003.
  • [3] C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A Systematic Study of Automated Program Repair: Fixing 55 Out of 105 Bugs for $8 Each,” in Proceedings of the International Conference on Software Engineering, 2012, pp. 3–13.
  • [4] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
  • [5] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models on automated program repair,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   Los Alamitos, CA, USA: IEEE Computer Society, may 2023, pp. 1430–1442. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE48619.2023.00125
  • [6] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176, 2023.
  • [7] OpenAI. (2023) Pricing. [Online]. Available: https://openai.com/pricing
  • [8] Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language model on software engineering: An example from chatgpt and automated program repair,” arXiv preprint arXiv:2310.08879, 2023.
  • [9] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440.
  • [10] A. S. Luccioni, S. Viguier, and A.-L. Ligozat, “Estimating the carbon footprint of bloom, a 176b parameter language model,” Journal of Machine Learning Research, vol. 24, no. 253, pp. 1–15, 2023.
  • [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [12] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023.
  • [13] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  • [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [15] M. Zhang and J. Li, “A commentary of gpt-3 in mit technology review 2021,” Fundamental Research, vol. 1, no. 6, pp. 831–833, 2021.
  • [16] L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023.
  • [17] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  • [18] A. Roberts, C. Raffel, and N. Shazeer, “How much knowledge can you pack into the parameters of a language model?” arXiv preprint arXiv:2002.08910, 2020.
  • [19] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with large language model apis,” arXiv preprint arXiv:2301.08721, 2023.
  • [20] “OpenAI pricing,” https://openai.com/pricing, accessed: 2024-04-05.
  • [21] C. Wang, S. X. Liu, and A. H. Awadallah, “Cost-effective hyperparameter optimization for large language model generation inference,” arXiv preprint arXiv:2303.04673, 2023.
  • [22] T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama, “Demystifying gpt self-repair for code generation,” arXiv preprint arXiv:2306.09896, 2023.
  • [23] J. Oppenlaender and J. Hämäläinen, “Mapping the challenges of hci: An application and evaluation of chatgpt and gpt-4 for cost-efficient question answering,” arXiv preprint arXiv:2306.05036, 2023.
  • [24] H. Ye, J. Gu, M. Martinez, T. Durieux, and M. Monperrus, “Automated classification of overfitting patches with statically extracted code features,” IEEE Transactions on Software Engineering, vol. 48, no. 8, pp. 2920–2938, 2021.
  • [25] A. Ghanbari and A. Marcus, “Patch correctness assessment in automated program repair based on the impact of patches on production and test code,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 654–665.
  • [26] S. Fakhoury, S. Chakraborty, M. Musuvathi, and S. K. Lahiri, “Towards generating functionally correct code edits from natural language issue descriptions,” arXiv preprint arXiv:2304.03816, 2023.
  • [27] Y. Zhang, Z. Jin, Y. Xing, and G. Li, “Steam: Simulating the interactive behavior of programmers for automatic bug fixing,” arXiv preprint arXiv:2308.14460, 2023.
  • [28] K. Huang, Z. Xu, S. Yang, H. Sun, X. Li, Z. Yan, and Y. Zhang, “A survey on automated program repair techniques,” arXiv preprint arXiv:2303.18184, 2023.
  • [29] C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 959–971.
  • [30] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
  • [31] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in 2009 IEEE conference on computer vision and pattern recognition.   IEEE, 2009, pp. 951–958.
  • [32] Q. Zhang, C. Fang, T. Zhang, B. Yu, W. Sun, and Z. Chen, “Gamma: Revisiting template-based automated program repair via mask prediction,” arXiv preprint arXiv:2309.09308, 2023.
  • [33] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP).   IEEE, 2023, pp. 2339–2356.
  • [34] R. Paul, M. M. Hossain, M. L. Siddiq, M. Hasan, A. Iqbal, and J. C. S. Santos, “Enhancing automated program repair through fine-tuning and prompt engineering,” 2023.
  • [35] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt design, advantages and limitations of chatgpt for deep learning program repair,” arXiv preprint arXiv:2304.08191, 2023.
  • [36] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023.
  • [37] N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt selection for code-related few-shot learning,” in Proceedings of the 45th International Conference on Software Engineering (ICSE’23), 2023.
  • [38] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
  • [39] S. Robertson, H. Zaragoza et al., “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
  • [40] T. Ahmed and P. Devanbu, “Better patching using llm prompting, via self-consistency,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE Computer Society, 2023, pp. 1742–1746.
  • [41] S. Chakraborty and B. Ray, “On multi-modal learning of editing source code,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2021, pp. 443–455.
  • [42] W. Yuan, Q. Zhang, T. He, C. Fang, N. Q. V. Hung, X. Hao, and H. Yin, “Circle: Continual repair across programming languages,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 678–690.
  • [43] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug fixing performance of chatgpt,” arXiv preprint arXiv:2301.08653, 2023.
  • [44] Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE, 2023, pp. 1469–1481.
  • [45] C. Liu, P. Cetin, Y. Patodia, S. Chakraborty, Y. Ding, and B. Ray, “Automated code editing with search-generate-modify,” arXiv preprint arXiv:2306.06490, 2023.
  • [46] M. M. A. Haque, W. U. Ahmad, I. Lourentzou, and C. Brown, “Fixeval: Execution-based evaluation of program fixes for programming problems,” in 2023 IEEE/ACM International Workshop on Automated Program Repair (APR).   IEEE, 2023, pp. 11–18.
  • [47] Y. Wu, N. Jiang, H. V. Pham, T. Lutellier, J. Davis, L. Tan, P. Babkin, and S. Shah, “How effective are neural networks for fixing security vulnerabilities,” arXiv preprint arXiv:2305.18607, 2023.
  • [48] Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, and Z. Chen, “Pre-trained model-based automated software vulnerability repair: How far are we?” IEEE Transactions on Dependable and Secure Computing, 2023.
  • [49] E. Mashhadi and H. Hemmati, “Applying codebert for automated program repair of java simple bugs,” in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).   IEEE, 2021, pp. 505–509.
  • [50] A. Zirak and H. Hemmati, “Improving automated program repair with domain adaptation,” ACM Transactions on Software Engineering and Methodology, 2022.
  • [51] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” arXiv preprint arXiv:2204.05999, 2022.
  • [52] K. Huang, X. Meng, J. Zhang, Y. Liu, W. Wang, S. Li, and Y. Zhang, “An empirical study on fine-tuning large language models of code for automated program repair,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE Computer Society, 2023, pp. 1162–1174.
  • [53] J. Zhang, R. Krishna, A. H. Awadallah, and C. Wang, “Ecoassistant: Using llm assistant more affordably and accurately,” arXiv preprint arXiv:2310.03046, 2023.
  • [54] M. A. Arefeen, B. Debnath, and S. Chakradhar, “Leancontext: Cost-efficient domain-specific question answering using llms,” arXiv preprint arXiv:2309.00841, 2023.
  • [55] J. Mu, X. L. Li, and N. Goodman, “Learning to compress prompts with gist tokens,” arXiv preprint arXiv:2304.08467, 2023.
  • [56] J. Lin, M. Diesendruck, L. Du, and R. Abraham, “Batchprompt: Accomplish more with less,” arXiv preprint arXiv:2309.00384, 2023.