File format for generating error-free structured data with LLMs

Question

I'm working on a project that requires generating structured data with Large Language Models (LLMs) like GPT-4.

I've encountered issues with the output, as the LLM sometimes produces invalid JSON files. For example, there might be missing closing braces or extra commas:

{
  "key": "value",

JSON appears to be error-prone, and fixing invalid JSON files can be non-trivial. This has led to challenges in working with LLMs, especially when complex JSON output is required.

Given this context, I'm looking for a file format that is less likely to become invalid due to small mistakes made by the LLM. Among TOML, YAML, JSON, XML, or others, which would be the most robust choice? Objective insights or evidence, such as benchmarks, would be particularly valuable.

All of the autonomous GPT projects use JSON, but I don't know if that's actually best. — endolith, Commented Aug 2, 2023 at 15:27
Please describe what you consider the "most robust file format"? It would be greatly appreciated if you could share your personal experiences using various file formats. Add more context to help to understand why this question is of particular interest to this community instead of other communities like Stack Overflow, Software Engineering, Computer Science, Software Quality Assurance & Testing, etc. — Wicket, Commented Aug 2, 2023 at 17:25
@Wicket I gave more context to my question but I feel it is in the right community. The best file format for LLM is very specific to how LLM work generating token by token. — Léonard Henriquez, Commented Aug 2, 2023 at 19:25
Please add usage guidance to structured-data and json. (see What should a tag wiki excerpt contain? — Wicket, Commented Aug 3, 2023 at 23:00

Elijas Dapšauskas · Accepted Answer · 2024-06-18 19:58:50Z

Use an LLM Completion Generation Framework

Take a look at Completion Generation Frameworks such as Guidance (created by Microsoft).

@guidance
def character_maker(lm, id, description, valid_weapons):
    lm += f"""\
    The following is a character profile for an RPG game in JSON format.
    ```json
    {{
        "id": "{id}",
        "description": "{description}",
        "name": "{gen('name', stop='"')}",
        "age": {gen('age', regex='[0-9]+', stop=',')},
        "armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
        "weapon": "{select(options=valid_weapons, name='weapon')}",
        "class": "{gen('class', stop='"')}",
        "mantra": "{gen('mantra', stop='"')}",
        "strength": {gen('strength', regex='[0-9]+', stop=',')},
        "items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
    }}```"""
    return lm
a = time.time()
lm = llama2 + character_maker(1, 'A nimble fighter', ['axe', 'sword', 'bow'])
time.time() - a

A side bonus is that the framework will not ask LLM to generate the tokens related to JSON structure itself, this guarantees a valid JSON output every time. Note that generated tokens are marked in green.

Frameworks available

I'll give a list of what I consider to be the top three LLM Completion Generation Frameworks as of today. Note that they have different strengths and tradeoffs.

All of these should support both fixed and open-ended ontologies.

There are some frameworks as well, but they may or may not be suited to this use case and have variable levels of community support:

howlger · Accepted Answer · 2023-08-03 08:53:10Z

Depending on which API you use or how you run your model, there are currently at least the following ways to generate structured output as valid JSON:

The OpenAI API provides the so-called Function calling for that. See also:
- Abhinav Upadhyay's blog post A Tutorial Guide to Using The Function Call Feature of OpenAI's ChatGPT API
- Sam Witteveen's video OpenAI Functions + LangChain : Building a Multi Tool Agent
When running your model with llama.cpp you can use --grammar-file path/to/your_grammar_file.gbnf to specify a grammar file in a Backus-Naur form (BNF)-like syntax:
- For details see pull request #1773: llama : add grammar-based sampling
- For JSON see this JSON .gbnf sample grammar file.

There are also strategies for what to do if the generated result is not in the right format:

Use of a more robust, fault-tolerant parser that works even when there are minor syntax errors
Redoing the generation, or better a follow-up prompt telling that the format is wrong and asking to correct the format or redo it in the correct format, ideally with one or more (additional) examples of what the format should look like

You might also try to improve your prompt, e.g. by providing more examples to avoid an output in the wrong format.

Or, instead of generating a complex JSON in one step, you could generate parts of it separately, and then assemble them into the complex JSON you need.

Franck Dernoncourt · Accepted Answer · 2023-11-06 19:41:23Z

1

I'm working on a project that requires generating structured data with Large Language Models (LLMs) like GPT-4.

OpenAI GPT-4 now has a JSON mode that ensures the model outputs a valid JSON. You can find more info here.

answered Nov 6, 2023 at 19:41

Franck Dernoncourt

2,8333 silver badges23 bronze badges

Although GPT-4 now offer a specific feature for the JSON format, it doesn't really answer the question which is more "meta". Which format is the most robust one for generation with any LLM ?
– Léonard Henriquez
Commented Nov 7, 2023 at 16:03

Add a comment |

Stack Exchange Network

File format for generating error-free structured data with LLMs

3 Answers 3

Use an LLM Completion Generation Framework

Frameworks available

Not the answer you're looking for? Browse other questions tagged
chatgpt
prompt-design
llm
structured-data
json
or ask your own question.

Linked

Hot Network Questions

File format for generating error-free structured data with LLMs

3 Answers 3

Use an LLM Completion Generation Framework

Frameworks available

Not the answer you're looking for? Browse other questions tagged chatgptprompt-designllmstructured-datajson or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
chatgpt
prompt-design
llm
structured-data
json
or ask your own question.