7

I'm working on a project that requires generating structured data with Large Language Models (LLMs) like GPT-4.

I've encountered issues with the output, as the LLM sometimes produces invalid JSON files. For example, there might be missing closing braces or extra commas:

{
  "key": "value",

JSON appears to be error-prone, and fixing invalid JSON files can be non-trivial. This has led to challenges in working with LLMs, especially when complex JSON output is required.

Given this context, I'm looking for a file format that is less likely to become invalid due to small mistakes made by the LLM. Among TOML, YAML, JSON, XML, or others, which would be the most robust choice? Objective insights or evidence, such as benchmarks, would be particularly valuable.

4
  • 1
    All of the autonomous GPT projects use JSON, but I don't know if that's actually best.
    – endolith
    Commented Aug 2, 2023 at 15:27
  • 1
    Please describe what you consider the "most robust file format"? It would be greatly appreciated if you could share your personal experiences using various file formats. Add more context to help to understand why this question is of particular interest to this community instead of other communities like Stack Overflow, Software Engineering, Computer Science, Software Quality Assurance & Testing, etc.
    – Wicket
    Commented Aug 2, 2023 at 17:25
  • 1
    @Wicket I gave more context to my question but I feel it is in the right community. The best file format for LLM is very specific to how LLM work generating token by token. Commented Aug 2, 2023 at 19:25
  • Please add usage guidance to structured-data and json. (see What should a tag wiki excerpt contain?
    – Wicket
    Commented Aug 3, 2023 at 23:00

3 Answers 3

2

Use an LLM Completion Generation Framework

Take a look at Completion Generation Frameworks such as Guidance (created by Microsoft).

@guidance
def character_maker(lm, id, description, valid_weapons):
    lm += f"""\
    The following is a character profile for an RPG game in JSON format.
    ```json
    {{
        "id": "{id}",
        "description": "{description}",
        "name": "{gen('name', stop='"')}",
        "age": {gen('age', regex='[0-9]+', stop=',')},
        "armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
        "weapon": "{select(options=valid_weapons, name='weapon')}",
        "class": "{gen('class', stop='"')}",
        "mantra": "{gen('mantra', stop='"')}",
        "strength": {gen('strength', regex='[0-9]+', stop=',')},
        "items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
    }}```"""
    return lm
a = time.time()
lm = llama2 + character_maker(1, 'A nimble fighter', ['axe', 'sword', 'bow'])
time.time() - a

A side bonus is that the framework will not ask LLM to generate the tokens related to JSON structure itself, this guarantees a valid JSON output every time. Note that generated tokens are marked in green.

enter image description here

Frameworks available

I'll give a list of what I consider to be the top three LLM Completion Generation Frameworks as of today. Note that they have different strengths and tradeoffs.

All of these should support both fixed and open-ended ontologies.

There are some frameworks as well, but they may or may not be suited to this use case and have variable levels of community support:

3

Depending on which API you use or how you run your model, there are currently at least the following ways to generate structured output as valid JSON:

There are also strategies for what to do if the generated result is not in the right format:

  • Use of a more robust, fault-tolerant parser that works even when there are minor syntax errors
  • Redoing the generation, or better a follow-up prompt telling that the format is wrong and asking to correct the format or redo it in the correct format, ideally with one or more (additional) examples of what the format should look like

You might also try to improve your prompt, e.g. by providing more examples to avoid an output in the wrong format.

Or, instead of generating a complex JSON in one step, you could generate parts of it separately, and then assemble them into the complex JSON you need.

1

I'm working on a project that requires generating structured data with Large Language Models (LLMs) like GPT-4.

OpenAI GPT-4 now has a JSON mode that ensures the model outputs a valid JSON. You can find more info here.

1
  • Although GPT-4 now offer a specific feature for the JSON format, it doesn't really answer the question which is more "meta". Which format is the most robust one for generation with any LLM ? Commented Nov 7, 2023 at 16:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.