9

I'm currently working on a project that requires the generation of complex JSON documents using Large Language Models (LLMs). These documents, representing rich text content, can contain various nested structures such as bullet points, multiple columns, images, videos, and embeds.

To illustrate, below is a simplified version of the JSON document that I am trying to generate:

{
  "type": "doc",
  "content": [
    {
      "type": "heading",
      "content": [
        {
          "text": "Lorem ipsum"
        }
      ]
    },
    {
      "type": "columns",
      "content": [
        {
          "type": "bulletList",
          "content": [
            {
              "type": "paragraph",
              "content": [
                {
                  "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Notably, the schema of the JSON document can be complex due to indefinite nesting (e.g., bullet lists within bullet lists). While a schema definition is challenging, I have a Node.js function available that validates whether a document is correct or not.

I've tried straightforward approaches, like asking GPT-4 to generate the JSON, few shots prompt engineering, but it didn't produce the output conforming to my schema. I also explored solutions like zod-gpt (https://github.com/dzhng/zod-gpt), but it struggled with nested structures and failed to produce a valid JSON document.

How can I effectively use LLMs to generate these complex and nested JSON documents? Are there any strategies, libraries, or methodologies that are particularly effective for this kind of task?

1
  • 3
    Don't. You've got a javascript function that validates the input? Then it almost certainly contains a parser. Modify that parser to output the json. That's the right tool for the job.
    – Ray
    Commented Aug 1, 2023 at 18:43

10 Answers 10

5

Here are some strategies for generating complex and nested JSON documents using large language models:

  1. Fine-tune the model on a dataset of valid JSON examples: Pre-train the LLM on a diverse dataset of JSON documents that match your target schema. This allows the model to learn the syntactic patterns and valid nesting structures. You can generate the training data synthetically or use real-world examples.

  2. Context-aware prompt: Create a prompt that provides the model with context about the JSON structure and schema. For example:

    "Generate a JSON document representing a blog post that conforms to this schema.

    {
      "title": string,
      "content": array of
      {
        "type": "paragraph"|"image"|"embed",
        "text": string
      }
    }
    

    JSON:[Your_JSON_Example_Format_goes_here]"

  3. Hierarchical decoding: Break generation into multiple steps - first generate the top-level blocks, then recursively generate content for each block. This allows you to validate at each step.

In addition to points 2 and 3, Constrained decoding: At each generation step, validate the output and reject samples that don't match the expected scheme. Only allow the model to continue generating based on valid prefixes.


I expand more on using context-aware prompts and hierarchical decoding when generating complex nested JSON with large language models.

Here is an example prompt that provides more context:

"Generate a JSON document representing a blog post with the following structure:

  • It should have a 'title' key with a string value
  • The 'content' key should be an array
  • Each item in the 'content' array should be an object with a 'type' key that can be either 'paragraph', 'image', or 'embed'
  • Paragraph items should have a 'text' key with a string value
  • Image items should have 'src' and 'alt' keys with string values
  • Embed items should have 'url' and 'height' keys with integer values

The generated JSON should conform to this structure. Do not include any additional or invalid keys.

JSON: [Your_JSON_Example_Format_goes_here]"

This prompt provides clear constraints and examples that the LLM can follow to generate a valid nested structure.

For hierarchical decoding, you can split it into two stages:

  1. Generate the top-level keys and array/object structures:
{
  "title": "",
  "content": [] 
}
  1. Recursively generate the content of each item in the "content" array using the type-specific schemas.

This allows you to validate at each stage that the structure is correct before generating nested content.

You can continue prompting the model recursively for each sub-object, providing the schema for that level. This divide and conquer approach helps improve correctness.

4

There are several ways to constrain LLM generations. This is often done for text-to-code models, as there exists a concrete syntax/grammar to constrain your generations to.

The llama.cpp repo, for example, added an option to constrain generations to a specified grammar, limiting samples to ones that could result in a valid parse. They include examples for constraining to only valid JSON outputs.

You could also look at PICARD, they constrain beam search to only generations that could result in a valid parse. They specifically apply this to text-to-SQL generation.

A similar method is detailed in this paper.

4

If you're using OpenAI's chat completion APIs (GPT3.5 and GPT4) then you can rely on function calling to have the model format the reply to specific JSON. You can prompt the model to give you a function call, use the params (and never do the call).

Here's more detail: https://dev.to/maximsaplin/openai-function-calling-to-enforce-reply-formatschema-23gi

UPD, March 2024

By the end of 2023 OpenAI introduced a few changes:

  1. Function calls have been deprecated in favour of tool_choice (similar concept, different naming)
  2. Newer versions of GPT3.5 and GPT4 now support JSON mode
3

While a schema definition is challenging, I have a Node.js function available that validates whether a document is correct or not.

If it's feasible for you to define a function that determines what text would be valid next, you can use it restrict the model to only being able to predict valid continuations.

With models runnable locally through HuggingFace Transformers, such as Meta's LLaMA 2, prefix_allowed_tokens_fn will determine (given batch number and the tokens so far) what tokens should be allowed next. Something similarizing such subsequent sample should suffice:

def allowed_fn(b, ts):
    if ts[-1] in whitespace_ids:
        return begin_with_s_ids
    else:
        return all_allowed_ids

...

model.generate(input_ids=input_ids, prefix_allowed_tokens_fn=allowed_fn, num_beams=3, ...)

The tokenizer will have functions you need for getting token IDs and their textual representation (e.g.)

Through OpenAI's API the equivalent would be logits_bias, but I wouldn't recommend using a remote model like this, as it'd require repeated back-and-forth between your code and the OpenAI API with one token completions.


Altenatively, it may help to simplify the format to something it's more easily able to understand, then parse it into your JSON format manually afterwards.

For example, markdown supports many of the features you want, is common enough for the model to have seen plenty of it, and extensions would likely be feasible with few shot prompting examples.

2
  • This approach with prefix_allowed_tokens_fn seems really interesting but I'm not sure I understand how to use it. What does b and ts contains? In a context of generating JSON does { " and } are considered as tokens? Commented Aug 2, 2023 at 14:09
  • @LéonardHenriquez b is the batch number, ts is a list of the token IDs generated so far. You'd likely want to use the model's tokeniser to decode ts into plaintext. // "In a context of generating JSON does { " and } are considered as tokens?" - Tokens are around 3-4 characters long (though, less for special characters) so there'll be many tokens that contain {. Could list all tokens with something like tokenizer.convert_ids_to_tokens(list(range(32000))), then filter.
    – SirBenet
    Commented Aug 2, 2023 at 14:34
3

If one uses OpenAI GPT-4, it now has a JSON mode that ensures the model outputs a valid JSON:

GPT-4 Turbo performs better than our previous models on tasks that require the careful following of instructions, such as generating specific formats (e.g., “always respond in XML”). It also supports our new JSON mode, which ensures the model will respond with valid JSON. The new API parameter response_format enables the model to constrain its output to generate a syntactically correct JSON object. JSON mode is useful for developers generating JSON in the Chat Completions API outside of function calling.

Documentation:

JSON mode

New

A common way to use Chat Completions is to instruct the model to always return JSON in some format that makes sense for your use case, by providing a system message. This works well, but occasionally the models may generate output that does not parse to valid JSON.

To prevent these errors and improve model performance, when calling gpt-4-visual-preview or gpt-3.5-turbo, you can set response_format to { type: "json_object" } to enable JSON mode. When JSON mode is enabled, the model is constrained to only generate strings that parse into valid JSON.

Important notes:

  • To use JSON mode, your system message must instruct the model to produce JSON. To help ensure you don't forget, the API will throw an error if the string "JSON" does not appear in your system message.

  • The message the model returns may be partial (i.e. cut off) if finish_reason is length, which indicates the generation exceeded max_tokens or the conversation exceeded the token limit. To guard against this, check finish_reason before parsing the response.

  • JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors.

2
  • 1
    Thanks for the update but it doesn't really solve the problem because the JSON mode ensures that the response is a valid JSON but it doesn't ensure that the response respects a given validation schema Commented Nov 7, 2023 at 16:00
  • @LéonardHenriquez i think the answer about function calling will do that. maybe JSON mode can be used with function called but i havent checked Commented Nov 22, 2023 at 22:44
2

Use an LLM Completion Generation Framework

Take a look at Completion Generation Frameworks such as Guidance (created by Microsoft).

@guidance
def character_maker(lm, id, description, valid_weapons):
    lm += f"""\
    The following is a character profile for an RPG game in JSON format.
    ```json
    {{
        "id": "{id}",
        "description": "{description}",
        "name": "{gen('name', stop='"')}",
        "age": {gen('age', regex='[0-9]+', stop=',')},
        "armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
        "weapon": "{select(options=valid_weapons, name='weapon')}",
        "class": "{gen('class', stop='"')}",
        "mantra": "{gen('mantra', stop='"')}",
        "strength": {gen('strength', regex='[0-9]+', stop=',')},
        "items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
    }}```"""
    return lm
a = time.time()
lm = llama2 + character_maker(1, 'A nimble fighter', ['axe', 'sword', 'bow'])
time.time() - a

A side bonus is that the framework will not ask LLM to generate the tokens related to JSON structure itself, this guarantees a valid JSON output every time. Note that generated tokens are marked in green.

enter image description here

EDIT

There are some frameworks as well, but they may or may not be suited to this use case and have variable levels of community support.

1

I recently had this issue on a project I was working on, even tho I was using GPT-3.5 or GPT-X models, it really came down to the prompt. What I found out is that the approach I took was as mention by @PriNova being contextual aware, however, I used JSON-Schema to ensure the structure of my JSON was kept, you can read or learn more here -> What is JSON Schema?

Based on your needs where certain scenarios such as indefinite nesting etc can be expressed in the schema definition and further elaborating on that scenario in the in the constraints section make it more specific to what is generated.

I utilized a good read from the best practices from OPENAI GPT Best Practices Guide that explores prompt engineering recommendations that carry over to other LLM in my opinion as it purely about the prompt engineering even tho being specific to GPT-X family

You can use a prompt that takes following format:

TEMPLATE

[CONTEXT]
<THE ASK OR INSTRUCTIONS HERE> JSON format based on the SCHEMA defined below: 

[CONSTRAINTS]

1. Constraint one 
2. Constraint two
3. <data or model name that> should be returned in json format based on the following schema definition:


[SCHEMA]:
<JSOn Schema definition rules of the how it should look like>
{
 type: "object",
 properties: {
    content: {
      type: string,
      required: true,
    },
    instructions: {
      type: string,
      required: true,
    },
    questionType: {
      type: string,
      required: true,
      enum: [
        "multipleChoice",
        "trueFalse",
        "shortAnswer",
        "selectMany",
        "selectOne",
        "essay",
      ],
      default: "multipleChoice",
    },
    options: {
      type: "array",
      items: {
        type: "string"
      }
      required: false,
    },
    tags: {
      type: "array",
      items: {
        type: "string"
      }
      required: true,
    },
 }

[EXAMPLE]:
<example of the output of the final schema scenarios here>
    [{
      content: "Will these need to be optimized for search?",
      instructions: "Select one of the answers from the options given",
      questionType: "trueFalse",
      options: ["yes", "no"],
      tags: ["scope", "specs", "fixed", "seo"],
   }]


[AI ANSWER]:

WORKING EXAMPLE

[CONTEXT]
Brainstorm 100 questions for Shopping Cart requirements gathering in JSON format based on the SCHEMA defined below: 

[CONSTRAINTS]

1. Questions should be phrased as it would be asked of a potential client during requirements gathering in software project.
2. DO NOT STOP UNTIL COMPLETE ALL 100 questions.
3. Questions should be varied given the questionType listed in the schema.
4. Questions should be returned in json format based on the following schema definition:


[SCHEMA]:

{
 type: "object",
 properties: {
    content: {
      type: string,
      required: true,
    },
    instructions: {
      type: string,
      required: true,
    },
    questionType: {
      type: string,
      required: true,
      enum: [
        "multipleChoice",
        "trueFalse",
        "shortAnswer",
        "selectMany",
        "selectOne",
        "essay",
      ],
      default: "multipleChoice",
    },
    options: {
      type: "array",
      items: {
        type: "string"
      }
      required: false,
    },
    tags: {
      type: "array",
      items: {
        type: "string"
      }
      required: true,
    },
 }

[EXAMPLE]:
    [{
      content: "Will these need to be optimized for search?",
      instructions: "Select one of the answers from the options given",
      questionType: "trueFalse",
      options: ["yes", "no"],
      tags: ["scope", "specs", "fixed", "seo"],
   }]


[AI ANSWER]:

Hope this helps as much as it helped me, just thought I would share the approach and who knows it might give an additional point of references to all the answers already given.

0

I think you should try few shots prompt engineering, instructing the GPT with few examples in "Question, Instruction, Answer" format and play around little bit evaluating what works best.

1
  • Tried it but it doesn't work very well because there are too many possibilities Commented Aug 1, 2023 at 11:25
0

If you are willing to use Open Source models you can use the Outlines library to force the model to generate valid JSON. You define the expected structure of the output using a Pydantic model and the library guarantees that the output will follow this structure. Like in the following example

from enum import Enum
from pydantic import BaseModel, constr

import outlines.models as models
import outlines.text.generate as generate


class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    weapon: Weapon
    strength: int


model = models.transformers("gpt2", device="cuda")
generator = generate.json(model, Character, max_tokens=100)
sequence = generator("Give me a character description")

There is another example in the documentation. From experience you don't need to prompt models with the exact structure, only with the field names, but your mileage may vary.

0

I would also add TypeChat to the list of tools that help generate structured output such as JSON.

TypeChat also supports validating the LLM responses against the defined Schema.

Read more about TypeChat here

Not the answer you're looking for? Browse other questions tagged or ask your own question.