Multimodal Language Models in Action

Invoice or Bill Custom Parsing using Instructor (Pydantic Extension), Open AI’s GPT-4o & Prompt Engineering

Structured Data Extraction for Semi-Structured Image

Kaushik Shakkari

Published in

GoPenAI

5 min readJun 4, 2024

Picture by the author: A pretty sunset near Golden Gate Park, Seattle, Washington (April 2023)

In my last article, I demonstrated how OpenAI’s GPT-3.5 model with the Kor library can be utilized for structured data extraction from unstructured or semi-structured data like PDFs. While this approach is powerful, it is limited to processing textual data alone. Recently, multimodal models, such as OpenAI’s GPT-4o, have emerged, capable of processing both text and visual data. These models, trained on large datasets of image-text pairs, understand the correlations between visual cues and textual descriptions. This allows them to comprehend semantic content, interpret visual layouts, and reason about the relationships between visual elements and text, thereby broadening the scope of data extraction tasks. In this article, I will showcase how GPT-4o can be used to extract structured data from images and documents, effectively handling complex layouts, handwritten text, and low-quality images without the need for explicit optical character recognition (OCR).

Source: https://openai.com/index/hello-gpt-4o/

Code in Action:

Image invoice sample created by the author: download here

Let’s use the sample invoice provided. We aim to extract information such as the invoice number, item details (including description, quantity, unit price, and total), total balance, and various addresses (company, billing, and shipping).

Link to execute the below code: colab notebook

Step 1: Load invoice image file and encode as base64 Image

Base64 is a widely used encoding scheme that allows binary data like images to be represented as text so the language model can take it as input.

import base64

resume_image_path = 'sample_invoice.png'

with open(resume_image_path, "rb") as image_file:
    encoded_resume_image_path = base64.b64encode(image_file.read()).decode('utf-8')

Step 2: Install and import relevant packages for extraction

We mainly use OpenAI and Instructor packages. Instructor is a Python library designed to simplify working with structured outputs from large language models (LLMs). It is built on Pydantic and offers a straightforward, transparent, and user-friendly API for handling validation, retries, and streaming responses. This enhances the LLM workflows effortlessly.

!pip install openai==0.27.8

# instructor is used 
!pip install instructor==1.3.2

#replace '' with your OpenAI API key
#you can create and view your API keys at https://platform.openai.com/account/api-keys
openai_api_key = ''

from typing import List
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field

Step 3: Define PyDantic classes for extraction

These classes will help in structuring the extracted data from the invoice.

class Address(BaseModel):
  name: str = Field(description="the name of person and organization")
  address_line: str = Field(description="the local delivery information such as street, building number, PO box, or apartment portion of a postal address")
  city: str = Field(description="the city portion of the address")
  state_province_code: str = Field(description="the code for address US states")
  postal_code: int = Field(description="the postal code portion of the address")

class Product(BaseModel):
  product_description: str = Field(description="the description of the product or service")
  count: int = Field(description="number of units bought for the product")
  unit_item_price: float = Field(description="price per unit")
  product_total_price: float = Field(description="the total price, which is number of units * unit_price")

class TotalBill(BaseModel):
  total: float = Field(description="the total amount before tax and delivery charges")
  discount_amount: float = Field(description="discount amount is total cost * discount %")
  tax_amount: float = Field(description="tax amount is tax_percentage * (total - discount_amount). If discount_amount is 0, then its tax_percentage * total")
  delivery_charges: float = Field(description="the cost of shipping products")
  final_total: float = Field(description="the total price or balance after removing tax, adding delivery and tax from total")

class Invoice(BaseModel):
  invoice_number: str = Field(description="extraction of relevant information from invoice")
  billing_address: Address = Field(description="where the bill for a product or service is sent so it can be paid by the recipient")
  product: List[Product] = Field(description="the details of bill")
  total_bill: TotalBill = Field(description="the details of total amount, discounts and tax")

Step 4: Define chat message and call GPT-4o

Here, we define the chat messages and call the GPT-4o model to extract data from the invoice image.

messages = [
    {
        "role": "user",
        "content": "Your goal is to extract structured information from the provided invoice"
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{encoded_resume_image_path}"
                }
            }
        ]
    }
]



response = instructor.from_openai(OpenAI(api_key=openai_api_key)).chat.completions.create(
    model='gpt-4o',
    response_model=Invoice,
    messages=messages
)

result = response.model_dump_json(indent=2)

print(result)

Sample Output

{
  "invoice_number": "INV-28913",
  "billing_address": {
    "name": "Pepper Potts",
    "address_line": "Robo Street",
    "city": "Malibu",
    "state_province_code": "CA",
    "postal_code": 90265
  },
  "product": [
    {
      "product_description": "Lambda Scalar 4U AMD GLU",
      "count": 1,
      "unit_item_price": 160090.0,
      "product_total_price": 160090.0
    },
    {
      "product_description": "16-inch MacBook Pro - Space Gray",
      "count": 2,
      "unit_item_price": 3500.0,
      "product_total_price": 7000.0
    },
    {
      "product_description": "12.99 inch iPad Pro",
      "count": 2,
      "unit_item_price": 1200.0,
      "product_total_price": 2400.0
    },
    {
      "product_description": "2nd generation Apple Pencil",
      "count": 1,
      "unit_item_price": 130.0,
      "product_total_price": 130.0
    },
    {
      "product_description": "Space Gray AirPods Max",
      "count": 1,
      "unit_item_price": 550.0,
      "product_total_price": 550.0
    },
    {
      "product_description": "Service Fee",
      "count": 1,
      "unit_item_price": 250.0,
      "product_total_price": 250.0
    }
  ],
  "total_bill": {
    "total": 170420.0,
    "discount_amount": 0.0,
    "tax_amount": 15378.0,
    "delivery_charges": 100.0,
    "final_total": 153478.0
  }
}

Conclusion:

In this article, we explored how multimodal language models like GPT-4o can be effectively used to extract structured data from images and documents. The integration of text and visual data processing capabilities enables these models to handle complex layouts and low-quality images without the need for OCR. By leveraging tools like OpenAI and the Instructor library, we can streamline the extraction process, enhancing efficiency and accuracy. This approach opens up new possibilities for automating data extraction tasks across various industries, providing a powerful solution for handling semi-structured data.

Kudos, you have learned how to custom parse!
Stay tuned for more articles on generative language modeling!!!

Add me on LinkedIn. Thank you!

Multimodal Language Models in Action

Invoice or Bill Custom Parsing using Instructor (Pydantic Extension), Open AI’s GPT-4o & Prompt Engineering

Structured Data Extraction for Semi-Structured Image

Code in Action:

Conclusion:

Written by Kaushik Shakkari