Extracting Structured Data from Images Using OpenAI’s gpt-4-vision and Jason Liu’s Instructor… (2024)

Michael Fox

As you are probably aware, OpenAI has just released a new model that can extract information from image data. This model is still under development and far from perfect. However, in it’s current form, it is clear that this is powerful capability with many uses. Given an image, and a simple prompt like ‘What’s in this image’, passed to chat completions, the gpt-4-vision-preview model can extract a wealth of details about the image in text form. However, the data is not consistently formatted or, in other words, “unstructured”. In order to make gpt-vision more useful, I’ve combined it with the Instructor patch to the OpenAI api. With it, OpenAI’s model can be used again to process this unstructured text into orderly Pydantic models.

Here are the imports we will need to make this work.

import base64
import os
from openai import OpenAI # openai version 1.1.1
import instructor
import requests
from pydantic.main import BaseModel

Here is a PNG of a invoice I made up in LibreOffice Calc. You can try this using this image or by substituting an image of your own.

Extracting Structured Data from Images Using OpenAI’s gpt-4-vision and Jason Liu’s Instructor… (2)

In order to pass the image to the OpenAI chat completions API, we will encode our image using base64.

def encode_image(image_path):
 with open(image_path, "rb") as image_file:
 return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "sample_invoice_1.png"base64_image = encode_image(image_path)

After encoding the image, it is passed to chat/completions along with our OPENAI_API_KEY and the prompt ‘What’s in this image?’.

headers = {
 "Content-Type": "application/json",
 "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"
}

payload = {
 "model": "gpt-4-vision-preview",
 "messages": [
 {
 "role": "user",
 "content": [
 {
 "type": "text",
 "text": "What’s in this image?"
 },
 {
 "type": "image_url",
 "image_url": {
 "url": f"data:image/jpeg;base64,{base64_image}"
 }
 }
 ]
 }
 ],
 "max_tokens": 300
}response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

The API responds with the following JSON. In the content, a detailed description of the image is returned.

{‘id’: ‘chatcmpl-8JnuAvFSmERvLF8L1kQ5869nV41OE’, ‘object’: ‘chat.completion’, ‘created’: 1699730758, ‘model’: ‘gpt-4–1106-vision-preview’, ‘usage’: {‘prompt_tokens’: 778, ‘completion_tokens’: 296, ‘total_tokens’: 1074}, ‘choices’: [{‘message’: {‘role’: ‘assistant’, ‘content’: ‘The image shows a template or sample of an invoice document. The top section of the invoice has a placeholder text “LOGO GOES HERE” indicating where a company\’s logo should be placed. Below that, the company\’s address and contact information, including a phone number, email, and website, are listed.\n\nThe invoice is dated “November 11, 2023,” and has an invoice number “# 9999999.” It includes billing information addressed to “Fred Davis” at a specific address, as well as service information provided to “Tampa Aquarium” at a different address.\n\nThe table in the center of the invoice lists three line items with “Product ID,” “Description,” “Qty/Hours,” “Unit Price,” and “Line Total.” The products or services listed include “Service,” “Parts AAA,” and “Parts BBB,” with their respective quantities, prices, and totals.\n\nAt the bottom right, there is a financial summary including a subtotal, a discount percentage (which is 0% in this example), the discount amount, the total amount, the paid amount (which is 0.00 indicating that nothing has been paid yet), and the total due, which is 400.00.\n\nThe bottom of the invoice has a notes section for additional information and a statement of gratitude that reads “THANK YOU FOR YOUR BUSINESS!”\n\nThis image is typically used for illustrative purposes to show what an invoice might look like and how it is structured.’}, ‘finish_details’: {‘type’: ‘stop’, ‘stop’: ‘<|fim_suffix|>’}, ‘index’: 0}]}

Now the task is to convert this unstructured content into structured data we can use. For this, we will use the jnxl/Instructor library which you will find on github here https://github.com/jxnl/instructor

This libary patches the OpenAI ChatCompletions class to allow prompt guided mapping of unstructured text to a Pydantic model. For this to work, it is important to use a compatible version of the OpenAI library. I downgraded slightly to 1.1.1 and everything worked fine.

pip install --force-reinstall -v "openai==1.1.1"

Next, patch the library with this statement.

client = instructor.patch(OpenAI())

Define the Pydantic model

class OrderDetail(BaseModel):
 InvoiceNumber: str
 CustomerName: str
 Amount: float

and pass the content returned before to gpt-4 along with a prompt detailing the information we want to extract.

response_json = response.json()
content = response_json['choices'][0]['message']['content']order_detail = client.chat.completions.create(
 model="gpt-4",
 response_model=OrderDetail,
 messages=[
 {"role": "user", "content": "Extract invoice number, customer name (The entity billed for the product or service) and order total amount from the following purchase order description json:" + content},
 ]
)

Using this prompt, ‘“Extract customer name (The entity taking delivery of the product), po number and order total amount from the following purchase order description json:’’’, gpt-4 is able to extract the information we need to populate the elements of the model. Note the extra hints I provided about the customer name as there was more than one named entity on the purchase order. The prompt successfully guided the model to select the correct one.

Finally, we print the model that is returned.

print(order_detail)

InvoiceNumber='# 9999999' CustomerName='Fred Davis' Amount=400.0

Not bad.