Automating Invoice Management with OCR + AI

 Managing operational expenses such as utility bills for water, electricity, and gas is often a time-consuming and costly task. Companies that handle multiple projects or properties receive thousands of invoices on a regular basis. The data from these documents needs to be extracted, processed, and recorded in internal systems. 

Traditionally, this process has been done manually, creating a bottleneck that consumes valuable time and resources.

Beyond cost and inefficiency, manual data entry especially from scanned invoices introduces a significant risk: human error. Visual fatigue, repetitive tasks, and variability in document format and quality all increase the likelihood of transcription mistakes, which can lead to accounting discrepancies, incorrect payments, and audit issues.

The challenge

In our specific case, the situation is even more complex because most invoices are scanned documents unlike native digital files, scanned invoices pose additional challenges for Optical Character Recognition (OCR) technologies:

  • Variable image quality: low resolution, compression, noise, smudges, folds.
  • Distortions: skewed text, perspective issues, cut-off edges.
  • Contrast and lighting problems: poorly legible text.
  • Format and font variability: different vendors use diverse designs and typography.
  • Artifacts: stamps, watermarks, or handwritten notes interfering with the text.

These issues make it significantly harder to accurately extract data using automated methods.

OCR + AI

OCR model research and selection

Recognizing the scope and technical complexity of the problem, we conducted an in-depth review of the latest OCR and AI-based document processing models. Our goal was to identify the most robust and accurate solution capable of handling complex scanned invoices.

We evaluated leading vision-language models, including:

  • Claude’s Sonnet 3.5 (Anthropic)
  • Google AI’s Gemini 1.5 Pro
  • Google AI’s Gemini 2.0 Flash
  • Various Mistral model variants

 

We tested these models against a representative dataset of scanned invoices from different providers. Some models such as certain Mistral variants produced an unacceptably high error rate, confusing similar digits (e.g., ‘5’ and ‘9’) or letters (‘R’ and ‘F’), making them unsuitable for sensitive financial data extraction.

After comparing their performance on extracting key information vendor names, account numbers, dates, amounts, and addresses—from complex scanned documents, Gemini 1.5 Pro and Sonnet 3.5 emerged as the most promising options. Both demonstrated strong contextual understanding and reliability, even under suboptimal conditions.

Why Gemini 1.5 Pro for invoice analysis?

We ultimately selected Gemini 1.5 Pro due to its optimal balance of performance and costwhile Sonnet 3.5 delivered comparable results, its token-based pricing model was significantly more expensive at the processing scale required. Gemini 1.5 Pro, with its large context window, enabled efficient processing of long documents in a single pass and offered more favorable economics. (Note: Pricing models should be verified with the provider’s current documentation, but historically Gemini 1.5 Pro has offered better cost-efficiency for high-volume, large-context tasks.)

La solución: Arquitectura y flujo de trabajo

ocr agent

To address this challenge, we built a robust automated solution using a combination of Python for low-level processing and n8n for workflow orchestration. The process includes the following stages:

1. Document ingestion and storage:

  • Entry point: Invoices in PDF format are submitted via a Telegram bot, providing users with a simple and accessible interface.
  • Centralized storage: The Telegram bot uploads these PDFs to a structured folder system in OneDrive which not only serves as input for automated processing but also creates a searchable, digitized archive for audits.

 

2. Pre-processing with Python:

  • A Python script monitors the OneDrive folder and processes incoming PDFs.
  • It identifies individual documents (in case one file contains multiple invoices) and assigns a unique ID to each invoice for tracking across the workflow.

 

3. Preparing for multimodal processing (Base64):

  • To leverage Gemini 1.5 Pro’s multimodal capabilities (handling both text and images), image data must be sent along with the prompt via API.
  • The script converts each invoice page image into Base64 format, a standard for encoding binary data as ASCII text for HTTP transmission.
  • Por lo tanto, el script de Python (o un paso previo en n8n) convierte las imágenes de cada página de la factura a su representación en Base64.

 

4. Data extraction with Gemini 1.5 Pro:

  • The Base64-encoded images, along with a carefully crafted prompt, are sent to the Gemini API.
  • Success in this stage hinges on prompt engineering: the model is not simply asked to extract fields, but guided to understand document structure, handle multi-page invoices, normalize and validate data.

 

The prompt

  • The prompt is a key component and reflects best practices for guiding a multimodal model in structured extraction. It includes:

    • Multi-page document handling: IInstructions for grouping pages logically into single invoices based on account numbers, dates, etc.
    • Data normalization: For example, standardizing account numbers by removing spaces before comparison.
    • New invoice criteria: Clear logic for when a new invoice begins (change in account number, date, or vendor).
    • Context-based classification: Detecting service type (water, electricity, gas) from document cues (units, section headers, logos).
    • Accurate vendor identification: Clarifying that the vendor is the issuer, not the customer, with guidance on where to find this info..
    • Structured address extraction: Splitting address into line, city, ZIP code, etc., and including unit/apartment details.
    • Special formatting: Handling of negative values or uncommon amount formats.
    • Avoiding false splits: Preventing mis-segmentation due to multiple meters or service addresses under one account.
    • Strict output format: Requiring valid JSON arrays with one object per invoice to facilitate downstream automation.

 

6. Advanced processing with n8n and AI agents:

  • Gemini’s structured JSON output is passed into an n8n workflow.

  • At this stage, a custom AI agent powered by GPT-4.1 (OpenAI) comes into play. Integrated directly within n8n, this agent acts as the operational brain of the system.

 

The importance of data validation

Validation is a critical integrity layer. Even with highly accurate extraction, there’s always a risk of anomalies. Validating data before database insertion ensures: antes de la inserción en la base de datos asegura que:

  • Compliance with database rules (data types, required fields).
  • Logical consistency for reporting, analysis, and operations.
  • Prevention of errors that could corrupt records or require manual fixes.

Project outcomes and benefits

This project showcases the transformative impact of combining advanced OCR, large language models (LLMs), and workflow automation. The construction company achieved:

  • Efficiency gains: Tasks that took days or weeks now run automatically in minutes or hours.
  • Significant cost savings: Human resources are freed from repetitive work for higher-value tasks.
  • Improved accuracy: Automation reduces human error, yielding more reliable financial data.
  • Structured digital archiving: OneDrive becomes a centralized, easily accessible invoice repository..
  • Scalability: The solution handles increasing invoice volumes without proportional resource growth.

 

The project's success stems not only from using cutting-edge AI models (Gemini 1.5 Pro for OCR, GPT-4.1 as an intelligent agent), but also from meticulous prompt engineering and a solid workflow architecture in n8n. Strategic use of techniques like Chain of Thought ("THINK") reasoning enhances the agent’s ability to handle complex scenarios, interact with external systems (via API), and validate critical data before proceeding.

In summary, we’ve transformed an inefficient, error-prone manual process into a smart, scalable digital workflow demonstrating how AI can solve real-world operational challenges and drive tangible business value.

What's next?

If you have a similar idea you'd like to implement in your company, feel free to reach out. You can contact me at pablo@ideasforge.io

To homepage