OCR Backends¶

Kreuzberg supports multiple OCR (Optical Character Recognition) backends, giving you flexibility to choose the best option for your specific needs. Each backend has different strengths, language support, and installation requirements.

Supported Backends¶

1. Tesseract OCR¶

Tesseract OCR is the default OCR backend in Kreuzberg. It's a mature, open-source OCR engine with support for over 100 languages.

Installation Requirements:

Requires system-level installation
Minimum required version: Tesseract 5.0

Installation Instructions:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
choco install -y tesseract

Language Support:

For languages other than English, install additional language packs:
- Ubuntu: sudo apt-get install tesseract-ocr-deu (for German)
- macOS: brew install tesseract-lang

Configuration:

from kreuzberg import extract_file, ExtractionConfig, TesseractConfig, PSMMode

result = await extract_file(
    "document.pdf",
    config=ExtractionConfig(
        ocr_backend="tesseract",  # This is the default
        ocr_config=TesseractConfig(language="eng+deu", psm=PSMMode.AUTO),  # English and German  # Page segmentation mode
    ),
)

2. EasyOCR¶

EasyOCR is a Python library that uses deep learning models for OCR. It supports over 80 languages and can be more accurate for certain scripts.

Installation Requirements:

Requires the easyocr optional dependency
Install with: pip install "kreuzberg[easyocr]"

GPU Support:

Experimental Feature

GPU support is not considered an official feature and might be subject to change or removal in future versions.

EasyOCR can use GPU acceleration when PyTorch with CUDA is available
To enable GPU, set use_gpu=True in the configuration
Kreuzberg will automatically check if CUDA is available via PyTorch

Language Support:

Uses different language codes than Tesseract
Examples: en (English), de (German), zh (Chinese), etc.
See the EasyOCR documentation for the full list

Configuration:

from kreuzberg import extract_file, ExtractionConfig, EasyOCRConfig

result = await extract_file(
    "document.jpg",
    config=ExtractionConfig(
        ocr_backend="easyocr",
        ocr_config=EasyOCRConfig(
            language_list=["en", "de"], use_gpu=True  # English and German  # Enable GPU acceleration if available (experimental)
        ),
    ),
)

3. PaddleOCR¶

PaddleOCR is an OCR toolkit developed by Baidu. It's particularly strong for Chinese and other Asian languages.

Installation Requirements:

Requires the paddleocr optional dependency
Install with: pip install "kreuzberg[paddleocr]"

GPU Support:

Experimental Feature

GPU support is not considered an official feature and might be subject to change or removal in future versions.

PaddleOCR can utilize GPU acceleration if the paddlepaddle-gpu package is installed
Kreuzberg automatically detects if paddlepaddle-gpu is available
To explicitly enable GPU, set use_gpu=True in the configuration
For GPU usage, install: pip install paddlepaddle-gpu instead of the standard paddlepaddle package

Language Support:

Limited language support compared to other backends
Supported languages: ch (Chinese), en (English), french, german, japan, korean

Configuration:

from kreuzberg import extract_file, ExtractionConfig, PaddleOCRConfig

result = await extract_file(
    "chinese_document.jpg",
    config=ExtractionConfig(
        ocr_backend="paddleocr",
        ocr_config=PaddleOCRConfig(
            language="ch",  # Chinese
            use_gpu=True,  # Enable GPU acceleration if paddlepaddle-gpu is available (experimental)
            gpu_mem=4000,  # Set GPU memory limit in MB (experimental)
        ),
    ),
)

4. No OCR¶

You can also disable OCR completely, which is useful for documents that already contain searchable text.

Configuration:

from kreuzberg import extract_file, ExtractionConfig

result = await extract_file("searchable_pdf.pdf", config=ExtractionConfig(ocr_backend=None))

Choosing the Right Backend¶

Here are some guidelines for choosing the appropriate OCR backend:

Tesseract OCR (Default)¶

Advantages:

Lightweight and CPU-optimized
No model downloads required (faster startup)
Mature and widely used
Lower memory usage
Good for general-purpose OCR across many languages
Good balance of accuracy and performance

Considerations:

Requires system-level installation
May have lower accuracy for some languages or complex layouts
More configuration may be needed for optimal results

EasyOCR¶

Advantages:

Good accuracy across multiple languages
No system dependencies required (pure Python)
Simple configuration
Better for complex scripts and languages like Arabic, Thai, or Hindi
Can be more accurate for handwritten text

Considerations:

Larger memory footprint (requires PyTorch)
Slower first-run due to model downloads
Heavier resource usage
Model files are downloaded on first use, causing initial delay

PaddleOCR¶

Advantages:

Excellent accuracy, especially for Asian languages
No system dependencies required
Modern deep learning architecture
Fast processing once models are loaded

Considerations:

Largest memory footprint of the three options (requires PaddlePaddle)
Slower first-run due to model downloads
More resource-intensive
Model files are downloaded on first use, causing initial delay

No OCR (Setting ocr_backend=None)¶

Use when:

Processing searchable PDFs or documents with embedded text
You want to extract embedded text only
You want to avoid the overhead of OCR processing

Behavior:

For searchable PDFs, embedded text will still be extracted
For images and non-searchable PDFs, an empty string will be returned for content
Fastest option as it skips OCR processing entirely

Installation Summary¶

To install Kreuzberg with different OCR backends:

# Basic installation (Tesseract requires separate system installation)
pip install kreuzberg

# With EasyOCR support
pip install "kreuzberg[easyocr]"

# With PaddleOCR support
pip install "kreuzberg[paddleocr]"

# With chunking support
pip install "kreuzberg[chunking]"

# With all optional dependencies (OCR backends and chunking)
pip install "kreuzberg[all]"

System Dependencies

Remember that Pandoc and Tesseract are system dependencies that must be installed separately from the Python package.

For Tesseract, you must install version 5.0 or higher, and you'll need to install additional language data files for languages other than English.