Skip to content

Quick Start

Get started with Kreuzberg for text extraction from documents and images.

Basic Usage

Kreuzberg provides both asynchronous and synchronous APIs for text extraction.

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF file
    result = await extract_file("document.pdf")
    print(result.content)

    # The result also contains metadata
    print(f"Mime type: {result.mime_type}")
    print(f"Extraction method: {result.extraction_method}")

asyncio.run(main())

Synchronous API

1
2
3
4
5
from kreuzberg import extract_file_sync

# Extract text from a PDF file
result = extract_file_sync("document.pdf")
print(result.content)

OCR Configuration

Kreuzberg supports OCR for images and scanned PDFs with configurable language and page segmentation mode:

from kreuzberg import extract_file, ExtractionConfig, TesseractConfig, PSMMode

async def main():
    # Extract text from an image with German language model
    result = await extract_file(
        "german_document.jpg",
        config=ExtractionConfig(
            ocr_config=TesseractConfig(
                language="deu", psm=PSMMode.SINGLE_BLOCK  # German language model  # Treat as a single text block
            )
        ),
    )
    print(result.content)

asyncio.run(main())

Batch Processing

Process multiple files concurrently:

from pathlib import Path
from kreuzberg import batch_extract_file

async def process_documents():
    file_paths = [Path("document1.pdf"), Path("document2.docx"), Path("image.jpg")]

    # Process all files concurrently
    results = await batch_extract_file(file_paths)

    # Results are returned in the same order as inputs
    for path, result in zip(file_paths, results):
        print(f"File: {path}")
        print(f"Content: {result.content[:100]}...")  # First 100 chars
        print(f"Mime type: {result.mime_type}")
        print(f"Method: {result.extraction_method}")
        print("---")

asyncio.run(process_documents())

Error Handling

Kreuzberg provides specific exceptions for different error cases:

from kreuzberg import extract_file
from kreuzberg import KreuzbergError, MissingDependencyError, OCRError, ParsingError

async def safe_extract(path):
    try:
        result = await extract_file(path)
        return result.content
    except ParsingError:
        print(f"Unsupported or invalid file format: {path}")
    except MissingDependencyError as e:
        print(f"Missing dependency: {e}")
    except OCRError as e:
        print(f"OCR processing failed: {e}")
    except KreuzbergError as e:
        print(f"Extraction failed: {e}")
    return None

Next Steps