Quick Start¶

Get started with Kreuzberg for text extraction from documents and images.

Basic Usage¶

Kreuzberg provides both asynchronous and synchronous APIs for text extraction.

Async API (Recommended)¶

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF file
    result = await extract_file("document.pdf")
    print(result.content)

    # The result also contains metadata
    print(f"Mime type: {result.mime_type}")
    print(f"Extraction method: {result.extraction_method}")

asyncio.run(main())

Synchronous API¶

from kreuzberg import extract_file_sync

# Extract text from a PDF file
result = extract_file_sync("document.pdf")
print(result.content)

OCR Configuration¶

Kreuzberg supports OCR for images and scanned PDFs with configurable language and page segmentation mode:

from kreuzberg import extract_file, ExtractionConfig, TesseractConfig, PSMMode

async def main():
    # Extract text from an image with German language model
    result = await extract_file(
        "german_document.jpg",
        config=ExtractionConfig(
            ocr_config=TesseractConfig(
                language="deu", psm=PSMMode.SINGLE_BLOCK  # German language model  # Treat as a single text block
            )
        ),
    )
    print(result.content)

asyncio.run(main())

Batch Processing¶

Process multiple files concurrently:

from pathlib import Path
from kreuzberg import batch_extract_file

async def process_documents():
    file_paths = [Path("document1.pdf"), Path("document2.docx"), Path("image.jpg")]

    # Process all files concurrently
    results = await batch_extract_file(file_paths)

    # Results are returned in the same order as inputs
    for path, result in zip(file_paths, results):
        print(f"File: {path}")
        print(f"Content: {result.content[:100]}...")  # First 100 chars
        print(f"Mime type: {result.mime_type}")
        print(f"Method: {result.extraction_method}")
        print("---")

asyncio.run(process_documents())

Error Handling¶

Kreuzberg provides specific exceptions for different error cases:

from kreuzberg import extract_file
from kreuzberg import KreuzbergError, MissingDependencyError, OCRError, ParsingError

async def safe_extract(path):
    try:
        result = await extract_file(path)
        return result.content
    except ParsingError:
        print(f"Unsupported or invalid file format: {path}")
    except MissingDependencyError as e:
        print(f"Missing dependency: {e}")
    except OCRError as e:
        print(f"OCR processing failed: {e}")
    except KreuzbergError as e:
        print(f"Extraction failed: {e}")
    return None

Next Steps¶

Check the OCR Configuration guide for detailed OCR options
See Supported Formats for all file types Kreuzberg can process
Explore the API Reference for detailed function documentation