Skip to content

Basic Usage

Kreuzberg offers a simple API for text extraction from documents and images.

Core Functions

Kreuzberg exports the following main functions:

Single Item Processing

Batch Processing

Async Examples

Extract Text from a File

import asyncio
from kreuzberg import extract_file

async def main():
    result = await extract_file("document.pdf")
    print(result.content)
    print(f"MIME type: {result.mime_type}")
    print(f"Metadata: {result.metadata}")

asyncio.run(main())

Process Multiple Files Concurrently

import asyncio
from pathlib import Path
from kreuzberg import batch_extract_file

async def process_documents():
    file_paths = [Path("document1.pdf"), Path("document2.docx"), Path("image.jpg")]

    # Process all files concurrently
    results = await batch_extract_file(file_paths)

    # Results are returned in the same order as inputs
    for path, result in zip(file_paths, results):
        print(f"File: {path}")
        print(f"Content: {result.content[:100]}...")  # First 100 chars
        print(f"MIME type: {result.mime_type}")
        print("---")

asyncio.run(process_documents())

Synchronous Examples

Extract Text from a File

1
2
3
4
from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)

Process Multiple Files

1
2
3
4
5
6
7
8
from kreuzberg import batch_extract_file_sync

file_paths = ["document1.pdf", "document2.docx", "image.jpg"]
results = batch_extract_file_sync(file_paths)

for path, result in zip(file_paths, results):
    print(f"File: {path}")
    print(f"Content: {result.content[:100]}...")

Working with Byte Content

If you already have the file content in memory, you can use the bytes extraction functions:

import asyncio
from kreuzberg import extract_bytes

async def extract_from_memory():
    with open("document.pdf", "rb") as f:
        content = f.read()

    result = await extract_bytes(content, mime_type="application/pdf")
    print(result.content)

asyncio.run(extract_from_memory())

Extraction Result

All extraction functions return an ExtractionResult object containing:

  • content: Extracted text
  • mime_type: Document MIME type
  • metadata: Document metadata (see Metadata Extraction)
from kreuzberg import extract_file, ExtractionResult  # Import types directly from kreuzberg

async def show_metadata():
    result: ExtractionResult = await extract_file("document.pdf")

    # Access the content
    print(result.content)

    # Access metadata (if available)
    if "title" in result.metadata:
        print(f"Title: {result.metadata['title']}")

    if "authors" in result.metadata:
        print(f"Authors: {', '.join(result.metadata['authors'])}")

    if "created_at" in result.metadata:
        print(f"Created: {result.metadata['created_at']}")

asyncio.run(show_metadata())