Extraction Configuration¶

Kreuzberg provides extensive configuration options for the extraction process through the ExtractionConfig class. This guide covers common configuration scenarios and examples.

Basic Configuration¶

All extraction functions accept an optional config parameter of type ExtractionConfig. This object allows you to:

Control OCR behavior with force_ocr and ocr_backend
Provide engine-specific OCR configuration via ocr_config
Enable table extraction with extract_tables and configure it via gmft_config
Enable automatic language detection with auto_detect_language
Add validation and post-processing hooks
Configure custom extractors

Examples¶

Basic Usage¶

from kreuzberg import extract_file, ExtractionConfig

# Simple extraction with default configuration
result = await extract_file("document.pdf")

# Extraction with custom configuration
result = await extract_file("document.pdf", config=ExtractionConfig(force_ocr=True))

OCR Configuration¶

from kreuzberg import extract_file, ExtractionConfig, TesseractConfig, PSMMode

# Configure Tesseract OCR with specific language and page segmentation mode
result = await extract_file(
    "document.pdf",
    config=ExtractionConfig(force_ocr=True, ocr_config=TesseractConfig(language="eng+deu", psm=PSMMode.SINGLE_BLOCK)),
)

The language parameter specifies which language model Tesseract should use. You can specify multiple languages by joining them with a plus sign (e.g., "eng+deu" for English and German).

The psm (Page Segmentation Mode) parameter controls how Tesseract analyzes page layout. Different modes are suitable for different types of documents:

PSMMode.AUTO: Automatic page segmentation (default)
PSMMode.SINGLE_BLOCK: Treat the image as a single text block
PSMMode.SINGLE_LINE: Treat the image as a single text line
PSMMode.SINGLE_WORD: Treat the image as a single word
PSMMode.SINGLE_CHAR: Treat the image as a single character

Alternative OCR Engines¶

from kreuzberg import extract_file, ExtractionConfig, EasyOCRConfig, PaddleOCRConfig

# Use EasyOCR backend
result = await extract_file(
    "document.jpg", config=ExtractionConfig(ocr_backend="easyocr", ocr_config=EasyOCRConfig(language_list=["en", "de"]))
)

# Use PaddleOCR backend
result = await extract_file(
    "chinese_document.jpg", config=ExtractionConfig(ocr_backend="paddleocr", ocr_config=PaddleOCRConfig(language="ch"))
)

Table Extraction¶

Kreuzberg can extract tables from PDF documents using the GMFT package:

from kreuzberg import extract_file, ExtractionConfig, GMFTConfig

# Extract tables with default configuration
result = await extract_file("document_with_tables.pdf", config=ExtractionConfig(extract_tables=True))

# Extract tables with custom configuration
config = ExtractionConfig(
    extract_tables=True,
    gmft_config=GMFTConfig(
        detector_base_threshold=0.85,  # Minimum confidence score required for a table
        remove_null_rows=True,  # Remove rows with no text
        enable_multi_header=True,  # Enable multi-indices in the dataframe
    ),
)
result = await extract_file("document_with_tables.pdf", config=config)

# Access extracted tables
for i, table in enumerate(result.tables):
    print(f"Table {i+1} on page {table.page_number}:")
    print(table.text)  # Markdown formatted table text
    # You can also access the pandas DataFrame directly
    df = table.df
    print(df.shape)  # (rows, columns)

Note that table extraction requires the gmft dependency. You can install it with:

pip install "kreuzberg[gmft]"

Language Detection¶

Kreuzberg can automatically detect the language of extracted text using fast-langdetect:

from kreuzberg import extract_file, ExtractionConfig, LanguageDetectionConfig

# Simple automatic language detection
result = await extract_file("multilingual_document.pdf", config=ExtractionConfig(auto_detect_language=True))

# Access detected languages (lowercase ISO 639-1 codes)
if result.detected_languages:
    print(f"Detected languages: {', '.join(result.detected_languages)}")
    # Example output: "Detected languages: en, de, fr"

# Advanced configuration with multilingual detection
lang_config = LanguageDetectionConfig(
    multilingual=True,  # Enable mixed-language detection
    top_k=5,  # Return top 5 languages
    low_memory=False,  # Use high accuracy mode
    cache_dir="/tmp/lang_models",  # Custom model cache directory
)

result = await extract_file(
    "multilingual_document.pdf", config=ExtractionConfig(auto_detect_language=True, language_detection_config=lang_config)
)

# Use detected languages for OCR
if result.detected_languages:
    # Re-extract with OCR using the primary detected language
    from kreuzberg import TesseractConfig

    result_with_ocr = await extract_file(
        "multilingual_document.pdf",
        config=ExtractionConfig(force_ocr=True, ocr_config=TesseractConfig(language=result.detected_languages[0])),
    )

Language Detection Configuration Options¶

low_memory (default: True): Use smaller model (~200MB) vs larger, more accurate model
multilingual (default: False): Enable detection of multiple languages in mixed text
top_k (default: 3): Maximum number of languages to return
cache_dir: Custom directory for language model storage
allow_fallback (default: True): Fall back to small model if large model fails

The feature requires the langdetect dependency:

pip install "kreuzberg[langdetect]"

Entity and Keyword Extraction¶

Kreuzberg can extract named entities and keywords from documents using spaCy for entity recognition and KeyBERT for keyword extraction:

from kreuzberg import extract_file, ExtractionConfig, SpacyEntityExtractionConfig

# Basic entity and keyword extraction
result = await extract_file(
    "document.pdf",
    config=ExtractionConfig(
        extract_entities=True,
        extract_keywords=True,
        keyword_count=10,  # Number of keywords to extract (default: 10)
    ),
)

# Access extracted entities and keywords
if result.entities:
    for entity in result.entities:
        print(f"{entity.type}: {entity.text} (position {entity.start}-{entity.end})")
        # Example: "PERSON: John Doe (position 0-8)"

if result.keywords:
    for keyword, score in result.keywords:
        print(f"{keyword}: {score:.3f}")
        # Example: "artificial intelligence: 0.845"

Entity Extraction with Language Support¶

spaCy supports entity extraction in multiple languages. You can configure language-specific models:

from kreuzberg import extract_file, ExtractionConfig, SpacyEntityExtractionConfig

# Configure spaCy for specific languages
spacy_config = SpacyEntityExtractionConfig(
    language_models={
        "en": "en_core_web_sm",  # English
        "de": "de_core_news_sm",  # German
        "fr": "fr_core_news_sm",  # French
        "es": "es_core_news_sm",  # Spanish
    },
    model_cache_dir="/tmp/spacy_models",  # Custom model cache directory
    fallback_to_multilingual=True,  # Use multilingual model if language-specific model fails
)

# Extract with language detection to automatically choose the right model
result = await extract_file(
    "multilingual_document.pdf",
    config=ExtractionConfig(
        auto_detect_language=True,  # Enable language detection
        extract_entities=True,
        spacy_entity_extraction_config=spacy_config,
    ),
)

# The system will automatically use the appropriate spaCy model based on detected languages
if result.detected_languages and result.entities:
    print(f"Detected languages: {result.detected_languages}")
    print(f"Extracted {len(result.entities)} entities")

Custom Entity Patterns¶

You can define custom entity patterns using regular expressions:

result = await extract_file(
    "invoice.pdf",
    config=ExtractionConfig(
        extract_entities=True,
        custom_entity_patterns={
            "INVOICE_ID": r"INV-\d{4,}",  # Invoice numbers
            "PHONE": r"\+?\d{1,3}[-.\s]?\d{3,4}[-.\s]?\d{3,4}[-.\s]?\d{3,4}",  # Phone numbers
            "EMAIL": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",  # Email addresses
        },
    ),
)

# Custom patterns are combined with spaCy's standard entity types
for entity in result.entities:
    if entity.type in ["INVOICE_ID", "PHONE", "EMAIL"]:
        print(f"Custom entity - {entity.type}: {entity.text}")
    else:
        print(f"Standard entity - {entity.type}: {entity.text}")

Supported Entity Types¶

spaCy automatically detects these standard entity types:

PERSON: People's names
ORG: Organizations, companies, agencies
GPE: Countries, cities, states (Geopolitical entities)
MONEY: Monetary values
DATE: Date expressions
TIME: Time expressions
PERCENT: Percentage values
CARDINAL: Numerals that do not fall under another type

Language-specific models may support additional entity types relevant to that language.

spaCy Configuration Options¶

language_models: Dict mapping language codes to spaCy model names
model_cache_dir: Custom directory for caching spaCy models
fallback_to_multilingual: Whether to use multilingual model (xx_ent_wiki_sm) as fallback
max_doc_length: Maximum document length for spaCy processing (default: 1,000,000 characters)
batch_size: Batch size for processing multiple texts (default: 1,000)

Installation Requirements¶

Entity and keyword extraction require additional dependencies:

# For entity extraction with spaCy
pip install "kreuzberg[entity-extraction]"

# Install specific spaCy language models as needed
python -m spacy download en_core_web_sm    # English
python -m spacy download de_core_news_sm   # German
python -m spacy download fr_core_news_sm   # French

Available spaCy models include: en_core_web_sm, de_core_news_sm, fr_core_news_sm, es_core_news_sm, pt_core_news_sm, it_core_news_sm, nl_core_news_sm, zh_core_web_sm, ja_core_news_sm, ko_core_news_sm, ru_core_news_sm, and many others.

Batch Processing¶

from kreuzberg import batch_extract_file, ExtractionConfig

# Process multiple files with the same configuration
file_paths = ["document1.pdf", "document2.docx", "image.jpg"]
config = ExtractionConfig(force_ocr=True)
results = await batch_extract_file(file_paths, config=config)

Synchronous API¶

from kreuzberg import extract_file_sync, ExtractionConfig, TesseractConfig

# Synchronous extraction with configuration
result = extract_file_sync("document.pdf", config=ExtractionConfig(ocr_config=TesseractConfig(language="eng")))

Using Custom Extractors¶

You can register custom extractors to handle specific file formats:

from kreuzberg import ExtractorRegistry, extract_file, ExtractionConfig
from my_module import CustomExtractor

# Register a custom extractor
ExtractorRegistry.add_extractor(CustomExtractor)

# Now extraction functions will use your custom extractor for supported MIME types
result = await extract_file("custom_document.xyz")

# Later, remove the extractor if needed
ExtractorRegistry.remove_extractor(CustomExtractor)

See the Custom Extractors guide for more details on creating and registering custom extractors.

OCR Best Practices¶

When configuring OCR for your documents, consider these best practices:

Language Selection: Choose the appropriate language model for your documents. Using the wrong language model can significantly reduce OCR accuracy.
Page Segmentation Mode: Select the appropriate PSM based on your document layout:
- Use PSMMode.AUTO for general documents with mixed content
- Use PSMMode.SINGLE_BLOCK for documents with a single column of text
- Use PSMMode.SINGLE_LINE for receipts or single-line text
- Use PSMMode.SINGLE_WORD or PSMMode.SINGLE_CHAR for specialized cases
OCR Engine Selection: Choose the appropriate OCR engine based on your needs:
- Tesseract: Good general-purpose OCR with support for many languages
- EasyOCR: Better for some non-Latin scripts and natural scene text
- PaddleOCR: Excellent for Chinese and other Asian languages
Preprocessing: For better OCR results, consider using validation and post-processing hooks to clean up the extracted text.