Installation¶

Kreuzberg is composed of a core package and several optional dependencies, which users can install at their discretion.

System Dependencies¶

Pandoc¶

Kreuzberg relies on pandoc, which is a required system dependency. To install it, follow the instructions below:

Ubuntu/Debian¶

sudo apt-get install pandoc

macOS¶

1	`brew install pandoc`

Windows¶

choco install -y pandoc

Kreuzberg Core Package¶

The Kreuzberg core package can be installed using pip with:

1	`pip install kreuzberg`

Optional Features¶

OCR¶

OCR is an optional feature for extracting text from images and non-searchable PDFs. Kreuzberg supports multiple OCR backends. To understand the differences between these backends, please read the OCR Backends documentation.

Tesseract OCR¶

Tesseract OCR is built into Kreuzberg and doesn't require additional Python packages. However, you must install Tesseract 5.0 or higher on your system:

Ubuntu/Debian¶

sudo apt-get install tesseract-ocr

macOS¶

1	`brew install tesseract`

Windows¶

choco install -y tesseract

Language Support

Tesseract includes English language support by default. If you need to process documents in other languages, you must install the appropriate language data files:

Ubuntu/Debian: sudo apt-get install tesseract-ocr-deu (for German)
macOS: brew install tesseract-lang
Windows: See the Tesseract documentation

For more details on language installation and configuration, refer to the Tesseract documentation.

EasyOCR¶

EasyOCR is a Python-based OCR backend with wide language support and strong performance.

pip install "kreuzberg[easyocr]"

PaddleOCR¶

pip install "kreuzberg[paddleocr]"

Chunking¶

Chunking is an optional feature - useful for RAG applications among others. Kreuzberg uses the excellent semantic-text-splitter package for chunking. To install Kreuzberg with chunking support, you can use:

pip install "kreuzberg[chunking]"

Table Extraction¶

Table extraction is an optional feature that allows Kreuzberg to extract tables from PDFs. It uses the GMFT package. To install Kreuzberg with table extraction support, you can use:

pip install "kreuzberg[gmft]"

Language Detection¶

Language detection is an optional feature that automatically detects the language of extracted text. It uses the fast-langdetect package. To install Kreuzberg with language detection support, you can use:

pip install "kreuzberg[langdetect]"

Entity and Keyword Extraction¶

Entity and keyword extraction are optional features that extract named entities and keywords from documents. Entity extraction uses spaCy for multilingual named entity recognition, while keyword extraction uses KeyBERT for semantic keyword extraction:

pip install "kreuzberg[entity-extraction]"

After installation, you'll need to download the spaCy language models you plan to use:

# Download English model (most common)
python -m spacy download en_core_web_sm

# Download other language models as needed
python -m spacy download de_core_news_sm  # German
python -m spacy download fr_core_news_sm  # French
python -m spacy download es_core_news_sm  # Spanish

Language Model Requirements

spaCy language models are large (50-500MB each) and are downloaded separately. Only download the models for languages you actually need to process. See the spaCy models documentation for a complete list of available models.

All Optional Dependencies¶

To install Kreuzberg with all optional dependencies, you can use the all extra group:

pip install "kreuzberg[all]"

This is equivalent to:

pip install "kreuzberg[chunking,easyocr,entity-extraction,gmft,langdetect,paddleocr]"