OCR Configuration¶
Kreuzberg offers simple configuration options for OCR to extract text from images and scanned documents.
OCR Configuration¶
All extraction functions in Kreuzberg accept an ExtractionConfig
object that can contain OCR configuration:
Language Configuration¶
The language
parameter in a TesseractConfig
object specifies which language model Tesseract should use for OCR:
Supported Language Codes¶
Language | Code | Language | Code |
---|---|---|---|
English | eng | German | deu |
French | fra | Spanish | spa |
Italian | ita | Japanese | jpn |
Korean | kor | Simplified Chinese | chi_sim |
Traditional Chinese | chi_tra | Russian | rus |
Arabic | ara | Hindi | hin |
Multi-Language Support¶
You can specify multiple languages by joining codes with a plus sign:
Note
The order of languages affects processing time and accuracy. The first language is treated as the primary language.
Language Installation¶
For Tesseract to recognize languages other than English, you need to install the corresponding language data:
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr-<lang-code>
- macOS:
brew install tesseract-lang
(installs all languages) - Windows: Download language data from GitHub
Page Segmentation Mode (PSM)¶
The psm
parameter in a TesseractConfig
object controls how Tesseract analyzes the layout of the page:
Available PSM Modes¶
Mode | Enum Value | Description | Best For |
---|---|---|---|
Auto Only | PSMMode.AUTO_ONLY | Automatic segmentation without orientation detection | Modern documents (default - fastest) |
Automatic | PSMMode.AUTO | Automatic page segmentation with orientation detection | Rotated/skewed documents |
Single Block | PSMMode.SINGLE_BLOCK | Treat the image as a single text block | Simple layouts, preserving paragraph structure |
Single Column | PSMMode.SINGLE_COLUMN | Assume a single column of text | Books, articles, single-column documents |
Single Line | PSMMode.SINGLE_LINE | Treat the image as a single text line | Receipts, labels, single-line text |
Single Word | PSMMode.SINGLE_WORD | Treat the image as a single word | Word recognition tasks |
Sparse Text | PSMMode.SPARSE_TEXT | Find as much text as possible without assuming structure | Forms, tables, scattered text |
Forcing OCR¶
By default, Kreuzberg will only use OCR for images and scanned PDFs. For searchable PDFs, it will extract text directly. You can override this behavior with the force_ocr
parameter in the ExtractionConfig
object:
This is useful when:
- The PDF contains both searchable text and images with text
- The embedded text in the PDF has encoding or extraction issues
- You want consistent processing across all documents
OCR Engine Selection¶
Kreuzberg supports multiple OCR engines:
Tesseract (Default)¶
Tesseract is the default OCR engine and requires no additional installation beyond the system dependency.
EasyOCR (Optional)¶
To use EasyOCR:
- Install with the extra:
pip install "kreuzberg[easyocr]"
- Use the
ocr_backend
parameter in theExtractionConfig
object:
PaddleOCR (Optional)¶
To use PaddleOCR:
- Install with the extra:
pip install "kreuzberg[paddleocr]"
- Use the
ocr_backend
parameter in theExtractionConfig
object:
Note
For PaddleOCR, the supported language codes are different: ch
(Chinese), en
(English), french
, german
, japan
, and korean
.
Performance Optimization¶
Default Configuration¶
Kreuzberg's defaults are optimized out-of-the-box for modern PDFs and standard documents:
- PSM Mode:
AUTO_ONLY
- Faster thanAUTO
without orientation detection overhead - Language Model: Disabled by default for optimal performance on modern documents
- Dictionary Correction: Enabled for accuracy
The default configuration provides excellent extraction quality for:
- Modern PDFs with embedded text
- Scanned documents with clear printing
- Office documents (DOCX, PPTX, XLSX)
- Standard business documents
Speed vs Quality Trade-offs¶
Language Model N-gram Settings¶
The language_model_ngram_on
parameter controls Tesseract's use of n-gram language models:
- Default (False): Optimized for modern documents with clear text
- When to enable: Historical documents, degraded scans, handwritten text, or noisy images
When to Disable OCR¶
For documents with text layers (searchable PDFs, Office docs), disable OCR entirely:
This provides significant speedup (78% of PDFs have text layers and extract in \<0.01s)
Best Practices¶
- Language Selection: Always specify the correct language for your documents to improve OCR accuracy
- PSM Mode Selection: Choose the appropriate PSM mode based on your document layout:
- Use
PSMMode.AUTO_ONLY
(default) for modern, well-formatted documents - Use
PSMMode.SINGLE_BLOCK
for simple layouts with faster processing - Use
PSMMode.SPARSE_TEXT
for forms or documents with tables - Use
PSMMode.AUTO
only when orientation detection is needed
- Use
- Performance Optimization:
- Disable OCR (
ocr_backend=None
) for documents with text layers - Disable language model for clean documents (
language_model_ngram_on=False
) - Disable dictionary correction for technical documents
- Disable OCR (
- Image Quality: For best results, ensure images are:
- High resolution (at least 300 DPI)
- Well-lit with good contrast
- Not skewed or rotated (unless using
PSMMode.AUTO
)