Extraction Configuration¶
Kreuzberg provides extensive configuration options for the extraction process through the ExtractionConfig
class. This guide covers common configuration scenarios and examples.
Basic Configuration¶
All extraction functions accept an optional config
parameter of type ExtractionConfig
. This object allows you to:
- Control OCR behavior with
force_ocr
andocr_backend
- Provide engine-specific OCR configuration via
ocr_config
- Add validation and post-processing hooks
- Configure custom extractors
Examples¶
Basic Usage¶
OCR Configuration¶
The language
parameter specifies which language model Tesseract should use. You can specify multiple languages by joining them with a plus sign (e.g., "eng+deu" for English and German).
The psm
(Page Segmentation Mode) parameter controls how Tesseract analyzes page layout. Different modes are suitable for different types of documents:
PSMMode.AUTO
: Automatic page segmentation (default)PSMMode.SINGLE_BLOCK
: Treat the image as a single text blockPSMMode.SINGLE_LINE
: Treat the image as a single text linePSMMode.SINGLE_WORD
: Treat the image as a single wordPSMMode.SINGLE_CHAR
: Treat the image as a single character
Alternative OCR Engines¶
Batch Processing¶
Synchronous API¶
Using Custom Extractors¶
You can register custom extractors to handle specific file formats:
See the Custom Extractors guide for more details on creating and registering custom extractors.
OCR Best Practices¶
When configuring OCR for your documents, consider these best practices:
-
Language Selection: Choose the appropriate language model for your documents. Using the wrong language model can significantly reduce OCR accuracy.
-
Page Segmentation Mode: Select the appropriate PSM based on your document layout:
- Use
PSMMode.AUTO
for general documents with mixed content - Use
PSMMode.SINGLE_BLOCK
for documents with a single column of text - Use
PSMMode.SINGLE_LINE
for receipts or single-line text - Use
PSMMode.SINGLE_WORD
orPSMMode.SINGLE_CHAR
for specialized cases
- Use
-
OCR Engine Selection: Choose the appropriate OCR engine based on your needs:
- Tesseract: Good general-purpose OCR with support for many languages
- EasyOCR: Better for some non-Latin scripts and natural scene text
- PaddleOCR: Excellent for Chinese and other Asian languages
-
Preprocessing: For better OCR results, consider using validation and post-processing hooks to clean up the extracted text.