Types¶
Core data structures for extraction results, configuration, and metadata.
ExtractionResult¶
The result of a file extraction, containing the extracted text, MIME type, and metadata:
kreuzberg.ExtractionResult
dataclass
¶
The result of a file extraction.
Source code in kreuzberg/_types.py
Attributes¶
chunks: list[str]
instance-attribute
¶
The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.
content: str
instance-attribute
¶
The extracted content.
metadata: Metadata
instance-attribute
¶
The metadata of the content.
mime_type: str
instance-attribute
¶
The mime type of the extracted content. Is either text/plain or text/markdown.
ExtractionConfig¶
Configuration options for extraction functions:
kreuzberg.ExtractionConfig
dataclass
¶
Represents configuration settings for an extraction process.
This class encapsulates the configuration options for extracting text from images or documents using Optical Character Recognition (OCR). It provides options to customize the OCR behavior, select the backend engine, and configure engine-specific parameters.
Source code in kreuzberg/_types.py
Attributes¶
chunk_content: bool = False
class-attribute
instance-attribute
¶
Whether to chunk the content into smaller chunks.
force_ocr: bool = False
class-attribute
instance-attribute
¶
Whether to force OCR.
max_chars: int = DEFAULT_MAX_CHARACTERS
class-attribute
instance-attribute
¶
The size of each chunk in characters.
max_overlap: int = DEFAULT_MAX_OVERLAP
class-attribute
instance-attribute
¶
The overlap between chunks in characters.
ocr_backend: OcrBackendType | None = 'tesseract'
class-attribute
instance-attribute
¶
The OCR backend to use.
ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
class-attribute
instance-attribute
¶
Configuration to pass to the OCR backend.
post_processing_hooks: list[PostProcessingHook] | None = None
class-attribute
instance-attribute
¶
Post processing hooks to call after processing is done and before the final result is returned.
validators: list[ValidationHook] | None = None
class-attribute
instance-attribute
¶
Validation hooks to call after processing is done and before post-processing and result return.
Functions¶
get_config_dict() -> dict[str, Any]
¶
Returns the OCR configuration object based on the backend specified.
RETURNS | DESCRIPTION |
---|---|
dict[str, Any] | A dict of the OCR configuration or an empty dict if no backend is provided. |
Source code in kreuzberg/_types.py
OCR Configuration¶
TesseractConfig¶
kreuzberg.TesseractConfig
dataclass
¶
Configuration options for Tesseract OCR engine.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
classify_use_pre_adapted_templates: bool = True
class-attribute
instance-attribute
¶
Whether to use pre-adapted templates during classification to improve recognition accuracy.
language: str = 'eng'
class-attribute
instance-attribute
¶
Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu')
language_model_ngram_on: bool = True
class-attribute
instance-attribute
¶
Enable or disable the use of n-gram-based language models for improved text recognition.
psm: PSMMode = PSMMode.AUTO
class-attribute
instance-attribute
¶
Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).
tessedit_dont_blkrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents block rejection of words identified as good, improving text output quality.
tessedit_dont_rowrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.
tessedit_enable_dict_correction: bool = True
class-attribute
instance-attribute
¶
Enable or disable dictionary-based correction for recognized text to improve word accuracy.
tessedit_use_primary_params_model: bool = True
class-attribute
instance-attribute
¶
If True, forces the use of the primary parameters model for text recognition.
textord_space_size_is_variable: bool = True
class-attribute
instance-attribute
¶
Allow variable spacing between words, useful for text with irregular spacing.
thresholding_method: bool = True
class-attribute
instance-attribute
¶
Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.
EasyOCRConfig¶
kreuzberg.EasyOCRConfig
dataclass
¶
Configuration options for EasyOCR.
Source code in kreuzberg/_ocr/_easyocr.py
Attributes¶
add_margin: float = 0.1
class-attribute
instance-attribute
¶
Extend bounding boxes in all directions.
adjust_contrast: float = 0.5
class-attribute
instance-attribute
¶
Target contrast level for low contrast text.
beam_width: int = 5
class-attribute
instance-attribute
¶
Beam width for beam search in recognition.
canvas_size: int = 2560
class-attribute
instance-attribute
¶
Maximum image dimension for detection.
contrast_ths: float = 0.1
class-attribute
instance-attribute
¶
Contrast threshold for preprocessing.
decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'
class-attribute
instance-attribute
¶
Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.
height_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum difference in box height for merging.
language: str | list[str] = 'en'
class-attribute
instance-attribute
¶
Language or languages to use for OCR.
link_threshold: float = 0.4
class-attribute
instance-attribute
¶
Link confidence threshold.
low_text: float = 0.4
class-attribute
instance-attribute
¶
Text low-bound score.
mag_ratio: float = 1.0
class-attribute
instance-attribute
¶
Image magnification ratio.
min_size: int = 10
class-attribute
instance-attribute
¶
Minimum text box size in pixels.
rotation_info: list[int] | None = None
class-attribute
instance-attribute
¶
List of angles to try for detection.
slope_ths: float = 0.1
class-attribute
instance-attribute
¶
Maximum slope for merging text boxes.
text_threshold: float = 0.7
class-attribute
instance-attribute
¶
Text confidence threshold.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference.
width_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum horizontal distance for merging boxes.
x_ths: float = 1.0
class-attribute
instance-attribute
¶
Maximum horizontal distance for paragraph merging.
y_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum vertical distance for paragraph merging.
ycenter_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum shift in y direction for merging.
PaddleOCRConfig¶
kreuzberg.PaddleOCRConfig
dataclass
¶
Configuration options for PaddleOCR.
This TypedDict provides type hints and documentation for all PaddleOCR parameters.
Source code in kreuzberg/_ocr/_paddleocr.py
Attributes¶
cls_image_shape: str = '3,48,192'
class-attribute
instance-attribute
¶
Image shape for classification algorithm in format 'channels,height,width'.
det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'
class-attribute
instance-attribute
¶
Detection algorithm.
det_db_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
Score threshold for detected boxes. Boxes below this value are discarded.
det_db_thresh: float = 0.3
class-attribute
instance-attribute
¶
Binarization threshold for DB output map.
det_db_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
Expansion ratio for detected text boxes.
det_east_cover_thresh: float = 0.1
class-attribute
instance-attribute
¶
Score threshold for EAST output boxes.
det_east_nms_thresh: float = 0.2
class-attribute
instance-attribute
¶
NMS threshold for EAST model output boxes.
det_east_score_thresh: float = 0.8
class-attribute
instance-attribute
¶
Binarization threshold for EAST output map.
det_max_side_len: int = 960
class-attribute
instance-attribute
¶
Maximum size of image long side. Images exceeding this will be proportionally resized.
drop_score: float = 0.5
class-attribute
instance-attribute
¶
Filter recognition results by confidence score. Results below this are discarded.
enable_mkldnn: bool = False
class-attribute
instance-attribute
¶
Whether to enable MKL-DNN acceleration (Intel CPU only).
gpu_mem: int = 8000
class-attribute
instance-attribute
¶
GPU memory size (in MB) to use for initialization.
language: str = 'en'
class-attribute
instance-attribute
¶
Language to use for OCR.
max_text_length: int = 25
class-attribute
instance-attribute
¶
Maximum text length that the recognition algorithm can recognize.
rec: bool = True
class-attribute
instance-attribute
¶
Enable text recognition when using the ocr() function.
rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'
class-attribute
instance-attribute
¶
Recognition algorithm.
rec_image_shape: str = '3,32,320'
class-attribute
instance-attribute
¶
Image shape for recognition algorithm in format 'channels,height,width'.
table: bool = True
class-attribute
instance-attribute
¶
Whether to enable table recognition.
use_angle_cls: bool = True
class-attribute
instance-attribute
¶
Whether to use text orientation classification model.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. Requires installing the paddlepaddle-gpu package
use_space_char: bool = True
class-attribute
instance-attribute
¶
Whether to recognize spaces.
use_zero_copy_run: bool = False
class-attribute
instance-attribute
¶
Whether to enable zero_copy_run for inference optimization.
PSMMode (Page Segmentation Mode)¶
kreuzberg.PSMMode
¶
Bases: Enum
Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
AUTO = 3
class-attribute
instance-attribute
¶
Fully automatic page segmentation (default).
AUTO_ONLY = 2
class-attribute
instance-attribute
¶
Automatic page segmentation without OSD.
AUTO_OSD = 1
class-attribute
instance-attribute
¶
Automatic page segmentation with orientation and script detection.
CIRCLE_WORD = 9
class-attribute
instance-attribute
¶
Treat the image as a single word in a circle.
OSD_ONLY = 0
class-attribute
instance-attribute
¶
Orientation and script detection only.
SINGLE_BLOCK = 6
class-attribute
instance-attribute
¶
Assume a single uniform block of text.
SINGLE_BLOCK_VERTICAL = 5
class-attribute
instance-attribute
¶
Assume a single uniform block of vertically aligned text.
SINGLE_CHAR = 10
class-attribute
instance-attribute
¶
Treat the image as a single character.
SINGLE_COLUMN = 4
class-attribute
instance-attribute
¶
Assume a single column of text.
SINGLE_LINE = 7
class-attribute
instance-attribute
¶
Treat the image as a single text line.
SINGLE_WORD = 8
class-attribute
instance-attribute
¶
Treat the image as a single word.
Metadata¶
A TypedDict that contains optional metadata fields extracted from documents:
kreuzberg.Metadata
¶
Bases: TypedDict
Base metadata common to all document types.
All fields will only be included if they contain non-empty values. Any field that would be empty or None is omitted from the dictionary.
Source code in kreuzberg/_types.py
Attributes¶
authors: NotRequired[list[str]]
instance-attribute
¶
List of document authors.
categories: NotRequired[list[str]]
instance-attribute
¶
Categories or classifications.
citations: NotRequired[list[str]]
instance-attribute
¶
Citation identifiers.
comments: NotRequired[str]
instance-attribute
¶
General comments.
copyright: NotRequired[str]
instance-attribute
¶
Copyright information.
created_at: NotRequired[str]
instance-attribute
¶
Creation timestamp in ISO format.
created_by: NotRequired[str]
instance-attribute
¶
Document creator.
description: NotRequired[str]
instance-attribute
¶
Document description.
fonts: NotRequired[list[str]]
instance-attribute
¶
List of fonts used in the document.
height: NotRequired[int]
instance-attribute
¶
Height of the document page/slide/image, if applicable.
identifier: NotRequired[str]
instance-attribute
¶
Unique document identifier.
keywords: NotRequired[list[str]]
instance-attribute
¶
Keywords or tags.
languages: NotRequired[list[str]]
instance-attribute
¶
Document language code.
license: NotRequired[str]
instance-attribute
¶
License information.
modified_at: NotRequired[str]
instance-attribute
¶
Last modification timestamp in ISO format.
modified_by: NotRequired[str]
instance-attribute
¶
Username of last modifier.
organization: NotRequired[str | list[str]]
instance-attribute
¶
Organizational affiliation.
publisher: NotRequired[str]
instance-attribute
¶
Publisher or organization name.
references: NotRequired[list[str]]
instance-attribute
¶
Reference entries.
status: NotRequired[str]
instance-attribute
¶
Document status (e.g., draft, final).
subject: NotRequired[str]
instance-attribute
¶
Document subject or topic.
subtitle: NotRequired[str]
instance-attribute
¶
Document subtitle.
summary: NotRequired[str]
instance-attribute
¶
Document Summary
title: NotRequired[str]
instance-attribute
¶
Document title.
version: NotRequired[str]
instance-attribute
¶
Version identifier or revision number.
width: NotRequired[int]
instance-attribute
¶
Width of the document page/slide/image, if applicable.