Types¶
Core data structures for extraction results, configuration, and metadata.
ExtractionResult¶
The result of a file extraction, containing the extracted text, MIME type, metadata, and table data:
kreuzberg.ExtractionResult
dataclass
¶
The result of a file extraction.
Source code in kreuzberg/_types.py
Attributes¶
chunks: list[str] = field(default_factory=list)
class-attribute
instance-attribute
¶
The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.
content: str
instance-attribute
¶
The extracted content.
metadata: Metadata
instance-attribute
¶
The metadata of the content.
mime_type: str
instance-attribute
¶
The mime type of the extracted content. Is either text/plain or text/markdown.
tables: list[TableData] = field(default_factory=list)
class-attribute
instance-attribute
¶
Extracted tables. Is an empty list if 'extract_tables' is not set to True in the ExtractionConfig.
ExtractionConfig¶
Configuration options for extraction functions:
kreuzberg.ExtractionConfig
dataclass
¶
Represents configuration settings for an extraction process.
This class encapsulates the configuration options for extracting text from images or documents using Optical Character Recognition (OCR). It provides options to customize the OCR behavior, select the backend engine, and configure engine-specific parameters.
Source code in kreuzberg/_types.py
Attributes¶
chunk_content: bool = False
class-attribute
instance-attribute
¶
Whether to chunk the content into smaller chunks.
extract_tables: bool = False
class-attribute
instance-attribute
¶
Whether to extract tables from the content. This requires the 'gmft' dependency.
force_ocr: bool = False
class-attribute
instance-attribute
¶
Whether to force OCR.
gmft_config: GMFTConfig | None = None
class-attribute
instance-attribute
¶
GMFT configuration.
max_chars: int = DEFAULT_MAX_CHARACTERS
class-attribute
instance-attribute
¶
The size of each chunk in characters.
max_overlap: int = DEFAULT_MAX_OVERLAP
class-attribute
instance-attribute
¶
The overlap between chunks in characters.
ocr_backend: OcrBackendType | None = 'tesseract'
class-attribute
instance-attribute
¶
The OCR backend to use.
Notes
- If set to 'None', OCR will not be performed.
ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
class-attribute
instance-attribute
¶
Configuration to pass to the OCR backend.
post_processing_hooks: list[PostProcessingHook] | None = None
class-attribute
instance-attribute
¶
Post processing hooks to call after processing is done and before the final result is returned.
validators: list[ValidationHook] | None = None
class-attribute
instance-attribute
¶
Validation hooks to call after processing is done and before post-processing and result return.
Functions¶
get_config_dict() -> dict[str, Any]
¶
Returns the OCR configuration object based on the backend specified.
RETURNS | DESCRIPTION |
---|---|
dict[str, Any] | A dict of the OCR configuration or an empty dict if no backend is provided. |
Source code in kreuzberg/_types.py
TableData¶
A TypedDict that contains data extracted from tables in documents:
kreuzberg.TableData
¶
Bases: TypedDict
Table data, returned from table extraction.
Source code in kreuzberg/_types.py
OCR Configuration¶
TesseractConfig¶
kreuzberg.TesseractConfig
dataclass
¶
Configuration options for Tesseract OCR engine.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
classify_use_pre_adapted_templates: bool = True
class-attribute
instance-attribute
¶
Whether to use pre-adapted templates during classification to improve recognition accuracy.
language: str = 'eng'
class-attribute
instance-attribute
¶
Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu')
language_model_ngram_on: bool = True
class-attribute
instance-attribute
¶
Enable or disable the use of n-gram-based language models for improved text recognition.
psm: PSMMode = PSMMode.AUTO
class-attribute
instance-attribute
¶
Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).
tessedit_dont_blkrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents block rejection of words identified as good, improving text output quality.
tessedit_dont_rowrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.
tessedit_enable_dict_correction: bool = True
class-attribute
instance-attribute
¶
Enable or disable dictionary-based correction for recognized text to improve word accuracy.
tessedit_use_primary_params_model: bool = True
class-attribute
instance-attribute
¶
If True, forces the use of the primary parameters model for text recognition.
textord_space_size_is_variable: bool = True
class-attribute
instance-attribute
¶
Allow variable spacing between words, useful for text with irregular spacing.
thresholding_method: bool = False
class-attribute
instance-attribute
¶
Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.
EasyOCRConfig¶
kreuzberg.EasyOCRConfig
dataclass
¶
Configuration options for EasyOCR.
Source code in kreuzberg/_ocr/_easyocr.py
Attributes¶
add_margin: float = 0.1
class-attribute
instance-attribute
¶
Extend bounding boxes in all directions.
adjust_contrast: float = 0.5
class-attribute
instance-attribute
¶
Target contrast level for low contrast text.
beam_width: int = 5
class-attribute
instance-attribute
¶
Beam width for beam search in recognition.
canvas_size: int = 2560
class-attribute
instance-attribute
¶
Maximum image dimension for detection.
contrast_ths: float = 0.1
class-attribute
instance-attribute
¶
Contrast threshold for preprocessing.
decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'
class-attribute
instance-attribute
¶
Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.
height_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum difference in box height for merging.
language: str | list[str] = 'en'
class-attribute
instance-attribute
¶
Language or languages to use for OCR.
link_threshold: float = 0.4
class-attribute
instance-attribute
¶
Link confidence threshold.
low_text: float = 0.4
class-attribute
instance-attribute
¶
Text low-bound score.
mag_ratio: float = 1.0
class-attribute
instance-attribute
¶
Image magnification ratio.
min_size: int = 10
class-attribute
instance-attribute
¶
Minimum text box size in pixels.
rotation_info: list[int] | None = None
class-attribute
instance-attribute
¶
List of angles to try for detection.
slope_ths: float = 0.1
class-attribute
instance-attribute
¶
Maximum slope for merging text boxes.
text_threshold: float = 0.7
class-attribute
instance-attribute
¶
Text confidence threshold.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference.
width_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum horizontal distance for merging boxes.
x_ths: float = 1.0
class-attribute
instance-attribute
¶
Maximum horizontal distance for paragraph merging.
y_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum vertical distance for paragraph merging.
ycenter_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum shift in y direction for merging.
PaddleOCRConfig¶
kreuzberg.PaddleOCRConfig
dataclass
¶
Configuration options for PaddleOCR.
This TypedDict provides type hints and documentation for all PaddleOCR parameters.
Source code in kreuzberg/_ocr/_paddleocr.py
Attributes¶
cls_image_shape: str = '3,48,192'
class-attribute
instance-attribute
¶
Image shape for classification algorithm in format 'channels,height,width'.
det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'
class-attribute
instance-attribute
¶
Detection algorithm.
det_db_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
Score threshold for detected boxes. Boxes below this value are discarded.
det_db_thresh: float = 0.3
class-attribute
instance-attribute
¶
Binarization threshold for DB output map.
det_db_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
Expansion ratio for detected text boxes.
det_east_cover_thresh: float = 0.1
class-attribute
instance-attribute
¶
Score threshold for EAST output boxes.
det_east_nms_thresh: float = 0.2
class-attribute
instance-attribute
¶
NMS threshold for EAST model output boxes.
det_east_score_thresh: float = 0.8
class-attribute
instance-attribute
¶
Binarization threshold for EAST output map.
det_max_side_len: int = 960
class-attribute
instance-attribute
¶
Maximum size of image long side. Images exceeding this will be proportionally resized.
drop_score: float = 0.5
class-attribute
instance-attribute
¶
Filter recognition results by confidence score. Results below this are discarded.
enable_mkldnn: bool = False
class-attribute
instance-attribute
¶
Whether to enable MKL-DNN acceleration (Intel CPU only).
gpu_mem: int = 8000
class-attribute
instance-attribute
¶
GPU memory size (in MB) to use for initialization.
language: str = 'en'
class-attribute
instance-attribute
¶
Language to use for OCR.
max_text_length: int = 25
class-attribute
instance-attribute
¶
Maximum text length that the recognition algorithm can recognize.
rec: bool = True
class-attribute
instance-attribute
¶
Enable text recognition when using the ocr() function.
rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'
class-attribute
instance-attribute
¶
Recognition algorithm.
rec_image_shape: str = '3,32,320'
class-attribute
instance-attribute
¶
Image shape for recognition algorithm in format 'channels,height,width'.
table: bool = True
class-attribute
instance-attribute
¶
Whether to enable table recognition.
use_angle_cls: bool = True
class-attribute
instance-attribute
¶
Whether to use text orientation classification model.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. Requires installing the paddlepaddle-gpu package
use_space_char: bool = True
class-attribute
instance-attribute
¶
Whether to recognize spaces.
use_zero_copy_run: bool = False
class-attribute
instance-attribute
¶
Whether to enable zero_copy_run for inference optimization.
GMFT Configuration¶
Configuration options for the GMFT table extraction engine:
kreuzberg.GMFTConfig
dataclass
¶
Configuration options for GMFT.
This class encapsulates the configuration options for GMFT, providing a way to customize its behavior.
Source code in kreuzberg/_gmft.py
Attributes¶
cell_required_confidence: dict[Literal[0, 1, 2, 3, 4, 5, 6], float] = field(default_factory=lambda: {0: 0.3, 1: 0.3, 2: 0.3, 3: 0.3, 4: 0.5, 5: 0.5, 6: 99}, hash=False)
class-attribute
instance-attribute
¶
Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label
But low confidences may be better than too high confidence (see formatter_base_threshold)
detector_base_threshold: float = 0.9
class-attribute
instance-attribute
¶
Minimum confidence score required for a table
enable_multi_header: bool = False
class-attribute
instance-attribute
¶
Enable multi-indices in the dataframe.
If false, then multiple headers will be merged column-wise.
force_large_table_assumption: bool | None = None
class-attribute
instance-attribute
¶
Force the large table assumption to be applied, regardless of the number of rows and overlap.
formatter_base_threshold: float = 0.3
class-attribute
instance-attribute
¶
Base threshold for the confidence demanded of a table feature (row/column).
Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).
large_table_if_n_rows_removed: int = 8
class-attribute
instance-attribute
¶
If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.
large_table_maximum_rows: int = 1000
class-attribute
instance-attribute
¶
Maximum number of rows allowed for a large table.
large_table_row_overlap_threshold: float = 0.2
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold).
large_table_threshold: int = 10
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
remove_null_rows: bool = True
class-attribute
instance-attribute
¶
Flag to remove rows with no text.
semantic_hierarchical_left_fill: str | None = 'algorithm'
class-attribute
instance-attribute
¶
[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row.
Possible values: 'algorithm', 'deep', None.
'algorithm': assumes that the higher-level header is always the first row followed by several empty rows. 'deep': merges headers according to the spanning cells detected by the Table Transformer. None: headers are not duplicated.
semantic_spanning_cells: bool = False
class-attribute
instance-attribute
¶
[Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.
verbosity: int = 0
class-attribute
instance-attribute
¶
Verbosity level for logging.
0: errors only 1: print warnings 2: print warnings and info 3: print warnings, info, and debug
PSMMode (Page Segmentation Mode)¶
kreuzberg.PSMMode
¶
Bases: Enum
Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
AUTO = 3
class-attribute
instance-attribute
¶
Fully automatic page segmentation (default).
AUTO_ONLY = 2
class-attribute
instance-attribute
¶
Automatic page segmentation without OSD.
AUTO_OSD = 1
class-attribute
instance-attribute
¶
Automatic page segmentation with orientation and script detection.
CIRCLE_WORD = 9
class-attribute
instance-attribute
¶
Treat the image as a single word in a circle.
OSD_ONLY = 0
class-attribute
instance-attribute
¶
Orientation and script detection only.
SINGLE_BLOCK = 6
class-attribute
instance-attribute
¶
Assume a single uniform block of text.
SINGLE_BLOCK_VERTICAL = 5
class-attribute
instance-attribute
¶
Assume a single uniform block of vertically aligned text.
SINGLE_CHAR = 10
class-attribute
instance-attribute
¶
Treat the image as a single character.
SINGLE_COLUMN = 4
class-attribute
instance-attribute
¶
Assume a single column of text.
SINGLE_LINE = 7
class-attribute
instance-attribute
¶
Treat the image as a single text line.
SINGLE_WORD = 8
class-attribute
instance-attribute
¶
Treat the image as a single word.
Metadata¶
A TypedDict that contains optional metadata fields extracted from documents:
kreuzberg.Metadata
¶
Bases: TypedDict
Base metadata common to all document types.
All fields will only be included if they contain non-empty values. Any field that would be empty or None is omitted from the dictionary.
Source code in kreuzberg/_types.py
Attributes¶
authors: NotRequired[list[str]]
instance-attribute
¶
List of document authors.
categories: NotRequired[list[str]]
instance-attribute
¶
Categories or classifications.
citations: NotRequired[list[str]]
instance-attribute
¶
Citation identifiers.
comments: NotRequired[str]
instance-attribute
¶
General comments.
copyright: NotRequired[str]
instance-attribute
¶
Copyright information.
created_at: NotRequired[str]
instance-attribute
¶
Creation timestamp in ISO format.
created_by: NotRequired[str]
instance-attribute
¶
Document creator.
description: NotRequired[str]
instance-attribute
¶
Document description.
fonts: NotRequired[list[str]]
instance-attribute
¶
List of fonts used in the document.
height: NotRequired[int]
instance-attribute
¶
Height of the document page/slide/image, if applicable.
identifier: NotRequired[str]
instance-attribute
¶
Unique document identifier.
keywords: NotRequired[list[str]]
instance-attribute
¶
Keywords or tags.
languages: NotRequired[list[str]]
instance-attribute
¶
Document language code.
license: NotRequired[str]
instance-attribute
¶
License information.
modified_at: NotRequired[str]
instance-attribute
¶
Last modification timestamp in ISO format.
modified_by: NotRequired[str]
instance-attribute
¶
Username of last modifier.
organization: NotRequired[str | list[str]]
instance-attribute
¶
Organizational affiliation.
publisher: NotRequired[str]
instance-attribute
¶
Publisher or organization name.
references: NotRequired[list[str]]
instance-attribute
¶
Reference entries.
status: NotRequired[str]
instance-attribute
¶
Document status (e.g., draft, final).
subject: NotRequired[str]
instance-attribute
¶
Document subject or topic.
subtitle: NotRequired[str]
instance-attribute
¶
Document subtitle.
summary: NotRequired[str]
instance-attribute
¶
Document Summary
title: NotRequired[str]
instance-attribute
¶
Document title.
version: NotRequired[str]
instance-attribute
¶
Version identifier or revision number.
width: NotRequired[int]
instance-attribute
¶
Width of the document page/slide/image, if applicable.