Types¶
Core data structures for extraction results, configuration, and metadata.
ExtractionResult¶
The result of a file extraction, containing the extracted text, MIME type, metadata, and table data:
kreuzberg.ExtractionResult
dataclass
¶
The result of a file extraction.
Source code in kreuzberg/_types.py
Attributes¶
chunks: list[str] = field(default_factory=list)
class-attribute
instance-attribute
¶
The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.
content: str
instance-attribute
¶
The extracted content.
detected_languages: list[str] | None = None
class-attribute
instance-attribute
¶
Languages detected in the extracted content, if language detection is enabled.
entities: list[Entity] | None = None
class-attribute
instance-attribute
¶
Extracted entities, if entity extraction is enabled.
keywords: list[tuple[str, float]] | None = None
class-attribute
instance-attribute
¶
Extracted keywords and their scores, if keyword extraction is enabled.
metadata: Metadata
instance-attribute
¶
The metadata of the content.
mime_type: str
instance-attribute
¶
The mime type of the extracted content. Is either text/plain or text/markdown.
tables: list[TableData] = field(default_factory=list)
class-attribute
instance-attribute
¶
Extracted tables. Is an empty list if 'extract_tables' is not set to True in the ExtractionConfig.
Functions¶
ExtractionConfig¶
Configuration options for extraction functions:
kreuzberg.ExtractionConfig
dataclass
¶
Represents configuration settings for an extraction process.
This class encapsulates the configuration options for extracting text from images or documents using Optical Character Recognition (OCR). It provides options to customize the OCR behavior, select the backend engine, and configure engine-specific parameters.
Source code in kreuzberg/_types.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
|
Attributes¶
auto_detect_language: bool = False
class-attribute
instance-attribute
¶
Whether to automatically detect language and configure OCR accordingly.
chunk_content: bool = False
class-attribute
instance-attribute
¶
Whether to chunk the content into smaller chunks.
custom_entity_patterns: frozenset[tuple[str, str]] | None = None
class-attribute
instance-attribute
¶
Custom entity patterns as a frozenset of (entity_type, regex_pattern) tuples.
extract_entities: bool = False
class-attribute
instance-attribute
¶
Whether to extract named entities from the content.
extract_keywords: bool = False
class-attribute
instance-attribute
¶
Whether to extract keywords from the content.
extract_tables: bool = False
class-attribute
instance-attribute
¶
Whether to extract tables from the content. This requires the 'gmft' dependency.
force_ocr: bool = False
class-attribute
instance-attribute
¶
Whether to force OCR.
gmft_config: GMFTConfig | None = None
class-attribute
instance-attribute
¶
GMFT configuration.
keyword_count: int = 10
class-attribute
instance-attribute
¶
Number of keywords to extract if extract_keywords is True.
language_detection_config: LanguageDetectionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for language detection. If None, uses default settings.
max_chars: int = DEFAULT_MAX_CHARACTERS
class-attribute
instance-attribute
¶
The size of each chunk in characters.
max_overlap: int = DEFAULT_MAX_OVERLAP
class-attribute
instance-attribute
¶
The overlap between chunks in characters.
ocr_backend: OcrBackendType | None = 'tesseract'
class-attribute
instance-attribute
¶
The OCR backend to use.
Notes
- If set to 'None', OCR will not be performed.
ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
class-attribute
instance-attribute
¶
Configuration to pass to the OCR backend.
post_processing_hooks: list[PostProcessingHook] | None = None
class-attribute
instance-attribute
¶
Post processing hooks to call after processing is done and before the final result is returned.
spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for spaCy entity extraction. If None, uses default settings.
validators: list[ValidationHook] | None = None
class-attribute
instance-attribute
¶
Validation hooks to call after processing is done and before post-processing and result return.
Functions¶
get_config_dict() -> dict[str, Any]
¶
Returns the OCR configuration object based on the backend specified.
RETURNS | DESCRIPTION |
---|---|
dict[str, Any] | A dict of the OCR configuration or an empty dict if no backend is provided. |
Source code in kreuzberg/_types.py
TableData¶
A TypedDict that contains data extracted from tables in documents:
kreuzberg.TableData
¶
Bases: TypedDict
Table data, returned from table extraction.
Source code in kreuzberg/_types.py
OCR Configuration¶
TesseractConfig¶
kreuzberg.TesseractConfig
dataclass
¶
Configuration options for Tesseract OCR engine.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
classify_use_pre_adapted_templates: bool = True
class-attribute
instance-attribute
¶
Whether to use pre-adapted templates during classification to improve recognition accuracy.
language: str = 'eng'
class-attribute
instance-attribute
¶
Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu')
language_model_ngram_on: bool = False
class-attribute
instance-attribute
¶
Enable or disable the use of n-gram-based language models for improved text recognition.
Default is False for optimal performance on modern documents. Enable for degraded or historical text.
psm: PSMMode = PSMMode.AUTO_ONLY
class-attribute
instance-attribute
¶
Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).
tessedit_char_whitelist: str = ''
class-attribute
instance-attribute
¶
Whitelist of characters that Tesseract is allowed to recognize. Empty string means no restriction.
tessedit_dont_blkrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents block rejection of words identified as good, improving text output quality.
tessedit_dont_rowrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.
tessedit_enable_dict_correction: bool = True
class-attribute
instance-attribute
¶
Enable or disable dictionary-based correction for recognized text to improve word accuracy.
tessedit_use_primary_params_model: bool = True
class-attribute
instance-attribute
¶
If True, forces the use of the primary parameters model for text recognition.
textord_space_size_is_variable: bool = True
class-attribute
instance-attribute
¶
Allow variable spacing between words, useful for text with irregular spacing.
thresholding_method: bool = False
class-attribute
instance-attribute
¶
Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.
EasyOCRConfig¶
kreuzberg.EasyOCRConfig
dataclass
¶
Configuration options for EasyOCR.
Source code in kreuzberg/_ocr/_easyocr.py
Attributes¶
add_margin: float = 0.1
class-attribute
instance-attribute
¶
Extend bounding boxes in all directions.
adjust_contrast: float = 0.5
class-attribute
instance-attribute
¶
Target contrast level for low contrast text.
beam_width: int = 5
class-attribute
instance-attribute
¶
Beam width for beam search in recognition.
canvas_size: int = 2560
class-attribute
instance-attribute
¶
Maximum image dimension for detection.
contrast_ths: float = 0.1
class-attribute
instance-attribute
¶
Contrast threshold for preprocessing.
decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'
class-attribute
instance-attribute
¶
Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'mps', 'auto'.
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
Maximum GPU memory to use in GB. None for no limit.
height_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum difference in box height for merging.
language: str | list[str] = 'en'
class-attribute
instance-attribute
¶
Language or languages to use for OCR. Can be a single language code (e.g., 'en'), a comma-separated string of language codes (e.g., 'en,ch_sim'), or a list of language codes.
link_threshold: float = 0.4
class-attribute
instance-attribute
¶
Link confidence threshold.
low_text: float = 0.4
class-attribute
instance-attribute
¶
Text low-bound score.
mag_ratio: float = 1.0
class-attribute
instance-attribute
¶
Image magnification ratio.
min_size: int = 10
class-attribute
instance-attribute
¶
Minimum text box size in pixels.
rotation_info: list[int] | None = None
class-attribute
instance-attribute
¶
List of angles to try for detection.
slope_ths: float = 0.1
class-attribute
instance-attribute
¶
Maximum slope for merging text boxes.
text_threshold: float = 0.7
class-attribute
instance-attribute
¶
Text confidence threshold.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.
width_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum horizontal distance for merging boxes.
x_ths: float = 1.0
class-attribute
instance-attribute
¶
Maximum horizontal distance for paragraph merging.
y_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum vertical distance for paragraph merging.
ycenter_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum shift in y direction for merging.
PaddleOCRConfig¶
kreuzberg.PaddleOCRConfig
dataclass
¶
Configuration options for PaddleOCR.
This TypedDict provides type hints and documentation for all PaddleOCR parameters.
Source code in kreuzberg/_ocr/_paddleocr.py
Attributes¶
cls_image_shape: str = '3,48,192'
class-attribute
instance-attribute
¶
Image shape for classification algorithm in format 'channels,height,width'.
det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'
class-attribute
instance-attribute
¶
Detection algorithm.
det_db_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
Score threshold for detected boxes. Boxes below this value are discarded.
det_db_thresh: float = 0.3
class-attribute
instance-attribute
¶
Binarization threshold for DB output map.
det_db_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
Expansion ratio for detected text boxes.
det_east_cover_thresh: float = 0.1
class-attribute
instance-attribute
¶
Score threshold for EAST output boxes.
det_east_nms_thresh: float = 0.2
class-attribute
instance-attribute
¶
NMS threshold for EAST model output boxes.
det_east_score_thresh: float = 0.8
class-attribute
instance-attribute
¶
Binarization threshold for EAST output map.
det_max_side_len: int = 960
class-attribute
instance-attribute
¶
Maximum size of image long side. Images exceeding this will be proportionally resized.
det_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for detection model. If None, uses default model location.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'auto'. Note: MPS not supported by PaddlePaddle.
drop_score: float = 0.5
class-attribute
instance-attribute
¶
Filter recognition results by confidence score. Results below this are discarded.
enable_mkldnn: bool = False
class-attribute
instance-attribute
¶
Whether to enable MKL-DNN acceleration (Intel CPU only).
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_mem: int = 8000
class-attribute
instance-attribute
¶
GPU memory size (in MB) to use for initialization.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
Maximum GPU memory to use in GB. None for no limit.
language: str = 'en'
class-attribute
instance-attribute
¶
Language to use for OCR.
max_text_length: int = 25
class-attribute
instance-attribute
¶
Maximum text length that the recognition algorithm can recognize.
rec: bool = True
class-attribute
instance-attribute
¶
Enable text recognition when using the ocr() function.
rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'
class-attribute
instance-attribute
¶
Recognition algorithm.
rec_image_shape: str = '3,32,320'
class-attribute
instance-attribute
¶
Image shape for recognition algorithm in format 'channels,height,width'.
rec_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for recognition model. If None, uses default model location.
table: bool = True
class-attribute
instance-attribute
¶
Whether to enable table recognition.
use_angle_cls: bool = True
class-attribute
instance-attribute
¶
Whether to use text orientation classification model.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.
use_space_char: bool = True
class-attribute
instance-attribute
¶
Whether to recognize spaces.
use_zero_copy_run: bool = False
class-attribute
instance-attribute
¶
Whether to enable zero_copy_run for inference optimization.
GMFT Configuration¶
Configuration options for the GMFT table extraction engine:
kreuzberg.GMFTConfig
dataclass
¶
Configuration options for GMFT.
This class encapsulates the configuration options for GMFT, providing a way to customize its behavior.
Source code in kreuzberg/_gmft.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
Attributes¶
cell_required_confidence: dict[Literal[0, 1, 2, 3, 4, 5, 6], float] = field(default_factory=lambda: {0: 0.3, 1: 0.3, 2: 0.3, 3: 0.3, 4: 0.5, 5: 0.5, 6: 99}, hash=False)
class-attribute
instance-attribute
¶
Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label
But low confidences may be better than too high confidence (see formatter_base_threshold)
detector_base_threshold: float = 0.9
class-attribute
instance-attribute
¶
Minimum confidence score required for a table
enable_multi_header: bool = False
class-attribute
instance-attribute
¶
Enable multi-indices in the dataframe.
If false, then multiple headers will be merged column-wise.
force_large_table_assumption: bool | None = None
class-attribute
instance-attribute
¶
Force the large table assumption to be applied, regardless of the number of rows and overlap.
formatter_base_threshold: float = 0.3
class-attribute
instance-attribute
¶
Base threshold for the confidence demanded of a table feature (row/column).
Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).
iob_reject_threshold: float = 0.05
class-attribute
instance-attribute
¶
Reject if iob between textbox and cell is < 5%.
iob_warn_threshold: float = 0.5
class-attribute
instance-attribute
¶
Warn if iob between textbox and cell is < 50%.
large_table_if_n_rows_removed: int = 8
class-attribute
instance-attribute
¶
If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.
large_table_maximum_rows: int = 1000
class-attribute
instance-attribute
¶
Maximum number of rows allowed for a large table.
large_table_row_overlap_threshold: float = 0.2
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold).
large_table_threshold: int = 10
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
nms_warn_threshold: int = 5
class-attribute
instance-attribute
¶
Warn if non maxima suppression removes > 5 rows.
remove_null_rows: bool = True
class-attribute
instance-attribute
¶
Flag to remove rows with no text.
semantic_hierarchical_left_fill: Literal['algorithm', 'deep'] | None = 'algorithm'
class-attribute
instance-attribute
¶
[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row.
Possible values: 'algorithm', 'deep', None.
'algorithm': assumes that the higher-level header is always the first row followed by several empty rows. 'deep': merges headers according to the spanning cells detected by the Table Transformer. None: headers are not duplicated.
semantic_spanning_cells: bool = False
class-attribute
instance-attribute
¶
[Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.
total_overlap_reject_threshold: float = 0.9
class-attribute
instance-attribute
¶
Reject if total overlap is > 90% of table area.
total_overlap_warn_threshold: float = 0.1
class-attribute
instance-attribute
¶
Warn if total overlap is > 10% of table area.
verbosity: int = 0
class-attribute
instance-attribute
¶
Verbosity level for logging.
0: errors only 1: print warnings 2: print warnings and info 3: print warnings, info, and debug
Entity Extraction Configuration¶
Configuration options for spaCy-based entity extraction:
kreuzberg.SpacyEntityExtractionConfig
dataclass
¶
Configuration for spaCy-based entity extraction.
Source code in kreuzberg/_entity_extraction.py
Attributes¶
batch_size: int = 1000
class-attribute
instance-attribute
¶
Batch size for processing multiple texts.
fallback_to_multilingual: bool = True
class-attribute
instance-attribute
¶
If True and language-specific model fails, try xx_ent_wiki_sm (multilingual).
language_models: dict[str, str] | tuple[tuple[str, str], ...] | None = None
class-attribute
instance-attribute
¶
Mapping of language codes to spaCy model names.
If None, uses default mappings: - en: en_core_web_sm - de: de_core_news_sm - fr: fr_core_news_sm - es: es_core_news_sm - pt: pt_core_news_sm - it: it_core_news_sm - nl: nl_core_news_sm - zh: zh_core_web_sm - ja: ja_core_news_sm
max_doc_length: int = 1000000
class-attribute
instance-attribute
¶
Maximum document length for spaCy processing.
model_cache_dir: str | Path | None = None
class-attribute
instance-attribute
¶
Directory to cache spaCy models. If None, uses spaCy's default.
Functions¶
get_fallback_model() -> str | None
¶
get_model_for_language(language_code: str) -> str | None
¶
Get the appropriate spaCy model for a language code.
Source code in kreuzberg/_entity_extraction.py
Language Detection Configuration¶
Configuration options for automatic language detection:
kreuzberg.LanguageDetectionConfig
dataclass
¶
Configuration for language detection.
ATTRIBUTE | DESCRIPTION |
---|---|
low_memory | If True, uses a smaller model (~200MB). If False, uses a larger, more accurate model. Defaults to True for better memory efficiency. TYPE: |
top_k | Maximum number of languages to return for multilingual detection. Defaults to 3. TYPE: |
multilingual | If True, uses multilingual detection to handle mixed-language text. If False, uses single language detection. Defaults to False. TYPE: |
cache_dir | Custom directory for model cache. If None, uses system default. TYPE: |
allow_fallback | If True, falls back to small model if large model fails. Defaults to True. TYPE: |
Source code in kreuzberg/_language_detection.py
PSMMode (Page Segmentation Mode)¶
kreuzberg.PSMMode
¶
Bases: Enum
Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values.
Source code in kreuzberg/_ocr/_tesseract.py
Attributes¶
AUTO = 3
class-attribute
instance-attribute
¶
Fully automatic page segmentation (default).
AUTO_ONLY = 2
class-attribute
instance-attribute
¶
Automatic page segmentation without OSD.
AUTO_OSD = 1
class-attribute
instance-attribute
¶
Automatic page segmentation with orientation and script detection.
CIRCLE_WORD = 9
class-attribute
instance-attribute
¶
Treat the image as a single word in a circle.
OSD_ONLY = 0
class-attribute
instance-attribute
¶
Orientation and script detection only.
SINGLE_BLOCK = 6
class-attribute
instance-attribute
¶
Assume a single uniform block of text.
SINGLE_BLOCK_VERTICAL = 5
class-attribute
instance-attribute
¶
Assume a single uniform block of vertically aligned text.
SINGLE_CHAR = 10
class-attribute
instance-attribute
¶
Treat the image as a single character.
SINGLE_COLUMN = 4
class-attribute
instance-attribute
¶
Assume a single column of text.
SINGLE_LINE = 7
class-attribute
instance-attribute
¶
Treat the image as a single text line.
SINGLE_WORD = 8
class-attribute
instance-attribute
¶
Treat the image as a single word.
Entity¶
Represents an extracted named entity:
kreuzberg.Entity
dataclass
¶
Represents an extracted entity with type, text, and position.
Source code in kreuzberg/_types.py
Metadata¶
A TypedDict that contains optional metadata fields extracted from documents:
kreuzberg.Metadata
¶
Bases: TypedDict
Base metadata common to all document types.
All fields will only be included if they contain non-empty values. Any field that would be empty or None is omitted from the dictionary.
Source code in kreuzberg/_types.py
Attributes¶
authors: NotRequired[list[str]]
instance-attribute
¶
List of document authors.
categories: NotRequired[list[str]]
instance-attribute
¶
Categories or classifications.
citations: NotRequired[list[str]]
instance-attribute
¶
Citation identifiers.
comments: NotRequired[str]
instance-attribute
¶
General comments.
copyright: NotRequired[str]
instance-attribute
¶
Copyright information.
created_at: NotRequired[str]
instance-attribute
¶
Creation timestamp in ISO format.
created_by: NotRequired[str]
instance-attribute
¶
Document creator.
description: NotRequired[str]
instance-attribute
¶
Document description.
fonts: NotRequired[list[str]]
instance-attribute
¶
List of fonts used in the document.
height: NotRequired[int]
instance-attribute
¶
Height of the document page/slide/image, if applicable.
identifier: NotRequired[str]
instance-attribute
¶
Unique document identifier.
keywords: NotRequired[list[str]]
instance-attribute
¶
Keywords or tags.
languages: NotRequired[list[str]]
instance-attribute
¶
Document language code.
license: NotRequired[str]
instance-attribute
¶
License information.
modified_at: NotRequired[str]
instance-attribute
¶
Last modification timestamp in ISO format.
modified_by: NotRequired[str]
instance-attribute
¶
Username of last modifier.
organization: NotRequired[str | list[str]]
instance-attribute
¶
Organizational affiliation.
publisher: NotRequired[str]
instance-attribute
¶
Publisher or organization name.
references: NotRequired[list[str]]
instance-attribute
¶
Reference entries.
status: NotRequired[str]
instance-attribute
¶
Document status (e.g., draft, final).
subject: NotRequired[str]
instance-attribute
¶
Document subject or topic.
subtitle: NotRequired[str]
instance-attribute
¶
Document subtitle.
summary: NotRequired[str]
instance-attribute
¶
Document Summary
title: NotRequired[str]
instance-attribute
¶
Document title.
version: NotRequired[str]
instance-attribute
¶
Version identifier or revision number.
width: NotRequired[int]
instance-attribute
¶
Width of the document page/slide/image, if applicable.