Types¶

Core data structures for extraction results, configuration, and metadata.

ExtractionResult¶

The result of a file extraction, containing the extracted text, MIME type, metadata, and table data:

`kreuzberg.ExtractionResult` `dataclass` ¶

The result of a file extraction.

Source code in kreuzberg/_types.py

@dataclass
class ExtractionResult:
    """The result of a file extraction."""

    content: str
    """The extracted content."""
    mime_type: str
    """The mime type of the extracted content. Is either text/plain or text/markdown."""
    metadata: Metadata
    """The metadata of the content."""
    tables: list[TableData] = field(default_factory=list)
    """Extracted tables. Is an empty list if 'extract_tables' is not set to True in the ExtractionConfig."""
    chunks: list[str] = field(default_factory=list)
    """The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig."""
    entities: list[Entity] | None = None
    """Extracted entities, if entity extraction is enabled."""
    keywords: list[tuple[str, float]] | None = None
    """Extracted keywords and their scores, if keyword extraction is enabled."""
    detected_languages: list[str] | None = None
    """Languages detected in the extracted content, if language detection is enabled."""

    def to_dict(self) -> dict[str, Any]:
        """Converts the ExtractionResult to a dictionary."""
        return asdict(self)

Attributes¶

`chunks: list[str] = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.

`content: str` `instance-attribute` ¶

The extracted content.

`detected_languages: list[str] | None = None` `class-attribute` `instance-attribute` ¶

Languages detected in the extracted content, if language detection is enabled.

`entities: list[Entity] | None = None` `class-attribute` `instance-attribute` ¶

Extracted entities, if entity extraction is enabled.

`keywords: list[tuple[str, float]] | None = None` `class-attribute` `instance-attribute` ¶

Extracted keywords and their scores, if keyword extraction is enabled.

`metadata: Metadata` `instance-attribute` ¶

The metadata of the content.

`mime_type: str` `instance-attribute` ¶

The mime type of the extracted content. Is either text/plain or text/markdown.

`tables: list[TableData] = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

Extracted tables. Is an empty list if 'extract_tables' is not set to True in the ExtractionConfig.

Functions¶

`to_dict() -> dict[str, Any]` ¶

Converts the ExtractionResult to a dictionary.

Source code in kreuzberg/_types.py

def to_dict(self) -> dict[str, Any]:
    """Converts the ExtractionResult to a dictionary."""
    return asdict(self)

ExtractionConfig¶

Configuration options for extraction functions:

`kreuzberg.ExtractionConfig` `dataclass` ¶

Represents configuration settings for an extraction process.

This class encapsulates the configuration options for extracting text from images or documents using Optical Character Recognition (OCR). It provides options to customize the OCR behavior, select the backend engine, and configure engine-specific parameters.

Source code in kreuzberg/_types.py

@dataclass(unsafe_hash=True)
class ExtractionConfig:
    """Represents configuration settings for an extraction process.

    This class encapsulates the configuration options for extracting text
    from images or documents using Optical Character Recognition (OCR). It
    provides options to customize the OCR behavior, select the backend
    engine, and configure engine-specific parameters.
    """

    force_ocr: bool = False
    """Whether to force OCR."""
    chunk_content: bool = False
    """Whether to chunk the content into smaller chunks."""
    extract_tables: bool = False
    """Whether to extract tables from the content. This requires the 'gmft' dependency."""
    max_chars: int = DEFAULT_MAX_CHARACTERS
    """The size of each chunk in characters."""
    max_overlap: int = DEFAULT_MAX_OVERLAP
    """The overlap between chunks in characters."""
    ocr_backend: OcrBackendType | None = "tesseract"
    """The OCR backend to use.

    Notes:
        - If set to 'None', OCR will not be performed.
    """
    ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
    """Configuration to pass to the OCR backend."""
    gmft_config: GMFTConfig | None = None
    """GMFT configuration."""
    post_processing_hooks: list[PostProcessingHook] | None = None
    """Post processing hooks to call after processing is done and before the final result is returned."""
    validators: list[ValidationHook] | None = None
    """Validation hooks to call after processing is done and before post-processing and result return."""
    extract_entities: bool = False
    """Whether to extract named entities from the content."""
    extract_keywords: bool = False
    """Whether to extract keywords from the content."""
    keyword_count: int = 10
    """Number of keywords to extract if extract_keywords is True."""
    custom_entity_patterns: frozenset[tuple[str, str]] | None = None
    """Custom entity patterns as a frozenset of (entity_type, regex_pattern) tuples."""
    auto_detect_language: bool = False
    """Whether to automatically detect language and configure OCR accordingly."""
    language_detection_config: LanguageDetectionConfig | None = None
    """Configuration for language detection. If None, uses default settings."""
    spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None
    """Configuration for spaCy entity extraction. If None, uses default settings."""

    def __post_init__(self) -> None:
        if self.custom_entity_patterns is not None and isinstance(self.custom_entity_patterns, dict):
            object.__setattr__(self, "custom_entity_patterns", frozenset(self.custom_entity_patterns.items()))
        if self.post_processing_hooks is not None and isinstance(self.post_processing_hooks, list):
            object.__setattr__(self, "post_processing_hooks", tuple(self.post_processing_hooks))
        if self.validators is not None and isinstance(self.validators, list):
            object.__setattr__(self, "validators", tuple(self.validators))
        from kreuzberg._ocr._easyocr import EasyOCRConfig
        from kreuzberg._ocr._paddleocr import PaddleOCRConfig
        from kreuzberg._ocr._tesseract import TesseractConfig

        if self.ocr_backend is None and self.ocr_config is not None:
            raise ValidationError("'ocr_backend' is None but 'ocr_config' is provided")

        if self.ocr_config is not None and (
            (self.ocr_backend == "tesseract" and not isinstance(self.ocr_config, TesseractConfig))
            or (self.ocr_backend == "easyocr" and not isinstance(self.ocr_config, EasyOCRConfig))
            or (self.ocr_backend == "paddleocr" and not isinstance(self.ocr_config, PaddleOCRConfig))
        ):
            raise ValidationError(
                "incompatible 'ocr_config' value provided for 'ocr_backend'",
                context={"ocr_backend": self.ocr_backend, "ocr_config": type(self.ocr_config).__name__},
            )

    def get_config_dict(self) -> dict[str, Any]:
        """Returns the OCR configuration object based on the backend specified.

        Returns:
            A dict of the OCR configuration or an empty dict if no backend is provided.
        """
        if self.ocr_backend is not None:
            if self.ocr_config is not None:
                return asdict(self.ocr_config)
            if self.ocr_backend == "tesseract":
                from kreuzberg._ocr._tesseract import TesseractConfig

                return asdict(TesseractConfig())
            if self.ocr_backend == "easyocr":
                from kreuzberg._ocr._easyocr import EasyOCRConfig

                return asdict(EasyOCRConfig())
            from kreuzberg._ocr._paddleocr import PaddleOCRConfig

            return asdict(PaddleOCRConfig())
        return {}

Attributes¶

`auto_detect_language: bool = False` `class-attribute` `instance-attribute` ¶

Whether to automatically detect language and configure OCR accordingly.

`chunk_content: bool = False` `class-attribute` `instance-attribute` ¶

Whether to chunk the content into smaller chunks.

`custom_entity_patterns: frozenset[tuple[str, str]] | None = None` `class-attribute` `instance-attribute` ¶

Custom entity patterns as a frozenset of (entity_type, regex_pattern) tuples.

`extract_entities: bool = False` `class-attribute` `instance-attribute` ¶

Whether to extract named entities from the content.

`extract_keywords: bool = False` `class-attribute` `instance-attribute` ¶

Whether to extract keywords from the content.

`extract_tables: bool = False` `class-attribute` `instance-attribute` ¶

Whether to extract tables from the content. This requires the 'gmft' dependency.

`force_ocr: bool = False` `class-attribute` `instance-attribute` ¶

Whether to force OCR.

`gmft_config: GMFTConfig | None = None` `class-attribute` `instance-attribute` ¶

GMFT configuration.

`keyword_count: int = 10` `class-attribute` `instance-attribute` ¶

Number of keywords to extract if extract_keywords is True.

`language_detection_config: LanguageDetectionConfig | None = None` `class-attribute` `instance-attribute` ¶

Configuration for language detection. If None, uses default settings.

`max_chars: int = DEFAULT_MAX_CHARACTERS` `class-attribute` `instance-attribute` ¶

The size of each chunk in characters.

`max_overlap: int = DEFAULT_MAX_OVERLAP` `class-attribute` `instance-attribute` ¶

The overlap between chunks in characters.

`ocr_backend: OcrBackendType | None = 'tesseract'` `class-attribute` `instance-attribute` ¶

The OCR backend to use.

Notes

If set to 'None', OCR will not be performed.

`ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None` `class-attribute` `instance-attribute` ¶

Configuration to pass to the OCR backend.

`post_processing_hooks: list[PostProcessingHook] | None = None` `class-attribute` `instance-attribute` ¶

Post processing hooks to call after processing is done and before the final result is returned.

`spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None` `class-attribute` `instance-attribute` ¶

Configuration for spaCy entity extraction. If None, uses default settings.

`validators: list[ValidationHook] | None = None` `class-attribute` `instance-attribute` ¶

Validation hooks to call after processing is done and before post-processing and result return.

Functions¶

`get_config_dict() -> dict[str, Any]` ¶

Returns the OCR configuration object based on the backend specified.

RETURNS	DESCRIPTION
`dict[str, Any]`	A dict of the OCR configuration or an empty dict if no backend is provided.

Source code in kreuzberg/_types.py

def get_config_dict(self) -> dict[str, Any]:
    """Returns the OCR configuration object based on the backend specified.

    Returns:
        A dict of the OCR configuration or an empty dict if no backend is provided.
    """
    if self.ocr_backend is not None:
        if self.ocr_config is not None:
            return asdict(self.ocr_config)
        if self.ocr_backend == "tesseract":
            from kreuzberg._ocr._tesseract import TesseractConfig

            return asdict(TesseractConfig())
        if self.ocr_backend == "easyocr":
            from kreuzberg._ocr._easyocr import EasyOCRConfig

            return asdict(EasyOCRConfig())
        from kreuzberg._ocr._paddleocr import PaddleOCRConfig

        return asdict(PaddleOCRConfig())
    return {}

TableData¶

A TypedDict that contains data extracted from tables in documents:

`kreuzberg.TableData` ¶

Bases: TypedDict

Table data, returned from table extraction.

Source code in kreuzberg/_types.py

class TableData(TypedDict):
    """Table data, returned from table extraction."""

    cropped_image: Image
    """The cropped image of the table."""
    df: DataFrame
    """The table data as a pandas DataFrame."""
    page_number: int
    """The page number of the table."""
    text: str
    """The table text as a markdown string."""

Attributes¶

`cropped_image: Image` `instance-attribute` ¶

The cropped image of the table.

`df: DataFrame` `instance-attribute` ¶

The table data as a pandas DataFrame.

`page_number: int` `instance-attribute` ¶

The page number of the table.

`text: str` `instance-attribute` ¶

The table text as a markdown string.

OCR Configuration¶

TesseractConfig¶

`kreuzberg.TesseractConfig` `dataclass` ¶

Configuration options for Tesseract OCR engine.

Source code in kreuzberg/_ocr/_tesseract.py

@dataclass(unsafe_hash=True, frozen=True)
class TesseractConfig:
    """Configuration options for Tesseract OCR engine."""

    classify_use_pre_adapted_templates: bool = True
    """Whether to use pre-adapted templates during classification to improve recognition accuracy."""
    language: str = "eng"
    """Language code to use for OCR.
    Examples:
            -   'eng' for English
            -   'deu' for German
            -    multiple languages combined with '+', e.g. 'eng+deu')
    """
    language_model_ngram_on: bool = False
    """Enable or disable the use of n-gram-based language models for improved text recognition.

    Default is False for optimal performance on modern documents. Enable for degraded or historical text."""
    psm: PSMMode = PSMMode.AUTO_ONLY
    """Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line)."""
    tessedit_dont_blkrej_good_wds: bool = True
    """If True, prevents block rejection of words identified as good, improving text output quality."""
    tessedit_dont_rowrej_good_wds: bool = True
    """If True, prevents row rejection of words identified as good, avoiding unnecessary omissions."""
    tessedit_enable_dict_correction: bool = True
    """Enable or disable dictionary-based correction for recognized text to improve word accuracy."""
    tessedit_char_whitelist: str = ""
    """Whitelist of characters that Tesseract is allowed to recognize. Empty string means no restriction."""
    tessedit_use_primary_params_model: bool = True
    """If True, forces the use of the primary parameters model for text recognition."""
    textord_space_size_is_variable: bool = True
    """Allow variable spacing between words, useful for text with irregular spacing."""
    thresholding_method: bool = False
    """Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy."""

Attributes¶

`classify_use_pre_adapted_templates: bool = True` `class-attribute` `instance-attribute` ¶

Whether to use pre-adapted templates during classification to improve recognition accuracy.

`language: str = 'eng'` `class-attribute` `instance-attribute` ¶

Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu')

`language_model_ngram_on: bool = False` `class-attribute` `instance-attribute` ¶

Enable or disable the use of n-gram-based language models for improved text recognition.

Default is False for optimal performance on modern documents. Enable for degraded or historical text.

`psm: PSMMode = PSMMode.AUTO_ONLY` `class-attribute` `instance-attribute` ¶

Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).

`tessedit_char_whitelist: str = ''` `class-attribute` `instance-attribute` ¶

Whitelist of characters that Tesseract is allowed to recognize. Empty string means no restriction.

`tessedit_dont_blkrej_good_wds: bool = True` `class-attribute` `instance-attribute` ¶

If True, prevents block rejection of words identified as good, improving text output quality.

`tessedit_dont_rowrej_good_wds: bool = True` `class-attribute` `instance-attribute` ¶

If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.

`tessedit_enable_dict_correction: bool = True` `class-attribute` `instance-attribute` ¶

Enable or disable dictionary-based correction for recognized text to improve word accuracy.

`tessedit_use_primary_params_model: bool = True` `class-attribute` `instance-attribute` ¶

If True, forces the use of the primary parameters model for text recognition.

`textord_space_size_is_variable: bool = True` `class-attribute` `instance-attribute` ¶

Allow variable spacing between words, useful for text with irregular spacing.

`thresholding_method: bool = False` `class-attribute` `instance-attribute` ¶

Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.

EasyOCRConfig¶

`kreuzberg.EasyOCRConfig` `dataclass` ¶

Configuration options for EasyOCR.

Source code in kreuzberg/_ocr/_easyocr.py

@dataclass(unsafe_hash=True, frozen=True)
class EasyOCRConfig:
    """Configuration options for EasyOCR."""

    add_margin: float = 0.1
    """Extend bounding boxes in all directions."""
    adjust_contrast: float = 0.5
    """Target contrast level for low contrast text."""
    beam_width: int = 5
    """Beam width for beam search in recognition."""
    canvas_size: int = 2560
    """Maximum image dimension for detection."""
    contrast_ths: float = 0.1
    """Contrast threshold for preprocessing."""
    decoder: Literal["greedy", "beamsearch", "wordbeamsearch"] = "greedy"
    """Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'."""
    height_ths: float = 0.5
    """Maximum difference in box height for merging."""
    language: str | list[str] = "en"
    """Language or languages to use for OCR. Can be a single language code (e.g., 'en'),
    a comma-separated string of language codes (e.g., 'en,ch_sim'), or a list of language codes."""
    link_threshold: float = 0.4
    """Link confidence threshold."""
    low_text: float = 0.4
    """Text low-bound score."""
    mag_ratio: float = 1.0
    """Image magnification ratio."""
    min_size: int = 10
    """Minimum text box size in pixels."""
    rotation_info: list[int] | None = None
    """List of angles to try for detection."""
    slope_ths: float = 0.1
    """Maximum slope for merging text boxes."""
    text_threshold: float = 0.7
    """Text confidence threshold."""
    use_gpu: bool = False
    """Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead."""
    device: DeviceType = "auto"
    """Device to use for inference. Options: 'cpu', 'cuda', 'mps', 'auto'."""
    gpu_memory_limit: float | None = None
    """Maximum GPU memory to use in GB. None for no limit."""
    fallback_to_cpu: bool = True
    """Whether to fallback to CPU if requested device is unavailable."""
    width_ths: float = 0.5
    """Maximum horizontal distance for merging boxes."""
    x_ths: float = 1.0
    """Maximum horizontal distance for paragraph merging."""
    y_ths: float = 0.5
    """Maximum vertical distance for paragraph merging."""
    ycenter_ths: float = 0.5
    """Maximum shift in y direction for merging."""

Attributes¶

`add_margin: float = 0.1` `class-attribute` `instance-attribute` ¶

Extend bounding boxes in all directions.

`adjust_contrast: float = 0.5` `class-attribute` `instance-attribute` ¶

Target contrast level for low contrast text.

`beam_width: int = 5` `class-attribute` `instance-attribute` ¶

Beam width for beam search in recognition.

`canvas_size: int = 2560` `class-attribute` `instance-attribute` ¶

Maximum image dimension for detection.

`contrast_ths: float = 0.1` `class-attribute` `instance-attribute` ¶

Contrast threshold for preprocessing.

`decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'` `class-attribute` `instance-attribute` ¶

Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.

`device: DeviceType = 'auto'` `class-attribute` `instance-attribute` ¶

Device to use for inference. Options: 'cpu', 'cuda', 'mps', 'auto'.

`fallback_to_cpu: bool = True` `class-attribute` `instance-attribute` ¶

Whether to fallback to CPU if requested device is unavailable.

`gpu_memory_limit: float | None = None` `class-attribute` `instance-attribute` ¶

Maximum GPU memory to use in GB. None for no limit.

`height_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

Maximum difference in box height for merging.

`language: str | list[str] = 'en'` `class-attribute` `instance-attribute` ¶

Language or languages to use for OCR. Can be a single language code (e.g., 'en'), a comma-separated string of language codes (e.g., 'en,ch_sim'), or a list of language codes.

`link_threshold: float = 0.4` `class-attribute` `instance-attribute` ¶

Link confidence threshold.

`low_text: float = 0.4` `class-attribute` `instance-attribute` ¶

Text low-bound score.

`mag_ratio: float = 1.0` `class-attribute` `instance-attribute` ¶

Image magnification ratio.

`min_size: int = 10` `class-attribute` `instance-attribute` ¶

Minimum text box size in pixels.

`rotation_info: list[int] | None = None` `class-attribute` `instance-attribute` ¶

List of angles to try for detection.

`slope_ths: float = 0.1` `class-attribute` `instance-attribute` ¶

Maximum slope for merging text boxes.

`text_threshold: float = 0.7` `class-attribute` `instance-attribute` ¶

Text confidence threshold.

`use_gpu: bool = False` `class-attribute` `instance-attribute` ¶

Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.

`width_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

Maximum horizontal distance for merging boxes.

`x_ths: float = 1.0` `class-attribute` `instance-attribute` ¶

Maximum horizontal distance for paragraph merging.

`y_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

Maximum vertical distance for paragraph merging.

`ycenter_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

Maximum shift in y direction for merging.

PaddleOCRConfig¶

`kreuzberg.PaddleOCRConfig` `dataclass` ¶

Configuration options for PaddleOCR.

This TypedDict provides type hints and documentation for all PaddleOCR parameters.

Source code in kreuzberg/_ocr/_paddleocr.py

@dataclass(unsafe_hash=True, frozen=True)
class PaddleOCRConfig:
    """Configuration options for PaddleOCR.

    This TypedDict provides type hints and documentation for all PaddleOCR parameters.
    """

    cls_image_shape: str = "3,48,192"
    """Image shape for classification algorithm in format 'channels,height,width'."""
    det_algorithm: Literal["DB", "EAST", "SAST", "PSE", "FCE", "PAN", "CT", "DB++", "Layout"] = "DB"
    """Detection algorithm."""
    det_db_box_thresh: float = 0.5
    """Score threshold for detected boxes. Boxes below this value are discarded."""
    det_db_thresh: float = 0.3
    """Binarization threshold for DB output map."""
    det_db_unclip_ratio: float = 2.0
    """Expansion ratio for detected text boxes."""
    det_east_cover_thresh: float = 0.1
    """Score threshold for EAST output boxes."""
    det_east_nms_thresh: float = 0.2
    """NMS threshold for EAST model output boxes."""
    det_east_score_thresh: float = 0.8
    """Binarization threshold for EAST output map."""
    det_max_side_len: int = 960
    """Maximum size of image long side. Images exceeding this will be proportionally resized."""
    det_model_dir: str | None = None
    """Directory for detection model. If None, uses default model location."""
    drop_score: float = 0.5
    """Filter recognition results by confidence score. Results below this are discarded."""
    enable_mkldnn: bool = False
    """Whether to enable MKL-DNN acceleration (Intel CPU only)."""
    gpu_mem: int = 8000
    """GPU memory size (in MB) to use for initialization."""
    language: str = "en"
    """Language to use for OCR."""
    max_text_length: int = 25
    """Maximum text length that the recognition algorithm can recognize."""
    rec: bool = True
    """Enable text recognition when using the ocr() function."""
    rec_algorithm: Literal[
        "CRNN",
        "SRN",
        "NRTR",
        "SAR",
        "SEED",
        "SVTR",
        "SVTR_LCNet",
        "ViTSTR",
        "ABINet",
        "VisionLAN",
        "SPIN",
        "RobustScanner",
        "RFL",
    ] = "CRNN"
    """Recognition algorithm."""
    rec_image_shape: str = "3,32,320"
    """Image shape for recognition algorithm in format 'channels,height,width'."""
    rec_model_dir: str | None = None
    """Directory for recognition model. If None, uses default model location."""
    table: bool = True
    """Whether to enable table recognition."""
    use_angle_cls: bool = True
    """Whether to use text orientation classification model."""
    use_gpu: bool = False
    """Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead."""
    device: DeviceType = "auto"
    """Device to use for inference. Options: 'cpu', 'cuda', 'auto'. Note: MPS not supported by PaddlePaddle."""
    gpu_memory_limit: float | None = None
    """Maximum GPU memory to use in GB. None for no limit."""
    fallback_to_cpu: bool = True
    """Whether to fallback to CPU if requested device is unavailable."""
    use_space_char: bool = True
    """Whether to recognize spaces."""
    use_zero_copy_run: bool = False
    """Whether to enable zero_copy_run for inference optimization."""

Attributes¶

`cls_image_shape: str = '3,48,192'` `class-attribute` `instance-attribute` ¶

Image shape for classification algorithm in format 'channels,height,width'.

`det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'` `class-attribute` `instance-attribute` ¶

Detection algorithm.

`det_db_box_thresh: float = 0.5` `class-attribute` `instance-attribute` ¶

Score threshold for detected boxes. Boxes below this value are discarded.

`det_db_thresh: float = 0.3` `class-attribute` `instance-attribute` ¶

Binarization threshold for DB output map.

`det_db_unclip_ratio: float = 2.0` `class-attribute` `instance-attribute` ¶

Expansion ratio for detected text boxes.

`det_east_cover_thresh: float = 0.1` `class-attribute` `instance-attribute` ¶

Score threshold for EAST output boxes.

`det_east_nms_thresh: float = 0.2` `class-attribute` `instance-attribute` ¶

NMS threshold for EAST model output boxes.

`det_east_score_thresh: float = 0.8` `class-attribute` `instance-attribute` ¶

Binarization threshold for EAST output map.

`det_max_side_len: int = 960` `class-attribute` `instance-attribute` ¶

Maximum size of image long side. Images exceeding this will be proportionally resized.

`det_model_dir: str | None = None` `class-attribute` `instance-attribute` ¶

Directory for detection model. If None, uses default model location.

`device: DeviceType = 'auto'` `class-attribute` `instance-attribute` ¶

Device to use for inference. Options: 'cpu', 'cuda', 'auto'. Note: MPS not supported by PaddlePaddle.

`drop_score: float = 0.5` `class-attribute` `instance-attribute` ¶

Filter recognition results by confidence score. Results below this are discarded.

`enable_mkldnn: bool = False` `class-attribute` `instance-attribute` ¶

Whether to enable MKL-DNN acceleration (Intel CPU only).

`fallback_to_cpu: bool = True` `class-attribute` `instance-attribute` ¶

Whether to fallback to CPU if requested device is unavailable.

`gpu_mem: int = 8000` `class-attribute` `instance-attribute` ¶

GPU memory size (in MB) to use for initialization.

`gpu_memory_limit: float | None = None` `class-attribute` `instance-attribute` ¶

Maximum GPU memory to use in GB. None for no limit.

`language: str = 'en'` `class-attribute` `instance-attribute` ¶

Language to use for OCR.

`max_text_length: int = 25` `class-attribute` `instance-attribute` ¶

Maximum text length that the recognition algorithm can recognize.

`rec: bool = True` `class-attribute` `instance-attribute` ¶

Enable text recognition when using the ocr() function.

`rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'` `class-attribute` `instance-attribute` ¶

Recognition algorithm.

`rec_image_shape: str = '3,32,320'` `class-attribute` `instance-attribute` ¶

Image shape for recognition algorithm in format 'channels,height,width'.

`rec_model_dir: str | None = None` `class-attribute` `instance-attribute` ¶

Directory for recognition model. If None, uses default model location.

`table: bool = True` `class-attribute` `instance-attribute` ¶

Whether to enable table recognition.

`use_angle_cls: bool = True` `class-attribute` `instance-attribute` ¶

Whether to use text orientation classification model.

`use_gpu: bool = False` `class-attribute` `instance-attribute` ¶

Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.

`use_space_char: bool = True` `class-attribute` `instance-attribute` ¶

Whether to recognize spaces.

`use_zero_copy_run: bool = False` `class-attribute` `instance-attribute` ¶

Whether to enable zero_copy_run for inference optimization.

GMFT Configuration¶

Configuration options for the GMFT table extraction engine:

`kreuzberg.GMFTConfig` `dataclass` ¶

Configuration options for GMFT.

This class encapsulates the configuration options for GMFT, providing a way to customize its behavior.

Source code in kreuzberg/_gmft.py

@dataclass(unsafe_hash=True)
class GMFTConfig:
    """Configuration options for GMFT.

    This class encapsulates the configuration options for GMFT, providing a way to customize its behavior.
    """

    verbosity: int = 0
    """
    Verbosity level for logging.

    0: errors only
    1: print warnings
    2: print warnings and info
    3: print warnings, info, and debug
    """
    formatter_base_threshold: float = 0.3
    """
    Base threshold for the confidence demanded of a table feature (row/column).

    Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).
    """
    cell_required_confidence: dict[Literal[0, 1, 2, 3, 4, 5, 6], float] = field(
        default_factory=lambda: {
            0: 0.3,
            1: 0.3,
            2: 0.3,
            3: 0.3,
            4: 0.5,
            5: 0.5,
            6: 99,
        },
        hash=False,
    )
    """
    Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label

    But low confidences may be better than too high confidence (see formatter_base_threshold)
    """
    detector_base_threshold: float = 0.9
    """Minimum confidence score required for a table"""
    remove_null_rows: bool = True
    """
    Flag to remove rows with no text.
    """
    enable_multi_header: bool = False
    """
    Enable multi-indices in the dataframe.

    If false, then multiple headers will be merged column-wise.
    """
    semantic_spanning_cells: bool = False
    """
    [Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.
    """
    semantic_hierarchical_left_fill: Literal["algorithm", "deep"] | None = "algorithm"
    """
    [Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row.

    Possible values: 'algorithm', 'deep', None.

    'algorithm': assumes that the higher-level header is always the first row followed by several empty rows.
    'deep': merges headers according to the spanning cells detected by the Table Transformer.
    None: headers are not duplicated.
    """
    large_table_if_n_rows_removed: int = 8
    """
    If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.
    """
    large_table_threshold: int = 10
    """
    With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

    Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
    """
    large_table_row_overlap_threshold: float = 0.2
    """
    With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

    Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold).
    """
    large_table_maximum_rows: int = 1000
    """
    Maximum number of rows allowed for a large table.
    """
    force_large_table_assumption: bool | None = None
    """
    Force the large table assumption to be applied, regardless of the number of rows and overlap.
    """
    total_overlap_reject_threshold: float = 0.9
    """
    Reject if total overlap is > 90% of table area.
    """
    total_overlap_warn_threshold: float = 0.1
    """
    Warn if total overlap is > 10% of table area.
    """
    nms_warn_threshold: int = 5
    """
    Warn if non maxima suppression removes > 5 rows.
    """
    iob_reject_threshold: float = 0.05
    """
    Reject if iob between textbox and cell is < 5%.
    """
    iob_warn_threshold: float = 0.5
    """
    Warn if iob between textbox and cell is < 50%.
    """

Attributes¶

`cell_required_confidence: dict[Literal[0, 1, 2, 3, 4, 5, 6], float] = field(default_factory=lambda: {0: 0.3, 1: 0.3, 2: 0.3, 3: 0.3, 4: 0.5, 5: 0.5, 6: 99}, hash=False)` `class-attribute` `instance-attribute` ¶

Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label

But low confidences may be better than too high confidence (see formatter_base_threshold)

`detector_base_threshold: float = 0.9` `class-attribute` `instance-attribute` ¶

Minimum confidence score required for a table

`enable_multi_header: bool = False` `class-attribute` `instance-attribute` ¶

Enable multi-indices in the dataframe.

If false, then multiple headers will be merged column-wise.

`force_large_table_assumption: bool | None = None` `class-attribute` `instance-attribute` ¶

Force the large table assumption to be applied, regardless of the number of rows and overlap.

`formatter_base_threshold: float = 0.3` `class-attribute` `instance-attribute` ¶

Base threshold for the confidence demanded of a table feature (row/column).

Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).

`iob_reject_threshold: float = 0.05` `class-attribute` `instance-attribute` ¶

Reject if iob between textbox and cell is < 5%.

`iob_warn_threshold: float = 0.5` `class-attribute` `instance-attribute` ¶

Warn if iob between textbox and cell is < 50%.

`large_table_if_n_rows_removed: int = 8` `class-attribute` `instance-attribute` ¶

If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.

`large_table_maximum_rows: int = 1000` `class-attribute` `instance-attribute` ¶

Maximum number of rows allowed for a large table.

`large_table_row_overlap_threshold: float = 0.2` `class-attribute` `instance-attribute` ¶

With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold).

`large_table_threshold: int = 10` `class-attribute` `instance-attribute` ¶

With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.

`nms_warn_threshold: int = 5` `class-attribute` `instance-attribute` ¶

Warn if non maxima suppression removes > 5 rows.

`remove_null_rows: bool = True` `class-attribute` `instance-attribute` ¶

Flag to remove rows with no text.

`semantic_hierarchical_left_fill: Literal['algorithm', 'deep'] | None = 'algorithm'` `class-attribute` `instance-attribute` ¶

[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row.

Possible values: 'algorithm', 'deep', None.

'algorithm': assumes that the higher-level header is always the first row followed by several empty rows. 'deep': merges headers according to the spanning cells detected by the Table Transformer. None: headers are not duplicated.

`semantic_spanning_cells: bool = False` `class-attribute` `instance-attribute` ¶

[Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.

`total_overlap_reject_threshold: float = 0.9` `class-attribute` `instance-attribute` ¶

Reject if total overlap is > 90% of table area.

`total_overlap_warn_threshold: float = 0.1` `class-attribute` `instance-attribute` ¶

Warn if total overlap is > 10% of table area.

`verbosity: int = 0` `class-attribute` `instance-attribute` ¶

Verbosity level for logging.

0: errors only 1: print warnings 2: print warnings and info 3: print warnings, info, and debug

Entity Extraction Configuration¶

Configuration options for spaCy-based entity extraction:

`kreuzberg.SpacyEntityExtractionConfig` `dataclass` ¶

Configuration for spaCy-based entity extraction.

Source code in kreuzberg/_entity_extraction.py

@dataclass(unsafe_hash=True, frozen=True)
class SpacyEntityExtractionConfig:
    """Configuration for spaCy-based entity extraction."""

    model_cache_dir: str | Path | None = None
    """Directory to cache spaCy models. If None, uses spaCy's default."""

    language_models: dict[str, str] | tuple[tuple[str, str], ...] | None = None
    """Mapping of language codes to spaCy model names.

    If None, uses default mappings:
    - en: en_core_web_sm
    - de: de_core_news_sm
    - fr: fr_core_news_sm
    - es: es_core_news_sm
    - pt: pt_core_news_sm
    - it: it_core_news_sm
    - nl: nl_core_news_sm
    - zh: zh_core_web_sm
    - ja: ja_core_news_sm
    """

    fallback_to_multilingual: bool = True
    """If True and language-specific model fails, try xx_ent_wiki_sm (multilingual)."""

    max_doc_length: int = 1000000
    """Maximum document length for spaCy processing."""

    batch_size: int = 1000
    """Batch size for processing multiple texts."""

    def __post_init__(self) -> None:
        if self.language_models is None:
            object.__setattr__(self, "language_models", self._get_default_language_models())

        if isinstance(self.language_models, dict):
            object.__setattr__(self, "language_models", tuple(sorted(self.language_models.items())))

    @staticmethod
    def _get_default_language_models() -> dict[str, str]:
        """Get default language model mappings based on available spaCy models."""
        return {
            "en": "en_core_web_sm",
            "de": "de_core_news_sm",
            "fr": "fr_core_news_sm",
            "es": "es_core_news_sm",
            "pt": "pt_core_news_sm",
            "it": "it_core_news_sm",
            "nl": "nl_core_news_sm",
            "zh": "zh_core_web_sm",
            "ja": "ja_core_news_sm",
            "ko": "ko_core_news_sm",
            "ru": "ru_core_news_sm",
            "pl": "pl_core_news_sm",
            "ro": "ro_core_news_sm",
            "el": "el_core_news_sm",
            "da": "da_core_news_sm",
            "fi": "fi_core_news_sm",
            "nb": "nb_core_news_sm",
            "sv": "sv_core_news_sm",
            "ca": "ca_core_news_sm",
            "hr": "hr_core_news_sm",
            "lt": "lt_core_news_sm",
            "mk": "mk_core_news_sm",
            "sl": "sl_core_news_sm",
            "uk": "uk_core_news_sm",
        }

    def get_model_for_language(self, language_code: str) -> str | None:
        """Get the appropriate spaCy model for a language code."""
        if not self.language_models:
            return None

        models_dict = dict(self.language_models) if isinstance(self.language_models, tuple) else self.language_models

        if language_code in models_dict:
            return models_dict[language_code]

        base_lang = language_code.split("-")[0].lower()
        if base_lang in models_dict:
            return models_dict[base_lang]

        return None

    def get_fallback_model(self) -> str | None:
        """Get fallback multilingual model if enabled."""
        return "xx_ent_wiki_sm" if self.fallback_to_multilingual else None

Attributes¶

`batch_size: int = 1000` `class-attribute` `instance-attribute` ¶

Batch size for processing multiple texts.

`fallback_to_multilingual: bool = True` `class-attribute` `instance-attribute` ¶

If True and language-specific model fails, try xx_ent_wiki_sm (multilingual).

`language_models: dict[str, str] | tuple[tuple[str, str], ...] | None = None` `class-attribute` `instance-attribute` ¶

Mapping of language codes to spaCy model names.

If None, uses default mappings: - en: en_core_web_sm - de: de_core_news_sm - fr: fr_core_news_sm - es: es_core_news_sm - pt: pt_core_news_sm - it: it_core_news_sm - nl: nl_core_news_sm - zh: zh_core_web_sm - ja: ja_core_news_sm

`max_doc_length: int = 1000000` `class-attribute` `instance-attribute` ¶

Maximum document length for spaCy processing.

`model_cache_dir: str | Path | None = None` `class-attribute` `instance-attribute` ¶

Directory to cache spaCy models. If None, uses spaCy's default.

Functions¶

`get_fallback_model() -> str | None` ¶

Get fallback multilingual model if enabled.

Source code in kreuzberg/_entity_extraction.py

def get_fallback_model(self) -> str | None:
    """Get fallback multilingual model if enabled."""
    return "xx_ent_wiki_sm" if self.fallback_to_multilingual else None

`get_model_for_language(language_code: str) -> str | None` ¶

Get the appropriate spaCy model for a language code.

Source code in kreuzberg/_entity_extraction.py

def get_model_for_language(self, language_code: str) -> str | None:
    """Get the appropriate spaCy model for a language code."""
    if not self.language_models:
        return None

    models_dict = dict(self.language_models) if isinstance(self.language_models, tuple) else self.language_models

    if language_code in models_dict:
        return models_dict[language_code]

    base_lang = language_code.split("-")[0].lower()
    if base_lang in models_dict:
        return models_dict[base_lang]

    return None

Language Detection Configuration¶

Configuration options for automatic language detection:

`kreuzberg.LanguageDetectionConfig` `dataclass` ¶

Configuration for language detection.

ATTRIBUTE	DESCRIPTION
`low_memory`	If True, uses a smaller model (~200MB). If False, uses a larger, more accurate model. Defaults to True for better memory efficiency. TYPE: `bool`
`top_k`	Maximum number of languages to return for multilingual detection. Defaults to 3. TYPE: `int`
`multilingual`	If True, uses multilingual detection to handle mixed-language text. If False, uses single language detection. Defaults to False. TYPE: `bool`
`cache_dir`	Custom directory for model cache. If None, uses system default. TYPE: `str \| None`
`allow_fallback`	If True, falls back to small model if large model fails. Defaults to True. TYPE: `bool`

Source code in kreuzberg/_language_detection.py

@dataclass(frozen=True)
class LanguageDetectionConfig:
    """Configuration for language detection.

    Attributes:
        low_memory: If True, uses a smaller model (~200MB). If False, uses a larger, more accurate model.
            Defaults to True for better memory efficiency.
        top_k: Maximum number of languages to return for multilingual detection. Defaults to 3.
        multilingual: If True, uses multilingual detection to handle mixed-language text.
            If False, uses single language detection. Defaults to False.
        cache_dir: Custom directory for model cache. If None, uses system default.
        allow_fallback: If True, falls back to small model if large model fails. Defaults to True.
    """

    low_memory: bool = True
    top_k: int = 3
    multilingual: bool = False
    cache_dir: str | None = None
    allow_fallback: bool = True

PSMMode (Page Segmentation Mode)¶

`kreuzberg.PSMMode` ¶

Bases: Enum

Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values.

Source code in kreuzberg/_ocr/_tesseract.py

class PSMMode(Enum):
    """Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values."""

    OSD_ONLY = 0
    """Orientation and script detection only."""
    AUTO_OSD = 1
    """Automatic page segmentation with orientation and script detection."""
    AUTO_ONLY = 2
    """Automatic page segmentation without OSD."""
    AUTO = 3
    """Fully automatic page segmentation (default)."""
    SINGLE_COLUMN = 4
    """Assume a single column of text."""
    SINGLE_BLOCK_VERTICAL = 5
    """Assume a single uniform block of vertically aligned text."""
    SINGLE_BLOCK = 6
    """Assume a single uniform block of text."""
    SINGLE_LINE = 7
    """Treat the image as a single text line."""
    SINGLE_WORD = 8
    """Treat the image as a single word."""
    CIRCLE_WORD = 9
    """Treat the image as a single word in a circle."""
    SINGLE_CHAR = 10
    """Treat the image as a single character."""

Attributes¶

`AUTO = 3` `class-attribute` `instance-attribute` ¶

Fully automatic page segmentation (default).

`AUTO_ONLY = 2` `class-attribute` `instance-attribute` ¶

Automatic page segmentation without OSD.

`AUTO_OSD = 1` `class-attribute` `instance-attribute` ¶

Automatic page segmentation with orientation and script detection.

`CIRCLE_WORD = 9` `class-attribute` `instance-attribute` ¶

Treat the image as a single word in a circle.

`OSD_ONLY = 0` `class-attribute` `instance-attribute` ¶

Orientation and script detection only.

`SINGLE_BLOCK = 6` `class-attribute` `instance-attribute` ¶

Assume a single uniform block of text.

`SINGLE_BLOCK_VERTICAL = 5` `class-attribute` `instance-attribute` ¶

Assume a single uniform block of vertically aligned text.

`SINGLE_CHAR = 10` `class-attribute` `instance-attribute` ¶

Treat the image as a single character.

`SINGLE_COLUMN = 4` `class-attribute` `instance-attribute` ¶

Assume a single column of text.

`SINGLE_LINE = 7` `class-attribute` `instance-attribute` ¶

Treat the image as a single text line.

`SINGLE_WORD = 8` `class-attribute` `instance-attribute` ¶

Treat the image as a single word.

Entity¶

Represents an extracted named entity:

`kreuzberg.Entity` `dataclass` ¶

Represents an extracted entity with type, text, and position.

Source code in kreuzberg/_types.py

@dataclass(frozen=True)
class Entity:
    """Represents an extracted entity with type, text, and position."""

    type: str
    """e.g., PERSON, ORGANIZATION, LOCATION, DATE, EMAIL, PHONE, or custom"""
    text: str
    """Extracted text"""
    start: int
    """Start character offset in the content"""
    end: int
    """End character offset in the content"""

Attributes¶

`end: int` `instance-attribute` ¶

End character offset in the content

`start: int` `instance-attribute` ¶

Start character offset in the content

`text: str` `instance-attribute` ¶

Extracted text

`type: str` `instance-attribute` ¶

e.g., PERSON, ORGANIZATION, LOCATION, DATE, EMAIL, PHONE, or custom

Metadata¶

A TypedDict that contains optional metadata fields extracted from documents:

`kreuzberg.Metadata` ¶

Bases: TypedDict

Base metadata common to all document types.

All fields will only be included if they contain non-empty values. Any field that would be empty or None is omitted from the dictionary.

Source code in kreuzberg/_types.py

class Metadata(TypedDict, total=False):
    """Base metadata common to all document types.

    All fields will only be included if they contain non-empty values.
    Any field that would be empty or None is omitted from the dictionary.
    """

    authors: NotRequired[list[str]]
    """List of document authors."""
    categories: NotRequired[list[str]]
    """Categories or classifications."""
    citations: NotRequired[list[str]]
    """Citation identifiers."""
    comments: NotRequired[str]
    """General comments."""
    copyright: NotRequired[str]
    """Copyright information."""
    created_at: NotRequired[str]
    """Creation timestamp in ISO format."""
    created_by: NotRequired[str]
    """Document creator."""
    description: NotRequired[str]
    """Document description."""
    fonts: NotRequired[list[str]]
    """List of fonts used in the document."""
    height: NotRequired[int]
    """Height of the document page/slide/image, if applicable."""
    identifier: NotRequired[str]
    """Unique document identifier."""
    keywords: NotRequired[list[str]]
    """Keywords or tags."""
    languages: NotRequired[list[str]]
    """Document language code."""
    license: NotRequired[str]
    """License information."""
    modified_at: NotRequired[str]
    """Last modification timestamp in ISO format."""
    modified_by: NotRequired[str]
    """Username of last modifier."""
    organization: NotRequired[str | list[str]]
    """Organizational affiliation."""
    publisher: NotRequired[str]
    """Publisher or organization name."""
    references: NotRequired[list[str]]
    """Reference entries."""
    status: NotRequired[str]
    """Document status (e.g., draft, final)."""
    subject: NotRequired[str]
    """Document subject or topic."""
    subtitle: NotRequired[str]
    """Document subtitle."""
    summary: NotRequired[str]
    """Document Summary"""
    title: NotRequired[str]
    """Document title."""
    version: NotRequired[str]
    """Version identifier or revision number."""
    width: NotRequired[int]
    """Width of the document page/slide/image, if applicable."""

Attributes¶

`authors: NotRequired[list[str]]` `instance-attribute` ¶

List of document authors.

`categories: NotRequired[list[str]]` `instance-attribute` ¶

Categories or classifications.

`citations: NotRequired[list[str]]` `instance-attribute` ¶

Citation identifiers.

`comments: NotRequired[str]` `instance-attribute` ¶

General comments.

`copyright: NotRequired[str]` `instance-attribute` ¶

Copyright information.

`created_at: NotRequired[str]` `instance-attribute` ¶

Creation timestamp in ISO format.

`created_by: NotRequired[str]` `instance-attribute` ¶

Document creator.

`description: NotRequired[str]` `instance-attribute` ¶

Document description.

`fonts: NotRequired[list[str]]` `instance-attribute` ¶

List of fonts used in the document.

`height: NotRequired[int]` `instance-attribute` ¶

Height of the document page/slide/image, if applicable.

`identifier: NotRequired[str]` `instance-attribute` ¶

Unique document identifier.

`keywords: NotRequired[list[str]]` `instance-attribute` ¶

Keywords or tags.

`languages: NotRequired[list[str]]` `instance-attribute` ¶

Document language code.

`license: NotRequired[str]` `instance-attribute` ¶

License information.

`modified_at: NotRequired[str]` `instance-attribute` ¶

Last modification timestamp in ISO format.

`modified_by: NotRequired[str]` `instance-attribute` ¶

Username of last modifier.

`organization: NotRequired[str | list[str]]` `instance-attribute` ¶

Organizational affiliation.

`publisher: NotRequired[str]` `instance-attribute` ¶

Publisher or organization name.

`references: NotRequired[list[str]]` `instance-attribute` ¶

Reference entries.

`status: NotRequired[str]` `instance-attribute` ¶

Document status (e.g., draft, final).

`subject: NotRequired[str]` `instance-attribute` ¶

Document subject or topic.

`subtitle: NotRequired[str]` `instance-attribute` ¶

Document subtitle.

`summary: NotRequired[str]` `instance-attribute` ¶

Document Summary

`title: NotRequired[str]` `instance-attribute` ¶

Document title.

`version: NotRequired[str]` `instance-attribute` ¶

Version identifier or revision number.

`width: NotRequired[int]` `instance-attribute` ¶

Width of the document page/slide/image, if applicable.

Types¶

ExtractionResult¶

kreuzberg.ExtractionResult dataclass ¶

Attributes¶

chunks: list[str] = field(default_factory=list) class-attribute instance-attribute ¶

content: str instance-attribute ¶

detected_languages: list[str] | None = None class-attribute instance-attribute ¶

entities: list[Entity] | None = None class-attribute instance-attribute ¶

keywords: list[tuple[str, float]] | None = None class-attribute instance-attribute ¶

metadata: Metadata instance-attribute ¶

mime_type: str instance-attribute ¶

tables: list[TableData] = field(default_factory=list) class-attribute instance-attribute ¶

Functions¶

to_dict() -> dict[str, Any] ¶

ExtractionConfig¶

kreuzberg.ExtractionConfig dataclass ¶

Attributes¶

auto_detect_language: bool = False class-attribute instance-attribute ¶

chunk_content: bool = False class-attribute instance-attribute ¶

custom_entity_patterns: frozenset[tuple[str, str]] | None = None class-attribute instance-attribute ¶

extract_entities: bool = False class-attribute instance-attribute ¶

extract_keywords: bool = False class-attribute instance-attribute ¶

extract_tables: bool = False class-attribute instance-attribute ¶

force_ocr: bool = False class-attribute instance-attribute ¶

gmft_config: GMFTConfig | None = None class-attribute instance-attribute ¶

keyword_count: int = 10 class-attribute instance-attribute ¶

language_detection_config: LanguageDetectionConfig | None = None class-attribute instance-attribute ¶

max_chars: int = DEFAULT_MAX_CHARACTERS class-attribute instance-attribute ¶

max_overlap: int = DEFAULT_MAX_OVERLAP class-attribute instance-attribute ¶

ocr_backend: OcrBackendType | None = 'tesseract' class-attribute instance-attribute ¶

ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None class-attribute instance-attribute ¶

post_processing_hooks: list[PostProcessingHook] | None = None class-attribute instance-attribute ¶

spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None class-attribute instance-attribute ¶

validators: list[ValidationHook] | None = None class-attribute instance-attribute ¶

Functions¶

get_config_dict() -> dict[str, Any] ¶

TableData¶

kreuzberg.TableData ¶

Attributes¶

cropped_image: Image instance-attribute ¶

df: DataFrame instance-attribute ¶

page_number: int instance-attribute ¶

text: str instance-attribute ¶

OCR Configuration¶

TesseractConfig¶

kreuzberg.TesseractConfig dataclass ¶

Attributes¶

classify_use_pre_adapted_templates: bool = True class-attribute instance-attribute ¶

language: str = 'eng' class-attribute instance-attribute ¶

language_model_ngram_on: bool = False class-attribute instance-attribute ¶

psm: PSMMode = PSMMode.AUTO_ONLY class-attribute instance-attribute ¶

tessedit_char_whitelist: str = '' class-attribute instance-attribute ¶

tessedit_dont_blkrej_good_wds: bool = True class-attribute instance-attribute ¶

tessedit_dont_rowrej_good_wds: bool = True class-attribute instance-attribute ¶

tessedit_enable_dict_correction: bool = True class-attribute instance-attribute ¶

tessedit_use_primary_params_model: bool = True class-attribute instance-attribute ¶

textord_space_size_is_variable: bool = True class-attribute instance-attribute ¶

thresholding_method: bool = False class-attribute instance-attribute ¶

EasyOCRConfig¶

kreuzberg.EasyOCRConfig dataclass ¶

Attributes¶

add_margin: float = 0.1 class-attribute instance-attribute ¶

adjust_contrast: float = 0.5 class-attribute instance-attribute ¶

beam_width: int = 5 class-attribute instance-attribute ¶

canvas_size: int = 2560 class-attribute instance-attribute ¶

contrast_ths: float = 0.1 class-attribute instance-attribute ¶

decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy' class-attribute instance-attribute ¶

device: DeviceType = 'auto' class-attribute instance-attribute ¶

fallback_to_cpu: bool = True class-attribute instance-attribute ¶

gpu_memory_limit: float | None = None class-attribute instance-attribute ¶

height_ths: float = 0.5 class-attribute instance-attribute ¶

language: str | list[str] = 'en' class-attribute instance-attribute ¶

link_threshold: float = 0.4 class-attribute instance-attribute ¶

low_text: float = 0.4 class-attribute instance-attribute ¶

mag_ratio: float = 1.0 class-attribute instance-attribute ¶

min_size: int = 10 class-attribute instance-attribute ¶

rotation_info: list[int] | None = None class-attribute instance-attribute ¶

slope_ths: float = 0.1 class-attribute instance-attribute ¶

text_threshold: float = 0.7 class-attribute instance-attribute ¶

use_gpu: bool = False class-attribute instance-attribute ¶

`kreuzberg.ExtractionResult` `dataclass` ¶

`chunks: list[str] = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

`content: str` `instance-attribute` ¶

`detected_languages: list[str] | None = None` `class-attribute` `instance-attribute` ¶

`entities: list[Entity] | None = None` `class-attribute` `instance-attribute` ¶

`keywords: list[tuple[str, float]] | None = None` `class-attribute` `instance-attribute` ¶

`metadata: Metadata` `instance-attribute` ¶

`mime_type: str` `instance-attribute` ¶

`tables: list[TableData] = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

`to_dict() -> dict[str, Any]` ¶

`kreuzberg.ExtractionConfig` `dataclass` ¶

`auto_detect_language: bool = False` `class-attribute` `instance-attribute` ¶

`chunk_content: bool = False` `class-attribute` `instance-attribute` ¶

`custom_entity_patterns: frozenset[tuple[str, str]] | None = None` `class-attribute` `instance-attribute` ¶

`extract_entities: bool = False` `class-attribute` `instance-attribute` ¶

`extract_keywords: bool = False` `class-attribute` `instance-attribute` ¶

`extract_tables: bool = False` `class-attribute` `instance-attribute` ¶

`force_ocr: bool = False` `class-attribute` `instance-attribute` ¶

`gmft_config: GMFTConfig | None = None` `class-attribute` `instance-attribute` ¶

`keyword_count: int = 10` `class-attribute` `instance-attribute` ¶

`language_detection_config: LanguageDetectionConfig | None = None` `class-attribute` `instance-attribute` ¶

`max_chars: int = DEFAULT_MAX_CHARACTERS` `class-attribute` `instance-attribute` ¶

`max_overlap: int = DEFAULT_MAX_OVERLAP` `class-attribute` `instance-attribute` ¶

`ocr_backend: OcrBackendType | None = 'tesseract'` `class-attribute` `instance-attribute` ¶

`ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None` `class-attribute` `instance-attribute` ¶

`post_processing_hooks: list[PostProcessingHook] | None = None` `class-attribute` `instance-attribute` ¶

`spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None` `class-attribute` `instance-attribute` ¶

`validators: list[ValidationHook] | None = None` `class-attribute` `instance-attribute` ¶

`get_config_dict() -> dict[str, Any]` ¶

`kreuzberg.TableData` ¶

`cropped_image: Image` `instance-attribute` ¶

`df: DataFrame` `instance-attribute` ¶

`page_number: int` `instance-attribute` ¶

`text: str` `instance-attribute` ¶

`kreuzberg.TesseractConfig` `dataclass` ¶

`classify_use_pre_adapted_templates: bool = True` `class-attribute` `instance-attribute` ¶

`language: str = 'eng'` `class-attribute` `instance-attribute` ¶

`language_model_ngram_on: bool = False` `class-attribute` `instance-attribute` ¶

`psm: PSMMode = PSMMode.AUTO_ONLY` `class-attribute` `instance-attribute` ¶

`tessedit_char_whitelist: str = ''` `class-attribute` `instance-attribute` ¶

`tessedit_dont_blkrej_good_wds: bool = True` `class-attribute` `instance-attribute` ¶

`tessedit_dont_rowrej_good_wds: bool = True` `class-attribute` `instance-attribute` ¶

`tessedit_enable_dict_correction: bool = True` `class-attribute` `instance-attribute` ¶

`tessedit_use_primary_params_model: bool = True` `class-attribute` `instance-attribute` ¶

`textord_space_size_is_variable: bool = True` `class-attribute` `instance-attribute` ¶

`thresholding_method: bool = False` `class-attribute` `instance-attribute` ¶

`kreuzberg.EasyOCRConfig` `dataclass` ¶

`add_margin: float = 0.1` `class-attribute` `instance-attribute` ¶

`adjust_contrast: float = 0.5` `class-attribute` `instance-attribute` ¶

`beam_width: int = 5` `class-attribute` `instance-attribute` ¶

`canvas_size: int = 2560` `class-attribute` `instance-attribute` ¶

`contrast_ths: float = 0.1` `class-attribute` `instance-attribute` ¶

`decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'` `class-attribute` `instance-attribute` ¶

`device: DeviceType = 'auto'` `class-attribute` `instance-attribute` ¶

`fallback_to_cpu: bool = True` `class-attribute` `instance-attribute` ¶

`gpu_memory_limit: float | None = None` `class-attribute` `instance-attribute` ¶

`height_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

`language: str | list[str] = 'en'` `class-attribute` `instance-attribute` ¶

`link_threshold: float = 0.4` `class-attribute` `instance-attribute` ¶

`low_text: float = 0.4` `class-attribute` `instance-attribute` ¶

`mag_ratio: float = 1.0` `class-attribute` `instance-attribute` ¶

`min_size: int = 10` `class-attribute` `instance-attribute` ¶

`rotation_info: list[int] | None = None` `class-attribute` `instance-attribute` ¶

`slope_ths: float = 0.1` `class-attribute` `instance-attribute` ¶

`text_threshold: float = 0.7` `class-attribute` `instance-attribute` ¶

`use_gpu: bool = False` `class-attribute` `instance-attribute` ¶

`width_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

`x_ths: float = 1.0` `class-attribute` `instance-attribute` ¶

`y_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

`ycenter_ths: float = 0.5` `class-attribute` `instance-attribute` ¶

`kreuzberg.PaddleOCRConfig` `dataclass` ¶

`cls_image_shape: str = '3,48,192'` `class-attribute` `instance-attribute` ¶

`det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'` `class-attribute` `instance-attribute` ¶

`det_db_box_thresh: float = 0.5` `class-attribute` `instance-attribute` ¶

`det_db_thresh: float = 0.3` `class-attribute` `instance-attribute` ¶

`det_db_unclip_ratio: float = 2.0` `class-attribute` `instance-attribute` ¶

`det_east_cover_thresh: float = 0.1` `class-attribute` `instance-attribute` ¶

`det_east_nms_thresh: float = 0.2` `class-attribute` `instance-attribute` ¶

`det_east_score_thresh: float = 0.8` `class-attribute` `instance-attribute` ¶

`det_max_side_len: int = 960` `class-attribute` `instance-attribute` ¶