Skip to content

Types

Core data structures for extraction results, configuration, and metadata.

ExtractionResult

The result of a file extraction, containing the extracted text, MIME type, and metadata:

kreuzberg.ExtractionResult dataclass

The result of a file extraction.

Source code in kreuzberg/_types.py
@dataclass
class ExtractionResult:
    """The result of a file extraction."""

    content: str
    """The extracted content."""
    chunks: list[str]
    """The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig."""
    mime_type: str
    """The mime type of the extracted content. Is either text/plain or text/markdown."""
    metadata: Metadata
    """The metadata of the content."""

Attributes

chunks: list[str] instance-attribute

The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.

content: str instance-attribute

The extracted content.

metadata: Metadata instance-attribute

The metadata of the content.

mime_type: str instance-attribute

The mime type of the extracted content. Is either text/plain or text/markdown.

ExtractionConfig

Configuration options for extraction functions:

kreuzberg.ExtractionConfig dataclass

Represents configuration settings for an extraction process.

This class encapsulates the configuration options for extracting text from images or documents using Optical Character Recognition (OCR). It provides options to customize the OCR behavior, select the backend engine, and configure engine-specific parameters.

Source code in kreuzberg/_types.py
@dataclass(unsafe_hash=True)
class ExtractionConfig:
    """Represents configuration settings for an extraction process.

    This class encapsulates the configuration options for extracting text
    from images or documents using Optical Character Recognition (OCR). It
    provides options to customize the OCR behavior, select the backend
    engine, and configure engine-specific parameters.
    """

    force_ocr: bool = False
    """Whether to force OCR."""
    chunk_content: bool = False
    """Whether to chunk the content into smaller chunks."""
    max_chars: int = DEFAULT_MAX_CHARACTERS
    """The size of each chunk in characters."""
    max_overlap: int = DEFAULT_MAX_OVERLAP
    """The overlap between chunks in characters."""
    ocr_backend: OcrBackendType | None = "tesseract"
    """The OCR backend to use."""
    ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
    """Configuration to pass to the OCR backend."""
    post_processing_hooks: list[PostProcessingHook] | None = None
    """Post processing hooks to call after processing is done and before the final result is returned."""
    validators: list[ValidationHook] | None = None
    """Validation hooks to call after processing is done and before post-processing and result return."""

    def __post_init__(self) -> None:
        from kreuzberg._ocr._easyocr import EasyOCRConfig
        from kreuzberg._ocr._paddleocr import PaddleOCRConfig
        from kreuzberg._ocr._tesseract import TesseractConfig

        if self.ocr_backend is None and self.ocr_config is not None:
            raise ValidationError("'ocr_backend' is None but 'ocr_config' is provided")

        if self.ocr_config is not None and (
            (self.ocr_backend == "tesseract" and not isinstance(self.ocr_config, TesseractConfig))
            or (self.ocr_backend == "easyocr" and not isinstance(self.ocr_config, EasyOCRConfig))
            or (self.ocr_backend == "paddleocr" and not isinstance(self.ocr_config, PaddleOCRConfig))
        ):
            raise ValidationError(
                "incompatible 'ocr_config' value provided for 'ocr_backend'",
                context={"ocr_backend": self.ocr_backend, "ocr_config": type(self.ocr_config).__name__},
            )

    def get_config_dict(self) -> dict[str, Any]:
        """Returns the OCR configuration object based on the backend specified.

        Returns:
            A dict of the OCR configuration or an empty dict if no backend is provided.
        """
        if self.ocr_backend is not None:
            if self.ocr_config is not None:
                return asdict(self.ocr_config)
            if self.ocr_backend == "tesseract":
                from kreuzberg._ocr._tesseract import TesseractConfig

                return asdict(TesseractConfig())
            if self.ocr_backend == "easyocr":
                from kreuzberg._ocr._easyocr import EasyOCRConfig

                return asdict(EasyOCRConfig())
            from kreuzberg._ocr._paddleocr import PaddleOCRConfig

            return asdict(PaddleOCRConfig())
        return {}

Attributes

chunk_content: bool = False class-attribute instance-attribute

Whether to chunk the content into smaller chunks.

force_ocr: bool = False class-attribute instance-attribute

Whether to force OCR.

max_chars: int = DEFAULT_MAX_CHARACTERS class-attribute instance-attribute

The size of each chunk in characters.

max_overlap: int = DEFAULT_MAX_OVERLAP class-attribute instance-attribute

The overlap between chunks in characters.

ocr_backend: OcrBackendType | None = 'tesseract' class-attribute instance-attribute

The OCR backend to use.

ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None class-attribute instance-attribute

Configuration to pass to the OCR backend.

post_processing_hooks: list[PostProcessingHook] | None = None class-attribute instance-attribute

Post processing hooks to call after processing is done and before the final result is returned.

validators: list[ValidationHook] | None = None class-attribute instance-attribute

Validation hooks to call after processing is done and before post-processing and result return.

Functions

get_config_dict() -> dict[str, Any]

Returns the OCR configuration object based on the backend specified.

RETURNS DESCRIPTION
dict[str, Any]

A dict of the OCR configuration or an empty dict if no backend is provided.

Source code in kreuzberg/_types.py
def get_config_dict(self) -> dict[str, Any]:
    """Returns the OCR configuration object based on the backend specified.

    Returns:
        A dict of the OCR configuration or an empty dict if no backend is provided.
    """
    if self.ocr_backend is not None:
        if self.ocr_config is not None:
            return asdict(self.ocr_config)
        if self.ocr_backend == "tesseract":
            from kreuzberg._ocr._tesseract import TesseractConfig

            return asdict(TesseractConfig())
        if self.ocr_backend == "easyocr":
            from kreuzberg._ocr._easyocr import EasyOCRConfig

            return asdict(EasyOCRConfig())
        from kreuzberg._ocr._paddleocr import PaddleOCRConfig

        return asdict(PaddleOCRConfig())
    return {}

OCR Configuration

TesseractConfig

kreuzberg.TesseractConfig dataclass

Configuration options for Tesseract OCR engine.

Source code in kreuzberg/_ocr/_tesseract.py
@dataclass(unsafe_hash=True, frozen=True)
class TesseractConfig:
    """Configuration options for Tesseract OCR engine."""

    classify_use_pre_adapted_templates: bool = True
    """Whether to use pre-adapted templates during classification to improve recognition accuracy."""
    language: str = "eng"
    """Language code to use for OCR.
    Examples:
            -   'eng' for English
            -   'deu' for German
            -    multiple languages combined with '+', e.g. 'eng+deu')
    """
    language_model_ngram_on: bool = True
    """Enable or disable the use of n-gram-based language models for improved text recognition."""
    psm: PSMMode = PSMMode.AUTO
    """Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line)."""
    tessedit_dont_blkrej_good_wds: bool = True
    """If True, prevents block rejection of words identified as good, improving text output quality."""
    tessedit_dont_rowrej_good_wds: bool = True
    """If True, prevents row rejection of words identified as good, avoiding unnecessary omissions."""
    tessedit_enable_dict_correction: bool = True
    """Enable or disable dictionary-based correction for recognized text to improve word accuracy."""
    tessedit_use_primary_params_model: bool = True
    """If True, forces the use of the primary parameters model for text recognition."""
    textord_space_size_is_variable: bool = True
    """Allow variable spacing between words, useful for text with irregular spacing."""
    thresholding_method: bool = True
    """Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy."""

Attributes

classify_use_pre_adapted_templates: bool = True class-attribute instance-attribute

Whether to use pre-adapted templates during classification to improve recognition accuracy.

language: str = 'eng' class-attribute instance-attribute

Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu')

language_model_ngram_on: bool = True class-attribute instance-attribute

Enable or disable the use of n-gram-based language models for improved text recognition.

psm: PSMMode = PSMMode.AUTO class-attribute instance-attribute

Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).

tessedit_dont_blkrej_good_wds: bool = True class-attribute instance-attribute

If True, prevents block rejection of words identified as good, improving text output quality.

tessedit_dont_rowrej_good_wds: bool = True class-attribute instance-attribute

If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.

tessedit_enable_dict_correction: bool = True class-attribute instance-attribute

Enable or disable dictionary-based correction for recognized text to improve word accuracy.

tessedit_use_primary_params_model: bool = True class-attribute instance-attribute

If True, forces the use of the primary parameters model for text recognition.

textord_space_size_is_variable: bool = True class-attribute instance-attribute

Allow variable spacing between words, useful for text with irregular spacing.

thresholding_method: bool = True class-attribute instance-attribute

Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.

EasyOCRConfig

kreuzberg.EasyOCRConfig dataclass

Configuration options for EasyOCR.

Source code in kreuzberg/_ocr/_easyocr.py
@dataclass(unsafe_hash=True, frozen=True)
class EasyOCRConfig:
    """Configuration options for EasyOCR."""

    add_margin: float = 0.1
    """Extend bounding boxes in all directions."""
    adjust_contrast: float = 0.5
    """Target contrast level for low contrast text."""
    beam_width: int = 5
    """Beam width for beam search in recognition."""
    canvas_size: int = 2560
    """Maximum image dimension for detection."""
    contrast_ths: float = 0.1
    """Contrast threshold for preprocessing."""
    decoder: Literal["greedy", "beamsearch", "wordbeamsearch"] = "greedy"
    """Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'."""
    height_ths: float = 0.5
    """Maximum difference in box height for merging."""
    language: str | list[str] = "en"
    """Language or languages to use for OCR."""
    link_threshold: float = 0.4
    """Link confidence threshold."""
    low_text: float = 0.4
    """Text low-bound score."""
    mag_ratio: float = 1.0
    """Image magnification ratio."""
    min_size: int = 10
    """Minimum text box size in pixels."""
    rotation_info: list[int] | None = None
    """List of angles to try for detection."""
    slope_ths: float = 0.1
    """Maximum slope for merging text boxes."""
    text_threshold: float = 0.7
    """Text confidence threshold."""
    use_gpu: bool = False
    """Whether to use GPU for inference."""
    width_ths: float = 0.5
    """Maximum horizontal distance for merging boxes."""
    x_ths: float = 1.0
    """Maximum horizontal distance for paragraph merging."""
    y_ths: float = 0.5
    """Maximum vertical distance for paragraph merging."""
    ycenter_ths: float = 0.5
    """Maximum shift in y direction for merging."""

Attributes

add_margin: float = 0.1 class-attribute instance-attribute

Extend bounding boxes in all directions.

adjust_contrast: float = 0.5 class-attribute instance-attribute

Target contrast level for low contrast text.

beam_width: int = 5 class-attribute instance-attribute

Beam width for beam search in recognition.

canvas_size: int = 2560 class-attribute instance-attribute

Maximum image dimension for detection.

contrast_ths: float = 0.1 class-attribute instance-attribute

Contrast threshold for preprocessing.

decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy' class-attribute instance-attribute

Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.

height_ths: float = 0.5 class-attribute instance-attribute

Maximum difference in box height for merging.

language: str | list[str] = 'en' class-attribute instance-attribute

Language or languages to use for OCR.

Link confidence threshold.

low_text: float = 0.4 class-attribute instance-attribute

Text low-bound score.

mag_ratio: float = 1.0 class-attribute instance-attribute

Image magnification ratio.

min_size: int = 10 class-attribute instance-attribute

Minimum text box size in pixels.

rotation_info: list[int] | None = None class-attribute instance-attribute

List of angles to try for detection.

slope_ths: float = 0.1 class-attribute instance-attribute

Maximum slope for merging text boxes.

text_threshold: float = 0.7 class-attribute instance-attribute

Text confidence threshold.

use_gpu: bool = False class-attribute instance-attribute

Whether to use GPU for inference.

width_ths: float = 0.5 class-attribute instance-attribute

Maximum horizontal distance for merging boxes.

x_ths: float = 1.0 class-attribute instance-attribute

Maximum horizontal distance for paragraph merging.

y_ths: float = 0.5 class-attribute instance-attribute

Maximum vertical distance for paragraph merging.

ycenter_ths: float = 0.5 class-attribute instance-attribute

Maximum shift in y direction for merging.

PaddleOCRConfig

kreuzberg.PaddleOCRConfig dataclass

Configuration options for PaddleOCR.

This TypedDict provides type hints and documentation for all PaddleOCR parameters.

Source code in kreuzberg/_ocr/_paddleocr.py
@dataclass(unsafe_hash=True, frozen=True)
class PaddleOCRConfig:
    """Configuration options for PaddleOCR.

    This TypedDict provides type hints and documentation for all PaddleOCR parameters.
    """

    cls_image_shape: str = "3,48,192"
    """Image shape for classification algorithm in format 'channels,height,width'."""
    det_algorithm: Literal["DB", "EAST", "SAST", "PSE", "FCE", "PAN", "CT", "DB++", "Layout"] = "DB"
    """Detection algorithm."""
    det_db_box_thresh: float = 0.5
    """Score threshold for detected boxes. Boxes below this value are discarded."""
    det_db_thresh: float = 0.3
    """Binarization threshold for DB output map."""
    det_db_unclip_ratio: float = 2.0
    """Expansion ratio for detected text boxes."""
    det_east_cover_thresh: float = 0.1
    """Score threshold for EAST output boxes."""
    det_east_nms_thresh: float = 0.2
    """NMS threshold for EAST model output boxes."""
    det_east_score_thresh: float = 0.8
    """Binarization threshold for EAST output map."""
    det_max_side_len: int = 960
    """Maximum size of image long side. Images exceeding this will be proportionally resized."""
    drop_score: float = 0.5
    """Filter recognition results by confidence score. Results below this are discarded."""
    enable_mkldnn: bool = False
    """Whether to enable MKL-DNN acceleration (Intel CPU only)."""
    gpu_mem: int = 8000
    """GPU memory size (in MB) to use for initialization."""
    language: str = "en"
    """Language to use for OCR."""
    max_text_length: int = 25
    """Maximum text length that the recognition algorithm can recognize."""
    rec: bool = True
    """Enable text recognition when using the ocr() function."""
    rec_algorithm: Literal[
        "CRNN",
        "SRN",
        "NRTR",
        "SAR",
        "SEED",
        "SVTR",
        "SVTR_LCNet",
        "ViTSTR",
        "ABINet",
        "VisionLAN",
        "SPIN",
        "RobustScanner",
        "RFL",
    ] = "CRNN"
    """Recognition algorithm."""
    rec_image_shape: str = "3,32,320"
    """Image shape for recognition algorithm in format 'channels,height,width'."""
    table: bool = True
    """Whether to enable table recognition."""
    use_angle_cls: bool = True
    """Whether to use text orientation classification model."""
    use_gpu: bool = False
    """Whether to use GPU for inference. Requires installing the paddlepaddle-gpu package"""
    use_space_char: bool = True
    """Whether to recognize spaces."""
    use_zero_copy_run: bool = False
    """Whether to enable zero_copy_run for inference optimization."""

Attributes

cls_image_shape: str = '3,48,192' class-attribute instance-attribute

Image shape for classification algorithm in format 'channels,height,width'.

det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB' class-attribute instance-attribute

Detection algorithm.

det_db_box_thresh: float = 0.5 class-attribute instance-attribute

Score threshold for detected boxes. Boxes below this value are discarded.

det_db_thresh: float = 0.3 class-attribute instance-attribute

Binarization threshold for DB output map.

det_db_unclip_ratio: float = 2.0 class-attribute instance-attribute

Expansion ratio for detected text boxes.

det_east_cover_thresh: float = 0.1 class-attribute instance-attribute

Score threshold for EAST output boxes.

det_east_nms_thresh: float = 0.2 class-attribute instance-attribute

NMS threshold for EAST model output boxes.

det_east_score_thresh: float = 0.8 class-attribute instance-attribute

Binarization threshold for EAST output map.

det_max_side_len: int = 960 class-attribute instance-attribute

Maximum size of image long side. Images exceeding this will be proportionally resized.

drop_score: float = 0.5 class-attribute instance-attribute

Filter recognition results by confidence score. Results below this are discarded.

enable_mkldnn: bool = False class-attribute instance-attribute

Whether to enable MKL-DNN acceleration (Intel CPU only).

gpu_mem: int = 8000 class-attribute instance-attribute

GPU memory size (in MB) to use for initialization.

language: str = 'en' class-attribute instance-attribute

Language to use for OCR.

max_text_length: int = 25 class-attribute instance-attribute

Maximum text length that the recognition algorithm can recognize.

rec: bool = True class-attribute instance-attribute

Enable text recognition when using the ocr() function.

rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN' class-attribute instance-attribute

Recognition algorithm.

rec_image_shape: str = '3,32,320' class-attribute instance-attribute

Image shape for recognition algorithm in format 'channels,height,width'.

table: bool = True class-attribute instance-attribute

Whether to enable table recognition.

use_angle_cls: bool = True class-attribute instance-attribute

Whether to use text orientation classification model.

use_gpu: bool = False class-attribute instance-attribute

Whether to use GPU for inference. Requires installing the paddlepaddle-gpu package

use_space_char: bool = True class-attribute instance-attribute

Whether to recognize spaces.

use_zero_copy_run: bool = False class-attribute instance-attribute

Whether to enable zero_copy_run for inference optimization.

PSMMode (Page Segmentation Mode)

kreuzberg.PSMMode

Bases: Enum

Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values.

Source code in kreuzberg/_ocr/_tesseract.py
class PSMMode(Enum):
    """Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values."""

    OSD_ONLY = 0
    """Orientation and script detection only."""
    AUTO_OSD = 1
    """Automatic page segmentation with orientation and script detection."""
    AUTO_ONLY = 2
    """Automatic page segmentation without OSD."""
    AUTO = 3
    """Fully automatic page segmentation (default)."""
    SINGLE_COLUMN = 4
    """Assume a single column of text."""
    SINGLE_BLOCK_VERTICAL = 5
    """Assume a single uniform block of vertically aligned text."""
    SINGLE_BLOCK = 6
    """Assume a single uniform block of text."""
    SINGLE_LINE = 7
    """Treat the image as a single text line."""
    SINGLE_WORD = 8
    """Treat the image as a single word."""
    CIRCLE_WORD = 9
    """Treat the image as a single word in a circle."""
    SINGLE_CHAR = 10
    """Treat the image as a single character."""

Attributes

AUTO = 3 class-attribute instance-attribute

Fully automatic page segmentation (default).

AUTO_ONLY = 2 class-attribute instance-attribute

Automatic page segmentation without OSD.

AUTO_OSD = 1 class-attribute instance-attribute

Automatic page segmentation with orientation and script detection.

CIRCLE_WORD = 9 class-attribute instance-attribute

Treat the image as a single word in a circle.

OSD_ONLY = 0 class-attribute instance-attribute

Orientation and script detection only.

SINGLE_BLOCK = 6 class-attribute instance-attribute

Assume a single uniform block of text.

SINGLE_BLOCK_VERTICAL = 5 class-attribute instance-attribute

Assume a single uniform block of vertically aligned text.

SINGLE_CHAR = 10 class-attribute instance-attribute

Treat the image as a single character.

SINGLE_COLUMN = 4 class-attribute instance-attribute

Assume a single column of text.

SINGLE_LINE = 7 class-attribute instance-attribute

Treat the image as a single text line.

SINGLE_WORD = 8 class-attribute instance-attribute

Treat the image as a single word.

Metadata

A TypedDict that contains optional metadata fields extracted from documents:

kreuzberg.Metadata

Bases: TypedDict

Base metadata common to all document types.

All fields will only be included if they contain non-empty values. Any field that would be empty or None is omitted from the dictionary.

Source code in kreuzberg/_types.py
class Metadata(TypedDict, total=False):
    """Base metadata common to all document types.

    All fields will only be included if they contain non-empty values.
    Any field that would be empty or None is omitted from the dictionary.
    """

    authors: NotRequired[list[str]]
    """List of document authors."""
    categories: NotRequired[list[str]]
    """Categories or classifications."""
    citations: NotRequired[list[str]]
    """Citation identifiers."""
    comments: NotRequired[str]
    """General comments."""
    copyright: NotRequired[str]
    """Copyright information."""
    created_at: NotRequired[str]
    """Creation timestamp in ISO format."""
    created_by: NotRequired[str]
    """Document creator."""
    description: NotRequired[str]
    """Document description."""
    fonts: NotRequired[list[str]]
    """List of fonts used in the document."""
    height: NotRequired[int]
    """Height of the document page/slide/image, if applicable."""
    identifier: NotRequired[str]
    """Unique document identifier."""
    keywords: NotRequired[list[str]]
    """Keywords or tags."""
    languages: NotRequired[list[str]]
    """Document language code."""
    license: NotRequired[str]
    """License information."""
    modified_at: NotRequired[str]
    """Last modification timestamp in ISO format."""
    modified_by: NotRequired[str]
    """Username of last modifier."""
    organization: NotRequired[str | list[str]]
    """Organizational affiliation."""
    publisher: NotRequired[str]
    """Publisher or organization name."""
    references: NotRequired[list[str]]
    """Reference entries."""
    status: NotRequired[str]
    """Document status (e.g., draft, final)."""
    subject: NotRequired[str]
    """Document subject or topic."""
    subtitle: NotRequired[str]
    """Document subtitle."""
    summary: NotRequired[str]
    """Document Summary"""
    title: NotRequired[str]
    """Document title."""
    version: NotRequired[str]
    """Version identifier or revision number."""
    width: NotRequired[int]
    """Width of the document page/slide/image, if applicable."""

Attributes

authors: NotRequired[list[str]] instance-attribute

List of document authors.

categories: NotRequired[list[str]] instance-attribute

Categories or classifications.

citations: NotRequired[list[str]] instance-attribute

Citation identifiers.

comments: NotRequired[str] instance-attribute

General comments.

copyright: NotRequired[str] instance-attribute

Copyright information.

created_at: NotRequired[str] instance-attribute

Creation timestamp in ISO format.

created_by: NotRequired[str] instance-attribute

Document creator.

description: NotRequired[str] instance-attribute

Document description.

fonts: NotRequired[list[str]] instance-attribute

List of fonts used in the document.

height: NotRequired[int] instance-attribute

Height of the document page/slide/image, if applicable.

identifier: NotRequired[str] instance-attribute

Unique document identifier.

keywords: NotRequired[list[str]] instance-attribute

Keywords or tags.

languages: NotRequired[list[str]] instance-attribute

Document language code.

license: NotRequired[str] instance-attribute

License information.

modified_at: NotRequired[str] instance-attribute

Last modification timestamp in ISO format.

modified_by: NotRequired[str] instance-attribute

Username of last modifier.

organization: NotRequired[str | list[str]] instance-attribute

Organizational affiliation.

publisher: NotRequired[str] instance-attribute

Publisher or organization name.

references: NotRequired[list[str]] instance-attribute

Reference entries.

status: NotRequired[str] instance-attribute

Document status (e.g., draft, final).

subject: NotRequired[str] instance-attribute

Document subject or topic.

subtitle: NotRequired[str] instance-attribute

Document subtitle.

summary: NotRequired[str] instance-attribute

Document Summary

title: NotRequired[str] instance-attribute

Document title.

version: NotRequired[str] instance-attribute

Version identifier or revision number.

width: NotRequired[int] instance-attribute

Width of the document page/slide/image, if applicable.