Extraction Functions¶
Kreuzberg provides both async and sync functions for text extraction. All functions accept an optional ExtractionConfig
parameter for configuring the extraction process.
Asynchronous Functions¶
These functions return awaitable coroutines that must be awaited or run in an asyncio event loop.
extract_file¶
Extract text from a file path:
kreuzberg.extract_file(file_path: PathLike[str] | str, mime_type: str | None = None, config: ExtractionConfig = DEFAULT_CONFIG) -> ExtractionResult
async
¶
Extract the textual content from a given file.
PARAMETER | DESCRIPTION |
---|---|
file_path | The path to the file. TYPE: |
mime_type | The mime type of the content. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
ExtractionResult | The extracted content and the mime type of the content. |
Source code in kreuzberg/extraction.py
extract_bytes¶
Extract text from raw bytes:
kreuzberg.extract_bytes(content: bytes, mime_type: str, config: ExtractionConfig = DEFAULT_CONFIG) -> ExtractionResult
async
¶
Extract the textual content from a given byte string representing a file's contents.
PARAMETER | DESCRIPTION |
---|---|
content | The content to extract. TYPE: |
mime_type | The mime type of the content. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
ExtractionResult | The extracted content and the mime type of the content. |
Source code in kreuzberg/extraction.py
batch_extract_file¶
Process multiple files concurrently:
kreuzberg.batch_extract_file(file_paths: Sequence[PathLike[str] | str], config: ExtractionConfig = DEFAULT_CONFIG) -> list[ExtractionResult]
async
¶
Extract text from multiple files concurrently.
PARAMETER | DESCRIPTION |
---|---|
file_paths | A sequence of paths to files to extract text from. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
list[ExtractionResult] | A list of extraction results in the same order as the input paths. |
Source code in kreuzberg/extraction.py
batch_extract_bytes¶
Process multiple byte contents concurrently:
kreuzberg.batch_extract_bytes(contents: Sequence[tuple[bytes, str]], config: ExtractionConfig = DEFAULT_CONFIG) -> list[ExtractionResult]
async
¶
Extract text from multiple byte contents concurrently.
PARAMETER | DESCRIPTION |
---|---|
contents | A sequence of tuples containing (content, mime_type) pairs. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
list[ExtractionResult] | A list of extraction results in the same order as the input contents. |
Source code in kreuzberg/extraction.py
Synchronous Functions¶
These functions block until extraction is complete and are suitable for non-async contexts.
extract_file_sync¶
Synchronous version of extract_file:
kreuzberg.extract_file_sync(file_path: Path | str, mime_type: str | None = None, config: ExtractionConfig = DEFAULT_CONFIG) -> ExtractionResult
¶
Synchronous version of extract_file.
PARAMETER | DESCRIPTION |
---|---|
file_path | The path to the file. TYPE: |
mime_type | The mime type of the content. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
ExtractionResult | The extracted content and the mime type of the content. |
Source code in kreuzberg/extraction.py
extract_bytes_sync¶
Synchronous version of extract_bytes:
kreuzberg.extract_bytes_sync(content: bytes, mime_type: str, config: ExtractionConfig = DEFAULT_CONFIG) -> ExtractionResult
¶
Synchronous version of extract_bytes.
PARAMETER | DESCRIPTION |
---|---|
content | The content to extract. TYPE: |
mime_type | The mime type of the content. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
ExtractionResult | The extracted content and the mime type of the content. |
Source code in kreuzberg/extraction.py
batch_extract_file_sync¶
Synchronous version of batch_extract_file:
kreuzberg.batch_extract_file_sync(file_paths: Sequence[PathLike[str] | str], config: ExtractionConfig = DEFAULT_CONFIG) -> list[ExtractionResult]
¶
Synchronous version of batch_extract_file.
PARAMETER | DESCRIPTION |
---|---|
file_paths | A sequence of paths to files to extract text from. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
list[ExtractionResult] | A list of extraction results in the same order as the input paths. |
Source code in kreuzberg/extraction.py
batch_extract_bytes_sync¶
Synchronous version of batch_extract_bytes:
kreuzberg.batch_extract_bytes_sync(contents: Sequence[tuple[bytes, str]], config: ExtractionConfig = DEFAULT_CONFIG) -> list[ExtractionResult]
¶
Synchronous version of batch_extract_bytes.
PARAMETER | DESCRIPTION |
---|---|
contents | A sequence of tuples containing (content, mime_type) pairs. TYPE: |
config | Extraction options object, defaults to the default object. TYPE: |
RETURNS | DESCRIPTION |
---|---|
list[ExtractionResult] | A list of extraction results in the same order as the input contents. |