Content Chunking¶
Kreuzberg provides a powerful content chunking capability that allows you to split extracted text into smaller, more manageable chunks. This feature is particularly useful for processing large documents, working with language models that have token limits, or implementing semantic search functionality.
Overview¶
Content chunking divides the extracted text into smaller segments while maintaining semantic coherence. Kreuzberg uses the semantic-text-splitter
library to intelligently split text based on content type (plain text or markdown), respecting the document's structure.
Configuration¶
Chunking is controlled through the ExtractionConfig
class with these parameters:
chunk_content
: Boolean flag to enable/disable chunking (default:False
)max_chars
: Maximum number of characters per chunk (default: 4000)max_overlap
: Number of characters to overlap between chunks (default: 200)
Basic Usage¶
To enable chunking in your extraction process:
Customizing Chunk Size and Overlap¶
You can customize the chunk size and overlap to suit your specific needs:
Format-Aware Chunking¶
Kreuzberg's chunking system is format-aware, meaning it handles different content types appropriately:
- Markdown: When extracting from formats that produce markdown output (like DOCX, PPTX), the chunker preserves markdown structure, avoiding breaks in the middle of headings, lists, or code blocks.
- Plain Text: For plain text output, the chunker attempts to split on natural boundaries like paragraph breaks and sentences.
Use Cases¶
Working with Large Language Models¶
When using LLMs with token limits, chunking allows you to process documents that would otherwise exceed those limits:
Semantic Search Implementation¶
Chunking is essential for implementing effective semantic search:
Technical Details¶
Under the hood, Kreuzberg uses the semantic-text-splitter
library which intelligently splits text while preserving semantic structure. The chunking process:
- Identifies the content type (markdown or plain text)
- Creates an appropriate splitter based on the content type
- Splits the content according to the specified maximum size and overlap
- Returns the chunks as a list of strings in the
ExtractionResult.chunks
field
The chunker is cached for performance, so creating multiple extraction results with the same chunking parameters is efficient.
Best Practices¶
- Choose appropriate chunk sizes: Smaller chunks (1000-2000 characters) work well for precise semantic search, while larger chunks (4000-8000 characters) may be better for context-aware processing.
- Set meaningful overlap: Overlap ensures that context isn't lost between chunks. A good rule of thumb is 5-10% of your chunk size.
- Consider content type: Markdown content may require larger chunk sizes to preserve structure.
- Test with your specific use case: Optimal chunking parameters depend on your specific documents and use case.