Metadata Extraction¶
Kreuzberg automatically extracts rich metadata from documents including titles, authors, dates, and format-specific properties.
How It Works¶
The ExtractionResult
includes a metadata
dictionary with available metadata properties. Each format uses specialized extractors:
- PDF: Uses Playa to extract document information, structure, and properties
- Office: Uses Pandoc for Word, PowerPoint, and other formats
- Images: Extracts dimensions and format properties
- HTML: Extracts meta tags and structured data
Metadata Fields¶
The Metadata
dictionary is a TypedDict
with optional fields. Only available fields are included in the results.
Usage Example¶
Accessing document metadata:
PDF-Specific Metadata¶
For PDF documents, Kreuzberg extracts a rich set of metadata including:
- Document information dictionary values (title, author, subject, keywords, etc.)
- Document structure information including page count, dimensions, and outline
- Font information from document pages
- PDF/A compliance status
- Encryption status and permissions
- Language information when available in document structure
If a PDF document contains UTF-16BE encoded strings (often present in PDF metadata with a byte order mark \xfe\xff
), Kreuzberg will automatically detect and decode these properly.
Working with Multiple Document Types¶
When working with multiple document types, it's important to remember that different document formats may provide different metadata fields. Always use defensive programming (like using .get()
with a default value) when accessing metadata fields:
Viewing Available Metadata¶
To view all available metadata for a document:
This will print all available metadata fields for the document in a readable format.