MCP Server¶
The Kreuzberg MCP (Model Context Protocol) server enables seamless integration with AI tools like Claude Desktop, Cursor, and other MCP-compatible applications. This allows AI assistants to directly extract text from documents without requiring API calls or manual file processing.
What is MCP?¶
The Model Context Protocol (MCP) is an open standard developed by Anthropic that allows AI applications to securely connect with external tools and data sources. It provides a standardized way for AI models to:
- Execute tools and functions
- Access resources and data
- Use pre-built prompt templates
Quick Start¶
Installation¶
The MCP server is included with the base Kreuzberg installation. For the best experience, install with all features:
Running the MCP Server¶
Claude Desktop Configuration¶
Add Kreuzberg to your Claude Desktop configuration file:
On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%\Claude\claude_desktop_config.json
Recommended Configuration (All Features)¶
Basic Configuration (Core Features Only)¶
Alternative: Using Pre-installed Kreuzberg¶
If you have Kreuzberg installed with pip:
Optional Dependencies¶
Kreuzberg MCP server supports enhanced functionality through optional dependencies:
Available Feature Sets¶
Feature | Package Extra | Description |
---|---|---|
Content Chunking | chunking | Split documents into chunks for RAG applications |
Language Detection | langdetect | Automatically detect document languages |
Entity Extraction | entity-extraction | Extract named entities and keywords |
Advanced OCR | easyocr , paddleocr | Alternative OCR engines |
Table Extraction | gmft | Extract structured tables from PDFs |
All Features | all | Install all optional dependencies |
Installation Examples¶
Using with uvx¶
Claude Desktop Configuration Examples¶
For different use cases:
RAG Application Setup¶
Document Analysis Setup¶
Advanced OCR Setup¶
Multiple MCP Servers (Recommended Setup)¶
Feature Availability¶
Without optional dependencies, certain features will be disabled:
- Chunking:
chunk_content=True
will raise an error - Language Detection:
auto_detect_language=True
will be ignored - Entity Extraction:
extract_entities=True
will be ignored - Keyword Extraction:
extract_keywords=True
will be ignored - Advanced OCR: Only Tesseract will be available
- Table Extraction:
extract_tables=True
will raise an error
The MCP server will inform you when features are unavailable due to missing dependencies.
Available Capabilities¶
Tools¶
The MCP server exposes three main extraction tools:
extract_document
¶
Comprehensive document extraction with full configuration options.
Parameters:
file_path
(required): Path to the document filemime_type
(optional): MIME type of the documentforce_ocr
(optional): Force OCR even for text-based documentschunk_content
(optional): Split content into chunksextract_tables
(optional): Extract tables from the documentextract_entities
(optional): Extract named entitiesextract_keywords
(optional): Extract keywordsocr_backend
(optional): OCR backend to use (tesseract, easyocr, paddleocr)max_chars
(optional): Maximum characters per chunkmax_overlap
(optional): Character overlap between chunkskeyword_count
(optional): Number of keywords to extractauto_detect_language
(optional): Auto-detect document language
Returns: Dictionary with extracted content, metadata, tables, chunks, entities, and keywords.
extract_bytes
¶
Extract text from document bytes (base64-encoded).
Parameters:
content_base64
(required): Base64-encoded document contentmime_type
(required): MIME type of the document- All other parameters same as
extract_document
Returns: Dictionary with extracted content, metadata, and optional features.
extract_simple
¶
Simple text extraction with minimal configuration.
Parameters:
file_path
(required): Path to the document filemime_type
(optional): MIME type of the document
Returns: Extracted text content as a string.
Resources¶
Access configuration and system information:
config://default
¶
Returns the default extraction configuration as a string.
config://available-backends
¶
Lists available OCR backends (tesseract, easyocr, paddleocr).
extractors://supported-formats
¶
Returns information about supported document formats.
Prompts¶
Pre-built prompt templates for common workflows:
extract_and_summarize
¶
Extracts text from a document and provides a prompt for summarization.
Parameters:
file_path
(required): Path to the document file
Returns: Extracted content with summarization prompt.
extract_structured
¶
Extracts text with structured analysis including entities, keywords, and tables.
Parameters:
file_path
(required): Path to the document file
Returns: Extracted content with structured analysis prompt.
Usage Examples¶
Basic Text Extraction¶
Advanced Document Processing (with all features)¶
Comprehensive Document Analysis¶
Structured Analysis with Prompts¶
Integration with Other AI Tools¶
Cursor IDE¶
Configure Kreuzberg MCP server in Cursor's settings:
Custom MCP Clients¶
You can also integrate with custom MCP clients using the standard MCP protocol:
Configuration¶
OCR Backend Selection¶
The MCP server supports all three OCR backends. You can specify which one to use:
Chunking for RAG Applications¶
For RAG (Retrieval-Augmented Generation) applications, you can chunk content:
Advanced Features¶
Language Detection¶
The MCP server can automatically detect document languages:
Table Extraction¶
Extract structured tables from documents:
Troubleshooting¶
Common Issues¶
-
MCP Server Not Starting
- Ensure Kreuzberg is properly installed:
pip install kreuzberg
- Check that the command is available:
which kreuzberg-mcp
- Ensure Kreuzberg is properly installed:
-
Claude Desktop Not Connecting
- Verify the configuration file path is correct
- Check that
uvx
is installed and available - Restart Claude Desktop after configuration changes
-
OCR Not Working
- Ensure system dependencies are installed (tesseract, etc.)
- Check OCR backend availability using the
config://available-backends
resource
-
File Access Issues
- Verify file paths are absolute and accessible
- Check file permissions
- Ensure the document format is supported
-
Missing Optional Dependencies
- Chunking Error:
MissingDependencyError: The package 'semantic-text-splitter' is required
- Solution:
uvx --with "kreuzberg[chunking]" kreuzberg-mcp
- Solution:
- Language Detection Ignored: No error, but
auto_detect_language=True
has no effect- Solution:
uvx --with "kreuzberg[langdetect]" kreuzberg-mcp
- Solution:
- Entity/Keyword Extraction Ignored: No error, but features return None
- Solution:
uvx --with "kreuzberg[entity-extraction]" kreuzberg-mcp
- Solution:
- Advanced OCR Unavailable:
easyocr
orpaddleocr
backend not found- Solution:
uvx --with "kreuzberg[easyocr,paddleocr]" kreuzberg-mcp
- Solution:
- Table Extraction Error:
MissingDependencyError: The package 'gmft' is required
- Solution:
uvx --with "kreuzberg[gmft]" kreuzberg-mcp
- Solution:
- Chunking Error:
-
uvx Command Not Found
- Install uvx:
pip install uvx
- Or use pip installation:
pip install "kreuzberg[all]"
thenkreuzberg-mcp
- Install uvx:
Debug Mode¶
Run the MCP server with debug logging:
Security Considerations¶
The MCP server operates locally and does not send data to external services:
- All document processing happens on your machine
- No cloud dependencies or external API calls
- File access is limited to what you explicitly request
- No data is stored or cached beyond the session
Performance Tips¶
For optimal performance when using the MCP server:
- Use appropriate tools:
extract_simple
for basic text,extract_document
for advanced features - Chunk large documents: Enable chunking for documents over 10MB
- Select OCR backend: Choose the most appropriate OCR backend for your use case
- Batch processing: For multiple documents, consider using the CLI or API instead
Next Steps¶
- API Reference - Complete API documentation
- CLI Guide - Command-line interface
- Docker Guide - Container deployment
- OCR Configuration - OCR engine setup