Docker¶
Kreuzberg provides official Docker images for easy deployment and containerized usage.
Available Images¶
Docker images are available on Docker Hub:
goldziher/kreuzberg:latest
- Core image with API server and Tesseract OCRgoldziher/kreuzberg:latest-easyocr
- With EasyOCR supportgoldziher/kreuzberg:latest-paddle
- With PaddleOCR supportgoldziher/kreuzberg:latest-gmft
- With GMFT table extractiongoldziher/kreuzberg:latest-all
- With all optional dependencies
Note: Specific version tags are also available (e.g.,
v3.4.0
,v3.4.0-easyocr
)
Quick Start¶
Running the API Server¶
The API server will be available at http://localhost:8000
.
Extract Files¶
Docker Compose¶
Create a docker-compose.yml
:
Run with:
Using Different OCR Engines¶
EasyOCR¶
PaddleOCR¶
All Features¶
Building Custom Images¶
If you need a custom configuration, you can build your own image:
Image Details¶
Base Image¶
- Based on
python:3.13-bookworm
(requires Python 3.10+) - Includes system dependencies:
pandoc
,tesseract-ocr
- Runs as non-root user
appuser
- Exposes port 8000
Included Dependencies¶
All images include:
- Kreuzberg core library
- Litestar API framework
- Tesseract OCR
- Pandoc for document conversion
Additional dependencies by variant:
- easyocr: EasyOCR deep learning models
- paddle: PaddleOCR and PaddlePaddle
- gmft: GMFT for table extraction
- all: All optional dependencies
Health Check¶
All Docker images include a health check endpoint:
Returns a JSON response with service status and version information.
Observability¶
The Docker images include built-in OpenTelemetry instrumentation via Litestar:
- Tracing: Automatic request/response tracing
- Metrics: Performance and usage metrics
- Logging: Structured JSON logging
Configure via standard OpenTelemetry environment variables:
Environment Variables¶
PYTHONUNBUFFERED=1
- Ensures proper logging outputPYTHONDONTWRITEBYTECODE=1
- Prevents .pyc file creationUV_LINK_MODE=copy
- Optimizes package installation
Production Deployment¶
With nginx Reverse Proxy¶
Kubernetes Deployment¶
Resource Requirements¶
Minimum Requirements¶
- CPU: 1 core
- Memory: 512MB
- Storage: 1GB (more for OCR models)
Recommended for Production¶
- CPU: 2+ cores
- Memory: 2GB+ (4GB+ for EasyOCR/PaddleOCR)
- Storage: 5GB+ (depends on OCR models)
OCR Model Sizes¶
- Tesseract: ~100MB
- EasyOCR: ~64MB-2.5GB per language
- PaddleOCR: ~400MB
Troubleshooting¶
Container Logs¶
Shell Access¶
Common Issues¶
- Out of Memory: Increase Docker memory allocation or use a smaller OCR engine
- Slow Startup: First run downloads OCR models; subsequent runs are faster
- Permission Denied: Ensure mounted volumes have correct permissions
Security Considerations¶
- Runs as non-root user by default
- No external API calls or cloud dependencies
- Process files locally within the container
- Consider adding authentication for production use
- Use volume mounts carefully to limit file system access