Error Handling
Kreuzberg provides specific exception types to help you handle different error scenarios during text extraction.
Exception Hierarchy
All Kreuzberg exceptions inherit from the base KreuzbergError
class:
| KreuzbergError
├── MissingDependencyError
├── OCRError
├── ParsingError
└── ValidationError
|
Handling Specific Exceptions
Comprehensive Error Handling
| from kreuzberg import extract_file
from kreuzberg import KreuzbergError, MissingDependencyError, OCRError, ParsingError, ValidationError
async def safe_extract(path):
try:
result = await extract_file(path)
return result.content
except MissingDependencyError as e:
# Handle missing system dependencies (Tesseract, Pandoc)
print(f"Missing dependency: {e}")
print("Please install the required dependencies.")
# You might want to provide installation instructions here
except OCRError as e:
# Handle OCR processing failures
print(f"OCR processing failed: {e}")
# You might want to retry with different OCR settings
except ParsingError as e:
# Handle document parsing failures
print(f"Document parsing failed: {e}")
# You might want to try a different approach or format
except ValidationError as e:
# Handle validation errors in configuration
print(f"Validation error: {e}")
# Fix the configuration issue
except KreuzbergError as e:
# Catch-all for any other Kreuzberg-specific errors
print(f"Extraction error: {e}")
except Exception as e:
# Handle unexpected errors
print(f"Unexpected error: {e}")
return None
|
Simplified Error Handling
For simpler applications, you can catch just the base KreuzbergError
:
| from kreuzberg import extract_file, KreuzbergError
async def simple_safe_extract(path):
try:
result = await extract_file(path)
return result.content
except KreuzbergError as e:
print(f"Extraction failed: {e}")
return None
|
Common Error Scenarios
Missing Dependencies
| try:
result = await extract_file("document.pdf")
except MissingDependencyError as e:
if "tesseract" in str(e).lower():
print("Tesseract OCR is not installed. Please install it:")
print(" - Ubuntu: sudo apt-get install tesseract-ocr")
print(" - macOS: brew install tesseract")
print(" - Windows: choco install tesseract")
elif "pandoc" in str(e).lower():
print("Pandoc is not installed. Please install it:")
print(" - Ubuntu: sudo apt-get install pandoc")
print(" - macOS: brew install pandoc")
print(" - Windows: choco install pandoc")
|
OCR Errors
| from kreuzberg import extract_file, OCRError, TesseractConfig, PSMMode
async def extract_with_fallback(path):
# Try with default settings
try:
result = await extract_file(path)
return result.content
except OCRError:
# Try with different OCR settings
try:
result = await extract_file(
path, force_ocr=True, ocr_config=TesseractConfig(psm=PSMMode.SINGLE_BLOCK, language="eng")
)
return result.content
except OCRError as e:
print(f"OCR failed with all attempts: {e}")
return None
|
Validation Errors
| from kreuzberg import extract_file, ValidationError, TesseractConfig
async def extract_with_validation_handling():
try:
# This will raise a ValidationError - incompatible config
result = await extract_file(
"document.pdf", ocr_backend="easyocr", ocr_config=TesseractConfig(language="eng") # Wrong config type for easyocr
)
except ValidationError as e:
print(f"Configuration error: {e}")
# Fix the configuration
from kreuzberg import EasyOCRConfig
result = await extract_file(
"document.pdf", ocr_backend="easyocr", ocr_config=EasyOCRConfig(language="en") # Correct config type
)
return result.content
|
Best Practices
- Always use try/except: Wrap extraction calls in try/except blocks to handle potential errors gracefully
- Provide helpful error messages: Give users clear information about what went wrong and how to fix it
- Implement fallbacks: When possible, try alternative approaches when the primary method fails
- Log detailed error information: Include error details in logs for debugging
- Check dependencies upfront: Verify that required dependencies are installed before attempting extraction
| import subprocess
from kreuzberg import extract_file
def check_dependencies():
"""Check if required dependencies are installed."""
missing = []
# Check for Tesseract
try:
subprocess.run(["tesseract", "--version"], capture_output=True, check=True)
except (subprocess.SubprocessError, FileNotFoundError):
missing.append("tesseract")
# Check for Pandoc
try:
subprocess.run(["pandoc", "--version"], capture_output=True, check=True)
except (subprocess.SubprocessError, FileNotFoundError):
missing.append("pandoc")
return missing
async def main():
# Check dependencies before extraction
missing_deps = check_dependencies()
if missing_deps:
print(f"Missing dependencies: {', '.join(missing_deps)}")
print("Please install them before continuing.")
return
# Proceed with extraction
try:
result = await extract_file("document.pdf")
print(result.content)
except Exception as e:
print(f"Extraction failed: {e}")
|