7.8 KiB
Multi-Language OCR Guide
Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.
🌍 Overview
The multi-language OCR system allows you to:
- Process documents in up to 4 languages simultaneously for best results
- Set preferred languages that apply to all your document uploads
- Retry failed OCR with different language combinations
- Automatically optimize text extraction by using multiple language models
🚀 Getting Started
Setting Your Language Preferences
- Navigate to Settings in your account
- Select OCR Languages section
- Choose up to 4 preferred languages - these will be used for all new uploads
- Set a primary language - this language gets processing priority
- Save your preferences
Example preferred language setup:
- Primary: English (
eng) - Additional: Spanish (
spa), French (fra) - Result: Documents processed with English priority, plus Spanish and French recognition
Language Selection During Upload
When uploading documents, you can:
- Use your default preferences - no action needed
- Override for specific documents:
- Click the language selector in the upload area
- Choose different languages for this upload session
- These languages will be applied to all files in the current upload
📋 Available Languages
Readur supports 67+ languages including:
Major World Languages
- English (
eng) - Default and most reliable - Spanish (
spa) - Excellent accuracy - French (
fra) - High quality results - German (
deu) - Strong performance - Italian (
ita) - Good accuracy - Portuguese (
por) - Reliable processing - Russian (
rus) - Solid results
Asian Languages
- Chinese Simplified (
chi_sim) - Chinese Traditional (
chi_tra) - Japanese (
jpn) - Korean (
kor) - Hindi (
hin) - Thai (
tha) - Vietnamese (
vie)
European Languages
- Dutch (
nld) - Swedish (
swe) - Norwegian (
nor) - Danish (
dan) - Finnish (
fin) - Polish (
pol) - Czech (
ces)
And Many More
Including Arabic (ara), Hebrew (heb), Turkish (tur), and dozens of other languages.
Tip: For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint:
GET /api/ocr/languages
🛠️ Using the API
Get Available Languages
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://your-readur-instance.com/api/ocr/languages
Response:
{
"available_languages": [
{
"code": "eng",
"name": "English",
"installed": true
},
{
"code": "spa",
"name": "Spanish",
"installed": true
}
],
"current_user_language": "eng"
}
Update Language Preferences
curl -X PUT \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"preferred_languages": ["eng", "spa", "fra"],
"primary_language": "eng"
}' \
https://your-readur-instance.com/api/settings
Retry OCR with Different Languages
curl -X POST \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"languages": ["eng", "deu"]
}' \
https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
🎯 Best Practices
Language Selection Strategy
For Mixed-Language Documents:
- Choose 2-3 languages that appear in your document
- Always include English as a fallback (most reliable)
- Put the dominant language first as your primary language
Examples:
- Business document with English/Spanish:
["eng", "spa"] - European legal document:
["eng", "fra", "deu"] - Academic paper with multiple references:
["eng", "spa", "ita"]
Performance Optimization
Do:
- ✅ Limit to 2-4 languages for best performance
- ✅ Include English when processing mixed content
- ✅ Use specific language combinations for consistent document types
- ✅ Set realistic expectations for complex multilingual documents
Don't:
- ❌ Select languages not present in your documents
- ❌ Use more than 4 languages simultaneously
- ❌ Expect perfect results with very low-quality scans
- ❌ Mix completely unrelated language families unnecessarily
🔄 Retrying OCR Processing
If OCR results are poor, you can retry with different languages:
Via Web Interface
- Navigate to the document with poor OCR results
- Click "Retry OCR" button
- Select different languages that better match your document
- Start retry process
Common Retry Scenarios
Scenario 1: Wrong Language Detected
- Original: English-only processing of Spanish document
- Solution: Retry with
["spa", "eng"]
Scenario 2: Mixed Language Document
- Original: Single language processing
- Solution: Add 2-3 relevant languages
Scenario 3: Poor Quality Scan
- Original: Fast processing with limited languages
- Solution: Try with primary language + English fallback
📊 Monitoring OCR Results
Understanding OCR Confidence
- 90%+ - Excellent results, high accuracy
- 70-89% - Good results, minor errors possible
- 50-69% - Moderate results, review recommended
- Below 50% - Poor results, consider retry with different languages
Language-Specific Performance
Different languages have varying accuracy rates:
- Latin-based scripts (English, Spanish, French): Highest accuracy
- Germanic languages (German, Dutch): Very good accuracy
- Asian languages (Chinese, Japanese): Good accuracy with proper font recognition
- Arabic/Hebrew scripts: Moderate accuracy, depends on text quality
🐛 Troubleshooting
Common Issues
Problem: "Language not available" error Solution:
- Check language code spelling (e.g.,
engnotenglish) - Verify language is installed on the server
- Contact administrator if language should be available
Problem: Poor OCR results despite correct language Solutions:
- Ensure document scan quality is sufficient (300+ DPI recommended)
- Try adding English as a fallback language
- Consider document preprocessing (contrast, rotation correction)
- Retry with fewer languages for better performance
Problem: Slow processing with multiple languages
Solutions:
- Reduce number of selected languages to 2-3
- Use languages only present in your document
- Consider processing during off-peak hours
Getting Help
If you're experiencing issues:
- Check the OCR Health page -
GET /api/ocr/health - Review your language selection - ensure languages match document content
- Try with English fallback - adds reliability to processing
- Contact support with document ID and language combination used
🔮 Advanced Features
Planned Enhancements
- Auto-language detection: Automatic suggestion of optimal language combinations
- Custom language models: Upload your own specialized language data
- Batch language updates: Change languages for multiple documents at once
- Language-specific confidence thresholds: Fine-tune accuracy requirements per language
Integration Options
The multi-language OCR system integrates with:
- Document management workflows
- Automated processing pipelines
- Third-party applications via REST API
- Webhook notifications for completion
📚 Additional Resources
- API Documentation: Complete endpoint reference
- Language Codes Reference: Full list of supported language codes
- Performance Guidelines: Optimization recommendations
- Migration Guide: Upgrading from single-language setup
Need Help? Contact support or check the system health dashboard for real-time OCR capability status.