mirror of https://github.com/readur/readur.git synced 2026-01-01 20:10:29 -06:00

Files

perf3ct 2e6c1ef238 feat(docs): add docs about multiple ocr languages

2025-07-21 23:34:57 +00:00

7.8 KiB

Raw Permalink Blame History

Multi-Language OCR Guide

Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.

🌍 Overview

The multi-language OCR system allows you to:

Process documents in up to 4 languages simultaneously for best results
Set preferred languages that apply to all your document uploads
Retry failed OCR with different language combinations
Automatically optimize text extraction by using multiple language models

🚀 Getting Started

Setting Your Language Preferences

Navigate to Settings in your account
Select OCR Languages section
Choose up to 4 preferred languages - these will be used for all new uploads
Set a primary language - this language gets processing priority
Save your preferences

Example preferred language setup:

Primary: English (eng)
Additional: Spanish (spa), French (fra)
Result: Documents processed with English priority, plus Spanish and French recognition

Language Selection During Upload

When uploading documents, you can:

Use your default preferences - no action needed
Override for specific documents:
- Click the language selector in the upload area
- Choose different languages for this upload session
- These languages will be applied to all files in the current upload

📋 Available Languages

Readur supports 67+ languages including:

Major World Languages

English (eng) - Default and most reliable
Spanish (spa) - Excellent accuracy
French (fra) - High quality results
German (deu) - Strong performance
Italian (ita) - Good accuracy
Portuguese (por) - Reliable processing
Russian (rus) - Solid results

Asian Languages

Chinese Simplified (chi_sim)
Chinese Traditional (chi_tra)
Japanese (jpn)
Korean (kor)
Hindi (hin)
Thai (tha)
Vietnamese (vie)

European Languages

Dutch (nld)
Swedish (swe)
Norwegian (nor)
Danish (dan)
Finnish (fin)
Polish (pol)
Czech (ces)

And Many More

Including Arabic (ara), Hebrew (heb), Turkish (tur), and dozens of other languages.

Tip: For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: GET /api/ocr/languages

🛠️ Using the API

Get Available Languages

curl -H "Authorization: Bearer YOUR_TOKEN" \
     https://your-readur-instance.com/api/ocr/languages

Response:

{
  "available_languages": [
    {
      "code": "eng",
      "name": "English",
      "installed": true
    },
    {
      "code": "spa", 
      "name": "Spanish",
      "installed": true
    }
  ],
  "current_user_language": "eng"
}

Update Language Preferences

curl -X PUT \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "preferred_languages": ["eng", "spa", "fra"],
       "primary_language": "eng"
     }' \
     https://your-readur-instance.com/api/settings

Retry OCR with Different Languages

curl -X POST \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "languages": ["eng", "deu"]
     }' \
     https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry

🎯 Best Practices

Language Selection Strategy

For Mixed-Language Documents:

Choose 2-3 languages that appear in your document
Always include English as a fallback (most reliable)
Put the dominant language first as your primary language

Examples:

Business document with English/Spanish: ["eng", "spa"]
European legal document: ["eng", "fra", "deu"]
Academic paper with multiple references: ["eng", "spa", "ita"]

Performance Optimization

Do:

✅ Limit to 2-4 languages for best performance
✅ Include English when processing mixed content
✅ Use specific language combinations for consistent document types
✅ Set realistic expectations for complex multilingual documents

Don't:

❌ Select languages not present in your documents
❌ Use more than 4 languages simultaneously
❌ Expect perfect results with very low-quality scans
❌ Mix completely unrelated language families unnecessarily

🔄 Retrying OCR Processing

If OCR results are poor, you can retry with different languages:

Via Web Interface

Navigate to the document with poor OCR results
Click "Retry OCR" button
Select different languages that better match your document
Start retry process

Common Retry Scenarios

Scenario 1: Wrong Language Detected

Original: English-only processing of Spanish document
Solution: Retry with ["spa", "eng"]

Scenario 2: Mixed Language Document

Original: Single language processing
Solution: Add 2-3 relevant languages

Scenario 3: Poor Quality Scan

Original: Fast processing with limited languages
Solution: Try with primary language + English fallback

📊 Monitoring OCR Results

Understanding OCR Confidence

90%+ - Excellent results, high accuracy
70-89% - Good results, minor errors possible
50-69% - Moderate results, review recommended
Below 50% - Poor results, consider retry with different languages

Language-Specific Performance

Different languages have varying accuracy rates:

Latin-based scripts (English, Spanish, French): Highest accuracy
Germanic languages (German, Dutch): Very good accuracy
Asian languages (Chinese, Japanese): Good accuracy with proper font recognition
Arabic/Hebrew scripts: Moderate accuracy, depends on text quality

🐛 Troubleshooting

Common Issues

Problem: "Language not available" error Solution:

Check language code spelling (e.g., eng not english)
Verify language is installed on the server
Contact administrator if language should be available

Problem: Poor OCR results despite correct language Solutions:

Ensure document scan quality is sufficient (300+ DPI recommended)
Try adding English as a fallback language
Consider document preprocessing (contrast, rotation correction)
Retry with fewer languages for better performance

Problem: Slow processing with multiple languages
Solutions:

Reduce number of selected languages to 2-3
Use languages only present in your document
Consider processing during off-peak hours

Getting Help

If you're experiencing issues:

Check the OCR Health page - GET /api/ocr/health
Review your language selection - ensure languages match document content
Try with English fallback - adds reliability to processing
Contact support with document ID and language combination used

🔮 Advanced Features

Planned Enhancements

Auto-language detection: Automatic suggestion of optimal language combinations
Custom language models: Upload your own specialized language data
Batch language updates: Change languages for multiple documents at once
Language-specific confidence thresholds: Fine-tune accuracy requirements per language

Integration Options

The multi-language OCR system integrates with:

Document management workflows
Automated processing pipelines
Third-party applications via REST API
Webhook notifications for completion

📚 Additional Resources

API Documentation: Complete endpoint reference
Language Codes Reference: Full list of supported language codes
Performance Guidelines: Optimization recommendations
Migration Guide: Upgrading from single-language setup

Need Help? Contact support or check the system health dashboard for real-time OCR capability status.

7.8 KiB Raw Permalink Blame History