Multilingual OCR Test Files
This directory contains test files for validating the multiple OCR language capabilities of Readur.
Test Files
Spanish Test Files
spanish_test.pdf- Basic Spanish document with common words, accents, and phrasesspanish_complex.pdf- Complex Spanish document with special characters (ñ, ü, ¿, ¡)
English Test Files
english_test.pdf- Basic English document with common words and technical termsenglish_complex.pdf- Complex English document with contractions, hyphens, and abbreviations
Mixed Language Test Files
mixed_language_test.pdf- Document containing both Spanish and English text sections
Expected OCR Content
Spanish Content Keywords
- español, documento, reconocimiento
- café, niño, comunicación, corazón
- también, habitación, compañía
- informática, educación, investigación
English Content Keywords
- English, document, recognition
- technology, computer, software, hardware
- testing, validation, verification, quality
Mixed Content
Both Spanish and English keywords should be recognized in the mixed language document.
Usage in E2E Tests
These files are used by the ocr-multiple-languages.spec.ts test suite to validate:
- Language Selection: Testing the OCR language selector component
- Document Upload: Uploading documents with specific language preferences
- OCR Processing: Validating OCR results contain expected language-specific content
- Language Persistence: Ensuring language preferences are saved across sessions
- Retry Functionality: Testing OCR retry with different languages
- Error Handling: Testing graceful fallback behavior
Test Languages
- Spanish (spa): Primary test language with accents and special characters
- English (eng): Secondary test language with technical terminology
- Auto-detect: Testing automatic language detection
File Creation
These files were created using the create_multilingual_test_pdfs.py script in the repository root.
To regenerate the test files:
python3 create_multilingual_test_pdfs.py
OCR Language Testing Workflow
- Set language preference in Settings page
- Upload test document with specific language content
- Wait for OCR processing to complete
- Validate OCR results contain expected keywords
- Test retry functionality with different languages
- Verify bulk operations work with multiple languages
Expected Test Results
When OCR is configured correctly for Spanish (spa):
- Spanish documents should have high recognition accuracy for accented characters
- Phrases like "Hola mundo", "este es un documento", "en español" should be recognized
- Special characters (ñ, ü, ¿, ¡) should be preserved
When OCR is configured correctly for English (eng):
- English documents should have high recognition accuracy
- Technical terms and abbreviations should be recognized
- Phrases like "Hello world", "this is an English", "document" should be recognized
Troubleshooting
If tests fail:
-
Check Tesseract Installation: Ensure Spanish language pack is installed
# Ubuntu/Debian sudo apt-get install tesseract-ocr-spa # macOS brew install tesseract-lang -
Verify Language Availability: Check
/api/ocr/languagesendpoint returns Spanish and English -
File Paths: Ensure test files exist in the correct directory structure
-
OCR Processing Time: Allow sufficient timeout (120s) for OCR processing to complete