6.5 KiB
Office Document Support
Readur provides comprehensive support for extracting text from Microsoft Office documents, enabling full-text search and content analysis across your document library.
Supported Formats
Modern Office Formats (Native Support)
These formats are fully supported without any additional dependencies:
-
DOCX - Word documents (Office 2007+)
- Full text extraction from document body
- Section and paragraph structure preservation
- Header and footer content extraction
-
XLSX - Excel spreadsheets (Office 2007+)
- Text extraction from all worksheets
- Cell content with proper formatting
- Sheet names and structure preservation
Legacy Office Formats (External Tools Required)
These older formats require external tools for text extraction:
-
DOC - Legacy Word documents (Office 97-2003)
- Requires
antiword,catdoc, orwvText - Binary format parsing via external tools
- Requires
-
XLS - Legacy Excel spreadsheets (Office 97-2003)
- Currently returns an error suggesting conversion to XLSX
Installation
Docker Installation
The official Docker image includes all necessary dependencies:
docker pull readur/readur:latest
The Docker image includes antiword and catdoc pre-installed for legacy DOC support.
Manual Installation
For Modern Formats (DOCX, XLSX)
No additional dependencies required - these formats are parsed using built-in XML processing.
For Legacy DOC Files
Install one of the following tools:
Ubuntu/Debian:
# Option 1: antiword (recommended, lightweight)
sudo apt-get install antiword
# Option 2: catdoc (good alternative)
sudo apt-get install catdoc
# Option 3: wv (includes wvText)
sudo apt-get install wv
macOS:
# Option 1: antiword
brew install antiword
# Option 2: catdoc
brew install catdoc
# Option 3: wv
brew install wv
Alpine Linux:
# Option 1: antiword
apk add antiword
# Option 2: catdoc
apk add catdoc
How It Works
Modern Office Format Processing (DOCX/XLSX)
- ZIP Extraction: Modern Office files are ZIP archives containing XML files
- XML Parsing: Secure XML parser extracts text content
- Content Assembly: Text from different document parts is assembled
- Cleaning: Excessive whitespace and formatting artifacts are removed
Legacy DOC Processing
- Tool Detection: System checks for available tools (antiword, catdoc, wvText)
- External Processing: Selected tool converts DOC to plain text
- Security Validation: File paths are validated to prevent injection attacks
- Timeout Protection: 30-second timeout prevents hanging processes
- Text Cleaning: Output is sanitized and normalized
Configuration
Timeout Settings
Office document extraction timeout can be configured in user settings:
- Default: 120 seconds
- Range: 1-600 seconds
- Applies to: DOCX and XLSX processing
Error Handling
When processing fails, Readur provides helpful error messages:
- Missing Tools: Instructions for installing required tools
- File Too Large: Suggestions for file size reduction
- Corrupted Files: Guidance on file repair options
- Unsupported Formats: Conversion recommendations
Security Features
Built-in Protections
- ZIP Bomb Protection: Limits decompressed size to prevent resource exhaustion
- Path Validation: Prevents directory traversal and injection attacks
- XML Security: Entity expansion and external entity attacks prevented
- Process Isolation: External tools run with limited permissions
- Timeout Enforcement: Prevents infinite processing loops
File Size Limits
- Maximum Office Document Size: 50MB
- Maximum Decompressed Size: 500MB (ZIP bomb protection)
- Compression Ratio Limit: 100:1
Performance Considerations
Processing Speed
Typical extraction times:
- DOCX (1-10 pages): 50-200ms
- DOCX (100+ pages): 500-2000ms
- XLSX (small): 100-300ms
- XLSX (large): 1000-5000ms
- DOC (via antiword): 100-500ms
Resource Usage
- Memory: ~10-50MB per document during processing
- CPU: Single-threaded extraction, minimal impact
- Disk: Temporary files cleaned automatically
Troubleshooting
Common Issues
"No DOC extraction tools available"
Solution: Install antiword or catdoc as described above.
"Document processing timed out"
Possible causes:
- Very large or complex document
- Corrupted file structure
- System resource constraints
Solutions:
- Increase timeout in settings
- Convert to PDF format
- Split large documents
"Document format not supported"
Affected formats: PPT, PPTX, and other Office formats
Solution: Convert to supported format (PDF, DOCX, TXT)
Verification
To verify Office document support:
# Check for DOC support
which antiword || which catdoc || echo "No DOC tools installed"
# Test extraction (Docker)
docker exec readur-container antiword -v
# Test extraction (Manual)
antiword test.doc
Best Practices
- Prefer Modern Formats: Use DOCX over DOC when possible
- Convert Legacy Files: Batch convert DOC to DOCX for better performance
- Monitor File Sizes: Large Office files may need splitting
- Regular Updates: Keep external tools updated for security
- Test Extraction: Verify text extraction quality after setup
Migration from DOC to DOCX
For better performance and reliability, consider converting legacy DOC files:
Using LibreOffice (Batch Conversion)
libreoffice --headless --convert-to docx *.doc
Using Microsoft Word (Windows)
PowerShell script for batch conversion available in /scripts/convert-doc-to-docx.ps1
API Usage
Upload Office Document
curl -X POST http://localhost:8000/api/documents/upload \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "file=@document.docx"
Check Processing Status
curl http://localhost:8000/api/documents/{id}/status \
-H "Authorization: Bearer YOUR_TOKEN"
Future Enhancements
Planned improvements for Office document support:
- Native DOC parsing (without external tools)
- PowerPoint (PPTX/PPT) support
- Table structure preservation
- Embedded image extraction
- Style and formatting metadata
- Track changes and comments extraction