# Office Document Support Readur provides comprehensive support for extracting text from Microsoft Office documents, enabling full-text search and content analysis across your document library. ## Supported Formats ### Modern Office Formats (Native Support) These formats are fully supported without any additional dependencies: - **DOCX** - Word documents (Office 2007+) - Full text extraction from document body - Section and paragraph structure preservation - Header and footer content extraction - **XLSX** - Excel spreadsheets (Office 2007+) - Text extraction from all worksheets - Cell content with proper formatting - Sheet names and structure preservation ### Legacy Office Formats (External Tools Required) These older formats require external tools for text extraction: - **DOC** - Legacy Word documents (Office 97-2003) - Requires `antiword`, `catdoc`, or `wvText` - Binary format parsing via external tools - **XLS** - Legacy Excel spreadsheets (Office 97-2003) - Currently returns an error suggesting conversion to XLSX ## Installation ### Docker Installation The official Docker image includes all necessary dependencies: ```bash docker pull readur/readur:latest ``` The Docker image includes `antiword` and `catdoc` pre-installed for legacy DOC support. ### Manual Installation #### For Modern Formats (DOCX, XLSX) No additional dependencies required - these formats are parsed using built-in XML processing. #### For Legacy DOC Files Install one of the following tools: **Ubuntu/Debian:** ```bash # Option 1: antiword (recommended, lightweight) sudo apt-get install antiword # Option 2: catdoc (good alternative) sudo apt-get install catdoc # Option 3: wv (includes wvText) sudo apt-get install wv ``` **macOS:** ```bash # Option 1: antiword brew install antiword # Option 2: catdoc brew install catdoc # Option 3: wv brew install wv ``` **Alpine Linux:** ```bash # Option 1: antiword apk add antiword # Option 2: catdoc apk add catdoc ``` ## How It Works ### Modern Office Format Processing (DOCX/XLSX) 1. **ZIP Extraction**: Modern Office files are ZIP archives containing XML files 2. **XML Parsing**: Secure XML parser extracts text content 3. **Content Assembly**: Text from different document parts is assembled 4. **Cleaning**: Excessive whitespace and formatting artifacts are removed ### Legacy DOC Processing 1. **Tool Detection**: System checks for available tools (antiword, catdoc, wvText) 2. **External Processing**: Selected tool converts DOC to plain text 3. **Security Validation**: File paths are validated to prevent injection attacks 4. **Timeout Protection**: 30-second timeout prevents hanging processes 5. **Text Cleaning**: Output is sanitized and normalized ## Configuration ### Timeout Settings Office document extraction timeout can be configured in user settings: - **Default**: 120 seconds - **Range**: 1-600 seconds - **Applies to**: DOCX and XLSX processing ### Error Handling When processing fails, Readur provides helpful error messages: - **Missing Tools**: Instructions for installing required tools - **File Too Large**: Suggestions for file size reduction - **Corrupted Files**: Guidance on file repair options - **Unsupported Formats**: Conversion recommendations ## Security Features ### Built-in Protections 1. **ZIP Bomb Protection**: Limits decompressed size to prevent resource exhaustion 2. **Path Validation**: Prevents directory traversal and injection attacks 3. **XML Security**: Entity expansion and external entity attacks prevented 4. **Process Isolation**: External tools run with limited permissions 5. **Timeout Enforcement**: Prevents infinite processing loops ### File Size Limits - **Maximum Office Document Size**: 50MB - **Maximum Decompressed Size**: 500MB (ZIP bomb protection) - **Compression Ratio Limit**: 100:1 ## Performance Considerations ### Processing Speed Typical extraction times: - **DOCX (1-10 pages)**: 50-200ms - **DOCX (100+ pages)**: 500-2000ms - **XLSX (small)**: 100-300ms - **XLSX (large)**: 1000-5000ms - **DOC (via antiword)**: 100-500ms ### Resource Usage - **Memory**: ~10-50MB per document during processing - **CPU**: Single-threaded extraction, minimal impact - **Disk**: Temporary files cleaned automatically ## Troubleshooting ### Common Issues #### "No DOC extraction tools available" **Solution**: Install antiword or catdoc as described above. #### "Document processing timed out" **Possible causes**: - Very large or complex document - Corrupted file structure - System resource constraints **Solutions**: 1. Increase timeout in settings 2. Convert to PDF format 3. Split large documents #### "Document format not supported" **Affected formats**: PPT, PPTX, and other Office formats **Solution**: Convert to supported format (PDF, DOCX, TXT) ### Verification To verify Office document support: ```bash # Check for DOC support which antiword || which catdoc || echo "No DOC tools installed" # Test extraction (Docker) docker exec readur-container antiword -v # Test extraction (Manual) antiword test.doc ``` ## Best Practices 1. **Prefer Modern Formats**: Use DOCX over DOC when possible 2. **Convert Legacy Files**: Batch convert DOC to DOCX for better performance 3. **Monitor File Sizes**: Large Office files may need splitting 4. **Regular Updates**: Keep external tools updated for security 5. **Test Extraction**: Verify text extraction quality after setup ## Migration from DOC to DOCX For better performance and reliability, consider converting legacy DOC files: ### Using LibreOffice (Batch Conversion) ```bash libreoffice --headless --convert-to docx *.doc ``` ### Using Microsoft Word (Windows) PowerShell script for batch conversion available in `/scripts/convert-doc-to-docx.ps1` ## API Usage ### Upload Office Document ```bash curl -X POST http://localhost:8000/api/documents/upload \ -H "Authorization: Bearer YOUR_TOKEN" \ -F "file=@document.docx" ``` ### Check Processing Status ```bash curl http://localhost:8000/api/documents/{id}/status \ -H "Authorization: Bearer YOUR_TOKEN" ``` ## Future Enhancements Planned improvements for Office document support: - [ ] Native DOC parsing (without external tools) - [ ] PowerPoint (PPTX/PPT) support - [ ] Table structure preservation - [ ] Embedded image extraction - [ ] Style and formatting metadata - [ ] Track changes and comments extraction ## Related Documentation - [File Upload Guide](./file-upload-guide.md) - [OCR Optimization Guide](./dev/OCR_OPTIMIZATION_GUIDE.md) - [Advanced Search](./advanced-search.md) - [Configuration Reference](./configuration-reference.md)