Files
readur/docs/office-document-support.md

6.5 KiB

Office Document Support

Readur provides comprehensive support for extracting text from Microsoft Office documents, enabling full-text search and content analysis across your document library.

Supported Formats

Modern Office Formats (Native Support)

These formats are fully supported without any additional dependencies:

  • DOCX - Word documents (Office 2007+)

    • Full text extraction from document body
    • Section and paragraph structure preservation
    • Header and footer content extraction
  • XLSX - Excel spreadsheets (Office 2007+)

    • Text extraction from all worksheets
    • Cell content with proper formatting
    • Sheet names and structure preservation

Legacy Office Formats (External Tools Required)

These older formats require external tools for text extraction:

  • DOC - Legacy Word documents (Office 97-2003)

    • Requires antiword, catdoc, or wvText
    • Binary format parsing via external tools
  • XLS - Legacy Excel spreadsheets (Office 97-2003)

    • Currently returns an error suggesting conversion to XLSX

Installation

Docker Installation

The official Docker image includes all necessary dependencies:

docker pull readur/readur:latest

The Docker image includes antiword and catdoc pre-installed for legacy DOC support.

Manual Installation

For Modern Formats (DOCX, XLSX)

No additional dependencies required - these formats are parsed using built-in XML processing.

For Legacy DOC Files

Install one of the following tools:

Ubuntu/Debian:

# Option 1: antiword (recommended, lightweight)
sudo apt-get install antiword

# Option 2: catdoc (good alternative)
sudo apt-get install catdoc

# Option 3: wv (includes wvText)
sudo apt-get install wv

macOS:

# Option 1: antiword
brew install antiword

# Option 2: catdoc
brew install catdoc

# Option 3: wv
brew install wv

Alpine Linux:

# Option 1: antiword
apk add antiword

# Option 2: catdoc
apk add catdoc

How It Works

Modern Office Format Processing (DOCX/XLSX)

  1. ZIP Extraction: Modern Office files are ZIP archives containing XML files
  2. XML Parsing: Secure XML parser extracts text content
  3. Content Assembly: Text from different document parts is assembled
  4. Cleaning: Excessive whitespace and formatting artifacts are removed

Legacy DOC Processing

  1. Tool Detection: System checks for available tools (antiword, catdoc, wvText)
  2. External Processing: Selected tool converts DOC to plain text
  3. Security Validation: File paths are validated to prevent injection attacks
  4. Timeout Protection: 30-second timeout prevents hanging processes
  5. Text Cleaning: Output is sanitized and normalized

Configuration

Timeout Settings

Office document extraction timeout can be configured in user settings:

  • Default: 120 seconds
  • Range: 1-600 seconds
  • Applies to: DOCX and XLSX processing

Error Handling

When processing fails, Readur provides helpful error messages:

  • Missing Tools: Instructions for installing required tools
  • File Too Large: Suggestions for file size reduction
  • Corrupted Files: Guidance on file repair options
  • Unsupported Formats: Conversion recommendations

Security Features

Built-in Protections

  1. ZIP Bomb Protection: Limits decompressed size to prevent resource exhaustion
  2. Path Validation: Prevents directory traversal and injection attacks
  3. XML Security: Entity expansion and external entity attacks prevented
  4. Process Isolation: External tools run with limited permissions
  5. Timeout Enforcement: Prevents infinite processing loops

File Size Limits

  • Maximum Office Document Size: 50MB
  • Maximum Decompressed Size: 500MB (ZIP bomb protection)
  • Compression Ratio Limit: 100:1

Performance Considerations

Processing Speed

Typical extraction times:

  • DOCX (1-10 pages): 50-200ms
  • DOCX (100+ pages): 500-2000ms
  • XLSX (small): 100-300ms
  • XLSX (large): 1000-5000ms
  • DOC (via antiword): 100-500ms

Resource Usage

  • Memory: ~10-50MB per document during processing
  • CPU: Single-threaded extraction, minimal impact
  • Disk: Temporary files cleaned automatically

Troubleshooting

Common Issues

"No DOC extraction tools available"

Solution: Install antiword or catdoc as described above.

"Document processing timed out"

Possible causes:

  • Very large or complex document
  • Corrupted file structure
  • System resource constraints

Solutions:

  1. Increase timeout in settings
  2. Convert to PDF format
  3. Split large documents

"Document format not supported"

Affected formats: PPT, PPTX, and other Office formats

Solution: Convert to supported format (PDF, DOCX, TXT)

Verification

To verify Office document support:

# Check for DOC support
which antiword || which catdoc || echo "No DOC tools installed"

# Test extraction (Docker)
docker exec readur-container antiword -v

# Test extraction (Manual)
antiword test.doc

Best Practices

  1. Prefer Modern Formats: Use DOCX over DOC when possible
  2. Convert Legacy Files: Batch convert DOC to DOCX for better performance
  3. Monitor File Sizes: Large Office files may need splitting
  4. Regular Updates: Keep external tools updated for security
  5. Test Extraction: Verify text extraction quality after setup

Migration from DOC to DOCX

For better performance and reliability, consider converting legacy DOC files:

Using LibreOffice (Batch Conversion)

libreoffice --headless --convert-to docx *.doc

Using Microsoft Word (Windows)

PowerShell script for batch conversion available in /scripts/convert-doc-to-docx.ps1

API Usage

Upload Office Document

curl -X POST http://localhost:8000/api/documents/upload \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@document.docx"

Check Processing Status

curl http://localhost:8000/api/documents/{id}/status \
  -H "Authorization: Bearer YOUR_TOKEN"

Future Enhancements

Planned improvements for Office document support:

  • Native DOC parsing (without external tools)
  • PowerPoint (PPTX/PPT) support
  • Table structure preservation
  • Embedded image extraction
  • Style and formatting metadata
  • Track changes and comments extraction