mirror of https://github.com/readur/readur.git synced 2026-01-06 14:30:28 -06:00

Files

perf3ct 483d89132f feat(office): add documentation around using antiword/catdoc for doc functionality

2025-09-02 20:29:17 +00:00

6.5 KiB

Raw Blame History

Office Document Support

Readur provides comprehensive support for extracting text from Microsoft Office documents, enabling full-text search and content analysis across your document library.

Supported Formats

Modern Office Formats (Native Support)

These formats are fully supported without any additional dependencies:

DOCX - Word documents (Office 2007+)
- Full text extraction from document body
- Section and paragraph structure preservation
- Header and footer content extraction
XLSX - Excel spreadsheets (Office 2007+)
- Text extraction from all worksheets
- Cell content with proper formatting
- Sheet names and structure preservation

Legacy Office Formats (External Tools Required)

These older formats require external tools for text extraction:

DOC - Legacy Word documents (Office 97-2003)
- Requires antiword, catdoc, or wvText
- Binary format parsing via external tools
XLS - Legacy Excel spreadsheets (Office 97-2003)
- Currently returns an error suggesting conversion to XLSX

Installation

Docker Installation

The official Docker image includes all necessary dependencies:

docker pull readur/readur:latest

The Docker image includes antiword and catdoc pre-installed for legacy DOC support.

Manual Installation

For Modern Formats (DOCX, XLSX)

No additional dependencies required - these formats are parsed using built-in XML processing.

For Legacy DOC Files

Install one of the following tools:

Ubuntu/Debian:

# Option 1: antiword (recommended, lightweight)
sudo apt-get install antiword

# Option 2: catdoc (good alternative)
sudo apt-get install catdoc

# Option 3: wv (includes wvText)
sudo apt-get install wv

macOS:

# Option 1: antiword
brew install antiword

# Option 2: catdoc
brew install catdoc

# Option 3: wv
brew install wv

Alpine Linux:

# Option 1: antiword
apk add antiword

# Option 2: catdoc
apk add catdoc

How It Works

Modern Office Format Processing (DOCX/XLSX)

ZIP Extraction: Modern Office files are ZIP archives containing XML files
XML Parsing: Secure XML parser extracts text content
Content Assembly: Text from different document parts is assembled
Cleaning: Excessive whitespace and formatting artifacts are removed

Legacy DOC Processing

Tool Detection: System checks for available tools (antiword, catdoc, wvText)
External Processing: Selected tool converts DOC to plain text
Security Validation: File paths are validated to prevent injection attacks
Timeout Protection: 30-second timeout prevents hanging processes
Text Cleaning: Output is sanitized and normalized

Configuration

Timeout Settings

Office document extraction timeout can be configured in user settings:

Default: 120 seconds
Range: 1-600 seconds
Applies to: DOCX and XLSX processing

Error Handling

When processing fails, Readur provides helpful error messages:

Missing Tools: Instructions for installing required tools
File Too Large: Suggestions for file size reduction
Corrupted Files: Guidance on file repair options
Unsupported Formats: Conversion recommendations

Security Features

Built-in Protections

ZIP Bomb Protection: Limits decompressed size to prevent resource exhaustion
Path Validation: Prevents directory traversal and injection attacks
XML Security: Entity expansion and external entity attacks prevented
Process Isolation: External tools run with limited permissions
Timeout Enforcement: Prevents infinite processing loops

File Size Limits

Maximum Office Document Size: 50MB
Maximum Decompressed Size: 500MB (ZIP bomb protection)
Compression Ratio Limit: 100:1

Performance Considerations

Processing Speed

Typical extraction times:

DOCX (1-10 pages): 50-200ms
DOCX (100+ pages): 500-2000ms
XLSX (small): 100-300ms
XLSX (large): 1000-5000ms
DOC (via antiword): 100-500ms

Resource Usage

Memory: ~10-50MB per document during processing
CPU: Single-threaded extraction, minimal impact
Disk: Temporary files cleaned automatically

Troubleshooting

Common Issues

"No DOC extraction tools available"

Solution: Install antiword or catdoc as described above.

"Document processing timed out"

Possible causes:

Very large or complex document
Corrupted file structure
System resource constraints

Solutions:

Increase timeout in settings
Convert to PDF format
Split large documents

"Document format not supported"

Affected formats: PPT, PPTX, and other Office formats

Solution: Convert to supported format (PDF, DOCX, TXT)

Verification

To verify Office document support:

# Check for DOC support
which antiword || which catdoc || echo "No DOC tools installed"

# Test extraction (Docker)
docker exec readur-container antiword -v

# Test extraction (Manual)
antiword test.doc

Best Practices

Prefer Modern Formats: Use DOCX over DOC when possible
Convert Legacy Files: Batch convert DOC to DOCX for better performance
Monitor File Sizes: Large Office files may need splitting
Regular Updates: Keep external tools updated for security
Test Extraction: Verify text extraction quality after setup

Migration from DOC to DOCX

For better performance and reliability, consider converting legacy DOC files:

Using LibreOffice (Batch Conversion)

libreoffice --headless --convert-to docx *.doc

Using Microsoft Word (Windows)

PowerShell script for batch conversion available in /scripts/convert-doc-to-docx.ps1

API Usage

Upload Office Document

curl -X POST http://localhost:8000/api/documents/upload \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@document.docx"

Check Processing Status

curl http://localhost:8000/api/documents/{id}/status \
  -H "Authorization: Bearer YOUR_TOKEN"

Future Enhancements

Planned improvements for Office document support:

Native DOC parsing (without external tools)
PowerPoint (PPTX/PPT) support
Table structure preservation
Embedded image extraction
Style and formatting metadata
Track changes and comments extraction

6.5 KiB Raw Blame History