mirror of
https://github.com/readur/readur.git
synced 2026-01-06 06:20:17 -06:00
5.0 KiB
5.0 KiB
OCR Queue System Improvements
This document describes the major improvements made to handle large-scale OCR processing of 100k+ files.
Key Improvements
1. Database-Backed Queue System
- Replaced direct processing with persistent queue table
- Added retry mechanisms and failure tracking
- Implemented priority-based processing
- Added recovery for crashed workers
2. Worker Pool Architecture
- Dedicated OCR worker processes with concurrency control
- Configurable number of concurrent jobs
- Graceful shutdown and error handling
- Automatic stale job recovery
3. Batch Processing Support
- Dedicated CLI tool for bulk ingestion
- Processes files in configurable batches (default: 1000)
- Concurrent file I/O with semaphore limiting
- Progress monitoring and statistics
4. Priority-Based Processing
Priority levels based on file size:
- Priority 10: ≤ 1MB files (highest)
- Priority 8: 1-5MB files
- Priority 6: 5-10MB files
- Priority 4: 10-50MB files
- Priority 2: > 50MB files (lowest)
5. Monitoring & Observability
- Real-time queue statistics API
- Progress tracking and ETAs
- Failed job requeuing
- Automatic cleanup of old completed jobs
Database Schema
OCR Queue Table
CREATE TABLE ocr_queue (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
status VARCHAR(20) DEFAULT 'pending',
priority INT DEFAULT 5,
attempts INT DEFAULT 0,
max_attempts INT DEFAULT 3,
created_at TIMESTAMPTZ DEFAULT NOW(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error_message TEXT,
worker_id VARCHAR(100),
processing_time_ms INT,
file_size BIGINT
);
Document Status Tracking
ocr_status: Current OCR processing statusocr_error: Error message if OCR failedocr_completed_at: Timestamp when OCR completed
API Endpoints
Queue Status
GET /api/queue/stats
Returns:
{
"pending": 1500,
"processing": 8,
"failed": 12,
"completed_today": 5420,
"avg_wait_time_minutes": 3.2,
"oldest_pending_minutes": 15.7
}
Requeue Failed Jobs
POST /api/queue/requeue-failed
Requeues all failed jobs that haven't exceeded max attempts.
CLI Tools
Batch Ingestion
# Ingest all files from a directory
cargo run --bin batch_ingest /path/to/files --user-id 00000000-0000-0000-0000-000000000000
# Ingest and monitor progress
cargo run --bin batch_ingest /path/to/files --user-id USER_ID --monitor
Configuration
Environment Variables
OCR_CONCURRENT_JOBS: Number of concurrent OCR workers (default: 4)OCR_TIMEOUT_SECONDS: OCR processing timeout (default: 300)QUEUE_BATCH_SIZE: Batch size for processing (default: 1000)MAX_CONCURRENT_IO: Max concurrent file operations (default: 50)
User Settings
Users can configure:
concurrent_ocr_jobs: Max concurrent jobs for their documentsocr_timeout_seconds: Processing timeoutenable_background_ocr: Enable/disable automatic OCR
Performance Optimizations
1. Memory Management
- Streaming file reads for large files
- Configurable memory limits per worker
- Automatic cleanup of temporary data
2. I/O Optimization
- Batch database operations
- Connection pooling
- Concurrent file processing with limits
3. Resource Control
- CPU priority settings
- Memory limit enforcement
- Configurable worker counts
4. Failure Handling
- Exponential backoff for retries
- Separate failed job recovery
- Automatic stale job detection
Monitoring & Maintenance
Automatic Tasks
- Stale Recovery: Every 5 minutes, recover jobs stuck in processing
- Cleanup: Daily cleanup of completed jobs older than 7 days
- Health Checks: Worker health monitoring and restart
Manual Operations
-- Check queue health
SELECT * FROM get_ocr_queue_stats();
-- Find problematic jobs
SELECT * FROM ocr_queue WHERE status = 'failed' ORDER BY created_at;
-- Requeue specific job
UPDATE ocr_queue SET status = 'pending', attempts = 0 WHERE id = 'job-id';
Scalability Improvements
For 100k+ Files:
- Horizontal Scaling: Multiple worker instances across servers
- Database Optimization: Partitioned queue tables by date
- Caching: Redis cache for frequently accessed metadata
- Load Balancing: Distribute workers across multiple machines
Performance Metrics:
- Throughput: ~500-1000 files/hour per worker (depends on file size)
- Memory Usage: ~100MB per worker + file size
- Database Load: Optimized with proper indexing and batching
Migration Guide
From Old System:
- Run database migration:
migrations/001_add_ocr_queue.sql - Update application code to use queue endpoints
- Monitor existing processing and let queue drain
- Start new workers with queue system
Zero-Downtime Migration:
- Deploy new code with feature flag disabled
- Run migration scripts
- Enable queue processing gradually
- Monitor and adjust worker counts as needed