mirror of
https://github.com/readur/readur.git
synced 2026-01-06 22:40:26 -06:00
14 KiB
14 KiB
Monitoring and Observability Guide
Overview
This guide covers setting up comprehensive monitoring for Readur, including metrics collection, log aggregation, alerting, and dashboard creation.
Monitoring Stack Components
Core Components
- Metrics Collection: Prometheus + Node Exporter
- Visualization: Grafana
- Log Aggregation: Loki or ELK Stack
- Alerting: AlertManager
- Application Monitoring: Custom metrics and health checks
- Uptime Monitoring: Uptime Kuma or Pingdom
Health Monitoring
Built-in Health Endpoints
# Basic health check
curl http://localhost:8000/health
# Detailed health status
curl http://localhost:8000/health/detailed
# Response format
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"storage": "accessible",
"ocr_queue": 45,
"version": "2.5.4",
"uptime": 345600
}
Custom Health Checks
# health_checks.py
from typing import Dict, Any
class HealthMonitor:
@staticmethod
def check_database() -> Dict[str, Any]:
try:
db.session.execute("SELECT 1")
return {"status": "healthy", "response_time": 0.005}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_storage() -> Dict[str, Any]:
try:
# Check if storage is accessible
storage.list_files(limit=1)
return {"status": "healthy", "available_space": storage.get_free_space()}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_ocr_workers() -> Dict[str, Any]:
active = celery.control.inspect().active()
return {
"status": "healthy" if active else "degraded",
"active_workers": len(active or {}),
"queue_length": redis.llen("ocr_queue")
}
Prometheus Setup
Installation and Configuration
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- monitoring
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://readur:password@postgres:5432/readur?sslmode=disable"
ports:
- "9187:9187"
networks:
- monitoring
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
environment:
REDIS_ADDR: "redis://redis:6379"
ports:
- "9121:9121"
networks:
- monitoring
networks:
monitoring:
external: true
volumes:
prometheus_data:
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'readur-monitor'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- '/etc/prometheus/alerts/*.yml'
scrape_configs:
- job_name: 'readur'
static_configs:
- targets: ['readur:8000']
metrics_path: '/metrics'
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Grafana Dashboards
Setup Grafana
# Add to docker-compose.monitoring.yml
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_SERVER_ROOT_URL=https://grafana.readur.company.com
- GF_INSTALL_PLUGINS=redis-datasource
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
networks:
- monitoring
Dashboard Configuration
# grafana/provisioning/dashboards/readur.json
{
"dashboard": {
"title": "Readur Performance Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(readur_requests_total[5m])"
}]
},
{
"title": "Response Time",
"targets": [{
"expr": "histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "OCR Queue",
"targets": [{
"expr": "readur_ocr_queue_length"
}]
},
{
"title": "Database Connections",
"targets": [{
"expr": "pg_stat_database_numbackends{datname='readur'}"
}]
}
]
}
}
Application Metrics
Custom Metrics Implementation
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
# Define metrics
request_count = Counter('readur_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('readur_request_duration_seconds', 'Request duration')
ocr_queue_length = Gauge('readur_ocr_queue_length', 'OCR queue length')
active_users = Gauge('readur_active_users', 'Active users in last 5 minutes')
document_count = Gauge('readur_documents_total', 'Total documents', ['status'])
# Middleware to track requests
class MetricsMiddleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
path = environ.get('PATH_INFO', '/')
method = environ.get('REQUEST_METHOD', 'GET')
with request_duration.time():
request_count.labels(method=method, endpoint=path).inc()
return self.app(environ, start_response)
# Metrics endpoint
@app.route('/metrics')
def metrics():
# Update gauges
ocr_queue_length.set(redis.llen('ocr_queue'))
active_users.set(get_active_user_count())
document_count.labels(status='processed').set(get_document_count('processed'))
return generate_latest(), 200, {'Content-Type': 'text/plain'}
Log Aggregation
Loki Setup
# Add to docker-compose.monitoring.yml
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- monitoring
Log Configuration
# promtail/promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: readur
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
filters:
- name: label
values: ["com.docker.compose.project=readur"]
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'logstream'
Alerting
AlertManager Configuration
# alertmanager/config.yml
global:
smtp_from: 'alertmanager@readur.company.com'
smtp_smarthost: 'smtp.company.com:587'
smtp_auth_username: 'alertmanager@readur.company.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-admins'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'team-admins'
receivers:
- name: 'team-admins'
email_configs:
- to: 'admin-team@company.com'
headers:
Subject: 'Readur Alert: {{ .GroupLabels.alertname }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
Alert Rules
# prometheus/alerts/readur.yml
groups:
- name: readur
rules:
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on {{ $labels.instance }}"
description: "95th percentile response time is {{ $value }}s"
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
description: "PostgreSQL database is not responding"
- alert: HighOCRQueue
expr: readur_ocr_queue_length > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "OCR queue backlog"
description: "OCR queue has {{ $value }} pending items"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
Performance Monitoring
APM Integration
# apm_config.py
from elasticapm import Client
# Configure APM
apm_client = Client({
'SERVICE_NAME': 'readur',
'SERVER_URL': 'http://apm-server:8200',
'ENVIRONMENT': 'production',
'SECRET_TOKEN': 'your-secret-token',
})
# Instrument Flask app
from elasticapm.contrib.flask import ElasticAPM
apm = ElasticAPM(app, client=apm_client)
Custom Performance Metrics
# performance_metrics.py
import time
from contextlib import contextmanager
@contextmanager
def track_performance(operation_name):
start_time = time.time()
try:
yield
finally:
duration = time.time() - start_time
metrics.record_operation_time(operation_name, duration)
if duration > 1.0: # Log slow operations
logger.warning(f"Slow operation: {operation_name} took {duration:.2f}s")
# Usage
with track_performance("document_processing"):
process_document(doc_id)
Uptime Monitoring
External Monitoring
# uptime-kuma/docker-compose.yml
version: '3.8'
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
volumes:
- uptime-kuma_data:/app/data
ports:
- "3001:3001"
restart: unless-stopped
volumes:
uptime-kuma_data:
Status Page Configuration
# Public status page
server {
listen 443 ssl;
server_name status.readur.company.com;
location / {
proxy_pass http://localhost:3001;
proxy_set_header Host $host;
}
}
Dashboard Examples
Key Metrics Dashboard
-- Query for document processing stats
SELECT
DATE(created_at) as date,
COUNT(*) as documents_processed,
AVG(processing_time) as avg_processing_time,
MAX(processing_time) as max_processing_time
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Real-time Monitoring
// WebSocket monitoring dashboard
const ws = new WebSocket('wss://readur.company.com/ws/metrics');
ws.onmessage = (event) => {
const metrics = JSON.parse(event.data);
updateDashboard({
activeUsers: metrics.active_users,
queueLength: metrics.queue_length,
responseTime: metrics.response_time,
errorRate: metrics.error_rate
});
};
Troubleshooting Monitoring Issues
Prometheus Not Scraping
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Verify metrics endpoint
curl http://localhost:8000/metrics
# Check network connectivity
docker network inspect monitoring
Missing Metrics
# Debug metric collection
docker-compose exec readur python -c "
from prometheus_client import REGISTRY
for collector in REGISTRY._collector_to_names:
print(collector)
"
High Memory Usage
# Check Prometheus storage
du -sh /var/lib/prometheus
# Reduce retention
docker-compose exec prometheus promtool tsdb analyze /prometheus
# Clean old data
docker-compose exec prometheus promtool tsdb clean /prometheus
Best Practices
Monitoring Strategy
- Start Simple: Begin with basic health checks and expand
- Alert Fatigue: Only alert on actionable issues
- SLI/SLO Definition: Define and track service level indicators
- Dashboard Organization: Create role-specific dashboards
- Log Retention: Balance storage costs with debugging needs
- Security: Protect monitoring endpoints and dashboards
- Documentation: Document alert runbooks and response procedures
Maintenance
# Weekly maintenance tasks
#!/bin/bash
# Rotate logs
docker-compose exec readur logrotate -f /etc/logrotate.conf
# Clean up old metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
# Backup Grafana dashboards
docker-compose exec grafana grafana-cli admin export-dashboard
# Update monitoring stack
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d