Harden full-database restore and document operational behaviour

Admin restore runs in a background thread; the finally block must not use current_app.logger outside an application context. Use the captured Flask app instance for safe_file_remove logging instead. While restore_backup runs (extract through Alembic upgrade), set a per-app _database_restore_in_progress flag and expose is_database_restore_in_progress(). The client portal blueprint registers a global app_context_processor; get_current_client() now skips database access during restore and catches SQLAlchemy errors with session rollback so error pages and login can still render when the schema is briefly torn on PostgreSQL. Documentation: add docs/admin/BACKUP_AND_RESTORE.md, link it from the admin index and import/export docs, cross-reference from DATABASE_RECOVERY.md, and extend IMPORT_EXPORT_GUIDE.md with concurrent-restore guidance.
2026-05-17 01:49:35 -05:00 · 2026-05-11 07:13:04 +02:00
parent 2f838adeee
commit 1ddea89d40
8 changed files with 121 additions and 11 deletions
@@ -4225,7 +4225,7 @@ def restore(filename=None):
                    "message": str(e),
                }
            finally:
-                safe_file_remove(temp_path, current_app.logger)
+                safe_file_remove(temp_path, app_obj.logger)

        # Run restore in background to keep request responsive
        t = threading.Thread(target=_do_restore, daemon=True)
@@ -22,6 +22,9 @@ from flask import (
 )
 from flask_babel import gettext as _
 from sqlalchemy import func
+from sqlalchemy.exc import SQLAlchemyError
+
+from app.utils.backup import is_database_restore_in_progress

 from app import db
 from app.models import (
@@ -136,17 +139,27 @@ def handle_internal_error(error):

 def get_current_client():
    """Get the currently logged-in client from session (either Client or User portal access)"""
-    # Check for Client portal authentication
-    client_id = session.get("client_portal_id")
-    if client_id:
-        return Client.query.get(client_id)
+    from flask import has_app_context

-    # Check for User portal authentication
-    user_id = session.get("_user_id")
-    if user_id:
-        user = User.query.get(user_id)
-        if user and user.is_client_portal_user:
-            return user.client  # Return the Client object linked to the user
+    if has_app_context() and is_database_restore_in_progress(current_app._get_current_object()):
+        return None
+
+    try:
+        client_id = session.get("client_portal_id")
+        if client_id:
+            return Client.query.get(client_id)
+
+        user_id = session.get("_user_id")
+        if user_id:
+            user = User.query.get(user_id)
+            if user and user.is_client_portal_user:
+                return user.client
+    except SQLAlchemyError:
+        try:
+            db.session.rollback()
+        except Exception:
+            pass
+        return None

    return None

@@ -12,6 +12,15 @@ from zipfile import ZIP_DEFLATED, ZipFile
 logger = logging.getLogger(__name__)


+def is_database_restore_in_progress(app) -> bool:
+    """True while ``restore_backup`` is running (extract → DB restore → migrations).
+
+    Other request greenlets in the same Gunicorn worker can still run; they should
+    avoid non-essential database reads (see client portal context processor).
+    """
+    return bool(getattr(app, "_database_restore_in_progress", False))
+
+
 def get_backup_root_dir(app) -> str:
    """Return the directory where backup archives should be stored.

@@ -245,6 +254,8 @@ def restore_backup(app, archive_path: str, progress_callback=None) -> tuple[bool
            logger.debug(f"Progress callback failed: {e}")

    try:
+        setattr(app, "_database_restore_in_progress", True)
+
        # Extract archive
        with ZipFile(archive_path, mode="r") as zf:
            zf.extractall(tmp_dir)
@@ -385,6 +396,10 @@ def restore_backup(app, archive_path: str, progress_callback=None) -> tuple[bool
        _progress("Restore completed successfully", 100)
        return True, "Restore completed successfully"
    finally:
+        try:
+            delattr(app, "_database_restore_in_progress")
+        except AttributeError:
+            pass
        try:
            shutil.rmtree(tmp_dir, ignore_errors=True)
        except Exception as e:
@@ -41,6 +41,12 @@ services:
      - TT_SKIP_DB_CLEANUP=true
 ```

+## Full archive restore (intentional rollback to a backup)
+
+If you need to **replace the live database** with a previously created **full system ZIP backup** (Admin → Backups, or `flask backup_create`), use the dedicated restore flow rather than deleting volumes alone. That path runs `pg_restore` or replaces the SQLite file, restores bundled uploads, and runs migrations to the current code version.
+
+See **[Backup and full archive restore](admin/BACKUP_AND_RESTORE.md)** for step-by-step behaviour, concurrency caveats during restore, and multi-process notes.
+
 ## Manual Recovery

 If automatic cleanup doesn't resolve the issue, you can manually reset the database:
@@ -266,6 +266,8 @@ Admins can create full database backups:
 4. Wait for restore to complete
 5. Review Import History for results (JSON restores only; ZIP restores do not create import history entries)

+**Full ZIP restore and concurrent use**: A ZIP restore uses the same pipeline as **Admin → Backups → Restore** (`pg_restore --clean` on PostgreSQL). While restore runs, the database schema may be temporarily missing or inconsistent; other browser tabs or API clients can see database errors until the job finishes. The app sets an internal “restore in progress” flag so some global template logic avoids extra queries during that window, but you should still treat restore as **maintenance** (quiet period or stop traffic) for predictable behaviour. Details: [Backup and full archive restore](admin/BACKUP_AND_RESTORE.md).
+
 #### Best Practices

 - Create backups regularly (daily or weekly)
@@ -0,0 +1,70 @@
+# Backup and full archive restore
+
+This guide describes **full system backups** (ZIP archives with PostgreSQL `pg_dump` / SQLite file, settings snapshot, and static uploads) created from **Admin → Backups** or the `flask backup_create` CLI command, and how **restore** behaves in production-like deployments (including Docker).
+
+## Creating backups
+
+- **Web UI**: Admin → Backups → Create backup (downloads a `.zip` archive).
+- **CLI** (inside the app container or venv with app context):
+
+  ```bash
+  flask backup_create
+  ```
+
+Archive layout is implemented in `app/utils/backup.py` (`create_backup`). The manifest lists database type and Alembic revision at backup time.
+
+## Restoring a backup
+
+Paths that run the same restore pipeline:
+
+- **Admin → Backups → Restore** (upload or pick an existing archive; restore may run in a **background thread** so the browser can poll progress).
+- **Import/Export → Restore** with a **ZIP** full-system archive (same `restore_backup` implementation as admin).
+- **CLI**:
+
+  ```bash
+  flask backup_restore /path/to/backup_YYYYMMDD_HHMMSS.zip
+  ```
+
+Restore steps (see `restore_backup` in `app/utils/backup.py`):
+
+1. Extract the ZIP to a temporary directory.
+2. Close and dispose SQLAlchemy connections for the **current worker** (best-effort).
+3. Replace the database: **PostgreSQL** uses `pg_restore --clean --if-exists` against the configured database; **SQLite** replaces the database file (with a timestamped safety copy when possible).
+4. Merge `uploads/` from the archive into the app static uploads tree.
+5. Run **Alembic migrations to head** (`flask db upgrade` equivalent) so the restored data matches the running application version.
+
+## Behaviour during restore (important)
+
+### Schema is replaced while the app keeps running
+
+`pg_restore --clean` drops and recreates objects. Until the restore and migrations finish, the database can be **empty or inconsistent**. Any HTTP request that hits the database may see errors (for example `relation "users" does not exist` or `current transaction is aborted`).
+
+The application sets an internal flag **`_database_restore_in_progress`** on the Flask app object for the duration of `restore_backup` (from archive extract through migrations). Code that must stay safe—such as the **client portal** template context processor—uses `is_database_restore_in_progress()` to **skip non-essential database reads** during that window.
+
+### Single worker, concurrent greenlets
+
+Typical Docker images start Gunicorn with **one worker** and an async worker class (for example Eventlet). Restore may run in a **background thread** while **other requests on the same worker** are still served. The in-progress flag reduces failures for global template injection; it does **not** guarantee that every route or API call will succeed mid-restore.
+
+**Operational recommendation**: treat restore like maintenance—have users log out or pause use, perform restore in a quiet window, or temporarily stop routing traffic to the app (for example maintenance mode at the reverse proxy) if you need a hard guarantee.
+
+### Multi-process deployments
+
+The restore flag is **per process**. If you run **multiple Gunicorn workers or multiple app containers**, only the process executing `restore_backup` sets the flag. Other processes can still serve traffic against a database being rewritten. For multi-replica setups, coordinate maintenance (scale to one replica, or stop traffic) before restore.
+
+### Admin restore thread and Flask context
+
+The admin UI starts restore in a daemon thread. Cleanup must not use `current_app` in that thread; it uses the **captured application instance** (for example `app_obj.logger`) so file cleanup and logging do not raise “working outside of application context”.
+
+## Troubleshooting
+
+| Symptom | Likely cause |
+|--------|----------------|
+| `pg_restore failed` in UI or CLI message | Wrong archive, wrong DB credentials, or PostgreSQL version mismatch. Read the stderr fragment returned with the error. |
+| Errors mentioning missing `users` during restore | Concurrent requests while schema is being dropped/recreated; reduce traffic or retry after progress shows completion. |
+| `Working outside of application context` in logs after restore | Should be resolved for admin restore cleanup; if it reappears, check any new `current_app` usage inside threads. |
+
+## Related documentation
+
+- [Database recovery and automatic cleanup](../DATABASE_RECOVERY.md) — corrupted or partial startup states (not the same as intentional full archive restore).
+- [Import/Export guide](../IMPORT_EXPORT_GUIDE.md) — JSON export/import vs full ZIP restore from this app.
+- [Docker Compose setup](configuration/DOCKER_COMPOSE_SETUP.md) — volumes and database service layout.
@@ -22,6 +22,9 @@ Complete guides for TimeTracker administrators.
 ### Monitoring
 - See [monitoring/](monitoring/) for monitoring and analytics setup

+### Backup and disaster recovery
+- **[Backup and full archive restore](BACKUP_AND_RESTORE.md)** — ZIP backups, `pg_restore` / SQLite restore, behaviour during restore, and operational notes for Docker
+
 ## 🔧 Common Tasks

 1. **Initial Setup**: Start with [Docker Compose Setup](configuration/DOCKER_COMPOSE_SETUP.md)
@@ -18,6 +18,7 @@ The TimeTracker Import/Export system enables seamless data migration, GDPR-compl
 ## Quick Links

 - **User Guide**: [IMPORT_EXPORT_GUIDE.md](../IMPORT_EXPORT_GUIDE.md)
+- **Full system ZIP backup / restore (operations)**: [BACKUP_AND_RESTORE](../admin/BACKUP_AND_RESTORE.md) — same restore pipeline as Admin → Backups; concurrency and Docker notes
 - **Implementation Summary**: [IMPORT_EXPORT_IMPLEMENTATION_SUMMARY.md](../../IMPORT_EXPORT_IMPLEMENTATION_SUMMARY.md)
 - **API Documentation**: See User Guide → API Documentation section