# Disaster Recovery & Backup Plan

This document outlines the backup and restore procedures, as well as the Disaster Recovery (DR) plan for the Local Food AI stack.

## 1. Backup Procedures
Given the massive 3GB+ size of the OpenFoodFacts dataset, backing up the entire MySQL data volume dynamically is resource-intensive. The strategy is split into **Code Backup** and **Data Backup**.

### 1.1 Source Code & App Configuration
The entire application infrastructure (Dockerfiles, Python scripts, configuration) is tracked in the Git repository.
**Backup Command:** `git push origin main`
*Frequency: Triggered automatically by developers after every Sprint.*

### 1.2 Database (MySQL) Backup
We use `mysqldump` to create a cold-standby physical backup of the user data and dietary profiles, while ignoring the massive, immutable OpenFoodFacts partition (which can be re-ingested from the source CSV).

**Backup Command:**
```bash
sudo docker exec food_project-mysql-1 mysqldump -u root -proot_pass food_db users user_health_profiles plate_items > /backup/food_db_users_$(date +%F).sql
```
*Frequency: Daily via a server-side cron job (`0 3 * * *`).*

## 2. Restore Procedures

### 2.1 Database Restore (Warm Recovery)
If the database container crashes or the volumes are corrupted:
1. Stop the application container to prevent write conflicts: `sudo docker-compose stop app`
2. Wipe and re-initialize the MySQL container.
3. Restore the user tables from the SQL dump:
```bash
cat /backup/food_db_users_2026-05-12.sql | sudo docker exec -i food_project-mysql-1 mysql -u root -proot_pass food_db
```
4. Restart the background ingestion script (`./data_sync.sh`) to rebuild the massive 3GB OpenFoodFacts `products_core` tables.
5. Restart the application: `sudo docker-compose start app`

## 3. Disaster Recovery (DR) Plan

### 3.1 Recovery Objectives
- **Recovery Time Objective (RTO):** 4 Hours (primarily bottlenecked by the 3-hour re-ingestion time of the CSV dataset if the core tables are lost).
- **Recovery Point Objective (RPO):** 24 Hours (User profiles and plates are backed up nightly).

### 3.2 High Availability & Failover Strategy
If deploying in the distributed Multi-Hypervisor PoC environment (Hyper-V / VirtualBox / WSL):
- **Ollama Node Failure**: The `app` is engineered to gracefully catch LLM connection timeouts. If the VirtualBox Ollama node dies, the Streamlit app will continue to function for standard Database lookups, returning a safe fallback message for AI evaluations.
- **Zabbix Node Failure**: The SNMP daemons run autonomously in each container. If the Zabbix telemetry server goes offline, the containers will safely drop the UDP traps without bottlenecking application performance.