# Data Ingestion Guide The Local Food AI relies on the OpenFoodFacts dataset. Because this dataset is massive (~24GB), a specialized ingestion pipeline was built to bypass MySQL InnoDB row limits. ## The Architecture The database is structured using **Grouped Vertical Partitioning**. Instead of a single monolithic table with 200+ columns, data is sliced into 5 distinct tables: 1. `products_core` (Names, text, ingredients) 2. `products_allergens` (Allergy data) 3. `products_macros` (Fats, proteins, carbs, etc. as `DOUBLE`) 4. `products_vitamins` (Vitamin traces) 5. `products_minerals` (Mineral traces) A MySQL `VIEW` named `products` elegantly joins these together so the frontend can query them seamlessly. ## How to Ingest 1. Download the CSV using `download_csv.sh`. It will fetch `en.openfoodfacts.org.products.csv`. 2. Do **not** run the ingestion script directly in the terminal, as SSH disconnects will kill the process. 3. Use the `nohup` wrapper: ```bash nohup bash ./start_batch_ingest.sh > remote_ingest.log 2>&1 & ``` 4. You can monitor the ingestion progress by tailing the logs: ```bash tail -f ingestion_process.log ``` ## Script Internals The `ingest_csv.py` uses `pandas` chunking (`chunksize=10000`). For every chunk, it slices the DataFrame into the 5 partitions and executes an `INSERT IGNORE` into the MySQL database. This ensures robustness and allows the script to be safely interrupted and restarted.