1
0

Data_Ingestion.md 1.4 KB

Data Ingestion Guide

The Local Food AI relies on the OpenFoodFacts dataset. Because this dataset is massive (~24GB), a specialized ingestion pipeline was built to bypass MySQL InnoDB row limits.

The Architecture

The database is structured using Grouped Vertical Partitioning. Instead of a single monolithic table with 200+ columns, data is sliced into 5 distinct tables:

  1. products_core (Names, text, ingredients)
  2. products_allergens (Allergy data)
  3. products_macros (Fats, proteins, carbs, etc. as DOUBLE)
  4. products_vitamins (Vitamin traces)
  5. products_minerals (Mineral traces)

A MySQL VIEW named products elegantly joins these together so the frontend can query them seamlessly.

How to Ingest

  1. Download the CSV using download_csv.sh. It will fetch en.openfoodfacts.org.products.csv.
  2. Do not run the ingestion script directly in the terminal, as SSH disconnects will kill the process.
  3. Use the nohup wrapper:

    nohup bash ./start_batch_ingest.sh > remote_ingest.log 2>&1 &
    
  4. You can monitor the ingestion progress by tailing the logs:

    tail -f ingestion_process.log
    

Script Internals

The ingest_csv.py uses pandas chunking (chunksize=10000). For every chunk, it slices the DataFrame into the 5 partitions and executes an INSERT IGNORE into the MySQL database. This ensures robustness and allows the script to be safely interrupted and restarted.