Data Ingestion Guide

The Local Food AI relies on the OpenFoodFacts dataset. Because this dataset is massive (~24GB), a specialized ingestion pipeline was built to bypass MySQL InnoDB row limits.

The Architecture

The database is structured using Grouped Vertical Partitioning. Instead of a single monolithic table with 200+ columns, data is sliced into 5 distinct tables:

products_core (Names, text, ingredients)
products_allergens (Allergy data)
products_macros (Fats, proteins, carbs, etc. as DOUBLE)
products_vitamins (Vitamin traces)
products_minerals (Mineral traces)

A MySQL VIEW named products elegantly joins these together so the frontend can query them seamlessly.

How to Ingest

Download the CSV using download_csv.sh. It will fetch en.openfoodfacts.org.products.csv.
Do not run the ingestion script directly in the terminal, as SSH disconnects will kill the process.

Use the nohup wrapper:

nohup bash ./start_batch_ingest.sh > remote_ingest.log 2>&1 &

You can monitor the ingestion progress by tailing the logs:
```
tail -f ingestion_process.log
```

Script Internals

The ingest_csv.py uses pandas chunking (chunksize=10000). For every chunk, it slices the DataFrame into the 5 partitions and executes an INSERT IGNORE into the MySQL database. This ensures robustness and allows the script to be safely interrupted and restarted.

Data_Ingestion.md 1.4 KB Түүх Анхны өгөгдөл

Data Ingestion Guide

The Architecture

How to Ingest

Script Internals

Data_Ingestion.md 1.4 KB

Түүх Анхны өгөгдөл