Sprint6_Dataset_Decisions.md 2.8 KB

Sprint 6: Dataset & Chat History Decisions

Overview

This document provides details regarding the choices made during Sprint 6, specifically the transition to a massive nutritional dataset and the implementation of chat history persistence.

1. Dataset Expansion (USDA SR Legacy)

To improve the breadth and accuracy of the application's local search and RAG (Retrieval-Augmented Generation) capabilities, the system transitioned from a small, manually curated dataset (142 items) to the comprehensive USDA National Nutrient Database for Standard Reference, Legacy Release (SR Legacy).

Specifications

  • Original Source: USDA FoodData Central (SR Legacy Release, April 2018)
  • Methodology (The "MyFoodData" Approach): MyFoodData.com provides a pre-cleaned, spreadsheet-friendly version of the complex USDA database. Because downloading third-party spreadsheets directly to a Linux server is unreliable, we downloaded the raw, official USDA data and built a custom script (mega_seed_usda.py) to clean and flatten it locally. This gives us the exact same clean, 7,700-item structure as MyFoodData, but directly from the government source.
  • Total Items Ingested: 7,793 unique food items
  • Included Metrics: Calories, Protein, Total Fat, Carbohydrates, Fiber, Sugar, Sodium.

Technical Implementation & Search Optimization

  • Storage: Flattened data stored in a high-performance SQLite database (localfood.db).
  • Performance: Food names are indexed using COLLATE NOCASE. The search algorithm prioritizes exact prefixes, penalizes the "Baby Foods" category (which tends to crowd out general searches like "Chicken"), and sorts by name length to ensure base ingredients appear before obscure variants.
  • LLM Context Limit (Performance Fix): To maintain responsiveness with the local Llama 3.1 8B model running on a CPU, context injection is strictly limited. Only the top 3 most relevant items (names truncated to 100 characters, core macros only) are injected into the AI's prompt.

2. Chat History Persistence

To enhance the user experience, chat history was moved from volatile browser memory to permanent local storage.

Technical Choices

  • Database Table: A new chat_messages table was added, linked to the users table via user_id.
  • Transaction Safety: Encountered database is locked errors due to concurrent RAG reads and history writes. Resolved by implementing strict try...finally blocks around all database connections to prevent connection leaking during failed operations.
  • Context Window Limitation: The backend now strictly limits the conversational memory sent to the LLM to the last 6 messages. This drastically improves initial generation time on the CPU without losing immediate conversational context.