Inside the fdsrch allergen database: 739K products, one search
How we built the food allergen screening engine behind fdsrch.com — data sources, index design, and why the FDA database is both excellent and frustrating.
The FDA maintains a food product database called the USDA FoodData Central — and if you haven’t worked with it before, the short version is: it’s enormous, reasonably accurate, and formatted in a way that will make you appreciate good API design.
Here’s how we turned it into something you can search in under a second.
The raw data
FoodData Central has several datasets. For fdsrch we use the Branded Foods dataset — these are the packaged, labeled products you actually find on shelves. As of our last update, that’s roughly 739,000 records.
Each record contains:
- Product name and brand
- GTIN (the barcode)
- Ingredient list (free text)
- Labeled nutrients
- Modified date
The ingredient list is the interesting part. It’s unstructured text, sometimes in all caps, sometimes with nested parenthetical declarations, occasionally with spelling errors. This is where the interesting work happens.
Parsing ingredient lists
Allergen detection is not as simple as searching for “peanut” in the ingredient string. Consider:
- “peanut oil (fully refined)” — may not trigger reactions in peanut-allergic individuals
- “may contain traces of peanut” — cross-contamination warning, not an ingredient
- “CONTAINS: WHEAT, SOY” — top-8 declaration block, different format
- “modified food starch” — could be corn, wheat, or potato depending on country of origin
We handle these cases with a tiered matching system:
- Direct ingredient match — exact allergen name in the ingredient list
- CONTAINS declaration match — the standardized top-8 block
- Cross-contamination warning — “may contain” language, flagged separately
- Ambiguous starch/protein — flagged as uncertain, user decides
The result is that a search for “wheat” returns products in three tiers: definitely contains, may contain, and uncertain. Users can filter to their comfort level.
Search architecture
The search index is SQLite with FTS5. At this scale (739K records), SQLite handles the query load comfortably on a single small VPS. We build the index from the branded foods CSV on a weekly refresh schedule.
The full-text index covers product names and ingredient lists. Allergen screening is a secondary pass — we pre-compute allergen flags for the top 14 allergens (the EU list, which is a superset of the US top-8) and store them as indexed boolean columns.
A typical query: full-text search on product name, filter by allergen flags, return matching products sorted by brand relevance. Wall time: 40–80ms on a cold cache, under 10ms warm.
What the database doesn’t cover
The FDA branded foods dataset has real gaps:
- Restaurant food — not included. No McDonald’s menu items, no Chipotle bowls.
- Small local brands — the database skews toward national distribution.
- Private label — store brands are inconsistently represented.
- Supplements — spotty coverage.
- Freshness — some records are years stale. The GTIN helps cross-reference, but reformulated products are a known problem.
We’re working on supplement coverage and restaurant data as Phase 2 additions.
Open questions
The hardest problem in allergen screening isn’t the technology — it’s the liability question. We’re very clear in the UI that fdsrch is a screening tool, not medical advice. The “may contain” tier exists precisely because cross-contamination risk is real and we don’t want to give false confidence.
If you’re anaphylactic to something, you already know to read the actual label. fdsrch makes the shortlist shorter, not zero.