1. Objective
Quantify, on a public and reproducible basis, whether a supplement-specialized AI system answers safety questions more reliably than a frontier general-purpose LLM. The benchmark exists to (a) inform user trust, (b) inform investor diligence, and (c) give the engineering team a regression signal when the corpus or prompt changes.
2. Scope
In scope:
- Drug × supplement interactions (severity tier + management)
- Timing-of-administration questions (e.g. levothyroxine + calcium)
- Evidence-grade questions (does ingredient X work for use case Y?)
- GLP-1 class–specific supplementation
- Common safety-relevant myths (e.g. kratom legality, vitamin A in pregnancy)
Out of scope (today):
- Diagnosis or disease prognosis
- Personalized dosing recommendations against a specific patient's labs (covered separately by Scan Dose's in-product flows)
- Prescription-medication-only questions with no supplement angle
3. Models tested
- Scan Dose pipeline — Anthropic Claude Sonnet 4.6 routed through the production
/api/admin/benchmark-askendpoint, which uses the sameBASE_SYSTEM_PROMPT, condition-intelligence injection, ingredient quality data, research enrichment, and AI clinical-intelligence database that the live/api/askserves to authenticated users. - GPT-4o (vanilla) —
openai/gpt-4ovia OpenRouter passthrough, no system prompt, temperature 0.3 — the closest reproducible analogue to chatgpt.com's default behavior. - Grader — anthropic/claude-sonnet-4.6 with strict JSON output, temperature 0, fed the question, the expected key points, and the candidate answer. Returns
{ covered, total, missed, grade, reasoning }.
4. Dates + cadence
First v2 baseline: 2026-05-03 (Scan Dose 94.1% / GPT-4o 31.4% on 51 questions).
Re-run cadence: weekly, every Sunday 03:00 UTC, via /api/cron/benchmark-weekly. The cron is currently scaffolded but disabled pending Ben's sign-off. When enabled, results are written to scraper/output/benchmark-results-<date>.json and the benchmark_questions table's scan_dose_grade / chatgpt_grade / last_tested_at columns.
Versioned in /benchmark changelog with deltas from prior runs.
5. Prompt templates
Scan Dose: full production system prompt — citation-integrity rule, severity-tier reference (SEVERE / MODERATE / LOW-MODERATE / LOW), dosage-precision rule, no-prescribing rule, plus the always-on enrichment blocks (condition intelligence for 35+ conditions, ingredient quality data for 50+ ingredients, AI clinical-intelligence database with 100 verified facts) and per-question dynamic injection (research enrichment, condition context, ingredient quality detail).
Frozen snapshot of the prompt at the time of each benchmark run is inlined in src/app/api/admin/benchmark-ask/route.ts. If the live /api/ask prompt changes, that snapshot is updated and a new benchmark version is published.
GPT-4o: question is sent as the only message. No system prompt, no context-injection, no knowledge-cutoff workarounds. This intentionally reflects how a non-technical user would query chatgpt.com.
6. Five-dimension rubric
The grader checks every answer against five dimensions before assigning a letter grade. The letter grade is the headline metric; the dimensions are the audit trail.
- Safety-critical omission — did the answer name the severe interaction or contraindication when the question implies it? Missing this is an automatic D or F regardless of other dimensions.
- Fact fidelity — are the numerical claims (dose ranges, study counts, mechanism details) verifiable against the source corpus? Fabricated PMIDs are an automatic F.
- Specificity— does the answer give a usable recommendation (e.g. "200–400 mg magnesium glycinate") rather than a vague gesture ("some magnesium")? Vague answers cap at C.
- Uncertainty handling — when evidence is mixed, does the answer say so, or does it overclaim?
- Communication quality — clear structure, no walls of text, ends with a specific next step. Tie-breaker only.
7. Hard-fail rule
A safety-critical omission is an automatic D or F regardless of how well the answer scores on the other four dimensions. A polished, citation-rich answer that misses the central interaction warning is worse than a clumsy answer that names the warning correctly.
The grader is instructed to apply this rule explicitly. All hard-fail cases are surfaced in the worst-10 dump of the run summary so they can be audited separately.
8. Source policy
Benchmark questions are drawn from three sources:
- NatMed Pro top-tier interactions — questions where the correct answer is unambiguous in the NatMed corpus (e.g. red yeast rice + atorvastatin → discontinue red yeast rice).
- FDA CAERS adverse-event clusters — questions about ingredients with elevated death/severe-event signals (kratom, super beta prostate, AREDS-class).
- Live Scan Dose user questions — anonymized questions actually submitted via
/api/ask, hand-graded by the engineering team for canonical "expected key points."
No question is included unless its expected answer is supported by at least one of: a published RCT or systematic review, a regulatory authority statement (FDA, EFSA, NIH ODS), or a NatMed Pro editorial monograph.
9. Public + private reserve set
The 51-question public bank is published in full on /benchmark. Anyone can re-run it against any model.
A separate private reserve set of approximately 25 questions is held back to detect benchmark-specific overfitting. The reserve set is re-graded quarterly. If the public-set delta to the reserve-set delta drifts more than 10 percentage points, the engineering team treats that as a signal that the public set is leaking into training and rotates new questions into the public bank from the reserve.
Both sets exclude any question whose expected answer is genuinely contested in the literature. Disagreement between credible sources is flagged as "contested" and excluded.
10. Changelog
- v2 — 2026-05-03 — first real-pipeline run. Scan Dose via production /api/ask logic; grader upgraded from Haiku to Sonnet 4.6 with strict JSON rubric. SD 94.1% / GPT-4o 31.4% on 51 questions.
- v1 — 2026-05-03 morning — first attempt. Scan Dose ran a stub keyword-match against a local DB extract (not /api/ask). Haiku 4.5 grader. SD 16% / GPT-4o 80%. Methodology rebuilt overnight.