What does the benchmark measure?
On a versioned supplement-safety benchmark of 51 questions, how often does each tested system give an answer that covers the published clinical key points for the question, with the correct severity tier when the question implies a safety risk?
The bank spans five categories: drug-supplement interactions, timing-of-administration, evidence-grading, GLP-1-class supplementation, and common safety-relevant myths. Every question has a hand-built list of expected key points that an answer must substantively cover to earn an A or B (passing). C and below are failing. Full rubric at /methodology.
This benchmark measures a narrow task family. It does not establish general medical accuracy.
What's the comparison set?
Two leaderboards, each with five systems answering the same 51-question bank in the same week with the same grader.
Default tier: the everyday vanilla experience: ChatGPT 5.3 (logged-in default), Gemini 3 Flash (default), Perplexity Sonar (default), Claude Sonnet 4.6 (claude.ai default), and Scan Dose's production pipeline.
Best-named-mode tier: the highest paid mode each platform exposes: ChatGPT 5.5 (thinking), Gemini 3.1 Pro, Perplexity Sonar Pro, Claude Opus 4.7, and Scan Dose (same production pipeline; we don't have a separate "pro" mode).
Scan Dose appears in both leaderboards because we ship one production pipeline; users always get the same Sonnet 4.6 + RAG + system-prompt experience whether they pay or not. Exact model identifiers are listed on each score card and in the public methodology repo.
How are answers graded?
An independent Claude Sonnet 4.6 instance reads the question, the hand-built list of expected key points, and the candidate answer. It returns strict JSON with five fields: covered, total, missed,grade (A–F), and reasoning (one sentence).
A safety-critical omission is an automatic D or F regardless of how well the answer scores on other dimensions. A polished citation-rich answer that misses the central interaction warning is worse than a clumsy answer that names the warning correctly. Full rubric + hard-fail rule at /methodology.
Same-family-grader caveat: the grader is in the same model family (Sonnet 4.6) as one of the systems being tested. We mitigate this by publishing the prompt, the rubric, and the calibration set in the public repo so anyone can re-run with a different grader. Cross-grader sanity runs are part of the planned monthly cadence.
Why does the score change between runs?
Three reasons. One: we re-run the bank weekly. New ingestion (new interactions, refreshed NIH ODS fact sheets, new RCTs) shifts what each system has access to.
Two: at temperature 0 the grader is mostly deterministic but not perfectly so. The pass-rate noise floor is roughly ±2 percentage points per run. Individual letter grades drift more (A↔B within the pass tier) but the overall pass rate is reproducible.
Three: model providers ship updates. ChatGPT default may be 5.3 today and 5.4 next month. The benchmark records the exact model identifier of every system at run time so version-to-version deltas are interpretable. Changelog at /benchmark.
Why isn't the live 51-question set public?
To prevent benchmark-specific overfitting. If the active set is public, model providers can train on it (deliberately or not), and the number stops measuring what it's supposed to measure. This precedent comes from HealthBench, MedQA, and similar evaluations that rotate held-out reserves rather than publishing in full.
What IS public: a calibration set of 10–15 retired examples (NOT in the active bank) that show the rubric in action. Same severity mix, same structure, same grader prompt, published at github.com/satialaunch/scandose-benchmark. You can run the calibration set against any model in any wrapper and verify the grading is consistent with the public scores.
We rotate retired questions onto a public release schedule and add fresh held-out questions in their place. Every active-set question eventually retires to the public calibration set.
Is this doctor-grade? Should I use it for medical decisions?
No.
This benchmark measures a narrow task family: answering published supplement-safety questions for which we have a hand-built rubric of expected key points. It does not establish general medical accuracy. It does not validate diagnosis, dosing for individual cases, or any clinical decision. It is not doctor-grade and we do not claim it is.
A high benchmark score means the system reliably surfaces the published clinical key points on common safety questions. It does not mean the system can replace your prescriber, your pharmacist, or your clinician. Scan Dose is designed to surface evidence; medical decisions remain with you and your providers.
The full scoring methodology, including what we deliberately don't do, is at /how-scoring-works.