Reference · NBME flaws + USMLE gates, 13 checks per item

The 13-point USMLE question quality rubric.

The NBME Item-Writing Guide names two families of flaws and ten specific items. Turn that taxonomy into binary pass/fail checks, add three USMLE conventions every clinical vignette follows, and you get a thirteen-line rubric you can run on any QBank or AI tool in ten minutes per ten items.

The rubric is published below in usable form. The same checks are the in-flight gate inside Studyly's generator for medical decks, which is why the held-out composite at studyly.io/quality lands at 81.3 against a field that averages in the high 60s.

Matthew Diakonov, Written with AI

Published May 19, 20269 min read

Direct answer · verified 2026-05-19

What rubric scores USMLE-style question quality

Thirteen binary checks per item. Ten come from the NBME Item-Writing Guide: five testwiseness flaws (grammatical cues, logical cues, absolute terms, stem-key word repeats, convergence/longest option) and five irrelevant-difficulty flaws (unfocused stem, non-homogeneous options, vague terms, true-false hybrid options, filler templates). Three are USMLE-specific: clinical-vignette structure, two-step reasoning, source-anchored answer key. An item with zero fails is exam-grade. One fail is acceptable with a note. Two or more is unsafe and should be removed from rotation.

Authoritative source on the underlying flaw taxonomy: NBME Item-Writing Guide. Related four-criterion 0/3/7/10 rubric (composite score instead of per-item flaw count): studyly.io/t/ai-study-question-quality-rubric.

The thirteen checks

Each check is binary: the item either passes or fails. Sum the fails per item. An item with zero is exam-grade, one is acceptable with a note, two or more is unsafe. The checks are ordered so a grader can walk top to bottom and stop the moment a second fail appears.

usmle-quality-rubric.txt

Time budget on a ten-item sample: about ten minutes if you already know the content, twenty if you need to look up the underlying facts. The point of the binary scoring is speed.

Worked example: a typical AI-generated Step 1 item

The vignette below is the kind of output a saved ChatGPT prompt produces when asked for a Step 1 pulmonology question. It reads fluently. It also fails five of the thirteen checks. The run-down below shows where, and why, with the exact rule each fail violates.

step1-pulm-example.txt

Five fails means the item is unsafe. A student who drills it will learn the right diagnosis (pulmonary infarct) but will also internalize three cues that do not transfer to real NBME items: the longest option is correct, the option that repeats a stem keyword is correct, and the item rewards recognition alone. On the real exam those cues are scrubbed in review and the student is left with a habit, not a fact.

What the five most common fails look like

On the held-out three-document audit, five of the thirteen checks account for roughly eighty percent of the fails on AI output. The table below names each, shows the failure mode you see in an ungrounded prompt, and shows the rule Studyly applies at generation time to prevent it.

Feature	Ungrounded AI prompt	Generation-time gate (Studyly)
Stem-key word repeat (#4)	Common in ungrounded AI output. The model leans on the same vocabulary it used to set up the scenario when it picks the keyed answer. Easy testwiseness cue: pick the option that reuses a memorable word from the stem.	Generation-time gate. After the keyed answer is selected, the generator scans the stem and rejects any candidate whose key contains a high-information noun that is absent from all distractors. Triggers a regeneration of the distractor pool.
Convergence / longest option (#5)	Almost universal failure. The keyed answer carries more clinical detail, so it tends to be the longest. A test-wise student picks the longest option and is right disproportionately often.	Length-match check at emission. If the keyed answer is more than 25 percent longer than the median distractor, the generator either trims the key or expands the distractors to match.
Homogeneous options (#7)	Frequent on AI output that draws options from a free-form 'list four plausible wrong answers' prompt. The model mixes diagnoses, mechanisms, drug names, and management steps into a single option list.	Option-type tag is set on the keyed answer; distractors are sampled from the same type bucket. The check is also reviewed at the deck level: a deck with 80 percent diagnosis items and 20 percent mechanism items is fine; a deck where each item mixes types within its own option list is rejected.
Two-step reasoning (#12)	Ungrounded prompts default to single-step 'most likely diagnosis' items because that is the easiest shape to generate. Step 2 CK items require an additional action step; AI tools rarely produce that without prompting.	Two-step generation path: identify the clinical state from the vignette, then key the answer to an action conditional on the state (next step, mechanism, drug class). Single-step items are still allowed for Step 1 anatomy and biochem, where action does not apply.
Source-anchored key (#13)	Ungrounded by definition. Prompt-only generators have no source document to anchor to, so the keyed answer is whatever the model emitted. No verification path.	Every keyed answer carries a citation to a span in the uploaded source (slide number, paragraph, figure region). On revisit the explain panel jumps to that span. Items whose citation cannot resolve are rejected at emission.

The three USMLE-specific gates

The NBME flaw list is format-agnostic. It applies to any multiple-choice item, in any field. The three checks below are the conventions that make an item recognizably USMLE-shaped. A clean Step 1 or Step 2 CK item passes all three by default; an AI tool that produces a free-form medical MCQ often passes the ten NBME checks and fails one or two of these.

USMLE-shape requirements

Clinical vignette structure: demographics, presentation, finding, lead-in question, in that order. Out-of-order or missing components fail the gate.
Two-step reasoning: the item requires the test-taker to first recognize a clinical state and then select an action (diagnosis, mechanism, next step, drug).
Source-anchored answer key: the keyed answer traces to a specific span in the uploaded source. The explain panel jumps to that slide, paragraph, or figure region on click.

13 checks · 10 items · 10 minutes

“The rubric is binary on purpose. Speed beats nuance when the goal is to put a tool in the right bucket: exam-grade, acceptable, or unsafe. Ten consecutive items, scored against thirteen binary checks, gives an unsafe-rate that is comparable across QBanks and AI tools as long as you run the same protocol on the same source documents.”

Method aligned with NBME Item-Writing Guide; composite score reported at studyly.io/quality

Unsafe-rate on the held-out audit

The composite score at studyly.io/quality reports a 0/3/7/10 anchor scale across four criteria. The per-flaw rubric on this page reports the same underlying signal as a binary unsafe-rate. The two numbers correlate and pick the same tool order, but the unsafe-rate is the easier read on a single item.

unsafe-rate.txt

Why these thirteen and not others

The ten NBME checks are not original to this page; they are the flaw taxonomy NBME publishes for its own item writers and the one Downing (2005), Tarrant (2006), and Rush (2016) used to audit large samples of real medical-school item banks. They are the canonical list. The shorter five-flaw version that TrueLearn, Ben White, and YouSMLE publish is a subset of the ten and covers only testwiseness; the irrelevant-difficulty five tend to be the bigger problem on AI-generated content, which is why this rubric pulls both halves.

The three USMLE-specific gates are not in the NBME guide because the guide is field-agnostic. They are the conventions of the USMLE item itself: a vignette opens with patient demographics and a presenting complaint, the lead-in is the last sentence, the question asks the candidate to do something (diagnose, manage, intervene). These are the most-likely-to-fail checks on AI tools that are not specifically tuned for USMLE output, because an MCQ generator built for general study can and does write fluent medical questions that do not follow the USMLE shape.

For the longer treatment of the four-criterion composite score (factual correctness, clarity, distractor quality, question-type coverage on a 0/3/7/10 scale), see AI study question quality rubric: 4 criteria, 0/3/7/10 anchors, applied per card.

What the rubric does not measure

Three honest gaps. One, retention: an item can pass all thirteen checks and still be forgotten in a week if the spaced repetition schedule is wrong. Retention is downstream of the rubric, not part of it. Two, exam-blueprint alignment: a deck of items that each pass thirteen checks can still over-index on cardiology and under-index on biochemistry, which is a property of the source you uploaded, not of the items themselves. Three, the rubric measures fidelity to the upload and to NBME conventions, not fidelity to ground truth: if the source slide is wrong, an item built faithfully on it will pass and teach you the wrong fact.

The rubric is a necessary input to deciding which tool to drill from, not a sufficient one. Pair the ten-item audit with a one-deck pilot through your own spaced repetition routine before committing for a dedicated-period cycle.

Run the rubric on your own deck

Upload one lecture. Score ten items. Compare.

Free tier on app.jungleai.com, no credit card. Convert a real Step 1 or Step 2 CK lecture into items, walk ten consecutive ones through the thirteen checks above, and compare against any other tool you are using. Ten minutes of grading tells you whether a tool is safe to drill from cold.

Common questions about the USMLE question quality rubric

What rubric do I use to evaluate a USMLE-style question?

Ten flaws drawn directly from the NBME Item-Writing Guide, scored as pass or fail per item, plus three USMLE-specific gates. The ten flaws split into testwiseness (grammatical cues, logical cues, absolute terms, word repeats between stem and key, convergence/longest-option) and irrelevant difficulty (vague terms, non-homogeneous options, unfocused stem, true-false-hybrid options, filler templates like 'all of the above'). The three USMLE gates are clinical-vignette structure (demographics, presentation, finding, lead-in), two-step reasoning (the item asks the test-taker to recognize then act), and a source-anchored answer key. Score 13 binary checks per question. Zero flaws is exam-grade; one is acceptable; two or more means the item is not safe to study from cold.

Where does this rubric come from?

The ten flaw categories are paraphrased verbatim from the NBME Item-Writing Guide (nbme.org/educators/item-writing-guide), which is the source NBME item writers train against and which the literature (Downing 2005, Tarrant 2006, Rush 2016) uses to score retrospective audits of real exam banks. The three USMLE-specific gates are the conventions every NBME-style item follows whether or not the guide names them explicitly: a vignette opens with demographics, the lead-in is at the end, the test asks the candidate to do something with the data (most likely diagnosis, next step, mechanism). Studyly uses this same set of thirteen checks as the in-flight gate inside its generator for medical-school decks.

How do I actually run the rubric on a QBank or AI tool?

Pull ten consecutive items from the tool's output, not cherry-picked. For each item, walk the 13 checks in order. Mark each as pass or fail. Sum the fails per item; a high-quality item has zero. Across the ten items, count how many had two or more fails; that number divided by ten is your unsafe-rate. A reputable commercial QBank should run an unsafe-rate near zero. The held-out three-document eval at studyly.io/quality reports a related composite: factual correctness, clarity, distractor quality, question-type coverage, scored on a 0/3/7/10 anchor scale and renormalized to 0-100 (Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8).

Why split flaws into testwiseness and irrelevant difficulty?

Because they break the item in opposite directions. Testwiseness flaws make the right answer cheaper than it should be: a savvy test-taker can pick correctly without knowing the content, by reading cues in the options (grammatical agreement, longest option, repeated keywords). Irrelevant difficulty flaws make the item harder than it should be for the wrong reasons: vague stem, non-parallel options, time wasted on reading rather than thinking. A single item can have both. The 23.5-point spread on the held-out eval is heavier on irrelevant-difficulty failures than on testwiseness, because AI tools tend to write fluently but unfocused.

How is this different from the four-criterion rubric on the AI Study Question Quality Rubric page?

The four-criterion rubric (factual correctness, clarity, distractor quality, question-type coverage) is a generic study-question rubric on a 0/3/7/10 anchor scale, scored per card. That rubric is at studyly.io/t/ai-study-question-quality-rubric. This page is USMLE-specific: binary pass/fail, anchored to NBME's named flaws and the clinical-vignette conventions Step 1 and Step 2 CK actually use. They are compatible. If you want a single composite score across tools, use the four-criterion rubric. If you want a defensible go/no-go on a single USMLE-style item, use this one. Both reach the same conclusion on most items.

What is the most common failure on AI-generated USMLE questions?

Word repeats between the stem and the keyed answer. The model writes a vignette mentioning 'wedge-shaped pleural opacity,' then keys the answer to 'pulmonary infarct' while one of the distractors literally repeats 'wedge.' On the held-out audit this single flaw shows up in roughly one of every four ungrounded AI outputs. The runner-up is the convergence cue (longest option correct). Both are testwiseness flaws and both are easy to fix at generation time with a one-line rule. The harder failures are irrelevant-difficulty: a non-homogeneous option list (one diagnosis, one mechanism, one drug class) that no rule catches without semantic understanding.

Does the rubric apply to Step 2 CK and Step 3 too?

Yes. The ten NBME flaws are format-agnostic and the three USMLE gates are stricter on Step 2 CK and Step 3 than on Step 1. Step 2 CK items use longer vignettes and more often ask 'most appropriate next step in management' instead of 'most likely diagnosis,' which raises the bar on the two-step-reasoning gate (the item must require both recognition and a decision). Step 3 vignettes have even more clinical context and an explicit setting cue (clinic vs ED vs inpatient). All thirteen checks still apply. The unsafe-rate on AI-generated Step 2 CK items is higher than on Step 1 because the vignette structure is harder to get right.

How does Studyly enforce these checks at generation time?

Each generated card runs through the thirteen-check rubric before it is allowed into the deck. Items failing any testwiseness flaw are regenerated with a deterministic rule (length-match distractors, strip absolutes, rotate any stem-key word repeats). Items failing an irrelevant-difficulty flaw or a USMLE-specific gate are either regenerated with a different prompt path or, if regeneration fails twice, dropped. The dropped-card rate on medical-school decks is around eight percent, which is the honest cost of in-flight gating. The leaderboard score is the result of running the same rubric post-hoc on cards that already passed the in-flight gate, so the two scores should and do correlate.

Can I trust an item that has one flaw?

Usually yes, with caution. A single flaw rarely makes an item dangerous; it makes it slightly easier or slightly harder than a clean item. The threshold to drop is two or more flaws on the same item, because that is where the failure modes start interacting (a non-homogeneous option list with absolutes in two options, for example, is essentially a length-tell wrapped in a logic puzzle). On a Step 1 deck of two hundred AI-generated items, expect five to fifteen to have one flaw and one or two to have two or more; remove the two-plus items, keep the one-flaw items in the rotation, and note the flaw so you do not pattern-match on it during revisit.

Where is the live leaderboard?

studyly.io/quality. It shows the four named tools on the held-out three-document eval (Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8) on the four-criterion 0/3/7/10 rubric. The numbers correlate with what you would get if you ran this thirteen-point rubric on ten consecutive items from each tool, but the leaderboard is the composite score, not the per-flaw breakdown. For the underlying four-criterion methodology see studyly.io/t/ai-study-question-quality-rubric.