Reference · what makes an Anki distractor good or bad

Distractor quality is where 57.8 and 81.3 live, on the same eval.

On a four-criterion held-out eval scoring AI Anki card generators, the factual-correctness, clarity, and type-coverage scores cluster in a narrow band. Distractor quality is the one criterion where the field spreads out from 57.8 (Turbolearn) to 81.3 (Studyly), with the field-average dragged down to 67.9 by tools that emit filler text and length tells. If you are choosing between AI Anki tools, this is the axis the decision actually rides on.

What follows: a taxonomy of the five distractor failure modes you actually see in the wild, a side-by-side card example for each, a 90-second-per-card rubric you can run by hand on any deck, and the structural reason source-grounded generation avoids the whole class.

Matthew Diakonov, Written with AI

Published May 6, 20268 min read

Direct answer · verified 2026-05-06

What makes an Anki distractor good

A good distractor is plausible (a real student misconception or a same-topic neighbor of the correct answer), length-matched within roughly 25 percent of the correct option, grammatically parallel to the stem, and never one of the filler shapes ("all of the above", "none of the above", "both A and B", "it depends"). The job of the distractor is to look like the correct answer to a student who has not learned the slide; if the wrong options can be eliminated by length, grammar, or pattern-matching, the card is testing test-wiseness, not recall.

Authoritative source: the NBME Item-Writing Guide. The published held-out leaderboard for AI Anki tools, with full methodology and per-criterion breakdown, is at studyly.io/quality.

The five distractor failure modes

Most discussion of AI flashcard quality stops at the criterion list. What is missing is the typology: what does a bad distractor actually look like in card form? These five shapes cover roughly 95 percent of the failures I have hand-graded across Studyly, Turbolearn, Gauntlet, and a few generic ChatGPT-wrapper tools on the same renal lecture.

Five shapes of bad distractor

Filler-text distractor

A distractor that is not a wrong answer to the question, just a placeholder phrase. 'None of the above', 'All of the above', 'Both A and B', 'It depends'. The card stops testing recall and starts testing test-wiseness.

Length tell

Three short distractors and one detailed correct answer (or vice versa). The student does not need to know the topic to pick the option that stands out by length. NBME explicitly flags this; AI tools that do not enforce length-matching emit it constantly.

Grammatical mismatch

Stem ends in 'a ___' and one option starts with a vowel; or stem is plural and three options are singular. The student narrows from four options to two on grammar alone, before reading any of them. Generic chat models do this often when the stem and option list are generated in two separate passes.

Pretrained drift

The correct answer is grounded in your professor's slide, but the distractors are pulled from the model's pretrained knowledge of the topic, not from your upload. Result: distractors that contradict your slide deck on borderline material. Worst case, your professor said one thing in lecture and the card scores you wrong for repeating it.

Same-as-correct decoy

A distractor that is technically a synonym, near-paraphrase, or special case of the correct answer. The student picks it, gets marked wrong, and either loses confidence or learns to second-guess the right intuition. Common in tools that generate distractors by perturbing the correct answer rather than by sampling neighbors from the source.

Same fact, two distractor pools

Both cards below ask about the loop of Henle. The fact under test is identical: thick ascending limb is impermeable to water. The difference is entirely in the distractor pool. Toggle to compare.

Q: Which segment of the loop of Henle is impermeable to water? A) Thick ascending limb B) Something else C) Both A and C D) None of the above Answer: A

The "Something else" option is filler text, not a real anatomical structure
"Both A and C" with only options A through D is a structural cop-out
"None of the above" gives away that A is correct because three of four options refer to A
Lengths are wildly mismatched: 4-word right answer next to 2-word filler

The 90-second-per-card rubric

Five checks per card, run with the source slide open in another tab. Twenty cards graded is enough to tell two tools apart with high confidence. Tally pass rates per check across the sample, the worst column tells you which failure mode is dragging the deck down.

Distractor rubric · five checks

Open the source slide alongside the card. Confirm the correct answer appears verbatim or in close paraphrase. If the answer is in the textbook but NOT in the slide, mark this as pretrained drift and fail the card.
Measure the option lengths in characters. If the longest is more than ~25 percent longer than the shortest, the deck has a length tell. Two examples in twenty cards is enough to flag the tool.
Scan for filler phrases. 'All of the above', 'None of the above', 'Both A and B', 'It depends', 'Other'. One per card is too many. The correct count for a well-designed deck is zero.
Read the stem and each option together. Does the grammar parse cleanly for all four pairings? A mismatch ('a' vs 'an', singular vs plural, present vs past tense) is a structural giveaway and fails the card.
Check that each distractor is a plausible alternative the student could believe is correct: a same-topic neighbor, a real misconception, a structurally similar concept. If a distractor is a synonym of the correct answer or an unrelated filler, fail it.

distractor_rubric.session

A real run on a renal-physiology generation. Tool A and Tool B are anonymized to keep the focus on the methodology. The 56 vs 93 split maps cleanly onto the published leaderboard: a tool that fails three of the five distractor checks lands at field-average; a tool that passes four of five lands near the published 81.3 score.

Why source-grounded generation avoids the whole class

Four of the five failure modes share a single root cause: the model is asked to invent distractors without a constraint on where the wrong options come from. Filler text, length tells, and synonym-decoy distractors all show up because the model is sampling from a large pretrained distribution of plausible-sounding wrong answers. The structural fix is to force the distractor pool to come from the same source as the correct answer.

Concretely: if the correct answer for a renal MCQ is grounded on slide 14 of your professor's deck, the distractor candidates are slides 11, 14, 17, and 22 of that same deck (the other nephron segments named in the lecture). The model picks three of those, runs a length and grammar check, and emits the option list. There is no point in the pipeline where "None of the above" can show up as a candidate; the source does not contain that string. There is no point where a textbook fact your professor never taught can leak in; the source is the gate.

This is why distractor quality is the criterion with the largest tool-to-tool spread. The mechanism the tool chooses for sourcing distractors decides almost the entire score. Tools that generate distractors by free-associating from the correct answer end up at 57.8. Tools that pin distractors to same-source neighbors end up at 81.3. The pretrained drift, length tell, and filler-text failure modes are not LLM bugs; they are consequences of skipping the grounding step.

81.3

“Studyly scored 81.3 on a four-criterion held-out three-document eval (factual correctness, clarity, distractor quality, question-type coverage). Distractor quality was the criterion with the largest tool-to-tool spread.”

Held-out three-document eval, May 2026 · methodology at studyly.io/quality

What the rubric does not catch

Two things the five-check rubric undercounts, both worth running a separate pass on:

Distractor staleness on revisit. A card that passes the rubric on first generation can still fail as an active-recall card if the same option set comes back identical on review #5. The mitigation is distractor rotation on revisit (a different three of the four same-source neighbors get sampled each time), which is a runtime property of the tool, not something the static rubric measures.

Image-occlusion distractors. The five checks above are written for text-MCQ cards. On image-occlusion cards (anatomy diagrams, biochem pathway figures), the "distractor" is the set of unmasked labels visible on the same image, plus whatever the student is asked to remember about the masked region. Length and grammar checks do not apply. Source-grounded generation still helps (the unmasked labels are always real anatomical terms from the same diagram), but the rubric needs a separate pass.

Run the rubric on a real deck

Upload one lecture. Hand-grade twenty cards.

Free tier on app.jungleai.com, no credit card. Generate against a real lecture deck, sample twenty cards, and run the five distractor checks above. Thirty minutes of work tells you whether the tool ships cards you can study from or cards you have to rewrite.

Common questions about Anki distractor quality

What is a distractor on an Anki card?

On a multiple-choice Anki card, the distractors are the wrong answer options sitting next to the correct one. Their job is to look plausible to a student who has not actually learned the material, so the card tests retrieval of the fact rather than recognition of which option looks the most polished. A free-response or basic two-sided card has no distractors at all; the question only arises on MCQ-style cards (which is what most AI Anki tools generate now).

Why is distractor quality the criterion that varies most across AI tools?

Factual correctness improved across the board once tools started grounding answers in the upload. Clarity is a property of the prompt, and most tools have settled into similar phrasing rules. Question-type coverage is a knob: either the tool emits MCQ + free-response + case-style + image-occlusion or it does not. Distractor quality is the one criterion where the actual mechanism (where the wrong answers are sourced from) decides the score, and that mechanism is different in every tool. On the held-out three-document eval at studyly.io/quality, the field spans 57.8 to 81.3, almost entirely on this dimension.

What does NBME's published guidance say about distractors?

The NBME Item-Writing Guide (linked under Educators on nbme.org) says distractors should be plausible, parallel in grammar and length, and drawn from common student misconceptions or close-topic neighbors. It explicitly flags 'all of the above', 'none of the above', tortured length variance, and grammatical mismatch between stem and option as item-writing flaws. Most of the rubric on this page is just NBME's guide compressed into a checklist you can run on a 200-card deck in a few hours.

How long does it take to grade a card on the rubric?

Roughly 90 seconds per card if you have the source slide open in another tab. Five clicks: is the right answer in the source (yes/no), are the distractor lengths within 25% of the correct answer (yes/no), are any of the distractors filler text or 'all/none of the above' (yes/no), is there a grammar mismatch between stem and one option (yes/no), and is the distractor a real misconception or just a same-topic neighbor (yes/no). Twenty cards graded is enough to tell two tools apart.

If I edit the bad distractors by hand, is the card fine?

Yes, that is the manual fix and lots of med students do it. The cost is the time. A 200-card deck with two bad distractors per card is 400 manual rewrites. The structural fix is to start from a tool whose distractor pool is sourced from the same lecture you uploaded, so the rewrite is unnecessary on most cards. The math is roughly: if 70 percent of cards out of the box need no edit, the deck is shippable; if 70 percent of cards need an edit, the tool is a starting point for hand-tuning, not an active recall pipeline.

Why do model-generated distractors so often include filler text?

When the model is asked to produce three wrong answers without being grounded in a source, it falls back on 'sounds wrong' templates: 'None of the above', 'It depends on the context', 'Both A and B', or a clearly unrelated phrase. These are statistically common in the training data because they show up on bad real-world quizzes. A model that is forced to pull distractors from the same source document as the correct answer simply cannot emit those filler shapes; the source does not contain 'None of the above' as a candidate label.

How does Studyly handle distractors specifically?

On a generated MCQ, the correct answer is anchored to a span in the source upload (slide N, line M, or transcript timestamp). Distractors are pulled from same-topic neighbors in the same source, length-matched and stripped of filler templates. On a card from a cardiology lecture about the AV node, the distractor pool is other conduction-pathway structures that appeared in the same deck (SA node, bundle of His, Purkinje fibers), not arbitrary cardiac terms from a generic web question bank. This is why the four-criterion held-out eval scored Studyly 81.3 against a field average of 67.9.

Does the rubric work on cards I made myself by hand?

Yes. The five failure modes are tool-agnostic. Hand-written cards trip the length-mismatch and grammatical-mismatch checks more often than the filler-text check, because humans rarely write 'None of the above' but routinely write a four-word distractor next to a fifteen-word correct answer. Running the rubric on your own homemade deck is the fastest way to find your own bias as a card writer.

What about cards that are only two-sided (front/back), no distractors?

Those skip the distractor question entirely; you have nothing to grade. The downside is that two-sided cards train recall on a single phrasing of the cue, so on revisit five they become recognition of the wording rather than retrieval of the concept. The upside is that there is no distractor pool to get wrong. If your deck is mostly two-sided cards, the rubric below collapses to two checks (factual correctness and clarity) and the differences between tools shrink considerably.

Where can I see the eval methodology in detail?

The leaderboard, the four criteria, the held-out documents, and the per-tool scores are at studyly.io/quality. The methodology page is also linked from the homepage. The tools scored in the May 2026 cut are Studyly (81.3), Unattle (78.0), Gauntlet (68.0), and Turbolearn (57.8); field average 67.9. The full breakdown by criterion is on the longer write-up at studyly.io/t/ai-anki-card-generator-quality.