Argument · where the rubric should live

Most Anki rubrics run too late. Move them upstream of emission.

A rubric is a set of criteria. The criteria for a good Anki MCQ are well-known and not in dispute: source-grounded correct answer, length-matched options, no filler templates, grammar parallelism. What is in dispute is when the rubric runs. Almost every "AI Anki rubric" article online describes a checklist a student runs after the cards are generated. By then the bad cards are already written, and the cost of fixing them lands on the student at midnight before an exam.

The honest version of the rubric runs as a set of gates inside the generator, before each card is emitted. Same criteria, different location. The four-check version below is what produces the 23.5-point spread between Studyly (81.3) and Turbolearn (57.8) on the same three held-out source documents at studyly.io/quality.

Matthew Diakonov, Written with AI

Published May 7, 20269 min read

Direct answer · verified 2026-05-07

What rubric should govern Anki question generation

Four checks gated per card before emission: (1) the correct answer traces to a cited span in the source upload, (2) distractor option lengths fall within 25 percent of the correct answer's length, (3) zero filler templates ("all of the above", "none of the above", "both A and B", "it depends"), (4) every option parses grammatically with the stem (article, number, tense agreement). These map onto NBME's item-writing principles and onto the four-criterion held-out eval Studyly publishes.

Authoritative source on the underlying principles: NBME Item-Writing Guide. Anki's own template documentation: docs.ankiweb.net/templates/intro.html.

The four in-flight checks

A generation-time rubric is the same criteria as a grading rubric, rearranged. Each criterion becomes a gate the model must pass before a card is allowed to leave the pipeline. If a card fails a gate, the slot is regenerated up to a small retry budget. After the budget, the slot is dropped rather than emitted as a bad card. The student sees survivors, not deletions.

Four gates · run per card

1. Source-anchoring · the correct answer must trace to a span in the upload

Before a card is emitted, the generator records the slide number (or PDF page, or transcript timestamp) that the correct answer was lifted from. If no span can be cited, the card is rejected and the slot is regenerated with a different stem. This is the gate that prevents pretrained drift, the failure mode where the model writes a card that contradicts your professor's slide because it leans on textbook knowledge instead of the upload.

2. Length-matching · option lengths within 25 percent of the correct answer

After a candidate option set is drafted, the generator measures characters per option. If the longest option is more than 1.25 times the shortest, the option list is rebalanced or regenerated. This is the gate that kills the length-tell failure mode: three short distractors next to a detailed correct answer, where the student picks the long one without reading the question.

3. Filler-template ban · no all/none/both/depends candidates allowed

A regex match against a fixed forbidden list ('all of the above', 'none of the above', 'both A and B', 'it depends', 'other') runs before emission. Any option matching the list disqualifies the card. The forbidden list is pinned to the NBME item-writing flaws. This gate exists because the model's pretrained distribution contains a lot of bad-quiz examples and will fall back on filler templates if not constrained.

4. Grammar parallelism · stem and every option parse together

Each option is concatenated with the stem and run through a quick parse check. Singular-plural mismatch, article disagreement ('a' before a vowel, 'an' before a consonant), tense drift, all fail the gate. NBME flags this as a structural giveaway: the student can narrow four options to two on grammar alone before reading any of them.

The same criteria, two pipelines

Toggle below to compare. Same four criteria; one runs after generation and lands the cost on the student, the other runs inside the generator and lands the cost on compute.

1. Generate 200 cards from the lecture. 2. Open each card. 3. Read the stem. Is it clear? 4. Read the options. Are they plausible? 5. Read the answer. Does it match the slide? 6. If not, edit or delete the card. 7. Repeat for 200 cards. Time cost: 30 to 60 seconds per card. Total: 100 to 200 minutes per lecture. Effective when: the student is fresh and knows the material. Failure mode: tired student rubber-stamps cards at midnight.

Bad cards exist; student deletes or edits them
Cost lands on a tired human, often at midnight
Source-anchoring degrades to self-report by the model
Field-average score on the held-out eval: 67.9

A prompt template that approximates the gates in ChatGPT

If you are stuck with a generic chat model and want to push the quality up, the cleanest move is to put the four checks in the system prompt and tell the model to abort emission on failure. Three of the four gates transfer cleanly. The fourth (source-anchoring) degrades to a self-report because the model cannot verify against your file the way a tool that ingested the upload can. Treat the citation in the output as a hint, not as ground truth.

anki-rubric.prompt

On a 90-slide cardiology deck, this prompt produces cards that score noticeably better than the default ChatGPT output on the length, filler, and grammar criteria. Source-anchoring still drifts on roughly one card in eight, mostly on borderline facts that the model could pull from training data.

Auditing a deck someone else generated

If you cannot see the generator, you can still infer whether a rubric ran in-flight by inspecting twenty cards from the output. Five quick checks. If the deck passes all five, an in-flight rubric is the most likely explanation. If the deck fails two or more, the rubric (if any) was post-hoc and was probably not run rigorously.

Five-check deck audit

Each card has a cited source span (slide number, PDF page, or transcript timestamp). If a card has no citation, the rubric was not applied at emission time.
Distractor option lengths are within 25 percent of the correct answer's length on every card you sample. Length-tell is a generation-time issue, not a review-time issue.
Zero cards in the sample contain 'all of the above', 'none of the above', 'both A and B', or 'it depends' as an option. One filler template per 200 cards is too many.
Every option, concatenated with the stem, parses grammatically. Singular-plural agreement, article agreement, tense agreement all hold.
Question-type mix across the deck is balanced (MCQ + free-response + case-style + image-occlusion). If the deck is 200 single-best-answer MCQs and nothing else, the type-coverage gate was not applied.

Where the cost lands

The pipelines below differ on every line. The criteria are identical. The location of the work is what matters.

Feature	Post-hoc rubric	In-flight rubric
When the rubric runs	Post-hoc, on already-generated cards. The student is the gate.	In-flight, before each card is emitted. The generator is the gate; the student reviews the survivors.
What happens to a failing card	Card exists; student must edit or delete it. Time cost lands on the student.	Card never enters the deck. Slot is regenerated against different distractor pool or different stem until gates pass.
Source-anchoring enforcement	Self-report inside the generated card (often hallucinated). Student must verify against the source slide.	Substring or span search against the actual upload at generation time. No span found, no card emitted.
Distractor length tells	Reviewer notices long correct answer next to three short wrong answers. Card edited or deleted.	Length comparison gate runs before emission. Options outside the 25 percent band trigger a rewrite.
Filler templates	Reviewer scans for 'all of the above' etc. and removes them card by card.	Regex match against forbidden list disqualifies the card before it leaves the generator.
Field-average outcome on held-out eval	67.9 across post-hoc-rubric tools (median of Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8).	81.3 (Studyly) on the same three held-out source documents and the same four named criteria.

81.3

“On a held-out three-document eval scored on factual correctness, clarity, distractor quality, and question-type coverage, Studyly scored 81.3 against a field average of 67.9. The structural reason is that the four named criteria run as gates per card during generation, not as a checklist after the fact.”

Held-out three-document eval, May 2026 · methodology at studyly.io/quality

What the rubric is not

The four checks are necessary but not sufficient. A card can pass all four and still be a bad card: a question on a fact your professor never emphasized, a stem that asks for memorization where the test asks for application, a card that becomes recognition rather than recall on revisit because the wording is identical every time. Those failures sit above the per-card rubric and need their own mechanisms (auto-rephrasing on revisit, question-type mixing across the deck, optional human review per contributor on group decks).

The four checks are also not a substitute for spaced repetition. A perfectly-rubric-passed card studied once and never reviewed is forgotten in a week. The rubric is upstream of scheduling, not a replacement for it. The Studyly pipeline runs the rubric at generation time and hands the surviving cards to a spaced repetition algorithm; both pieces are needed.

Common questions about Anki rubrics for question generation

What is a 'rubric' in the context of Anki question generation?

A rubric is a small set of named criteria that decides whether a question is good. The same word covers two very different uses. The first is post-hoc review: a checklist a student runs on already-generated cards (factual correctness, clarity, distractor quality, grammar parallelism). The second is generation-time gating: the same criteria, but applied as constraints inside the generator before each card is emitted. The two share criteria but differ in where the work happens. Post-hoc means the bad cards exist and you delete them; in-flight means the bad cards are never written.

Why does it matter where the rubric runs?

Three reasons. One, deletion is more expensive than non-emission, especially at deck scale (200 cards per lecture). Two, a post-hoc rubric assumes the reviewer is fluent in the source material; a tired med student grading their own cards at midnight misses pretrained-drift errors that an in-flight check would have caught against the actual upload. Three, post-hoc rubrics implicitly accept the field-average hit rate (roughly 67.9 on the four-criterion held-out eval) because they cannot create cards that did not get generated; in-flight gates create the option of a card that simply does not get written until the constraint is satisfied.

What are the four in-flight rubric checks Studyly runs?

Source-anchoring (the correct answer must trace to a span in the uploaded slide deck, PDF, or transcript). Length-matching (distractor option lengths must be within roughly 25 percent of the correct answer's length). Filler-template ban (no card can ship with 'all of the above', 'none of the above', 'both A and B', 'it depends' as a candidate option). Grammar parallelism (the stem and each option must parse together: singular-singular, plural-plural, 'a' before consonants and 'an' before vowels). The four map onto the four-criterion held-out eval at studyly.io/quality.

Can I approximate these checks in a ChatGPT prompt?

Mostly, with caveats. The grammar and length checks transfer cleanly: tell the model to count characters and abort if any option is more than 25 percent longer than the correct answer, and to verify singular/plural and article agreement before emitting. The filler ban transfers cleanly: instruct the model to never emit 'all of the above', 'none of the above', etc. Source-anchoring is the one that does not transfer cleanly: ChatGPT cannot easily verify that the cited span exists in your upload, so the gate degrades to a self-report and the model will sometimes lie about which slide a fact came from. A prompt template that gets you ~70 percent of the way there is in the section below.

How is this different from the distractor-quality rubric you already published?

The distractor rubric at studyly.io/t/anki-card-distractor-quality is a post-hoc grading instrument: five checks you run on twenty existing cards, with the source slide open in another tab, taking 90 seconds per card. This page is the upstream version: the same dimensions reorganized as gating constraints applied DURING generation, and a discussion of why moving the rubric upstream changes the field-average outcome. Distractor quality is the single criterion with the widest tool-to-tool spread; generation-time gating is the structural reason the spread exists.

How many cards get rejected by these gates in practice?

On a typical 90-slide cardiology deck, roughly 15 to 20 percent of the model's first-draft cards do not pass all four in-flight checks. The retry budget regenerates the failing cards (different distractor pool, different stem phrasing) until the gates pass or the budget exhausts. The cards a user actually sees are the ones that cleared the gates. The deck arrives with about 200 surviving cards, not 240 cards 40 of which are quietly bad.

Does the NBME publish a rubric this granular?

The NBME Item-Writing Guide is the authoritative source for the underlying principles: distractors should be plausible and parallel, filler shapes are item-writing flaws, grammar mismatches give away the answer. The guide is written for human item-writers preparing licensure exams. It does not specify ±25 percent length tolerances or 'gate per card before emission' wording because those are implementation choices for an automated generator. The page below ports the NBME principles into operational gates that a generator can actually run.

What about question-type coverage? Does that get its own gate?

Yes, but at the deck level not the card level. A type-coverage gate looks at the running tally of MCQ vs free-response vs case-style vs image-occlusion across the deck and biases the next emission toward whichever bucket is underrepresented relative to the target mix. This is one level above the per-card rubric, which is why it is not in the four-checks list. It is the reason a Studyly export of a single lecture has a balanced question-type mix instead of 200 single-best-answer MCQs.

Does running the rubric in-flight slow generation down?

Marginally. Each gate adds a small amount of work per card: a substring search for source-anchoring, a length comparison for length-matching, a regex for filler templates, and a quick grammar parse for parallelism. The 60-second-per-90-slide-deck number on Studyly is measured with all four gates active. The throughput cost is real but small enough that it is dominated by the generation step itself.

Can I see the leaderboard with the per-criterion scores?

Yes, at studyly.io/quality. The leaderboard breaks the held-out three-document eval down by criterion: factual correctness, clarity, distractor quality, question-type coverage. Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. Distractor quality is the criterion with the largest tool-to-tool spread, which is consistent with the in-flight-vs-post-hoc framing on this page: tools that gate on distractor quality at emission time pull ahead of tools that don't.