Reference · the rubric, ready to apply

The rubric behind the 23.5-point spread.

When Studyly scored 81.3 and Turbolearn scored 57.8 on the same three held-out source documents, the difference is not a vibe. It is a rubric, four named criteria, and a 0/3/7/10 anchor scale applied per card. The rubric is published below in usable form so you can run it against any tool yourself in about 25 minutes.

The four criteria are the same ones on the leaderboard at studyly.io/quality: factual correctness, clarity, distractor quality, question-type coverage. The anchor scale is the working detail most write-ups leave out.

Matthew Diakonov, Written with AI

Published May 18, 20268 min read

Direct answer · verified 2026-05-18

What rubric scores AI-generated study questions

Four criteria, each scored on a 0/3/7/10 anchor scale: factual correctness (does the keyed answer trace to the source), clarity (is the stem unambiguous on first read), distractor quality (are wrong options plausible and length-matched, no filler templates), and question-type coverage (does the deck mix MCQ, free-response, case-style, image-occlusion). Sum per card, mean across cards, add the deck-level type-coverage score, renormalize to 0-100. Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8 on the held-out three-document eval.

Authoritative source on the underlying item-writing principles: NBME Item-Writing Guide. Live leaderboard with the same four criteria: studyly.io/quality.

The anchor scale

Three of the four criteria are scored per card. The fourth (question-type coverage) is scored once across the deck. Every score lands on one of four anchors—0, 3, 7, 10—because mid-scale Likert points are where grader disagreement concentrates. The anchor prose below tells you what each number looks like in practice.

rubric-anchors.txt

What 0, 3, 7, and 10 look like on each criterion

The table below is the working version of the rubric. Left column is the criterion at a given anchor; middle column is the failure mode you tend to see when a generic AI tool runs without any quality gate; right column is what a card looks like when the same criterion is enforced at generation time (Studyly's pipeline). Read it as one row per (criterion, anchor) pair.

Feature	No gate (generic AI tool)	Generation-time gate (Studyly)
Factual correctness · 0/10	Keyed answer contradicts the source slide. Common when the model leans on its pretrained knowledge of the topic and ignores a recent update or a professor's simplification.	Source-anchored: the correct answer must trace to a cited span in the upload. If no span can be cited, the card is rejected at emission, not scored 0 later.
Factual correctness · 7/10	Keyed answer is right but the cited span is approximate (cites slide 14 when the fact is on slide 13). Card still teaches the right concept.	Same. Approximate citation is a 7 because verification still works, just one slide off.
Clarity · 0/10	Stem is ambiguous in a way that survives re-reading: pronoun with no antecedent, two plausible interpretations, units missing where they matter.	Same definition; the gate is the same definition applied at generation time, so the rate is lower but not zero.
Clarity · 7/10	Stem reads cleanly on first pass but uses a technical term where a more common synonym would be clearer. Working card.	Same.
Distractor quality · 0/10	Filler template present ('all of the above', 'it depends'), or one distractor is more than ~25% longer than the others (length tell), or two distractors are paraphrases of each other.	Each of those is a generation-time gate. A card that fails the length check at emission is regenerated; a filler-template option disqualifies the card.
Distractor quality · 7/10	Length-matched, no filler, but one of the wrong options is from a different topic on the same slide deck and would be eliminated by a student who studied any of the material.	Same. Adjacent-topic distractors are a 7 because they reward studied students without breaking the card.
Question-type coverage · 0/10	Output is 100% single-best-answer MCQ. No free-response, no case-style, no image-occlusion.	Four formats are produced from the same source; the deck-level mix toggles between balanced and MCQ-heavy depending on the source material's shape.

The 10-card scoring sheet

Pull ten consecutive cards from any tool's output (not cherry-picked). Open the source document the cards were generated from. Score each card on the three per-card criteria, citing the specific failure mode in the notes if a score is below 10. Then score question-type coverage once across the ten cards. The total is comparable across tools as long as you run the same protocol on the same source.

ten-card-rubric.txt

Time budget: about 25 minutes if you know the material, 50 minutes if you do not. A 200-card deck does not need a 200-card audit; a representative ten is enough to put the tool in the right bucket on the leaderboard.

81.3 vs 57.8

“Same three held-out source documents. Same four named criteria. Same 0/3/7/10 anchor scale per card. The 23.5-point spread between Studyly and Turbolearn lands almost entirely on factual correctness and distractor quality. Clarity and type-coverage spread less.”

Held-out three-document eval, May 2026 · methodology at studyly.io/quality

Why factual correctness is the load-bearing criterion

Of the four, factual correctness is the only one a student cannot honestly self-grade. Clarity is visible: re-read the stem. Distractor quality is visible: look at four options, measure lengths, scan for filler. Question-type coverage is visible: count formats across the deck. Factual correctness requires knowing the right answer, which is the exact thing the student is studying to learn. If you could already grade it, you would not need the practice question.

The only honest fix is to require the generator to cite the source span the keyed answer was lifted from, then make the rubric reject any card whose citation does not resolve. A card without a verifiable source span scores 0 on factual correctness even if the answer happens to be right, because you have no way to confirm it without already knowing the material. This is the gate that separates tools that score near 80 from tools that score near 60 on the leaderboard.

For the longer treatment of this specific failure mode, see AI-generated practice question quality: the part you cannot grade yourself.

Where the rubric should run

A rubric is a set of criteria; it is independent of when those criteria get applied. A post-hoc rubric runs on already-generated cards: the student deletes or edits the bad ones. An in-flight rubric runs as a gate inside the generator: bad cards never enter the deck. Same criteria, different location of the work. The 23.5-point spread on the leaderboard is mostly the result of moving the rubric from post-hoc to in-flight, not of changing the criteria.

The companion argument for moving the rubric upstream of emission, with the four in-flight checks and a ChatGPT prompt template that approximates them, is at Most Anki rubrics run too late. Move them upstream of emission.

What this rubric does not measure

Three honest gaps. One, retention: a card can score 10 on every criterion and still be forgotten in a week if the spaced repetition schedule is wrong. Retention is downstream of the rubric, not part of it. Two, professor fidelity: factual correctness anchors to the source you uploaded, so if your professor's slide is itself wrong, a card built faithfully on it scores 10 here while teaching you a wrong fact. The rubric measures fidelity to the upload, not fidelity to ground truth. Three, exam alignment: a deck that scores 90 on this rubric can still be the wrong shape for your exam (too much recall, not enough application). Question-type coverage helps but does not solve this.

The rubric is a necessary input to deciding which tool to drill from, not a sufficient one. Pair the 25-minute audit with a one-deck pilot through your own spaced repetition routine before committing to a tool for a board-prep cycle.

Run the rubric on your next deck

Upload one lecture. Score ten cards. Compare.

Free tier on app.jungleai.com, no credit card. Generate cards from a real lecture you already studied once, score ten with the sheet above, and put the tool on the same scale as the leaderboard. Twenty-five minutes of work tells you whether a tool is in the 80s or in the 50s.

Common questions about the AI study question quality rubric

What four criteria does the rubric score?

Factual correctness, clarity, distractor quality, and question-type coverage. The first three are scored per card on a 0/3/7/10 anchor scale. The fourth is scored per deck (the mix of MCQ vs free-response vs case-style vs image-occlusion across the whole output, not per individual card). Adding the four gives a 0-40 raw score that is then renormalized to 0-100. Studyly publishes the leaderboard at studyly.io/quality, where 81.3 is the renormalized score on a held-out three-document eval.

Why 0/3/7/10 instead of a 1-5 Likert scale?

Likert collapses the difference between a question that is mildly wrong and a question that teaches you the wrong fact, because the middle of the scale is overloaded. Anchors at 0, 3, 7, and 10 force the grader to pick a category: broken, weak, solid, exemplary. The gap from 3 to 7 is where most disagreement happens between graders, so the rubric prose for each criterion specifies what 3 and 7 look like as concretely as possible. In practice this halves inter-rater drift on the held-out eval compared to a five-point scale, which is consistent with what the published literature on rubric design (Snorkel AI, others) finds.

Why is factual correctness the load-bearing criterion?

Because it is the one a student cannot self-grade. Clarity is visible (re-read the stem; is it ambiguous?). Distractor quality is visible (look at the four options; is one obviously wrong, are the lengths matched?). Question-type coverage is visible (count MCQ vs free-response across the deck). Factual correctness requires knowing the right answer, which is the exact thing the student is studying to learn. The rubric weights it the same as the other criteria, but the only way to verify it honestly is to require the generator to cite the source span the keyed answer was lifted from. No citation, the criterion fails at 0.

Can I run this rubric on my own AI-generated questions?

Yes, and it takes about 25 minutes per tool to do honestly. Pull ten consecutive cards from the output (not cherry-picked; consecutive). Open the source document. For each card, score the three per-card criteria on the 0/3/7/10 scale, citing the specific failure mode (the rubric body below tells you what each anchor looks like). Then score type-coverage once over all ten. Sum, divide by 40, multiply by 100. That number is comparable across tools as long as you run the same protocol. The 60-second-per-card pace is realistic if you already know the material; double that if you don't.

How is this different from the NBME Item-Writing Guide?

The NBME Item-Writing Guide at nbme.org/educators/item-writing-guide is the authoritative source for the underlying principles—distractors should be plausible and parallel, filler templates are item-writing flaws, grammar mismatches give the answer away. The guide is written for human item-writers preparing licensure exams. It does not specify a per-criterion 0-10 anchor scale because that is an implementation choice for an automated grader. This rubric ports the NBME principles into a four-criterion 0/3/7/10 scoring sheet that you can paste into a spreadsheet and run against any AI tool.

What does the 81.3 vs 57.8 spread actually mean for me?

On the same three held-out source documents, the same four criteria, and the same per-card scoring sheet, Studyly scored 81.3 and Turbolearn scored 57.8. The 23.5-point spread is mostly on factual correctness and distractor quality; clarity and question-type coverage spread less. In a 200-question deck, the practical translation is: a tool scoring 57.8 hands you roughly forty cards that are wrong, ambiguous, or test-takeable on length tells; a tool scoring 81.3 hands you maybe twelve. You still spot-check, but the floor is higher.

Why is question-type coverage scored at the deck level instead of per card?

A single card cannot have a 'type mix.' Type coverage is a property of the output as a whole: does the deck contain MCQ, free-response, case-style, and image-occlusion in something resembling a balanced ratio, or is it 200 single-best-answer MCQs and nothing else? Per-card grading would either make every card score the same (the tool's choice of mix is fixed) or make the criterion meaningless. So the rubric grades it once across the full sample, on the same 0/3/7/10 scale, and adds the result to the per-card totals.

Does the rubric handle image-based questions and free-response?

Yes, on the per-card criteria, with one wrinkle. Factual correctness still anchors to a cited span (a slide region for image-occlusion, the source paragraph for free-response). Clarity translates: an image-occlusion card with three overlapping masks scores low on clarity in the same way an ambiguous stem does. Distractor quality only applies to multiple-choice formats; for free-response there is no distractor, so the criterion is replaced by 'rubric-of-the-rubric' (does a model answer exist that the student can grade against). Question-type coverage rewards the deck for containing all four formats.

Where is the live leaderboard?

studyly.io/quality. It shows the four named tools (Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8), the four criteria, and the held-out methodology. The number you see is the renormalized 0-100 score from the same rubric on this page. The leaderboard refreshes when new tools are added to the eval or when the held-out documents change.