Argument · cram-time arithmetic

Rubric beats prompt for cram flashcards. The reason is arithmetic, not pedagogy.

The pedagogy argument has been made. Karpicke and Roediger 2008, retrieval beats rereading, about 80 percent retained one week later vs about 34 percent. That gap is real and it is in every guide on cramming with practice questions. What no guide does is the next step: figure out which kind of generated practice question survives contact with a 1 a.m. cram session. The answer is not the kind a better prompt produces. The answer is the kind a rubric gate would have refused to emit.

Matthew Diakonov, Written with AI

Published May 7, 20268 min

Direct answer · verified 2026-05-07

A rubric-gated generator beats a raw ChatGPT prompt for cram flashcards because cramming rewards drilled cards, not authored cards. Prompt-only output produces roughly 30 to 40 cards per 200-card deck that need editing before drilling, which is 15 to 40 minutes of dead time per lecture. A rubric-gated generator (one that runs source-anchoring, length-matching, filler-template, and grammar-parallelism checks before emission) drops those cards at generation time, so the deck arrives drillable. On a held-out three-document eval the gated approach scored 81.3 vs a 67.9 field average. Methodology and per-criterion scores at studyly.io/quality.

The cost equation flips when you are out of time

In a calm study session, a bad practice question is a card you edit. The editing is not wasted: you read the stem, you check the slide, you fix the option list, and you have engaged with the material in the process. The card was a bad card and you turned it into a good card and you learned something. This is the model that every "use AI to make flashcards" guide assumes without saying so. It is also the model that breaks the moment you are cramming.

In a cram session, editing is dead time. You have six hours and thirty lectures and the only thing that closes a knowledge gap is retrieving the answer wrong, looking at the slide, and retrieving it again right. Every minute spent rewriting a generator's mistakes is a minute not spent inside the retrieval loop. The card you spend two minutes fixing did not teach you anything you could not have learned faster from a different card.

That inversion is the whole argument. Once a card has to be edited, it is a worse card than no card at all, because no card costs zero minutes and an edited card costs whatever the editing took. The question stops being "which approach produces better cards on average" and starts being "which approach produces fewer cards I have to edit". A rubric beats a prompt at the second question, even on the days the prompt would have won the first.

The editing tax, in numbers, on one 90-slide deck

A typical 90-slide lecture converts to about 200 cards across MCQ, free-response, and case-style formats. The held-out eval rate for the field average (67.9 across the four named criteria) translates to roughly 15 to 20 percent of those cards needing edits before they are drillable. We use 18 percent here as the midpoint.

0Cards in a typical 90-slide deck

0%Average bad-card rate, prompt-only output

0Cards you must edit before drilling

0 minEditing tax at 45 seconds per card

27 minutes is one full deck of cramming time. A student with thirty decks left and six hours has not budgeted thirty editing taxes.

Why a smarter prompt does not close the gap

The natural objection is that the four checks the rubric runs (source-anchoring, length-matching, filler-template ban, grammar parallelism) are just instructions, and instructions can be written into a prompt. Ask ChatGPT to count characters in each option and refuse to emit cards where the longest is more than 1.25 times the shortest. Tell it to verify article agreement before printing the option set. List the forbidden filler templates. Done.

Three of the four gates transfer cleanly. Length-matching is a character count and the model can do it. Filler-template ban is a forbidden list and the model can respect it. Grammar parallelism is a parse and the model is good at parses. Source-anchoring is the gate that does not transfer, and it is the most expensive failure mode.

Source-anchoring asks the generator to cite the slide number or paragraph that the correct answer was lifted from, AND to actually verify the citation against the upload. The model cannot reliably do the second part inside its own context. It will produce a citation that looks right (slide 23, page 47) and sometimes the citation is correct and sometimes the model confabulated it because slide 23 of your deck is about something else entirely. The card looks fine on its face. You only catch the error if you open slide 23 and notice it does not say what the card claims it says. At cram speed you do not open slide 23. You drill the card, you internalize the wrong fact, and you find out it was wrong on the exam.

A rubric-gated generator runs the source-anchoring check outside the model: a substring or span search against the actual upload, deterministic, no confabulation surface area. Cards that cannot anchor get rejected and the slot is regenerated. This is why a careful prompt still leaves a residual 5 to 10 percent bad-card rate (mostly drift) where a gated generator hits zero on the same dimension. On 200 cards that is 10 to 20 cards of difference. At 45 seconds per edit it is a deck of cramming time.

Side by side, on one 200-card deck

The same lecture, the same student, the same six-hour budget. The only thing that changes is which generator emitted the deck.

Feature	Prompt-only (ChatGPT et al.)	Rubric-gated
Where the quality work happens	After the cards exist. Student reads each, decides keep or edit.	Before the cards exist. Generator emits only cards that pass the four gates.
Cards a 200-card deck contains that need editing	Roughly 30 to 40 (15 to 20 percent rate measured on held-out three-document eval)	Zero. Bad-shape cards are regenerated against a different stem until gates pass or the slot is dropped.
Editing tax before you drill the first card	15 to 40 minutes per deck at 30 to 60 seconds per card	Zero. The deck arrives drillable.
Pretrained drift (card contradicts your professor's slide)	Self-report citation in the stem; sometimes hallucinated; student must verify	Span search against the actual upload at generation time; no span found, no card emitted
Length-tell (long correct option vs three short distractors)	Reviewer notices and rewrites or deletes	Length comparison gate runs before emission; options outside ±25 percent trigger a rewrite
Filler templates ('all of the above', 'none of the above', 'it depends')	Reviewer scans and removes card by card	Regex match against forbidden list disqualifies the card before it leaves the generator
Held-out three-document eval score	Field average 67.9 (Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8)	81.3 (Studyly), measured on the same three documents and four named criteria

81.3 vs 57.8

“On a held-out three-document eval covering factual correctness, clarity, distractor quality, and question-type coverage. Distractor quality is the criterion with the widest tool-to-tool spread, which is the criterion most directly affected by an in-flight gate.”

Held-out eval, May 2026 · methodology and per-criterion scores at studyly.io/quality

The honest counterargument

The case where the prompt approach actually wins: a single short PDF, a calm afternoon, and a student who is reading every card with the source open. In that scenario the editing time is not dead time. It is study time wearing a different hat. The bad cards become teachable moments and the student finishes the session with a clean deck and a deeper understanding of the material than they would have gotten by drilling a pre-curated deck. A careful prompt, in this scenario, is not worse than a rubric. It might be better.

Two conditions have to hold for that case to apply. The student has to have time to edit. The student has to be fluent enough in the material to catch pretrained-drift errors against the source. Neither condition holds at 1 a.m. before an exam. The student is not fluent (that is why they are cramming) and the student does not have time (that is why they are cramming). The failure mode of the prompt approach in cram conditions is invisible: the student rubber-stamps cards that look right, internalizes the bad ones, and feels productive while doing it.

So the argument is not that prompts are bad. It is that prompts require a reviewer, and the reviewer the cramming student cannot be is the reviewer the prompt assumes.

How to spot whether your generator is actually rubric-gated

Some tools market a rubric and run it post-hoc as a UI checklist in the review pane, which just shifts the editing tax from the generator to the student. Five concrete things to look for in a 20-card spot-check. If any of them fails, the rubric is not gating emission.

Spot-check, 20 cards

Each card cites the slide number, PDF page, or transcript timestamp the answer was lifted from. If a card does not cite a span, the source-anchoring gate did not run.
Distractor option lengths are within roughly 25 percent of the correct answer length on every card you spot-check. Length-tell is a generation-time problem, not a review-time problem.
Zero cards in a 20-card sample contain 'all of the above', 'none of the above', 'both A and B', or 'it depends' as an option.
Every option, when concatenated with the stem, parses cleanly. Singular-plural agreement, article agreement, tense agreement all hold.
Question-type mix is balanced across the deck (MCQ, free-response, case-style, image-occlusion). 200 single-best-answer MCQs and nothing else means the type-coverage gate did not run.

What this means for tonight

If the exam is tomorrow, do not optimize the generator. Pick a rubric-gated tool, drop your remaining decks, and start drilling. The 27 minutes per deck you save by not editing is the budget you spend on retrieval, which is the only intervention with measured retention gains in the cram window. If you have already started a ChatGPT session, finish it on this deck (sunk cost is sunk) but run the next deck through a gated tool and compare the editing time. The arithmetic shows up immediately.

Studyly is the gated tool we built. The deck-by-deck conversion runs in roughly 60 seconds per 90-slide upload, the four-gate rubric runs at emission, and the held-out eval scores are at studyly.io/quality. Free tier on app.jungleai.com, no credit card. The argument above applies to any honest rubric-gated generator; we only know ours.

Frequently asked

Frequently asked questions

What is the actual claim here? Why does rubric beat prompt for cram flashcards?

The claim is narrower than 'rubrics produce better cards'. The claim is that cramming changes which kind of bad card matters. In a calm study session, a bad card is a card you edit, and editing time is fine because you are learning while you edit. In a cram session, editing time is dead time: you are not retrieving, not closing knowledge gaps, not progressing through the deck. A prompt-only output that contains 30 to 40 bad cards in a 200-card deck (the rate measured against held-out evals at the field's average) is a 15 to 40 minute editing tax before drilling starts. A rubric-gated output drops the bad cards before emission, so the deck arrives drillable.

What counts as a 'bad' card in this context?

Four shapes specifically. (1) Stem references a fact the upload does not contain (pretrained drift). (2) Three short distractors next to one long correct answer (length-tell, you guess without reading the stem). (3) An option that says 'all of the above', 'none of the above', 'both A and B', or 'it depends' (filler-template, NBME flags this as an item-writing flaw). (4) The stem and the correct option do not parse together (singular-plural mismatch, article disagreement). Each one is solvable by a quality rubric run before emission. None of them is solvable by a better prompt alone, because the prompt cannot verify the upload, count characters across options, or grammar-parse stem-option pairs reliably enough to gate.

Can I just write a smarter ChatGPT prompt that includes all four checks?

You can, and on a single short PDF on a calm afternoon it gets you most of the way. The mode of failure is at scale and under time pressure. The grammar and length checks transfer cleanly to a prompt: instruct the model to count characters and verify article agreement before emitting. The filler-template ban transfers cleanly: list forbidden options. Source-anchoring is the gate that does not transfer: ChatGPT cannot easily verify a cited slide span exists in your upload, so the source-anchoring check degrades to a self-report. The model will sometimes confabulate a citation. On a 200-card deck this means a residual ~5 to 10 percent bad-card rate even with a careful prompt, which is 10 to 20 cards of editing for a careful prompt vs zero for a rubric-gated generator. At 1 a.m. that is the difference between drilling 4 decks and drilling 5.

Where do you get the editing-time-per-card numbers?

30 to 60 seconds per card is the post-hoc-review cost we use internally for the same four-criterion rubric, applied as a checklist rather than as in-flight gates. It is in line with the published cost of human item review on NBME-style materials (a careful reviewer with the source open in a second window finishes one MCQ per 45 to 90 seconds depending on item complexity). The math in this page uses 45 seconds as the midpoint. A 30-card editing tax at 45 seconds per card is 22.5 minutes; a 40-card tax is 30 minutes. Both numbers round to 'a deck of cramming time you do not get back'.

Doesn't ChatGPT also produce some good cards? Why throw out the deck?

Nobody is throwing out the deck. The argument is about the tax to make the deck drillable, not about whether to use the cards that survive. The problem is that the bad cards and the good cards are interleaved, and there is no reliable way to tell them apart without reading every card with the source open. Skipping the editing pass and drilling the raw output means you internalize wrong facts (pretrained-drift cards that contradict your professor's slide), and you waste retrieval attempts on length-tell cards that you got right by reading the option list, not the stem. The interleaving is the cost. A rubric-gated generator removes the interleaving by emitting only the survivors.

How does a rubric-gated generator actually score on a held-out evaluation?

Studyly scored 81.3 on a held-out three-document eval covering factual correctness, clarity, distractor quality, and question-type coverage. Unattle scored 78.0, Gauntlet 68.0, Turbolearn 57.8. The distractor-quality criterion has the widest tool-to-tool spread, which is the criterion most directly affected by an in-flight gate. The full leaderboard and methodology are at studyly.io/quality. The eval is not what makes the page argument true; it is the upper bound on how much editing a generator's output saves you. A 23.5-point spread on the same documents means the rubric-gated generator emits roughly 23 percent fewer flagged cards on average, which is the figure the editing-tax math uses.

Is this just a Studyly pitch dressed up as an argument?

Partly. The argument is true regardless of which rubric-gated generator you use. Any tool that runs the four in-flight gates honestly will save you the editing tax compared to a prompt-only output. Studyly is the version we built and benchmarked, so the numbers are ours. If you find another tool that scores comparably on the same held-out eval and runs the same gates, the argument applies to it equally. The argument breaks for tools that claim a rubric but apply it post-hoc as a UI checklist, because that just shifts the editing tax onto the student.

What about retention? Does the rubric change retention or just editing time?

Both, but the channels are different. Editing time is the explicit channel measured here. Retention is the implicit channel: a deck without length-tell cards forces you to read the stem, which means you are practicing retrieval rather than recognition. The Karpicke and Roediger 2008 retrieval-vs-rereading retention gap (about 80 percent vs 34 percent one week later) is the upper bound on what retrieval practice can give you over passive review; bad cards that let you guess without retrieving leak the gap back. The cleaner the cards, the closer your effective retention sits to the retrieval ceiling.

What if my source is YouTube lecture videos or scanned PDFs, not slide decks?

The four-gate argument applies to any intake format. Source-anchoring degrades a little for scanned PDFs (OCR errors can prevent a clean span match) and for YouTube transcripts (timestamps replace slide numbers as the citation unit). The other three gates do not depend on the upload format. The editing-tax arithmetic is the same. A prompt-only output on a YouTube transcript still produces ~15 to 20 percent bad cards; a rubric-gated output still removes them.

I am cramming tonight. Where do I start?

Drop your remaining lecture decks and PDFs into a rubric-gated generator (Studyly is the one we built; the freemium tier on app.jungleai.com has no credit card gate). Skip the comparison shopping until the exam is over. Drill MCQs end-to-end on the first deck to get a baseline of what you do not know. Switch to free-response or case-style on the cards you missed, which forces cold recall instead of recognition. When you miss a card, click the source slide citation, read just that slide, then close it. Move to the next deck. The retrieval-with-feedback loop is the thing that produces the retention gap; everything else is overhead you should be removing.