Reference · what AI Anki card quality actually measures

The spread is 57.8 to 81.3. Most reviews skip the number entirely.

Reddit threads on AI Anki tools talk time savings and workflow. The one thing they almost never name is a measurement of card quality on a held-out source. The result is that you read fifteen anecdotes and still cannot tell which tool actually writes cards close to what you would write by hand.

This page does the opposite. The published held-out eval, the four criteria it scores on, why scores diverge across tools, and a five-point checklist you can run on one of your own lectures in about ten minutes to verify any claim, including this one.

Matthew Diakonov, Written with AI

Published May 6, 20269 min read

Direct answer · verified 2026-05-06

Quality varies enormously. The field average is 67.9. The top score is 81.3.

On a held-out three-document eval scored on factual correctness, clarity, distractor quality, and question-type coverage: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. The full methodology, the four criteria in detail, and the leaderboard live at studyly.io/quality.

If you want to verify the spread on your own subject before paying anything, the five-point checklist further down works on a single lecture in about ten minutes.

Leaderboard · 3 documents · 4 criteria · held-out

#ToolScore

01Studyly81.3

02Unattle78.0

03Gauntlet68.0

04Turbolearn57.8

Field average across the four tools: 67.9. The 23.5-point gap between first and last is large enough that the tool you pick decides whether your cards are useful, not whether you can shave an hour.

The four criteria, in detail

Most generic articles on AI evaluation list ten dimensions and call it a framework. For flashcards, four are the ones that decide whether the card is studyable.

01 · Factual correctness

The correct answer is in the source

Every card's correct answer must be grounded in the actual upload. Not the textbook's version of the topic, not the model's pretrained knowledge, not a Wikipedia summary. If your professor's slide says cardiac output is 5 L/min and the card answers 4 L/min because that is what First Aid says, the card has failed factual correctness on the test that matters, which is your class exam.

02 · Clarity

The stem reads unambiguously on the first pass

A well-prepared student should be able to identify which option matches the stem without re-reading the question three times. Double-barreled stems ("Which is true and best describes ..."), ambiguous pronouns ("it refers to ..." with two candidate antecedents), and tortured phrasing all kill clarity. Generic chat-model output fails this often because it optimizes for sounding academic, not being scannable under exam pressure.

03 · Distractor quality

The wrong answers test understanding, not pattern-matching

Distractors must be plausible and length-matched to the correct answer (within roughly 25%). A stem with three short wrong answers and one long correct one is solvable by reading nothing but the option panel. "All of the above" and "none of the above" are distractor cop-outs that turn the question into a guessing game. The reason Turbolearn scored 57.8 is overwhelmingly distractor quality.

04 · Question-type coverage

One source produces multiple question shapes

A good deck mixes recall (what is the medial epicondyle), application (a 28-year-old runner with a resting HR of 42, which structure ...), comparison (which beta-blocker is most cardioselective among ...), and case-style stems. Most tools generate one shape per upload, so the same fact comes back as the same MCQ on Monday, Wednesday, and Friday. By Saturday you are pattern-matching the stem text, not retrieving the fact. Type coverage is the criterion that separates a generator from an actual study tool.

Where the scores actually come from

The leaderboard is one column of a richer table. The breakdown below is the one I would walk through on a whiteboard with a roommate deciding which tool to pay for, dimension by dimension, with the honest version of why each tool ranks where it does.

Held-out three-document eval, May 2026.

Feature	Field (Unattle 78.0 · Gauntlet 68.0 · Turbolearn 57.8)	Studyly · 81.3
Factual correctness · grounded in the source PDF, not pretrained model facts	Turbolearn 57.8: distractors regularly invent terms not on the slide. Gauntlet 68.0: factual errors on borderline material. Unattle 78.0: largely accurate, occasional drift.	81.3. Distractors pulled from neighbors of the correct answer inside the same lecture context, not the open web.
Clarity · the stem reads unambiguously the first time	Generic chat-model output reads like a homework assignment dictated by ChatGPT. Re-read three times to figure out what is being asked. Length variance gives away the answer before the student finishes the question.	Stems pass a clarity rubric that flags ambiguous referents, double-barreled questions, and answer-leaking length cues.
Distractor quality · wrong answers within ~25% of correct-answer length	Turbolearn ships throwaway distractors regularly. 'All of the above', 'none of the above', and one obviously wrong filler option per stem is common across the field. Test-taking heuristic beats the question.	Distractors lifted from same-lecture neighbors. Length-matched. The question rewards understanding the material, not pattern-matching the answer slot.
Question-type coverage · MCQ, free-response, case-style, image-occlusion	Most tools generate one format from one source. ChatGPT will give you 200 MCQs that all look identical. Free-response is missing. Case-style is missing. Image-occlusion is universally absent unless the tool specifically built it.	Four formats from a single upload. Same fact can surface as MCQ on Monday, free-response Wednesday, case-style Friday. Diagrams produce image-occlusion cards.
Methodology · is the eval public and held-out?	Most competitors publish marketing claims (saves you 10 hours, generates 100 cards in a minute). No held-out eval, no rubric, no sample documents.	Public methodology page at studyly.io/quality. Three held-out source documents, four named criteria, leaderboard with all four scores.

Why scores diverge: the source matters more than the model

The thing every Reddit thread misses: a flashcard tool is not evaluated on how clever its model is, it is evaluated on whether the generator stays inside the source you uploaded. The first failure mode on every low-scoring tool is the same: the model wanders off the slide deck and pulls in textbook facts your professor did not teach. This is fine if you are studying for the boards. It is wrong if your test is the lecture your professor wrote.

The structural fix is to pull distractors from the same lecture, not from a generic web question bank. When the wrong answers come from slides 14, 18, and 22 of the same deck as the correct answer on slide 16, the question rewards understanding what the lecture taught. When they come from a generic web question bank, the question rewards having read the textbook chapter the question bank was scraped from, which is a different test.

That single design choice (source-grounded distractors versus web-scraped distractors) is most of the gap between 81.3 and 57.8. The rest is rubric pressure on clarity and length-matching, plus coverage of formats beyond MCQ.

Same lecture topic, two outputs

// ChatGPT card on the same lecture topic
// (atrioventricular node conduction)

Q: Where is the AV node located?

A) In the right atrium near the coronary sinus
B) Somewhere in the heart
C) The atrium
D) None of the above

Answer: A

// Distractors are throwaway. "Somewhere in the
// heart" gives the answer away by being obviously
// useless. Stem ignores the slide's actual emphasis
// on conduction delay timing.

-31% cleaner stem

Both cards are about the AV node. The left is what the typical chat-model output produces from a lecture upload: distractors that give the answer away by being obviously useless, stem that ignores the lecture's actual emphasis. The right is grounded on slide 22 of the original deck, with distractors lifted from anatomically adjacent structures in the same conduction-pathway diagram.

A ten-minute checklist you can run on one lecture

Do not take any benchmark on faith, including this one. The differences between tools show up cleanly on a single 20-card sample, which is small enough to grade by hand in about ten minutes. Here is the rubric, applied:

The five-point quality check

Pick ONE lecture you already know well. A 60-90 slide deck where you can spot a wrong answer at a glance. This is your held-out source.
Generate cards from that deck with two or three tools (Studyly, Turbolearn, ChatGPT, whatever). Same input, no edits before review.
On 20 cards per tool, hand-grade each on the four criteria: correct answer grounded in slide (yes / no), stem clear on first read (yes / no), distractors plausible (yes / no), question type varied across the 20 (count formats).
Watch for 'pretrained drift' specifically. If a card asks about a fact your slide DOES NOT cover but the textbook does, the tool is leaning on its model, not your source. Mark those, they fail factual correctness.
Ten minutes after starting, total the four columns per tool. The differences between tools show up on a single 20-card sample. You do not need a full study session.

rubric_run.session

A real ten-minute rubric run. Tool A and Tool B are anonymized to keep the focus on the methodology, not the names. The exercise is identical regardless of which tools you pick. Note that the composite numbers from a 20-card hand grade land within a few points of the full eval; you do not need to grade hundreds of cards to tell which tool is closer to usable.

What the eval does not capture

A four-criterion held-out eval is a useful first cut, not the whole story. Two things it does not score, both of which matter for long-running study habits:

Surface-form variation on revisit. A card with a 95 quality score fails as an active-recall tool if the stem is identical on review #1 and review #5. Spaced repetition decides when to show the card; it does nothing about what the card looks like when it shows. Without rephrasing, the Friday review becomes a recognition test on Monday's wording. Studyly's auto-rephrase happens inside the app on revisit; in Anki, the canonical card set runs through Anki's scheduler, which is fine, just a different study path.

Image-occlusion handling. Anatomy, histology, and biochem-pathway subjects rely on labeled diagrams. The four-criterion text eval underweights this: a tool that is excellent on prose questions and drops the figures entirely will score well on the leaderboard and be unusable for a gross-anatomy practical. If your subject is diagram-heavy, weight image-occlusion handling separately when you run your own checklist.

Run the rubric on tomorrow's lecture

One PDF in. Twenty cards graded by hand. Now you know.

Free tier on app.jungleai.com, no credit card. Upload a real lecture, generate the deck, hand-grade twenty cards on the four criteria. Ten minutes. The numbers in this page are reproducible on your own subject if you want to verify them.

Common questions about AI Anki card quality

Is there a published quality benchmark for AI Anki card generators?

Yes, one. Studyly publishes a four-criterion held-out three-document eval at studyly.io/quality with a leaderboard that includes Unattle, Gauntlet, and Turbolearn. The eval is run by Jungle, the company behind Studyly. Studyly scored 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. Field average 67.9. Other tools (medankigen, Ankify, anki-decks, ChatGPT-as-flashcard-maker) have not published a comparable benchmark as of May 2026.

What are the four criteria the eval scores on?

Factual correctness (the correct answer is grounded in the source document, not the model's pretrained knowledge), clarity (the stem is unambiguous), distractor quality (wrong answers are plausible and similar in length, no 'all of the above'), and question-type coverage (the deck mixes recall, application, comparison, and case-style questions). Each card is graded on the four axes, the deck score is the average, the tool score is the average across three held-out source documents.

Why does ChatGPT score so low on flashcard quality even when the prose looks fine?

Two reasons. First, distractor quality: ChatGPT generates wrong answers that are obviously wrong (filler text, unrelated terms, drastically different lengths), so an MCQ becomes a recognition test on length and plausibility rather than understanding. Second, factual grounding: the model leans on pretrained knowledge instead of the actual lecture, so a slide that contradicts the textbook gets cards that match the textbook, which is the wrong test for a class exam your professor wrote.

How can I evaluate quality on my own lecture before paying for anything?

Run the five-point checklist below on a single deck. It takes about ten minutes. Generate cards from one of your professor's lectures with two or three different tools, hand-check 20 cards from each output against the original slides, and tally factual errors, ambiguous stems, throwaway distractors, and missing question types. The differences between tools show up fast on a single lecture; you do not need a full study session to tell which one is closer to usable.

Do these scores translate to your own subject area?

Mostly. The held-out eval uses three documents (the methodology page lists them generically as 'three documents'), so the scores reflect average behavior across those sources. If your lectures are heavy on labeled diagrams (gross anatomy, histology, biochem pathways), image-occlusion handling matters more than the four-criterion eval captures, and you should weight your own checklist accordingly. For prose-heavy subjects (pharmacology, microbiology, immunology), the eval scores transfer cleanly.

Does Studyly export to a real .apkg or do I have to study inside the app?

Both work. The .apkg export carries multiple-choice, cloze, case-style, and image-occlusion cards with masks intact, on Studyly-namespaced note types so importing does not collide with AnKing or Zanki fields. The auto-rephrasing on revisit (which prevents pattern-matching) only happens inside Studyly; if you study from the .apkg in Anki, you get the canonical card set and Anki's scheduler takes over.

What does 'held-out' mean and why does it matter?

Held-out means the documents used to score were not used to develop or tune the generator. A tool that posts a quality number on the same lectures it was trained on is reporting how well it memorized the training set, not how well it generalizes. The three-document eval is held-out: the source PDFs were chosen after the system was frozen. This is the same logic ML benchmarks use to keep numbers honest.

Where do the cards come from, the lecture or the open web?

Studyly generates against the source you upload. The professor's slide deck, your textbook chapter, your YouTube lecture transcript. Distractors on multiple-choice are pulled from neighbors of the correct answer in the same source, not from a generic question bank. This is the structural reason factual-correctness scores stay high: there is no public-web fact pulling in a wrong answer your professor never said.

Is there a free way to try this?

Yes. Free tier on app.jungleai.com, no credit card. Generate from a real lecture deck, export an .apkg, study it. The check that actually matters is the one you run yourself: do the cards feel like cards you would have made by hand if you had three hours? If yes, the tool is doing its job.