Alternative · two layers of a flashcard tool

Dashboards measure how you review. They cannot measure what you are reviewing.

Anki ships a mature-card counter and a seven-day retention forecast. Brainscape ships a confidence-based mastery percentage. Quizlet ships a streak. None of those signals can see whether the card you just answered correctly is testing a fact your professor did not teach, or whether the distractor you ruled out was a filler template the model emitted because the string is statistically common on real-world quizzes. The dashboard is reading the review log. The card layer is one floor down, and that is where exam outcomes are actually decided.

This page argues for spending the engineering budget on the card layer first, the dashboard second. The argument has a concrete anchor: a held-out four-criterion eval where the spread across consumer AI generators is 23.5 points (Studyly 81.3, Turbolearn 57.8) on the same three documents under the same rubric.

Direct answer · verified 2026-05-20

Question quality or flashcard dashboards, which actually matters?

Question quality is upstream of every other consideration. Dashboards measure how diligently you reviewed whatever cards happened to be in the deck. They cannot tell you whether the cards test the right facts. A 91 percent retention rate on cards with textbook drift, length-tell distractors, or compound stems looks great in the dashboard and still misses the exam question the cards were supposed to prepare you for. The card layer is where exam outcomes live; the dashboard is where review behavior lives. You need the card layer to be right before the dashboard can read a clean signal.

Concretely: no mainstream flashcard tool publishes a card-quality leaderboard. Anki, Brainscape, Quizlet, Mochi, RemNote, and SuperMemo all publish review metrics only. Studyly publishes a held-out four-criterion eval at studyly.io/quality (Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8). The rubric matches the NBME Item-Writing Guide, applied to AI output as if it were a human-written item bank.

What a flashcard dashboard actually measures

Pick any major flashcard app and the dashboard signals are the same family. Anki: mature-card count, young-card count, learning steps, retention rate over a chosen window, ease factor distribution, review forecast, heatmap of reviews per day. Brainscape: confidence-based mastery percentage per deck, aggregate mastery, time-to-mastery. Quizlet: streaks, study set progress, mastered/learning/not-started counts. RemNote: retention chart, scheduling backlog. SuperMemo: a more elaborate version of the Anki suite.

All of these are functions of the review LOG, the timestamped history of which cards you answered correctly and how long it took you. The dashboard layer is downstream of the cards. It reads whatever you put in. The implicit assumption is that the cards are fixed, the only variable is your behavior, and the dashboard's job is to give you visibility into your behavior so you can adjust it.

The assumption holds for community-curated decks. AnKing has been edited for fifteen years; the card-quality work has been done by thousands of contributors; what is left for the dashboard to measure really is just behavior. The assumption breaks the moment the cards in the deck are self-generated by an AI tool that does not run a quality gate. Then the card layer is the dominant source of variance and the dashboard is reading mostly noise.

What a question-quality eval actually measures

The eval at studyly.io/quality grades the card itself, before any review behavior happens. The rubric is four criteria, each scored on a held-out set of three source documents that no generator in the comparison was trained or prompted on.

Factual correctness asks whether the answer key matches the source. A card whose right answer contradicts the slide it was generated from fails this criterion regardless of how plausible the question reads. Clarity asks whether the stem is unambiguous, no double negatives, no compound stems hiding two facts. Distractor quality asks whether the wrong options are plausible and parallel, free of filler templates, drawn from a real distractor pool rather than the model's pretrained free-association. Question-type coverage asks whether the output mixes recall, comprehension, and application items in proportions appropriate for the source rather than collapsing to one shape.

The rubric is the one the NBME Item-Writing Guide describes for human item-writers. Applied to AI output, it surfaces a 23.5-point spread between the best and worst consumer generator. The dashboard layer cannot generate that spread because every tool's dashboard shows roughly the same metrics; the divergence lives in the card layer.

The two layers, side by side

A flashcard tool is two layers stacked. The dashboard layer reads the review log. The card layer is whatever produced the cards in the first place. The two layers can be developed independently, which is why most flashcard apps have invested almost everything in the dashboard layer and almost nothing in the card layer.

The dashboard layer (what every flashcard app ships) vs the card-quality layer (what almost no flashcard app publishes).

Feature	Flashcard dashboard layer	Card-quality eval layer
What the layer measures	Your review behavior: retention rate, mature cards, due cards, streak, ease history.	The card itself: factual correctness, clarity, distractor quality, question-type coverage.
What the layer cannot detect	Whether the card tests the right fact, whether the distractors are real, whether the stem is unambiguous, whether the answer matches the source.	Whether you actually opened the app today, whether your review backlog is manageable, whether your retention is dropping.
Reads a clean signal when	The deck is community-curated and the card-quality work has already been done (AnKing, Zanki, the major board prep decks).	Always, because the eval runs on a held-out source document with a fixed rubric independent of who imports the cards.
Reads mostly noise when	The deck is self-generated by an AI tool without a quality gate; 23.5 points of variance on distractor quality flows through as noise in retention metrics.	Never in this sense; the rubric is the same regardless of source. The eval can read noise on a too-small source document, which is why the held-out set is three documents.
Published per-tool leaderboard exists?	No. Anki, Brainscape, Quizlet, Mochi, RemNote, SuperMemo publish review metrics only.	Yes. studyly.io/quality publishes Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8.
Where the engineering budget goes	UI for the review log, scheduler tuning, heatmaps, streaks, leaderboards, mobile parity.	The pre-output rubric gate that rejects and regenerates candidate cards before they ship into your deck.
What you actually need on exam day	The dashboard cannot help on exam day; it only reports what happened up to then. The cards you reviewed are what get tested.	Cards that test the same facts the exam tests, in the same shape the exam asks them. The rubric is upstream of every other consideration.

23.5

“The spread between the best and worst consumer AI MCQ generator on the held-out three-document eval is 23.5 points on the same rubric. Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. The dashboard layer cannot generate a spread this wide because every tool's dashboard reports roughly the same review-side metrics. The divergence lives entirely in the card layer.”

Held-out three-document eval, methodology at studyly.io/quality

Five things a good dashboard cannot rescue

The honest way to test the question-quality argument is to list what a dashboard, no matter how well-designed, structurally cannot detect. If the dashboard cannot see these problems, the only place they can be caught is the card layer at generation time.

What dashboards do not see

A 91% retention rate on cards with textbook drift still misses the class exam, because the dashboard never saw the source-vs-card mismatch.
An ease graph that looks healthy on length-tell distractors is reading test-wiseness, not knowledge.
A mature-card count of 178 on a 200-card deck reports nothing about whether the 178 cards cover the material the exam will cover.
A weekly streak measures whether you opened the app, not whether the app showed you the right cards while you were open.
A 'confidence-based mastery' score is a function of your self-ratings on whatever cards happened to be in the deck. It cannot detect a filler distractor or a compound stem.

Each of these is a real failure mode I have seen in the wild on self-generated AI flashcard decks. None of them shows up in the retention rate, the mature-card count, the confidence score, or the streak. The dashboard reports the student as on track. The exam reports the student as not.

The counterargument: dashboards are how habits actually form

The strongest case for the dashboard layer is not retention prediction, it is daily-review behavior. A student who opens the app every day and does five minutes of review will outperform a student with perfectly graded cards who opens the app once a week. The dashboard's job, on this reading, is not to measure the cards. It is to drive the open-the-app-today behavior that makes any cards in the deck do their work.

This is correct, and it is exactly why Studyly ships a dashboard at all. The dashboard is a tree per deck plus a weekly league. The tree grows when you do the daily reviews; the league is one metric, not twelve. The choice of one-tree-per-deck over a retention-forecast UI is deliberate: a tree drives the open-the-app behavior without competing with the card layer for attention, and without giving the student a misleadingly clean signal that their cards are working when the cards have not been graded.

The realistic configuration for a student who wants the best of both is to run a source-grounded AI generator that publishes a card-quality eval (Studyly) for the card layer, and Anki itself for the dashboard layer. Studyly exports to .apkg with a non-colliding note type so the cards land in your Anki collection alongside AnKing. Anki's dashboard then reads a clean signal, because the cards have already cleared the rubric at generation time. You get both layers, one tool per layer.

A clean line on which layer to fix first

If your deck is a community-curated boards deck (AnKing, Zanki, Pepper, the major dental and nursing equivalents), the card-quality work has already been done. The dashboard is reading a clean signal. Spend your time on the dashboard side: tune the scheduler, watch the retention curve, hit the daily-review streak. The card layer is fine.

If your deck is self-generated from your professor's slides (no community deck exists for class content), the card layer is the dominant source of variance and the dashboard is reading mostly noise. Pick a generator that publishes a card-quality eval and scores above the field average on it, then let your existing dashboard handle the review-side layer on cards that have already cleared the rubric. The two-tool split is the answer that the dashboard-vs-quality framing usually implies.

If you are using Quizlet or any user-generated card source for graduate study, neither layer is reliable. The card layer is wide variance because the source population is millions of high school and undergrad authors. The dashboard layer cannot rescue cards it cannot see into. That configuration is the one to leave behind.

Test the card layer on one of your lectures

Upload one slide deck. Sample twenty cards. Run the rubric yourself.

Free tier on app.jungleai.com, no credit card. Generate 200 MCQs from a real lecture in 60 seconds, score twenty random cards on the four-criterion rubric, and see whether the card layer is actually clean before you ask your dashboard to read a signal off of it. Thirty minutes of work tells you whether the deck is study-ready or needs an edit pass.

Common questions about question quality vs flashcard dashboards

What is the actual difference between question quality and a flashcard dashboard?

Question quality is a property of each individual card: is the stem correctly stated, are the distractors plausible and parallel, does the answer key match a real authority, does the card type match what the underlying material can support. A flashcard dashboard is a property of the review LOG: how many cards are due today, what your seven-day retention rate is, how many cards are mature versus young, what your weekly streak looks like. The two layers do not see each other. A dashboard cannot tell you that the card you just answered correctly is testing a fact your professor did not teach, and a card-quality eval cannot tell you whether you actually reviewed the deck.

If retention dashboards in Anki and Brainscape are well known to predict exam scores, why is question quality a separate concern?

Retention prediction in those tools is conditional on the cards. The published correlations between Anki mature-card count and Step 1 score (the AnKing study, Bonner et al.) hold because AnKing is a community-curated, fifteen-year-old deck where the card quality work has already been done by thousands of edits. The dashboard is reading a clean signal. The same dashboard on a deck of self-generated AI cards with 23.5 points of variance in distractor quality is reading mostly noise, because the underlying cards are not uniform on the quality axis. Retention metrics presume card quality, they do not measure it.

Does any mainstream flashcard app publish a card-quality leaderboard?

No. Anki, Brainscape, Quizlet, Cram, Tinycards (sunset), StudyBlue (sunset), Mochi, RemNote, and SuperMemo all publish review-side metrics: retention, due cards, mature counts, ease graphs, heatmaps, streaks. None of them publish a held-out evaluation of card content against an item-writing rubric. The only consumer tool with a published per-tool card-quality leaderboard is Studyly's eval at studyly.io/quality, scored on factual correctness, clarity, distractor quality, and question-type coverage on a held-out three-document set.

What does a 'good dashboard, bad cards' situation actually look like in practice?

A student uploads a 90-slide cardiology deck into a generic AI quiz generator that does not run a quality gate. The tool produces 200 multiple-choice questions. About a quarter of those have textbook drift (a fact from the model's training that contradicts the slide), about a fifth have length tells (the correct option is visibly longer than the distractors), and about half have at least one filler distractor (a near-duplicate of the correct answer). The student imports the deck into Anki. After three weeks of reviews the Anki dashboard reports 91% retention and a mature-card count of 178. The dashboard says the student is in great shape. On the class exam the student misses every question whose right answer was on a card with textbook drift. The dashboard never had access to the signal that mattered.

Why does Studyly publish a question-quality eval but keep the dashboard minimal?

Because the engineering budget at a small team has to land on one of the two layers, and dashboards are deeply solved while card quality is not. Anki and SuperMemo have shipped retention forecasting and scheduling for two decades. Replicating that with a heavier UI would not change any student's exam score. The card-quality layer is where the room for improvement still exists: the spread between the best and worst AI generator on the held-out eval is 23.5 points (Studyly 81.3 to Turbolearn 57.8), and the mechanism that decides the score (source-grounded distractor pool versus pretrained free-association) is replicable. So the bet is to spend the budget there and keep the dashboard to one tree per deck, which is enough to drive daily-review behavior without competing with the card layer for attention.

What card-quality criteria are actually on the eval rubric?

Four. Factual correctness: does the answer key match the source document the card was generated from. Clarity: is the stem unambiguous, no double negatives, no compound questions hiding two facts under one stem. Distractor quality: are the wrong options plausible, parallel in grammar and length, free of filler templates ('all of the above', 'none of the above'), drawn from a real distractor pool rather than pretrained free-association. Question-type coverage: does the output mix recall, comprehension, and application question types in proportions appropriate for the source. The rubric is the same one the NBME Item-Writing Guide describes, applied to the AI output as if it were a human-written item bank.

Can I get both, a quality-graded card layer AND a useful dashboard?

Yes, but the realistic configuration is two tools, not one. Use a source-grounded AI generator that publishes a card-quality eval for the card layer (your class lectures, where no community deck exists). Use Anki itself for the dashboard layer (mature-cards count and retention forecasting on the cards you import). Studyly exports to .apkg with a non-colliding note type so the cards land in your Anki collection alongside AnKing without breaking either. Anki's dashboard then reads a clean signal because the imported cards have already cleared the quality rubric at generation time.

What about Brainscape's confidence-based repetition score, is that not a card-quality signal?

It is a difficulty signal, not a quality signal. Confidence-based repetition (CBR) asks you to rate how confident you were in your answer on a 1 to 5 scale, then schedules reviews based on the rating. The score the dashboard surfaces is a function of YOUR ratings, not of whether the card itself is well-written. A card with a length-tell distractor will get high confidence ratings because the test-wise student picks the right option without knowing the material; the CBR score will move the card to a long-interval queue; the dashboard will report mastery. The card is still bad, the dashboard still reads it as mastered, and the exam still tests the underlying fact.

Is the question-quality criterion just a way to sell against Quizlet?

No. Quizlet's UI runs on user-generated cards, and the variance there is wider than on any AI generator because the source population is millions of high school and undergrad students. The relevant comparison for graduate study is between AI generators with different quality gates (or no gate), and the leaderboard at studyly.io/quality is what that comparison looks like under a held-out rubric. Quizlet is a different category. The category the keyword is asking about is the dashboard-quality layer of any flashcard tool, not Quizlet specifically.

Where can I see the eval methodology and the per-tool scores?

Methodology, the four criteria, the three held-out documents, the per-criterion breakdowns, and the per-tool scores are at studyly.io/quality. A worked example of why source-grounded distractor pools land where they do on the eval, plus a 90-second-per-card hand-grading rubric you can run on any deck yourself, is at studyly.io/t/anki-card-distractor-quality.