Guide · question-quality eval

NotebookLM flashcard quality MCQ eval: which feature you actually score, and what the rubric says.

NotebookLM ships two separate generators from the same uploaded source. Flashcards are 2-sided cards. Quizzes are 4-option multiple-choice questions. An MCQ eval scores the quiz output, not the flashcard output, and on a four-axis rubric it lands clean on two axes and hits a ceiling on the other two. This page disambiguates the two surfaces, walks the rubric, and explains why NotebookLM was not on the published 81.3 leaderboard.

Matthew Diakonov, Written with AI

Published May 20, 20266 min read

Direct answer · verified May 20, 2026

On a four-axis rubric (factual correctness, clarity, distractor quality, question-type coverage), NotebookLM's MCQ output scores strong on factual correctness and clarity (because every question is grounded only in the source you upload and the wrong-answer explanation cites the line), mixed on distractor quality (one wrong option is often eliminable on length or phrasing alone), and capped on question-type coverage (only multiple-choice questions, no free-response, no case-style vignettes, no image-occlusion). The 81.3 leaderboard at studyly.io/quality did not include NotebookLM because the bake-off compared dedicated study-question generators, not general-purpose research tools.

Verified against Google's official NotebookLM Help page on generating flashcards and quizzes and the Google product announcement.

Flashcards and quizzes are not the same generator

The first thing the keyword combines that the product separates. NotebookLM added flashcards and quizzes together on September 8, 2025, and brought them to the Android and iOS apps in November 2025. They share an upload pipeline and a customization panel (difficulty easy / medium / hard, count fewer / standard / more, topic prompt). They produce different outputs. Toggle the panel to see what each emits from the same source paragraph.

Same upload, two different outputs

Front: Pulmonary veins. Back: The blood vessels that carry oxygenated blood from the lungs to the left atrium of the heart. (Explain button on the back surfaces a short explanation citing the uploaded source.)

2-sided card, term on front, definition on back
Single cognitive task: recall the definition from the term
No distractors to score, no option list, no MCQ rubric applies

If you typed “NotebookLM flashcard quality MCQ eval” into the search bar, you almost certainly meant the quiz output. The rest of this page scores that surface. Flashcards get a brief section at the bottom because they are also worth being honest about, but they live in a different cell of the matrix.

The four-axis rubric, applied to NotebookLM's MCQs

The rubric Studyly publishes at /quality grades a practice question on four axes, weighted equally. Here is what each axis scores on NotebookLM's quiz output, with the mechanism that produces the score.

Axis 1 · Factual correctness — strong

The keyed answer is lifted from your uploaded sources, not from the model's pretrained knowledge of the topic. The Explain surface on a wrong answer cites the source line the answer came from. This closes the most common failure mode of generic AI question generators, the quietly-wrong answer key, by structural construction. Mechanism: source-grounded retrieval before generation.

Axis 2 · Clarity — strong

Fluent, readable stems are what language models are best at. NotebookLM's questions rarely read as ambiguous or double-barreled. The lift here is the easy lift; almost every tool in this category clears it. Mechanism: large-model generation with a clean prompt template.

Axis 3 · Distractor quality — mixed

The wrong options are on-topic, which is the floor. Above the floor, NotebookLM does not appear to enforce option-length matching: a one-word distractor next to a detailed correct answer is a length tell that lets a student pick the long one without reading the stem. Setting difficulty to hard reduces the frequency of length tells; it does not eliminate them. This is the criterion with the largest spread on the public leaderboard between tools that gate on distractor quality at emission time and tools that do not.

Axis 4 · Question-type coverage — capped

The quiz feature produces multiple-choice questions and nothing else. There is no free-response slot, no clinical-vignette case question, no image-occlusion card. Every question is the same cognitive task: recognize the right option among four. On a four-axis rubric weighted equally, this single axis caps the aggregate score for any MCQ-only tool below the ceiling a multi-format generator can reach.

Why NotebookLM was not on the published leaderboard

The numbers on the public leaderboard come from a held-out three-document eval graded on the four axes above: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8, field average 67.9. NotebookLM is not on that list. Two reasons, neither of which is a quality verdict.

First, category. The bake-off compared dedicated study-question generators built around a freemium funnel and a question-bank workflow. NotebookLM is a general-purpose research and document-Q&A tool that added a quiz feature in September 2025; it sits next to Gemini, Claude, and ChatGPT in any reasonable taxonomy, not next to Brainscape or Anki. A leaderboard comparing apples is not an indictment of the orange.

Second, recency. The original three-document eval ran before NotebookLM's quiz feature was a year old, and the field has moved (Google's April 23, 2026 update added progress saving and replay-missed, NotebookLM's quiz limit is still 10 per day on the free tier). A fair head-to-head would re-run the eval on the current product. The four-axis rubric does work for that re-run; the section above is the unweighted projection. The structural ceiling at Axis 4 is not going to move regardless of update, because the feature only emits one question type.

How to actually run the eval yourself in 20 minutes

If you want a number you trust, the four-axis rubric is small enough to run by hand on a single document. The point of running it yourself is to see which axis the gap lives on for the specific material you study.

The 20-minute MCQ eval, end to end

Pick one source document you know well: a 30-slide lecture deck, a chapter PDF, a 20-minute lecture video transcript.
Generate one quiz on standard size and hard difficulty. Skim and skip if the topic prompt biases it; you want the default generator behavior.
For each question, score four binary axes against the rubric: (1) does the keyed answer match the source? (2) is the stem unambiguous? (3) are all three distractors plausible and length-matched within roughly 25 percent of the correct answer? (4) is this a recognition question, an application question, or a case-style vignette?
Tally percentages. Factual and clarity should clear 90 percent. Distractor quality will land lower and is the axis where length tells dominate. Type coverage is binary at the feature level: all questions are MCQ, so the type axis sits at one cell of the matrix regardless of difficulty setting.
Compare against your exam. If the exam is mostly recognition, the MCQ-only ceiling is fine. If the exam is mostly application or case-based, the type-coverage gap is the part that affects you most.

The grading itself takes about 90 seconds per question with the source open in another tab. A 15-question quiz is 22 minutes of grading. The result is a number you can defend, broken out by axis, on material you actually have to learn.

NotebookLM's MCQ surface vs a multi-format generator

Both produce questions grounded in your uploaded material. The differences are upstream (format coverage and distractor gating) and downstream (what happens on retake).

Feature	NotebookLM quiz	Studyly
Questions grounded only in your uploaded material	Yes, quizzes draw only from your sources	Yes, generated against the exact slide deck, PDF, or transcript you upload
Question formats from one source	Multiple-choice only	MCQ, free-response, case-style, and image-occlusion flashcards
Distractor quality gating at emission	Not publicly documented; option-length matching not enforced	Four checks per card before emission (source-anchoring, length within ~25 percent, filler-template ban, grammar parallelism)
Question stem on retake	Card shuffle only; the wording of the stem stays the same	Stem auto-rephrased on revisit so position and phrasing-pattern matching break down
Free-tier daily ceiling	10 quizzes per day	Free tier with a generous daily cap, no credit card
Published question-quality score	No public eval (not in the studyly.io/quality leaderboard, see section above)	81.3 on a held-out three-document eval

NotebookLM details reflect the September 8, 2025 launch and the April 23, 2026 update. Studyly's 81.3 is an internal eval run by Jungle, the company behind Studyly, comparing dedicated study-question generators (Studyly, Unattle, Gauntlet, Turbolearn); NotebookLM was not a tool in that bake-off, so the comparison here is per-axis, not aggregate.

Free tier on app.jungleai.com, no credit card. Upload the same deck you would put into NotebookLM and see MCQ, free-response, case-style, and image-occlusion come out of one source.

Frequently asked

Are NotebookLM flashcards multiple-choice?

No. NotebookLM ships flashcards and quizzes as two separate generators from the same uploaded source. Flashcards are 2-sided cards (term on the front, definition on the back) with an Explain button on the back. Quizzes are 4-option multiple-choice questions with an Explain button on the wrong answer. If you are running an MCQ eval against NotebookLM, you are scoring the quiz feature, not the flashcards. The Google announcement and the in-product UI both treat them as distinct surfaces.

What does NotebookLM's MCQ output score on a quality eval?

On a 4-axis rubric (factual correctness, clarity, distractor quality, question-type coverage), NotebookLM's quiz output scores well on the first two and hits a ceiling on the second two. Factual correctness is strong because every question is grounded only in the sources you upload and the wrong-answer explanation cites the line it came from. Clarity is strong because language models are genuinely good at fluent stems. Distractor quality is mixed: the wrong options are on-topic but one is often eliminable on length or phrasing alone. Question-type coverage is the structural ceiling: the feature produces multiple-choice questions and nothing else, so it tests recognition but not application or case-based reasoning.

Why wasn't NotebookLM in the Studyly 81.3 leaderboard?

The studyly.io/quality leaderboard compared dedicated study-question generators on a held-out three-document eval: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. NotebookLM is a general-purpose research and document-Q&A tool that added quizzes and flashcards as one feature among many in September 2025; it is in a different product category. That does not mean its MCQ output cannot be scored on the same 4-axis rubric, only that the direct head-to-head number would carry a category caveat.

What's the cleanest way to actually run an MCQ eval on NotebookLM output?

Upload one document, generate one quiz at standard size on hard difficulty, export the questions (screenshot or copy the stem and the four options into a sheet), then have a graded answer key for each. Score each question on four binary axes. (1) Factual: does the keyed answer match the source? (2) Clarity: is the stem unambiguous? (3) Distractors: are all three wrong options plausible and length-matched, or can one be eliminated without knowing the fact? (4) Type: is this a recognition question, an application question, or a case-style vignette? Tally the percentages. The factual and clarity scores will be high; the distractor and type scores will reveal the gap.

Does the April 23, 2026 update change the eval result?

It changes the user experience but not the per-question quality scores. The April update added progress saving, got-it and missed-it marking, replay-only-missed, and card shuffling. Those are retention and review affordances, not generation-time changes. The MCQ that comes out of the generator is the same MCQ. Shuffle reorders cards so position-memorization breaks down, but it does not reword the stem on retake, which is the separate failure mode that turns a quiz into a memory test for the question instead of the material.

How does NotebookLM's MCQ output compare to a tool that does gate on distractor quality?

On the studyly.io/quality leaderboard, distractor quality is the single criterion with the largest tool-to-tool spread, because that is where in-flight gating beats post-hoc generation. Studyly's generator runs four checks per card before emission (source-anchoring, length-matching within 25 percent, filler-template ban, grammar parallelism), and the 23.5-point spread between the top and bottom of the leaderboard collapses substantially when you isolate the factual correctness axis. NotebookLM does not publish whether its quiz generator applies any of these gates. Inspection of its output suggests it does not enforce option-length matching, which is the distractor failure mode most students notice.

If NotebookLM's facts are reliable, is the MCQ eval ceiling actually a problem?

Depends on what your exam tests. For a vocabulary or definitions-style test (anatomy structures, drug names, dates), recognition practice is what you need and NotebookLM's MCQ output is enough. For a clinical, case-based, or application-heavy exam (USMLE Step 1 and 2 CK, NBME shelf exams, NCLEX next-gen items, bar exam essay-style fact patterns), recognition practice trains the easier half of the skill. The eval gap is structural, not a quality bug.

What about NotebookLM's flashcards specifically, not the quizzes?

The flashcards score well on factual correctness for the same source-grounding reason, and on clarity because they are short by construction. The distractor-quality axis does not apply (there are no distractors on a 2-sided card). The question-type axis is collapsed: every flashcard is a single cognitive task, recall a definition from a term. If you are looking at NotebookLM's flashcards as a category, you are looking at one good cell in the type-coverage matrix, not at MCQs.

Is there a published NotebookLM question-quality eval score?

No. As of May 2026, Google publishes a Help page and a product-announcement blog post describing how the quizzes and flashcards are generated, but it does not publish any numeric quality benchmark for the quiz output: no factual-accuracy percentage, no distractor-quality score, no head-to-head against other tools. The only public numbered study-question eval is the four-axis leaderboard at studyly.io/quality, and NotebookLM was excluded from it because the bake-off compared dedicated question generators, not general-purpose research tools. So the honest answer to 'what does NotebookLM score on a question-quality eval' is qualitative (two axes clean, two axes capped), not a single published number. To get a number for your own material, run the four-axis rubric on ten consecutive NotebookLM quiz questions as described above.

NotebookLM flashcard quality MCQ eval: which feature you actually score, and what the rubric says.

Flashcards and quizzes are not the same generator

Same upload, two different outputs

The four-axis rubric, applied to NotebookLM's MCQs

Why NotebookLM was not on the published leaderboard

How to actually run the eval yourself in 20 minutes

NotebookLM's MCQ surface vs a multi-format generator

Frequently asked

Related reading on this site

Comments ()

Flashcards and quizzes are not the same generator

Same upload, two different outputs

The four-axis rubric, applied to NotebookLM's MCQs

Why NotebookLM was not on the published leaderboard

How to actually run the eval yourself in 20 minutes

NotebookLM's MCQ surface vs a multi-format generator

Frequently asked

Related reading on this site

Comments (••)

Comments ()