Guide · question quality

AI-generated practice question quality: the part you cannot grade yourself.

Every guide on picking an AI question generator tells you the same thing: generate the questions, then review them and edit what is wrong. That advice quietly assumes you can tell what is wrong. For three of the four things that make a practice question good, you can. For the fourth, the one that actually costs you marks, you cannot. This page is about that fourth thing.

Matthew Diakonov, Written with AI

Published May 16, 20267 min read

Direct answer · verified May 16, 2026

Per question, AI-generated practice questions read fine. The published evidence is narrower than that sounds: in a 2025 BMC Medical Education quality-assurance study, only 22.2 percent of AI-written questions were usable with no edits, 46.8 percent needed minor fixes, and 30.9 percent were rejected, some for outright factual errors. The quality risk that matters is a question whose keyed correct answer is wrong, because that is the one thing you cannot catch by reading the question. An AI practice question is trustworthy only when its keyed answer traces back to a span in the source you uploaded.

Sources: the BMC Medical Education 2025 quality-assurance study and the held-out eval at studyly.io/quality.

Quality is four things. You can see three of them.

A practice question is good or bad on four axes. Studyly's held-out eval names them, and they are the same four any item-writer would use: clarity of the stem, plausibility of the distractors, the mix of question types across the whole set, and factual correctness of the keyed answer. Three of those four are things you can judge by looking. The fourth is not.

What you can verify on sight

Clarity. Read the stem once. If it is ambiguous or double-barreled, you can feel it immediately.
Distractor quality. Look at the four options. A throwaway like 'all of the above' or one obviously short filler answer is visible at a glance.
Question-type coverage. Tag twenty questions as recall or application and count. The skew, if there is one, takes a minute to see.
Factual correctness of the answer key. To know the keyed answer is wrong, you must already know the right answer, which is the thing you are studying to learn.

That last line is the whole problem. The other three failures announce themselves. A wrong answer key does not. It sits in your deck looking exactly like a correct one, and the person reviewing it is, by definition, the person who does not yet know the material. You drill it, you trust it, and the first time you discover the key was wrong is when the real exam marks you down for the answer your practice deck taught you.

What a wrong answer key is, and where it comes from

A wrong answer key is not a typo. It is a question where the stem is clean, the distractors are plausible, and the option marked correct is the wrong one. It happens because most generators write the question from your material but answer it from somewhere else. The model has read thousands of textbooks. When it decides which option is correct, it leans on that general knowledge instead of on the file you uploaded.

Most of the time the textbook and your lecture agree, so nothing breaks. The damage is on the borderline facts: the points where your professor simplified, updated, or deliberately contradicted the textbook consensus. There, the model keys the textbook answer, and the question now scores you wrong for repeating what your lecture actually taught. Toggle the panel below to see the two ways a generator can source that key.

Where the keyed answer comes from

The generator writes the question against your slide, then keys the correct option using the model's pretrained knowledge of the topic. The answer reflects the textbook consensus, not the specific thing your professor put on the slide. Nothing in the question records where the answer came from.

Keyed answer reflects the textbook, not your lecture
No way to trace which slide or page the answer is from
You only discover the key was wrong on exam day

What the quality-assurance research actually found

The clearest published data point is a 2025 study in BMC Medical Educationtitled “Quality assurance and validity of AI-generated single best answer questions.” Researchers had an AI write single-best-answer questions, then put every one through a formal review.

0AI-written questions put through a published quality-assurance review

0%usable with no edits at all

0%needed minor fixes before use

0%rejected outright, some for factual errors

Only about one in five questions was usable untouched. The rejected third included questions the reviewers flagged as not sensible: factual inaccuracies, missing reference ranges, answers that did not hold up. Notably, the same study found that once the questions were quality-assured, they performed no differently from human-authored questions on a real exam. The review is the load-bearing step. The numbers are from the BMC Medical Education study.

That study had trained medical educators doing the review. They knew the material, so they could catch the factual errors. A student drilling their own AI-generated deck at midnight does not have that advantage. The 22.2 percent that were clean is the number to keep in mind: roughly four out of five AI-written questions needed a human who already knew the answer to touch them before they were safe to study from.

The 60-second answer-key check

You cannot audit the answer key against your memory, so audit it against your source. This takes about a minute on ten questions and tells you more than re-reading the questions ever will.

1
Pull 10 questions
Take any ten from the set, ideally on material you have already studied once so you have a fighting chance of spotting a bad key.
2
Find each answer's source
For every question, locate the exact slide, page, or sentence the keyed answer should come from. If the tool cannot show you, that is your answer.
3
Read the source, then the key
Open your own material. Does the keyed answer match what your professor actually said, not what a textbook says? Borderline facts are where keys drift.
4
Count the mismatches
One wrong key in ten implies roughly twenty in a 200-question set. That is the deck quietly deciding which twenty facts you memorize wrong.

The read is binary. If every keyed answer traces cleanly to your material, the set is safe and you should go drill it. If you find mismatches, you have two options. Edit each one by hand, which requires you to already know the right answer for every question, or move to a generator that anchors the key to your source before the question is ever shown to you. The second option is the only one that scales past a single lecture.

Why source-grounding is the only structural fix

Studyly generates questions against the file you upload, your professor's actual slide deck, your PDF, your textbook chapter, your lecture transcript, and not against a generic web question bank. The rule that matters for this page is the one applied to the answer key: a generated question is kept only if its correct answer traces to a span in your upload. If the model cannot cite the slide, page, or timestamp the answer came from, the question is rejected and the slot is regenerated. There is no textbook fact pulling the key away from what your lecture said, because the key is not allowed to come from anywhere except your lecture.

That same anchoring is what makes the uncheckable dimension checkable. When you miss a question, the explanation references the exact slide or PDF page the correct answer was lifted from. You are not asked to trust the answer key. You are shown the line in your own material that produced it. The one quality dimension you could never self-grade becomes a one-click jump to your own source.

81.3

“On a held-out three-document eval scored on factual correctness, clarity, distractor quality, and question-type coverage. Unattle scored 78.0, Gauntlet 68.0, Turbolearn 57.8. Field average 67.9.”

Internal eval run by Jungle, the company behind Studyly. Methodology and per-criterion scores at studyly.io/quality.

That is an internal measurement, not an independent audit, and worth reading as one. But factual correctness is the criterion the eval scores first, and source-grounding is the mechanism behind the score. It is the difference between a deck of questions that look right and a deck of questions whose answers you can trace, without already knowing the material, back to the page you are studying from.

Free tier on app.jungleai.com, no credit card. Drop one lecture, generate the questions, and run the 60-second check above on the output.

Frequently asked

Are AI-generated practice questions good enough to study from?

Usually readable, not reliably correct. In a 2025 quality-assurance study published in BMC Medical Education, single-best-answer questions written by an AI were reviewed one by one: 22.2 percent were usable with no edits, 46.8 percent needed minor fixes, and 30.9 percent were rejected outright, some of them for plain factual errors. So a single AI practice question tends to look like a real exam item, because fluent question writing is the thing language models are genuinely good at. The catch is that 'looks like a real question' and 'has the right answer keyed' are two separate properties, and only the first one is visible to you.

What is the one quality dimension you cannot grade yourself?

Whether the keyed correct answer is actually correct. A practice question is graded on four things: clarity of the stem, plausibility of the distractors, the mix of question types across the set, and factual correctness of the answer key. The first three are visible. You can read a stem and tell if it is ambiguous. You can look at four options and see if one is a throwaway. You can count recall versus application across twenty questions. Factual correctness is different: to know the keyed answer is wrong, you have to already know the right answer, which is the exact thing you are studying to learn. You are the worst-positioned person to audit the answer key, because if you could, you would not need the practice question.

How do I check if an AI practice question has a wrong answer key?

You cannot check it against your own memory, so check it against your source. For each question, find the exact slide, PDF page, or sentence the correct answer should trace to, then read that source and confirm the keyed answer matches what your professor actually said. If the tool cannot show you where the answer came from, that inability is itself the finding. The 60-second check on this page walks through it on ten questions: one wrong key in ten implies roughly twenty wrong keys in a 200-question set, and a wrong key does not just cost you that question, it teaches you the wrong fact.

Why do AI tools key the wrong answer in the first place?

Because most generators write the question from your material but answer it from their pretrained knowledge of the topic. The model has read thousands of textbooks. When it picks the correct option, it leans on that general knowledge rather than on the specific file you uploaded. If your professor's slide simplifies, updates, or contradicts the textbook consensus, the model keys the textbook answer, and the practice question now scores you wrong for repeating what your lecture actually taught. The failure is invisible at generation time and only surfaces on exam day.

What does source-grounded mean, and why does it matter for the answer key?

Source-grounded means the keyed correct answer must trace to a specific span in the document you uploaded, not to the model's general knowledge. Studyly keeps a generated question only if its answer can be anchored to a slide, PDF page, or transcript timestamp in your upload; if no span can be cited, the question is not kept. This is the structural fix for the wrong-answer-key problem, because there is no textbook fact pulling the key away from your lecture. It also makes the one dimension you could not self-grade checkable: every question links back to the exact slide its answer came from, so verifying the key is a one-click jump instead of a memory test.

Does Studyly's question quality hold up on a benchmark?

On a held-out three-document eval run by Jungle, the company behind Studyly, Studyly scored 81.3 across four criteria: factual correctness, clarity, distractor quality, and question-type coverage. Unattle scored 78.0, Gauntlet 68.0, Turbolearn 57.8, with a field average of 67.9. Held-out means the source documents were chosen after the system was frozen, so the number reflects generalization rather than memorization of a tuning set. The full methodology and per-criterion breakdown are at studyly.io/quality. Treat it as the company's own measurement on a consistent rubric, not an independent audit.

Can a sharper prompt fix a wrong answer key?

Only partly. You can tell a model 'use only the attached file' and it will mostly comply on the question text. The answer key is harder, because the model still scores the options using its internal knowledge and a prompt cannot verify that the keyed answer was actually lifted from a span in your upload. The instruction degrades to a self-report, and a model will sometimes claim a fact came from slide 12 when it did not. Prompt craft moves the average, but the only thing that removes the wrong-key failure mode is a generator that rejects a question when the answer cannot be anchored to your source.

AI-generated practice question quality: the part you cannot grade yourself.

Quality is four things. You can see three of them.

What a wrong answer key is, and where it comes from

Where the keyed answer comes from

What the quality-assurance research actually found

The 60-second answer-key check

Why source-grounding is the only structural fix

Frequently asked

Related reading on this site

Comments ()

Quality is four things. You can see three of them.

What a wrong answer key is, and where it comes from

Where the keyed answer comes from

What the quality-assurance research actually found

The 60-second answer-key check

Why source-grounding is the only structural fix

Frequently asked

Related reading on this site

Comments (••)

Comments ()