Guide · question quality

Prompt-generated exam question quality is a property of the set, not the question.

Every guide on this tells you how to write a better prompt: be specific, name the topic, ask for plausible distractors, then review the output. That advice grades the wrong unit. The quality of prompt-generated exam questions is fine to look at one question at a time, and broken when you look at the whole batch. This page is about the failure that a per-question read structurally cannot catch.

Matthew Diakonov, Written with AI

Published May 15, 20268 min

Direct answer · verified 2026-05-15

Prompt-generated exam questions are usually fluent and plausible at the level of any single question, and unreliable as a set. A prompt cannot verify the correct answer against your specific course material, and it cannot control the mix of cognitive levels across the batch, so prompt-generated sets skew toward lower-order recall. In a study published in Academic Pathology, 43 percent of ChatGPT-generated multiple-choice questions needed substantial modification before they were exam-suitable (PMC10753050). The fix is not a better prompt. It is grading the set, not the question. Studyly does that with a four-criterion held-out eval, methodology at studyly.io/quality.

Every guide grades the prompt. None grades the output.

Search how to make practice questions with AI and you get the same page forty times: a list of prompts, a note that detail matters ("generate algebra questions for high school" beats "math questions"), and a closing line telling you to review the output and treat the model as a co-creator. Every one of those guides assumes the same thing: that quality is a function of how well you phrase the request.

It is a reasonable assumption for the parts of a question a prompt can actually touch. A clearer prompt does produce a clearer stem. A prompt that bans filler options does drop "all of the above". A prompt that asks for matched-length distractors does get them most of the time. Three of the things that make a question good are wording instructions, and wording instructions transfer to a prompt cleanly.

The problem is the closing line. "Review the output" means read each question and decide if it is good. That review runs one question at a time, and the failure that decides whether a prompt-generated set is worth studying does not live inside any one question. It lives in the distribution of the set. You cannot read a distribution. You have to count it.

What a prompt controls, and what it does not

Six things that decide whether an exam question is good. The first three are wording. A prompt handles wording. The last three are not wording, and that is where prompt-generated quality quietly falls apart.

Feature	A well-written prompt	What controlling it actually requires
Stem reads like a real exam question	Yes. Fluent stems are the thing language models are genuinely best at.	Yes. Same starting point.
Distractors plausible and similar in length	Usually, if you ask for it explicitly. This is a wording instruction and it transfers to a prompt cleanly.	Yes, enforced as a length check before the question is kept, not left to a wording request.
No 'all of the above' or filler options	Yes, if you ban them in the prompt. Another instruction that transfers.	Yes, a fixed forbidden list is matched before the question is emitted.
Correct answer matches your professor's slide, not a textbook	No reliable way. A prompt can say 'use the source' but cannot verify that the answer traces to a span in your actual upload.	Verified. The answer must trace to a span in the file you uploaded, or the question is not kept.
Cognitive-level mix across the whole set	No. A prompt acts one question at a time and defaults to recall. Nothing in the loop counts the running mix.	Scored. Question-type coverage is one of four named eval criteria; generation biases toward the underrepresented type.
Wrong answers come with an explanation	Inconsistent. In one study, 19 of 60 questions shipped with no explanation for the incorrect options at all.	Every miss links back to the slide or PDF page the correct answer was lifted from.

Three of these transfer to a prompt cleanly. The other three do not, and they are the three that decide whether the set is the right test for your exam.

The recall skew, and why it is invisible per question

A prompt generates one question at a time. Pointed at a slide, the lowest-effort thing it can produce is a recall question: name the structure, define the term, pick the true statement. An application question needs an invented scenario, and a case-based question needs a coherent clinical or worked situation. That is harder, and it is exactly where studies found language models weakest. A multinational study in PLoS Onereported that ChatGPT "performed poorly when instructed to generate clinical scenario, possibly due to high complexity" (PMC10464959).

So a prompt asked plainly for "50 exam questions" drifts to the floor of Bloom's taxonomy. Each question still looks fine. The stem is clean, the distractors are plausible, nothing is obviously wrong. You read ten of them, they all check out, and you conclude the batch is good. It is not. You read ten recall questions and confirmed that recall questions are well-formed. You never measured the one thing that was broken.

Here is the shape researchers keep finding. The proportions below illustrate the direction, not measured percentages: what matters is that the two columns are different shapes.

A "generate 50 exam questions" batch

One question at a time, no running count

Recallmost of the set

Applicationa thin slice

Analysis / casealmost none

The exam you actually sit

Written by someone counting the blueprint

Recalla foundation

Applicationthe bulk of the points

Analysis / casewhere grades separate

Drill the left column for a week and you will feel prepared. The exam is the right column. The gap between the two is the part of the quality story no single question can show you.

The trap inside the trap: questions that fake their own level

There is a worse version. When you do ask a prompt for application questions, it will often give you a recall question with a scenario sentence bolted on the front. A patient, some lab values, a setting, and then a stem that still reduces to a single factual lookup. The question reads like application. It is graded, in your head, as application. It is recall.

This is a documented effect. Reviewers of AI-generated items have noted that a scenario can make an item "look like it has a cognitive level of application when it is really just a recall question". It is the single hardest failure to catch by reading, because the surface signal (a scenario is present) is the exact signal you use to tag the question as application.

The test that defeats it is subtraction. Delete the scenario sentence. A real application question collapses without its setup: no patient, no values, no answer. A fake one survives, because the scenario was decoration. If the stem still has one obvious answer after you strip the story, count it as recall, no matter how clinical it looked. For what a genuine case-style item should demand of you, see the vignette-drilling guide.

What the research actually measured

The clearest single data point is from a 2023 study in Academic Pathology that built 60 multiple-choice questions for a graduate immunology class with ChatGPT and reviewed every one.

0ChatGPT MCQs reviewed in the Academic Pathology study

0%needed substantial rework before exam-suitable

0of those 60 gave no explanation for the wrong answers

0named criteria the Studyly quality eval scores

43 percent needing substantial rework is not a wording problem you fix with a sharper prompt. The most common single defect was no explanation for the wrong answers, which means a student drilling the set learns the right answer but not why the others fail. The numbers are from the Academic Pathology study.

The 90-second count you run on your own batch

You do not need a benchmark to grade your own prompt-generated set. You need a tally. This takes about a minute and a half and tells you more than re-reading the questions ever will.

1
Pull 20 questions
Take any 20 from the batch your prompt produced. One lecture's worth is enough to read the shape.
2
Tag each one
Recall (name it, define it, list it), application (use it in a situation), or analysis (compare, predict, interpret a case).
3
Count the tags
If 15 or more of 20 are recall, the batch is a recall test wearing an exam's clothes.
4
Compare to your real exam
Pull a past paper or the course blueprint. If it is half application and case-based, the batch trains the wrong half.

The read is binary. If your count roughly matches your exam's mix, the batch is usable and you should go drill it. If it is recall-heavy against an application-heavy exam, do not patch it question by question. Regenerate the whole set with an explicit type quota, recount, or move to a source-anchored generator that scores coverage before you ever see the output. Editing a recall-skewed set into a balanced one by hand costs more than regenerating it.

Why Studyly treats coverage as a score, not a feature

Studyly generates questions against the file you upload, your professor's actual slide deck or PDF, not a generic web question bank. That handles the factual-grounding column from the table above: a question is kept only if its answer traces to a span in your upload. The part that matters for this page is what happens at the level of the set.

The held-out eval scores four criteria, and question-type coverage is one of them, sitting next to factual correctness, clarity, and distractor quality. Treating coverage as a scored criterion is the structural reason a single Studyly upload comes back as a mix: multiple-choice, free-response, case-style, and image-occlusion questions from the same source, rather than 200 single-best-answer recall items. Coverage is graded at the level of the deck because that is the only level it exists at. For the per-question gates that run alongside it, see the in-flight rubric guide.

81.3

“On a held-out three-document eval across factual correctness, clarity, distractor quality, and question-type coverage. Unattle scored 78.0, Gauntlet 68.0, Turbolearn 57.8. Field average 67.9.”

Internal eval run by Jungle, the company behind Studyly. Methodology and per-criterion scores at studyly.io/quality.

That is an internal measurement, not an independent audit, and it is worth reading it as one. But the framing is the point a raw prompt cannot replicate: quality is something you score on the whole set against named criteria, before the questions reach the student, not something the student reconstructs one item at a time at midnight.

What to do tonight

If you already have a prompt-generated batch, run the count first. One minute and a half. If the mix matches your exam, study it, the batch is fine. If it does not, the fastest honest fix is to put the source through a generator that scores coverage rather than to edit the batch you have. Studyly turns a lecture deck, PDF, textbook, or YouTube lecture into roughly 200 questions across four formats in about 60 seconds, and the four-criterion eval is the public record of how that output is graded.

The free tier on app.jungleai.com has no credit card gate. Drop one deck, count the type mix it gives you back, and compare it to the batch your prompt produced. The difference is the part of question quality you were never going to see one question at a time.

Frequently asked

Are AI prompt-generated exam questions good enough to study from?

Per question, usually. Per set, not reliably. A language model writes a fluent stem and plausible distractors well, so any single prompt-generated question tends to read like a real exam item. The quality problem is two levels up. A prompt cannot verify that the correct answer matches your specific course material rather than a textbook, and it cannot control the mix of cognitive levels across the whole set. In one published study of ChatGPT-generated multiple-choice questions, 26 of 60 (43 percent) needed substantial modification before they were suitable for a practice exam. The questions that looked fine were not all fine, and the ones that were fine still skewed toward recall.

Why do prompt-generated questions skew toward recall?

Because a prompt generates one question at a time, and recall is the cheapest question to write from a chunk of text. Point a model at a slide and the lowest-effort output is 'name this', 'define this', 'which of these is true'. An application or case-based question needs a scenario the model has to invent, and that is exactly where studies found it weakest: a multinational study in PLoS One reported ChatGPT 'performed poorly when instructed to generate clinical scenario, possibly due to high complexity'. So without a deliberate counterweight, a set of prompt-generated questions drifts to the floor of Bloom's taxonomy, even when each item is individually clean.

Can a better prompt fix the recall skew?

Partly. You can ask for '8 application questions, 8 analysis questions, 4 recall questions' and the model will try. Two things still leak. First, the prompt acts one question at a time with no running count, so the mix it actually produces drifts. Second, a scenario wrapper on a recall core passes the instruction without being a real application item: the question looks like application and is graded as recall. Researchers have asked this directly, in a paper titled 'ChatGPT's ability or prompt quality: what determines the success of generating multiple-choice questions'. The honest answer is that prompt craft moves the needle but does not remove the ceiling.

How do I tell a real application question from a fake one?

Strip the scenario sentence. A genuine application question falls apart without its setup: remove the patient, the lab values, the situation, and the question no longer has an answer. A fake one survives. If you delete the scenario and the stem still reduces to one obvious factual lookup, it was a recall question wearing a costume. This is the single most useful check to run on a prompt-generated batch, because pseudo-application is the failure that a per-question read misses most often.

What is question-type coverage and why is it a quality score?

Question-type coverage is whether a set of questions mixes recall, application, comparison, and case-based items, or collapses onto one type. Studyly's held-out eval scores it as one of four named criteria, alongside factual correctness, clarity, and distractor quality. It is a quality score because a set can be perfect on the other three and still be the wrong test: 200 factually correct, clearly worded, well-distractored recall questions will not prepare you for an exam that is half application. Coverage is the criterion that only exists at the level of the set, which is exactly why reading one question at a time cannot grade it.

Does Studyly's question quality hold up on a benchmark?

On a held-out three-document eval run by Jungle (the company behind Studyly), Studyly scored 81.3 across the four criteria, ahead of Unattle at 78.0, Gauntlet at 68.0, and Turbolearn at 57.8. The field average is 67.9. The eval is held out, meaning the source documents were chosen after the system was frozen, so the number reflects generalization rather than memorization of a tuning set. The full methodology and per-criterion breakdown are at studyly.io/quality. It is an internal eval; treat it as the company's own measurement, run on a consistent rubric, not an independent audit.

I have 20 minutes before I study. What do I actually do?

Run the count described on this page: pull 20 questions, tag each as recall, application, or analysis, and compare the mix to your real exam. If it is recall-heavy, you have two moves. Regenerate with an explicit type quota and then re-count, or upload the source to a tool that scores coverage so the mix is handled before you ever see the questions. Either way, do not spend the 20 minutes editing individual questions. The editing tax is real, and a recall-skewed set does not get fixed one question at a time.

Prompt-generated exam question quality is a property of the set, not the question.

Every guide grades the prompt. None grades the output.

What a prompt controls, and what it does not

The recall skew, and why it is invisible per question

The trap inside the trap: questions that fake their own level

What the research actually measured

The 90-second count you run on your own batch

Why Studyly treats coverage as a score, not a feature

What to do tonight

Frequently asked

Related reading on this site

Comments ()

Every guide grades the prompt. None grades the output.

What a prompt controls, and what it does not

The recall skew, and why it is invisible per question

The trap inside the trap: questions that fake their own level

What the research actually measured

The 90-second count you run on your own batch

Why Studyly treats coverage as a score, not a feature

What to do tonight

Frequently asked

Related reading on this site

Comments (••)

Comments ()