Multiple choice question generator · the four-criterion rubric
A multiple choice question generator is only as good as the questions it makes.
Most tools in this category compete on inputs (PDFs, slides, YouTube) and counts (200 in 60 seconds). Almost none publish what their output actually scores when you grade it. Studyly does. The held-out three-document eval, the four criteria, and the rephrase mechanism below are why.
The four criteria a generated MCQ has to pass
Most articles about this topic spend their word count on input formats and counts. Almost none answer the actual question: when a model spits out a multiple-choice question, what tells you it's a usable question and not a noise-grade one?
Here is the rubric Studyly uses internally on every generated deck. It is the same rubric the held-out eval below scores tools against.
Factual correctness
The correct answer must be a sentence (or paraphrase of a sentence) found in the source document. We grade by retrieving the supporting passage from the original PDF or slide deck. Pretrained model knowledge does not count; if the answer isn't in your lecture, it isn't on your test.
Clarity
The stem reads clean on the first pass. No double negatives, no two questions stitched into one, no nesting that forces you to re-read. A prepared student should know what is being asked from the first sentence.
Distractor quality
Wrong answers are plausible. They're the same length as the correct one (within a few words), they're drawn from related concepts in the same chapter, and there is no 'all of the above' or obviously throwaway option. The question rewards understanding, not test-taking heuristics.
Question type coverage
A good output deck mixes recall (what is X), application (which patient gets X), comparison (X vs Y), and case-based (a 32-year-old presents with…) questions. We score how varied the output is across one source. Decks that only test recall score lower.
From a folder of sources to a graded deck
The diagram below is the path a single source takes through Studyly. Anything testable on the left, the rubric in the middle, four downstream question types on the right. The thing that's unusual is that the rubric is a hard pre-output gate, not a post-hoc scorecard. Questions that fall below threshold are regenerated before they reach you.
Studyly question pipeline
The held-out eval, in numbers
Three source documents were held out from any tool's training. Each tool was given the same three documents and asked to generate MCQs. Every output was graded on the rubric above. Same documents, same rubric, same graders, four tools.
Higher is better. The gap between Studyly and Turbolearn (23.5 points) is the difference between a deck where most questions are usable and one where you spend half your study time editing the model's mistakes. The gap between Studyly and Unattle (3.3 points) is smaller and shows up mostly in distractor quality and type coverage.
Anchor fact · how Studyly stops you from pattern-matching
Every revisit, the wording changes.
The first time you see "which enzyme unwinds DNA at the replication fork," you might know the answer because you read the lecture last night. The second time you see the exact same sentence, you might answer it because you remember the shape of the words, not the biology. That is a real problem with most generators: their decks are static lists.
Studyly rewords every question on revisit and reshuffles the four options. The underlying fact stays the same. The surface form doesn't. After a week of drilling you have seen the same fact tested four different ways, never the identical sentence twice.
Below is one question, generated once and surfaced twice across a three-day study session. Notice that the correct answer is still helicase; the words around it are not.
What we mean by "a bad multiple-choice question"
Every checked item below is a check Studyly runs before a question is shown to you. If the question fails any of them, it gets regenerated. You never see the broken intermediate output.
What every generated MCQ has to pass
- Stem is one sentence and not a double negative.
- Correct answer maps to a passage in the source PDF.
- All four options are within ~25% length of each other.
- Distractors are drawn from related concepts in the same chapter.
- No 'all of the above' or 'none of the above'.
- No obviously wrong throwaway option.
- No clue word in the stem that gives away the answer.
- Mixes recall, application, comparison, and case-style across the deck.
Studyly vs. a typical AI MCQ generator
The comparison below is conservative. We're using the median of the field, not the worst tool we tested.
| Feature | Typical AI MCQ tool | Studyly |
|---|---|---|
| Quality benchmark, published | No score, no methodology. | 81.3 on a held-out three-document eval, methodology public. |
| Anti-pattern-matching | Same stem and same option order every revisit. | Stem reworded, distractors reshuffled on every revisit. |
| Wrong-answer explanation | Generic explanation written by the model. | Quote pulled from the original PDF or slide that proves why the right answer is right. |
| Question-type mix | Mostly recall, occasionally application. | Recall + application + comparison + case-based, scored for variety. |
| Spaced repetition loop | Static list of questions; you decide when to revisit. | Weak topics resurface automatically; mastered topics decay out. |
| Anki export | Often missing; sometimes paid add-on. | One-click .apkg export including image-occlusion. |
How a question gets graded, end to end
When the held-out eval runs, this is the procedure. It's the same procedure Studyly applies to every deck a student generates. The published 81.3 isn't a marketing number; it's the average of running the steps below across three documents.
Five steps from raw output to a published score
Pull every question into the rubric
For a 100-question deck, each item is graded on all four criteria, 0 to 5 per criterion. That's 400 individual judgments, not one global thumbs up.
Verify factual correctness against the source
We retrieve the supporting passage from the original PDF or slide. If the model can't cite it, the question loses points. If the cited passage doesn't actually answer the stem, the question loses points.
Check distractors for plausibility
Each wrong answer is checked for length match, topical proximity, and absence of clue words. A distractor that's three words long when the right answer is fifteen words long is a giveaway and gets penalized.
Score the deck for type coverage
We sample question types across the full deck. If 90 of 100 are 'what is X', the deck loses coverage points even if every individual question is fine.
Normalize and publish
Per-criterion scores roll up to a 0-100 number. We run the same documents through Unattle, Gauntlet, and Turbolearn and publish the leaderboard. Last run: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8.
What sources work
Anything that came out of your professor's hands, basically. Drop in the lecture deck, the textbook chapter, the YouTube recording, the slide PDF, the handwritten notes you photographed. Studyly reads them all and grades the output the same way regardless of input.
The headline numbers, counted up
A summary of what's been said above, at a glance. The bottom three are product-level (1M+ active students, ~7,000 weekly signups, four question types). The top one is the only one that matters for this page: the held-out score.
held-out eval score
students using the tool
question formats per source
from deck to drillable
Try it on your next deck
Drop a lecture in. Drill the output in 60 seconds.
Free tier, no credit card. After you submit your email we send a one-click access link and route you straight in.
Common questions about generating MCQs from your sources
What actually makes a multiple choice question 'good'?
Four things, in this order: (1) the correct answer is grounded in the source document, not the model's pretrained knowledge, (2) the stem is unambiguous so a prepared student doesn't have to read it three times, (3) the wrong answers are plausible and similar in length so the question rewards understanding instead of test-taking heuristics, and (4) the deck mixes recall, application, comparison, and case-style questions across a single source. Studyly scores generated decks on all four and publishes the score (81.3 out of 100 on a held-out three-document eval).
How is Studyly different from generating questions in ChatGPT or Gemini?
ChatGPT and Gemini will produce a list of questions from a prompt, but they don't enforce a quality rubric, they don't track which questions you've gotten right, they don't reword the stem when a question reappears, and they don't run spaced repetition. On the same held-out eval Studyly scores 81.3, well above generic chat output. The bigger gap is the loop around the questions: every revisit rephrases the wording, every wrong answer surfaces a quote from the original PDF, and weak topics get drilled more often.
What inputs does Studyly accept?
Lecture slides (PowerPoint or Keynote), PDFs, scanned textbook chapters, study guides, handwritten notes (via OCR), and YouTube lecture videos. One source or thirty in a folder. The output is the same: multiple choice questions with realistic distractors, plus three other question types from the same source.
Why does Studyly rephrase questions on revisit?
If you see the same MCQ twice in a row, you start memorizing the shape of the stem instead of the underlying concept. Studyly rewords the stem and reshuffles the distractors every time a question comes back. A student in a Studyly session sees the same fact tested four different ways across a week, never the identical sentence twice. That's the mechanism that stops 'I knew the answer because I recognized the question' from leaking into your real exam.
How were the eval numbers (81.3 / 78.0 / 68.0 / 57.8) produced?
Three source documents were held out from training. Each tool generated MCQs from those documents. Every generated question was scored 0 to 5 on each of the four criteria (factual correctness, clarity, distractor quality, question-type coverage). The scores were normalized to 0 to 100 per tool. Studyly: 81.3. Unattle: 78.0. Gauntlet: 68.0. Turbolearn: 57.8. The same three documents and rubric were used for every tool, no cherry-picking.
Are the generated questions accurate enough for med school?
The product is most heavily used in medical, dental, nursing, pharmacy, PA, vet, and pre-med programs. Memorization-heavy science (anatomy, immunology, microbiology, pharmacology) is the strongest fit because every fact tied to the question maps to a sentence in the source PDF. Computational problem-solving (long-form math derivations, linear algebra) is weaker; concept questions about those subjects work, mechanical solving does not.
Can I export to Anki?
Yes. Every generated question is one-click exportable to .apkg, including image-occlusion cards for anatomy. If you have an existing Anki workflow you can keep it; Studyly is the front end that turns sources into cards faster than typing them.
Is there a free tier?
Yes. Drop in a deck, generate questions, drill them, all without paying. Free tier limits how many decks you can keep alive at once; paid removes the cap. No credit card to start.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.