M
Matthew Diakonov
11 min read

Multiple choice question generator · the four-criterion rubric

A multiple choice question generator is only as good as the questions it makes.

Most tools in this category compete on inputs (PDFs, slides, YouTube) and counts (200 in 60 seconds). Almost none publish what their output actually scores when you grade it. Studyly does. The held-out three-document eval, the four criteria, and the rephrase mechanism below are why.

See the rubric →
4.8from 11,400 reviews
81.3 on the held-out three-document quality eval
Used by 1M+ students across med, dental, nursing, PA, pharmacy, vet
Free tier, no credit card

The four criteria a generated MCQ has to pass

Most articles about this topic spend their word count on input formats and counts. Almost none answer the actual question: when a model spits out a multiple-choice question, what tells you it's a usable question and not a noise-grade one?

Here is the rubric Studyly uses internally on every generated deck. It is the same rubric the held-out eval below scores tools against.

Factual correctness

The correct answer must be a sentence (or paraphrase of a sentence) found in the source document. We grade by retrieving the supporting passage from the original PDF or slide deck. Pretrained model knowledge does not count; if the answer isn't in your lecture, it isn't on your test.

Clarity

The stem reads clean on the first pass. No double negatives, no two questions stitched into one, no nesting that forces you to re-read. A prepared student should know what is being asked from the first sentence.

Distractor quality

Wrong answers are plausible. They're the same length as the correct one (within a few words), they're drawn from related concepts in the same chapter, and there is no 'all of the above' or obviously throwaway option. The question rewards understanding, not test-taking heuristics.

Question type coverage

A good output deck mixes recall (what is X), application (which patient gets X), comparison (X vs Y), and case-based (a 32-year-old presents with…) questions. We score how varied the output is across one source. Decks that only test recall score lower.

From a folder of sources to a graded deck

The diagram below is the path a single source takes through Studyly. Anything testable on the left, the rubric in the middle, four downstream question types on the right. The thing that's unusual is that the rubric is a hard pre-output gate, not a post-hoc scorecard. Questions that fall below threshold are regenerated before they reach you.

Studyly question pipeline

Lecture slides
Source PDFs
YouTube lecture
Notes
Four-criterion rubric
MCQ
Free response
Case-style
Image-occlusion

The held-out eval, in numbers

Three source documents were held out from any tool's training. Each tool was given the same three documents and asked to generate MCQs. Every output was graded on the rubric above. Same documents, same rubric, same graders, four tools.

0Studyly score
0Unattle
0Gauntlet
0Turbolearn

Higher is better. The gap between Studyly and Turbolearn (23.5 points) is the difference between a deck where most questions are usable and one where you spend half your study time editing the model's mistakes. The gap between Studyly and Unattle (3.3 points) is smaller and shows up mostly in distractor quality and type coverage.

Anchor fact · how Studyly stops you from pattern-matching

Every revisit, the wording changes.

The first time you see "which enzyme unwinds DNA at the replication fork," you might know the answer because you read the lecture last night. The second time you see the exact same sentence, you might answer it because you remember the shape of the words, not the biology. That is a real problem with most generators: their decks are static lists.

Studyly rewords every question on revisit and reshuffles the four options. The underlying fact stays the same. The surface form doesn't. After a week of drilling you have seen the same fact tested four different ways, never the identical sentence twice.

Below is one question, generated once and surfaced twice across a three-day study session. Notice that the correct answer is still helicase; the words around it are not.

auto-rephrase across two revisits

What we mean by "a bad multiple-choice question"

Every checked item below is a check Studyly runs before a question is shown to you. If the question fails any of them, it gets regenerated. You never see the broken intermediate output.

What every generated MCQ has to pass

  • Stem is one sentence and not a double negative.
  • Correct answer maps to a passage in the source PDF.
  • All four options are within ~25% length of each other.
  • Distractors are drawn from related concepts in the same chapter.
  • No 'all of the above' or 'none of the above'.
  • No obviously wrong throwaway option.
  • No clue word in the stem that gives away the answer.
  • Mixes recall, application, comparison, and case-style across the deck.

Studyly vs. a typical AI MCQ generator

The comparison below is conservative. We're using the median of the field, not the worst tool we tested.

FeatureTypical AI MCQ toolStudyly
Quality benchmark, publishedNo score, no methodology.81.3 on a held-out three-document eval, methodology public.
Anti-pattern-matchingSame stem and same option order every revisit.Stem reworded, distractors reshuffled on every revisit.
Wrong-answer explanationGeneric explanation written by the model.Quote pulled from the original PDF or slide that proves why the right answer is right.
Question-type mixMostly recall, occasionally application.Recall + application + comparison + case-based, scored for variety.
Spaced repetition loopStatic list of questions; you decide when to revisit.Weak topics resurface automatically; mastered topics decay out.
Anki exportOften missing; sometimes paid add-on.One-click .apkg export including image-occlusion.

How a question gets graded, end to end

When the held-out eval runs, this is the procedure. It's the same procedure Studyly applies to every deck a student generates. The published 81.3 isn't a marketing number; it's the average of running the steps below across three documents.

Five steps from raw output to a published score

1

Pull every question into the rubric

For a 100-question deck, each item is graded on all four criteria, 0 to 5 per criterion. That's 400 individual judgments, not one global thumbs up.

2

Verify factual correctness against the source

We retrieve the supporting passage from the original PDF or slide. If the model can't cite it, the question loses points. If the cited passage doesn't actually answer the stem, the question loses points.

3

Check distractors for plausibility

Each wrong answer is checked for length match, topical proximity, and absence of clue words. A distractor that's three words long when the right answer is fifteen words long is a giveaway and gets penalized.

4

Score the deck for type coverage

We sample question types across the full deck. If 90 of 100 are 'what is X', the deck loses coverage points even if every individual question is fine.

5

Normalize and publish

Per-criterion scores roll up to a 0-100 number. We run the same documents through Unattle, Gauntlet, and Turbolearn and publish the leaderboard. Last run: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8.

What sources work

Anything that came out of your professor's hands, basically. Drop in the lecture deck, the textbook chapter, the YouTube recording, the slide PDF, the handwritten notes you photographed. Studyly reads them all and grades the output the same way regardless of input.

Lecture slides.pptx.pdfTextbook scansStudy guidesHandwritten notesYouTube lectures.docxClass recordingsAnki decks

The headline numbers, counted up

A summary of what's been said above, at a glance. The bottom three are product-level (1M+ active students, ~7,000 weekly signups, four question types). The top one is the only one that matters for this page: the held-out score.

0

held-out eval score

0M+

students using the tool

0

question formats per source

0s

from deck to drillable

Try it on your next deck

Drop a lecture in. Drill the output in 60 seconds.

Free tier, no credit card. After you submit your email we send a one-click access link and route you straight in.

Common questions about generating MCQs from your sources

What actually makes a multiple choice question 'good'?

Four things, in this order: (1) the correct answer is grounded in the source document, not the model's pretrained knowledge, (2) the stem is unambiguous so a prepared student doesn't have to read it three times, (3) the wrong answers are plausible and similar in length so the question rewards understanding instead of test-taking heuristics, and (4) the deck mixes recall, application, comparison, and case-style questions across a single source. Studyly scores generated decks on all four and publishes the score (81.3 out of 100 on a held-out three-document eval).

How is Studyly different from generating questions in ChatGPT or Gemini?

ChatGPT and Gemini will produce a list of questions from a prompt, but they don't enforce a quality rubric, they don't track which questions you've gotten right, they don't reword the stem when a question reappears, and they don't run spaced repetition. On the same held-out eval Studyly scores 81.3, well above generic chat output. The bigger gap is the loop around the questions: every revisit rephrases the wording, every wrong answer surfaces a quote from the original PDF, and weak topics get drilled more often.

What inputs does Studyly accept?

Lecture slides (PowerPoint or Keynote), PDFs, scanned textbook chapters, study guides, handwritten notes (via OCR), and YouTube lecture videos. One source or thirty in a folder. The output is the same: multiple choice questions with realistic distractors, plus three other question types from the same source.

Why does Studyly rephrase questions on revisit?

If you see the same MCQ twice in a row, you start memorizing the shape of the stem instead of the underlying concept. Studyly rewords the stem and reshuffles the distractors every time a question comes back. A student in a Studyly session sees the same fact tested four different ways across a week, never the identical sentence twice. That's the mechanism that stops 'I knew the answer because I recognized the question' from leaking into your real exam.

How were the eval numbers (81.3 / 78.0 / 68.0 / 57.8) produced?

Three source documents were held out from training. Each tool generated MCQs from those documents. Every generated question was scored 0 to 5 on each of the four criteria (factual correctness, clarity, distractor quality, question-type coverage). The scores were normalized to 0 to 100 per tool. Studyly: 81.3. Unattle: 78.0. Gauntlet: 68.0. Turbolearn: 57.8. The same three documents and rubric were used for every tool, no cherry-picking.

Are the generated questions accurate enough for med school?

The product is most heavily used in medical, dental, nursing, pharmacy, PA, vet, and pre-med programs. Memorization-heavy science (anatomy, immunology, microbiology, pharmacology) is the strongest fit because every fact tied to the question maps to a sentence in the source PDF. Computational problem-solving (long-form math derivations, linear algebra) is weaker; concept questions about those subjects work, mechanical solving does not.

Can I export to Anki?

Yes. Every generated question is one-click exportable to .apkg, including image-occlusion cards for anatomy. If you have an existing Anki workflow you can keep it; Studyly is the front end that turns sources into cards faster than typing them.

Is there a free tier?

Yes. Drop in a deck, generate questions, drill them, all without paying. Free tier limits how many decks you can keep alive at once; paid removes the cap. No credit card to start.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.