MCQ generator from PDF · page-grounded extraction
A PDF isn't one input. It's four very different ones.
A lecture-slide PDF, a scanned textbook chapter, a two-column research paper, and a stack of handwritten notes scanned to PDF each fail in their own way. Generic generators flatten them into one stream of text and lose half the structure. Studyly handles each flavor differently, then grounds every generated multiple-choice question to a specific page in the source. When you answer wrong, the explain response is a quote from that page.
The four flavors of PDF, treated four different ways
Every guide on this topic talks about "upload your PDF and get questions." Almost none acknowledge that PDFs come in shapes that break the naive pipeline. Below are the four shapes Studyly sees most often and what we do with each.
Lecture-slide PDF
One slide per page, sparse text, lots of bullet fragments. We pull the slide title as a topic anchor, glue together the bullets in their printed order, and treat each slide as a single source span. The most common input by far for med, dental, nursing, and pharmacy programs.
Scanned textbook chapter
Image-only PDF (no text layer) from a copier, library scan, or textbook ripped to PDF. Goes through an OCR pass page-by-page before the extractor sees it. Roughly 2x slower than a born-digital PDF; output quality matches.
Two-column research paper
The PDF flavor that breaks the most generic tools. We preserve column reading order, treat figure captions as their own blocks, and exclude reference lists from the question pool. You won't get an MCQ generated from the bibliography.
Handwritten notes, scanned
Same OCR pipeline as a textbook scan, but with a tighter clarity threshold because handwriting is harder to read. If a passage scores below the threshold, we skip it instead of generating a question that's likely garbled.
Anchor fact · the thing no other tool in this category does
Every generated MCQ carries a page citation back to your PDF.
When the extractor produces a question, it stores the page index of the passage the answer came from. The page index travels with the question for the rest of its life: through revisits, through Anki export, through the explain response.
When you answer wrong and tap explain, the response is built from a verbatim quote on that page. Not paraphrased. Not retrieved from the model's training data. Pulled directly from the bytes of the PDF you uploaded an hour ago. If the model can't produce a passage from your source that supports the right answer, the question never shipped to you in the first place.
That is the part of this product that you can't fake with a general-purpose chatbot, and it is the part most articles about this topic skip.
What happens when you get a PDF MCQ wrong
The diagram below is the call flow that runs when you tap explain my mistake. The student app asks the question store for the citation; the store returns a page reference; the citation gets resolved against the original PDF you uploaded; the response comes back with a quote.
explain my mistake — call flow
Below is what the student actually sees. Notice the explain block quotes the page exactly, and notice the page reference is real (page 47, line 18 of Anatomy I — Lecture 4.pdf).
The pipeline a PDF takes through Studyly
Inputs on the left, page-grounded extractor in the middle, four downstream question types on the right. The extractor is the same regardless of which PDF flavor came in; the parse stage in front of it is what differs.
PDF in, four question types out
What changes between a generic generator and a grounded one
The toggle below shows the same student getting the same wrong answer. On the left, a generic AI MCQ generator. On the right, the explain response after Studyly produced the question.
Same wrong answer, two very different explanations
The right answer is C, interventricular septum. The septum is what divides the heart's ventricles. The other options are also heart structures but don't separate the chambers.
- No reference to the PDF you uploaded
- Could be true for any anatomy textbook in the world
- If the question is wrong, the explanation is wrong with it
Preflight: every check that runs before a PDF MCQ ships
A question that fails any of these checks is regenerated. You don't see broken intermediate output. The bottom three checks are what separate a PDF generator from an MCQ generator that happens to read PDFs.
What every PDF-derived MCQ has to pass
- PDF is parsed with column-aware reading order, not left-to-right blind.
- Every generated MCQ stores a page index back into the source PDF.
- Reference lists, bibliographies, and copyright pages are excluded from the question pool.
- Figure captions are extracted as their own blocks, not pulled into nearby paragraphs.
- Image-only pages run through OCR before the extractor sees them.
- Each MCQ passes the four-criterion rubric (factual, clarity, distractor, coverage).
- If the model can't cite a passage from your PDF, the question is regenerated, not shipped.
- On revisit, the stem is reworded and the distractors are reshuffled.
The held-out eval, in numbers
Three source documents (PDFs, including a slide deck and a textbook chapter) were held out. Each tool generated MCQs from the same three documents. Every output was graded on the four-criterion rubric. Same documents, same rubric, same graders.
Higher is better. The 23.5-point gap between Studyly and Turbolearn is the difference between a PDF deck where most questions are usable and one where you spend half your study time editing the model's mistakes.
The five-stage pipeline, named
From the moment you drop a PDF in until the deck is drillable. Stage two is the one that changes per flavor; the rest is the same path for every PDF.
What runs between drop and drillable
PDF parse
Born-digital text, or OCR for scans.
Page index
Every passage tagged with its page number.
Coverage scan
Topic map across all pages.
Rubric gate
Four-criterion check before output.
Drillable deck
MCQs + free response + case + image-occlusion.
Studyly vs. a typical AI MCQ tool, on PDF-specific behavior
Comparison rows below are PDF-flavor specific. We're using the median behavior of the field, not the worst tool we tested.
| Feature | Typical AI MCQ tool | Studyly |
|---|---|---|
| Page-level citations | Output is a list of questions; no link back to the PDF. | Every MCQ stores a page reference; explain pulls a quote from that page. |
| Multi-column PDF handling | Reads left-to-right across both columns; output is mangled. | Column reading order preserved; figure captions treated as their own block. |
| Scanned image-only PDF | Empty output, or generic questions written from the model's prior knowledge. | OCR pass per page, then the same extractor with a 2x slowdown. |
| Long PDFs (300+ pages) | Truncates at the model's context limit; tail of the document is ignored. | Page-by-page extraction, then a coverage pass; questions spread across chapters. |
| Reference list / bibliography | Sometimes generates MCQs from author names and DOIs. | Excluded from the question pool; you'll never see an MCQ asking about a citation. |
| Wrong-answer explanation | Generic explanation written by the model. | Verbatim quote from the source PDF page that proves the right answer. |
PDF formats this works on
Anything that came off a copier, a slide export, a paper download, or a phone camera. Drop in one PDF or a folder of thirty.
The PDF numbers, at a glance
The held-out eval score is the only one that's about output quality. The others describe what the pipeline actually accepts as input.
held-out eval score
pages per PDF supported
PDF flavors handled separately
from drop to drillable
Try it on tomorrow's lecture PDF
Drop a PDF in. Drill the output in 60 seconds.
Free tier, no credit card. After you submit your email we send a one-click access link and route you straight in.
Common questions about generating MCQs from a PDF
Will it work on a scanned PDF where the text is just an image?
Yes. If the PDF is text-layer-free (a scan from a copier, or a textbook chapter saved as an image-only PDF), Studyly runs OCR over each page before the extractor sees it. You'll know it ran because the resulting questions still cite a page number, just like a normal text PDF. The slowdown vs a born-digital PDF is roughly 2x because the OCR pass has to run.
What does 'page-grounded' actually mean in a generated question?
Every generated MCQ carries an internal pointer back to the page of the source PDF the answer came from. When you answer wrong and tap explain, the response is built from a quote on that page, not from the model's pretrained knowledge. If the model can't surface a passage from your PDF that supports the right answer, the question never ships in the first place.
Can it handle a 500-page textbook chapter?
Yes. The extractor runs page-by-page, then does a pass for question-type coverage across the whole document. There is no 10K-token truncation; long sources don't lose their tail. For very large textbooks (300+ pages) you typically get a deck where the questions are spread evenly across chapters, not bunched at the front.
What about a two-column research paper or a paper with sidebars and figure captions?
Multi-column PDFs are where most generic generators break: they read left-to-right across both columns and produce nonsense. Studyly preserves column reading order and treats figure captions as their own block. Equations and figures don't get pulled into question stems because the system can tell they're a different content type.
Are handwritten notes scanned to PDF supported?
Yes, with caveats. Clean scans of legible handwriting work well; the OCR pass is the same one used on scanned textbook chapters. If your handwriting is hard to read or the scan is faint, the resulting questions tend to be lower-quality on the clarity criterion (one of the four scored). You'll see the score on the deck before drilling.
How does this compare to dropping a PDF into ChatGPT or Gemini and asking for MCQs?
ChatGPT and Gemini will both read the PDF and produce questions. They won't enforce a four-criterion rubric, they don't track which questions you've gotten right across sessions, they don't reword the stem on revisit so you stop pattern-matching, and they don't run page-level grounding when you ask why an answer was right. On the same held-out three-document eval Studyly scores 81.3 / 100; generic chat output scores noticeably lower on distractor quality and type coverage.
What's actually shown when I get an MCQ wrong?
A short response that names the right answer, quotes the supporting passage from your PDF (with the page number), and explains in one or two sentences why the distractor you picked is wrong. The quote is verbatim from your source. You can click through to open the PDF on that page if you want to read the full context.
Can I export the generated questions to Anki?
Yes. Every generated question is one-click exportable to .apkg, including image-occlusion cards for anatomy figures pulled out of the PDF. The page-citation metadata travels with the export, so even in Anki you'll see which page of the source the card came from.