Matthew Diakonov, Written with AI

Published May 4, 202610 min read

Active recall question generator · the diagnostic test

A question generator is active recall only if you cannot pattern-match the question on the second revisit.

Most tools that call themselves an active recall question generator print a static deck and rely on a spaced repetition timer to make it feel productive. By review #3 you remember the wording, not the biology. The diagnostic below distinguishes a recognition test in disguise from a generator that delivers actual retrieval practice.

Run the test →

Direct answer · verified 2026-05-04

An active recall question generator delivers what the label promises only when the same fact reappears in different surface forms across revisits, not just at different times. Studyly does this with two layers: an auto-rephrase pass on every revisit (the stem is reworded, the distractor pool rotated) and four formats from one source (MCQ, free-response, case-style, image-occlusion). On a held-out three-document quality eval Studyly scored 81.3 against Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. Methodology and the leaderboard are public on /quality.

The active-recall test

What separates a generator from a retrieval tool

Encoding ≠ retrieval. Re-reading is encoding.

Recognition ≠ recall. Same wording is recognition.

If the stem is identical on review #5, it is a recognition test.

Active recall needs varied surface forms across revisits.

Studyly: rephrase pass + four formats from one source.

0:00 / 0:05

The cognitive science that the label is borrowing from

The phrase "active recall" is a marketing rephrasing of three findings from the memory literature. A tool either implements all three or it is selling the term without the mechanism. Reading the three together is the simplest way to spot which side of the line a given product sits on.

Encoding vs retrieval

Reading and highlighting are encoding behaviors. They feel like learning because the page lights up with familiar words. Retrieval is the opposite: the page is gone and you have to reproduce the fact. The testing effect (Roediger and Karpicke, 2006) is the body of work that established testing produces stronger long-term retention than re-reading the same material for the same total time. A question generator has to deliver retrieval, not just produce questions.

Recognition vs recall

Recognition is when the cue (the question stem, the option list) carries enough information that you can identify the answer without reproducing it. Recall is when the cue is minimal and the answer has to be reproduced from memory. Active recall is the recall side of the line. A generator that always shows you the same MCQ wording is on the recognition side; you are matching the question to a memorized answer key.

Surface-form variability

If the same fact reaches you on review #1 as 'which enzyme unwinds DNA at the replication fork' and on review #5 as 'at the replication fork, which protein opens the double helix so synthesis can start,' your brain has to retrieve the underlying biology each time, not the sentence shape. The literature on transfer-appropriate processing (Morris, Bransford, Franks, 1977) is the academic frame for why this matters: tested under varied retrieval cues, knowledge transfers to new situations, including the ones on the exam.

The seven-question diagnostic

Run a tool through these seven checks before you spend a study cycle on it. Anything below five out of seven is producing a recognition test in disguise; the spaced repetition timer is the only thing keeping it from being obvious. Studyly checks every box on this list, which is the work the rest of the page substantiates.

Is this an active recall question generator?

When the same fact comes back on the next review, is the stem worded differently?
Are the wrong answers in an MCQ the same length as the correct one (within ~25%)?
Does the explanation on a wrong answer quote a passage from your own source, not a generic model summary?
Can the same fact appear as a free-response prompt sometimes and an MCQ other times?
Is the deck graded on a rubric that distinguishes recall, application, comparison, and case-style?
Do weak topics resurface more often, and mastered topics decay out?
Is there a quality score published on a held-out source the tool was not trained on?

Anchor fact · the same fact, five surface forms

Watch one fact get tested across a week.

The terminal below is one underlying fact (the interventricular septum separates the right and left ventricles) tracked across three review sessions. The wording rotates every session. Review #5 switches into a case-style stem. The correct answer never moves; the path your brain takes to find it always does. The component that drives this lives in the codebase at src/components/RephraseCarousel.tsx.

auto-rephrase across three review sessions

From a folder of sources to a real retrieval loop

The pipeline below is the path one source takes through Studyly. The interesting line is step four: the rephrase pass is what makes review #2 onward different from review #1. Without that step you have a static deck plus a scheduler, which is what most spaced-repetition apps actually are.

Six steps from a lecture deck to active recall

Drop the source in

Lecture slide deck, textbook chapter PDF, study guide, YouTube lecture link, or a folder with all of the above. About 60 seconds for a 90-slide deck.

Studyly grades the output before you see it

Every generated question is scored against the four-criterion rubric (factual correctness against the source, stem clarity, distractor plausibility, type coverage across the deck). Questions below threshold are regenerated. You never see the broken intermediate output.

First pass: recognition is fine

On review #1, recognition is OK. You are encoding. The MCQ format gives you four options and you pick the one that matches your source. This is the easy pass and the model that calls itself 'active recall' is mostly accurate here.

Subsequent passes: the rephrase happens

Review #2 reaches you with the stem reworded and the distractor pool rotated. Review #3 may rotate the format (MCQ → free-response → case-style). Review #5 might surface as image-occlusion if your source had a labeled diagram. Same fact, five surface forms.

Wrong answers get explained from your source

When you miss a question, the explain panel quotes the supporting passage from the original PDF or slide deck. The reasoning is not a model paraphrase; it is the sentence on the page in front of your professor when they wrote the lecture.

Spaced repetition + variation work together

Weak topics resurface more often (spaced repetition does this). Each resurfacing arrives in a different surface form (rephrase + format rotation do this). Both layers run on every deck without configuration.

The held-out eval, in numbers

Three source documents (a slide deck, a textbook chapter, and a paper) were held out from any tool's training. Each tool generated questions from the same three documents. Every output was graded on four dimensions (factual correctness, clarity, distractor quality, type coverage) with the same rubric and the same graders.

0Studyly

0Unattle

0Gauntlet

0Turbolearn

Higher is better. The 23.5-point gap between Studyly and Turbolearn is the difference between a deck where most questions are usable and one where you spend half your study time editing the model's mistakes. Full methodology and the leaderboard are public on studyly.io/quality.

Active recall behavior, side by side

What a typical AI question generator does on revisit vs what active recall actually requires from the question.

Feature	Typical AI question generator	Studyly
Wording on revisit	Identical stem, identical option order.	Stem reworded, distractors reshuffled, format may rotate (MCQ → case → free-response).
Source grounding	Question text comes from the model's pretrained knowledge.	Correct answer maps to a specific passage in your uploaded PDF or slide.
Wrong-answer explanation	Generic explanation written by the model.	Quote pulled from your source that proves why the right answer is right.
Question-type mix	Mostly recall ("what is X").	Recall + application + comparison + case-based, scored for variety on every deck.
Quality benchmark	No score, no methodology.	81.3 on a held-out three-document eval, methodology public on /quality.
Spaced repetition loop	Static list. You decide when to revisit.	Weak topics resurface automatically; mastered topics decay out.
Anki export	Often missing or paid add-on.	One-click .apkg export including image-occlusion cards.

When this is not the right tool

Three honest cases where another approach fits better. Naming them here is more useful than pretending one tool covers every kind of studying.

Computational problem sets. Worked-equation cards (calculus, physics, quantitative pharmacology) need a math problem-solver, not a question generator. Concept questions about those subjects work; mechanical solving does not.
The fact is in your head, not in any document. Personal mnemonics, attending pearls from rounds, the way your preceptor phrased a clinical heuristic. There is no source to upload because the source is you. Type the card by hand.
You only need a single pass. The 60-second conversion still wins on time, but the rephrase pass and four-format spread don't get to do their work on a deck you take once and never see again. The unique value is in the cross-revisit loop.

Try it on tomorrow's lecture

Drop a deck in. Watch the same fact get tested four ways.

Free tier on app.jungleai.com, no credit card. The email gate sends a one-click access link.

Common questions about active recall question generators

What is an active recall question generator, in one sentence?

A tool that turns your own source material (lecture slides, PDF, textbook chapter, YouTube lecture) into questions that force you to retrieve the answer from memory, then keeps the surface form of those questions varied across revisits so you cannot answer them by remembering the wording. The 'generator' part is the easy half. The 'active recall' part is the half most tools quietly skip.

Why does it matter whether the wording changes on revisit?

Active recall is retrieval practice. Retrieval is the act of pulling a fact out of memory under conditions different from the ones you encoded it in. If review #5 of the same card looks identical to review #1, your brain optimizes for the sentence shape, not the underlying fact. By the morning of the exam you know the question; you do not necessarily know the biology. This is why a generator that prints a static deck and a spaced-repetition app that re-shows that deck unchanged are both technically a 'question generator' and neither of them is an active-recall question generator. The label is honest only if the wording rotates.

How is this different from making questions in ChatGPT?

ChatGPT will produce a list of questions from a prompt. It will not run a quality rubric across the four criteria (factual correctness, clarity, distractor plausibility, type coverage), it will not track which questions you got right, it will not reword the stem on revisit, and it will not pull the supporting passage from your PDF when you get one wrong. On the same held-out three-document eval that scored Studyly 81.3, the generic field average sat at 67.9. Where ChatGPT helps is the first generation pass; the loop around the questions is what active recall actually requires, and that loop is not in a chat window.

Does spaced repetition by itself count as active recall?

No, and this is the trap. Spaced repetition decides when a card reappears. It does nothing about what the card looks like when it reappears. If the card is identical to the one you saw on Monday, the Friday review is a recognition test on Monday's wording, scheduled at a clever interval. Active recall needs spaced repetition AND surface-form variation; one without the other is half the work.

What inputs work?

Lecture slide decks (PowerPoint, Keynote, PDF), textbook chapters (PDF including image-only scans, OCR happens automatically), study guides, your own typed notes, handwritten notes captured by camera, YouTube lectures (transcript with timestamps preserved). One source or thirty in a folder. The output is the same: roughly 200 multiple-choice questions per 90-slide deck in about 60 seconds, plus three other question formats from the same source.

What are the four question formats from one source?

Multiple-choice (recognition under distractor pressure), free-response (pure recall, no options to lean on), case-style stems (apply the fact in a clinical scenario), and image-occlusion (recall a masked label on a diagram, for anatomy and pathway figures). The same fact in your source can surface as any of the four formats, which is the second axis of variation on top of the rephrase pass.

Can I export the questions to Anki?

Yes. The .apkg export carries MCQ, free-response, case-style, and image-occlusion cards intact. The auto-rephrasing happens inside Studyly during a review session, so when you export to Anki you get the canonical card set and Anki's scheduler takes over. If you want the rephrasing to keep happening on revisit, the work needs to stay in Studyly.

Does this work for math or computational problems?

Concept questions yes, step-by-step worked solutions no. Studyly is built for memorization-heavy programs (medical, dental, nursing, pharmacy, PA, vet, pre-med, biology, anatomy, immunology, microbiology). If you need a tool that walks through a derivation or shows the work on a calculus problem, a math problem-solver is the right tool, not this one.

Is there a free tier?

Yes. Drop a deck in, generate questions, drill them, all without paying. Free tier limits how many active decks you keep alive at once; paid removes the cap. No credit card to start.