Active recall · ChatGPT · medical school · the structural comparison

ChatGPT writes questions. It does not run the loop.

Direct answer, verified 2026-05-14. ChatGPT can generate a one-pass quiz on a lecture PDF in 40 seconds. It cannot do active recall for med school, because active recall is a closed loop and a chat window is not. Four structural gaps: no per-fact retention state across sessions, no rubric gate on distractors, no rephrase of the stem on revisit, no source-anchored explain on a miss. By day four of the week the loop collapses back into rereading the slides. The honest comparison is below.

Read the four structural gaps →

4.8from 11,400 reviews

81.3 of 100 on a held-out three-document rubric, vs an un-gated generator at 57.8

Four formats per source: MCQ, free-response, case stem, image-occlusion

1M+ active students across med, dental, nursing, pharmacy, vet, PA

The four things a chat window structurally cannot do

Active recall is not the act of seeing a question. It is a system with four moving parts. Every one of them has to keep moving for the loop to do its job. If any of them stalls, the session degrades into rereading with extra steps. ChatGPT does the first frame of all four and then stalls on every one.

Gap 1 · persistent miss-tracking

The loop needs to know which specific facts you got wrong on Monday so it can re-test them on Wednesday. A chat does not carry per-fact state across sessions. A new chat is a new transcript and the transcript is a wording, not a fact identifier. There is no row that says ‘fact 0427: loop of Henle counter-current, missed 2 of 3 attempts, due in 36 hours’.

Gap 2 · rubric gate on distractors

A question only tests recall if the wrong answers are discriminating. Synonyms of the right answer do not discriminate. Obviously wrong options do not discriminate. A chat ships whichever distractors fit the sentence shape it generated, without a separate gate that re-scores them before you see the question. The rubric in Studyly scores every candidate on factual correctness, clarity, distractor quality, and question-type coverage, and regenerates the ones that fail. Held-out blind eval: 81.3 vs 57.8 on the same three documents.

Gap 3 · auto-rephrase on revisit

The whole point of repeated retrieval is repeated retrieval. If the stem on revisit #5 is byte-for-byte the stem from revisit #1, your brain is matching the wording, not pulling the fact. Studyly re-rolls stem text on every revisit while the underlying topic-pin stays stable, so five revisits of the same fact are five different sentences. A chat window can be re-prompted to do this for one document; it cannot be made to do it consistently across thirty lectures over a semester.

Gap 4 · source-anchored explain

When you miss a question, the only useful next step is to look at the source line your professor actually wrote. Not a paraphrase from training data. Not a confident hallucination of a citation. The bullet on the slide. Studyly pins the source line on every card and the explain panel quotes it verbatim on a miss, with the slide number, so you can re-open the deck for context if you want. A chat cannot reliably do this because the chat does not own the slide; it owns a representation of the slide.

The blind eval, on the same three lecture documents

Three held-out documents the generators had not been trained on (a microbiology lecture, an internal medicine deck, a pharmacology PDF). Every tool received the same files. Every output card was scored blind on factual correctness, clarity, distractor quality, and question-type coverage. The 57.8 line is what an un-gated generator looks like, which is the closest reference for what ChatGPT-style output without a rubric layer would produce on the same task. Source: Jungle internal admin Quality Comparison panel, 2026-04-24.

Studyly (rubric-gated)

Unattle

Gauntlet

Un-gated baseline

Most of the gap shows up on distractor quality and on question-type coverage. A chat will write five recall MCQs in a row before it writes a case stem, and three of its five wrong answers will be synonyms of the right answer. The rubric gate is what closes that. The eval cannot test the loop axes (miss-tracking, rephrase, source anchor) on a single document; those failures are visible only after a few sessions, which is why they are easy to miss when you are one-pass quizzing on a Sunday night.

The same renal lecture, a week of drilling, two workflows

A 90-slide renal physiology lecture. Exam in eight days. Left side is what a week of drilling looks like in a chat window. Right side is the same week running through the retrieval loop. The collapse on the left happens around day three.

One lecture, one week, two ways to try active recall

Monday night. Paste the renal physiology PDF into ChatGPT. Ask for 30 MCQs. Get 30 MCQs in 40 seconds, half of them with distractors that are synonyms of the right answer. Drill them on the screen. Get 7 wrong. Close the tab. Tuesday night. Open a new chat. Paste the same PDF. Ask for 30 MCQs. You get 30 different MCQs because the model rerolled. None of them are the 7 you got wrong yesterday. Drill these new 30. Get 9 wrong. Close the tab. Wednesday night. You realize the loop is not closing. You try to paste yesterday's wrong list into the new chat and ask it to re-test only those. It writes seven new questions that ask roughly the same facts using new wording, which is good, but you have no way to know that revisit #2 of question 4 is supposed to feel different than revisit #1 of question 12, because there is no fact identifier connecting them. By Friday you are rereading the slides.

Each chat is a fresh transcript with no per-fact memory
No rubric gate so distractors are often non-discriminating
By day four the loop has collapsed back into rereading

The contrast is not about model quality. It is about whether the system carries state between sessions. The chat does not. The loop does. That is the whole comparison and everything else on this page is downstream of it.

Anchor fact · how rephrase-on-revisit actually works

The topic-pin survives. The stem text does not.

Each fact in the system has a stable topic-pin tied to the source slide line. The stem wording is generated against that pin and re-rolled on every revisit. Same fact on Monday and Friday: two different sentences asking the same question. Options are reshuffled. The case-stem variant arrives as a clinical scenario that embeds the fact rather than asking it directly. By revisit #5 you have done five genuine retrievals under five different retrieval conditions, which is the regime the testing-effect literature finds the most retention gain in.

ChatGPT cannot run this without a layer above it that owns the pin. The chat owns a transcript, and the transcript is the wording itself, so ‘reword question 7 on revisit’ is a re-prompt, not a structural property. Across thirty lectures and twelve weeks, the re-prompt approach silently degrades.

This is the part the other pages on this comparison do not print, because they treat ‘ChatGPT for active recall’ as a yes-or-no question instead of a system-design question. It is a system-design question and the answer is ‘no on the four axes, yes for the explain step’.

Side by side, axis by axis

The four structural gaps plus the eval. The cost row is at the bottom because cost is not the bottleneck on cramming season; the loop is.

ChatGPT chat-window workflow vs Studyly retrieval loop, drilling the same med school lecture deck.

Feature	ChatGPT chat window	Studyly
Tracks which specific facts you missed across sessions	No persistent per-fact state. Each new chat starts from zero. Conversation memory does not key on fact identifiers.	Per-fact retention state survives sessions. Missed cards re-enter the queue on a spaced-retrieval schedule.
Rubric gate on distractor quality	No enforced rubric. Distractors are often synonyms of the right answer or obviously wrong, so the question does not actually test recall.	Four-axis rubric (factual correctness, clarity, distractor quality, question-type coverage). Candidates failing any axis are regenerated before output.
Stem reworded on revisit	Same stem text on every recall of the same fact unless you manually re-prompt. By pass #3 you pattern-match the first three words.	Topic-pin survives, stem text re-rolls on every revisit, options reshuffle. Pass #5 is five genuine retrievals not one retrieval and four matches.
Source anchor on the explain panel	Paraphrases from training data. Occasionally invents a citation. Cannot point at the bullet your professor wrote.	Explain quotes the bullet line on the slide it came from. Slide number and source line carry through, also into the Anki .apkg export.
Question-type coverage on one source	Tends toward one or two shapes (mostly recall MCQ). Coverage flips with prompt phrasing.	Four formats from the same source: MCQ, free-response, case-style, image-occlusion. Type mix is part of the rubric.
Image-occlusion on labeled anatomy or histology figures	Cannot natively produce a labeled figure with masks. Will describe one in text.	Extracts the figure, identifies labeled structures, generates an image-occlusion card per label. Exports as native Anki image-occlusion notes.
Held-out three-document rubric score (blind, four-axis)	Un-gated generator output scored 57.8 of 100 on the same three documents.	81.3 of 100. Source: Jungle internal admin Quality Comparison panel, 2026-04-24.
Free tier, no credit card	ChatGPT Plus is $20/mo. Free tier exists but rate-limited on the strong models.	Free tier on app.jungleai.com, no card on file. Paid is opt-in and removes a per-account deck cap.

When ChatGPT is still the right tool

Three honest cases, because pretending the chat window is useless is the kind of marketing copy that costs you the reader. Use ChatGPT for explain. Drop a confusing slide in and ask for a one-paragraph explanation of why the loop of Henle gradient is counter-current and not co-current. That is the explain step, not the recall step, and the chat is fine at it.

Use ChatGPT for cold free-response drills on a single topic the night before an exam. ‘Write me ten free-response questions on glomerulonephritis, do not give the answers’. There is no recognition to pattern-match against because there are no options. The miss-tracking and rephrase axes do not bind on a single-pass one-night drill.

Use ChatGPT for sanity-checking the wording of a question you wrote yourself. The chat is a reasonable critic of a stem. It is not a reasonable owner of a queue of stems across a semester. That is the whole distinction.

Run the loop on tomorrow's lecture

Drop a PDF, get ~200 rubric-gated questions in 60 seconds, drill them across the week.

Free tier on app.jungleai.com, no card on file. Four formats from the same source. Misses pinned across sessions. Stems re-roll on revisit. Explain quotes the slide.

Common questions when comparing ChatGPT to a retrieval loop for med school

Wait, can ChatGPT actually do active recall for med school or not?

It can do the first 10 minutes of it and then it falls apart. Drop a lecture PDF into a chat, ask for 20 MCQs on it, and you get 20 MCQs. That is a one-pass quiz, not active recall. Active recall is a closed loop: every fact you miss has to come back, the stem cannot be the same wording on revisit (or you pattern-match instead of pulling the fact), the distractors have to be discriminating (or wrong answers are obviously wrong and the question does not actually test recall), and on a miss you need the original slide quoted at you so you can update the memory against the source. A chat window does not carry per-question retention state across sessions, does not run a rubric gate on distractors, does not rephrase the stem on revisit, and does not anchor the explain to a specific slide line. Those are four structural gaps and they all matter by week two.

But I have ChatGPT Plus and Custom Instructions. Can I prompt around it?

Partially, and partially does not save you. You can paste the same lecture and ask for the same 20 cards reworded, which fixes the rephrase axis on one document. You cannot ask ChatGPT to remember that you got question 7 wrong yesterday and surface it back today, because per-conversation memory is fragile and chat-spanning memory does not key on a fact-level identifier. You cannot reliably ask it to score its own distractors against a four-axis rubric before showing them to you, because it will say it did and then ship the same kind of synonym-of-the-right-answer distractor it would have shipped without the rubric. The blind eval data is unambiguous on this: on a held-out three-document set, a generic generator scored 57.8 of 100. A rubric-gated pipeline scored 81.3 on the same three documents.

Is there any active-recall task ChatGPT is actually good at?

Yes, two. First, single-pass concept explanations: ask it to explain why the loop of Henle counter-current multiplier works the way it does, and you get a coherent explanation you can then test yourself against. That is not active recall; it is the explain step. Use it freely. Second, generating a free-response prompt cold ('write me ten questions on glomerulonephritis, do not include the answers') is a good cold-retrieval drill because you cannot accidentally recognition-match anything. Those two uses are fine. The thing it cannot do is run the loop over weeks across thirty lectures.

What about Claude or Gemini, are they the same story?

Same story on the structural axes. Any general-purpose chat assistant is a one-pass writer of questions, not a retrieval loop, because retrieval requires persistent per-fact state and a rubric gate that lives between generation and surfacing. The model is not the bottleneck; the loop is. Switching to a smarter model improves question quality on a single shot and does not fix miss-tracking, rephrase, or distractor discipline.

How does the rephrase on revisit actually work, and why does it matter that much?

Each fact in the system has a topic-pin that survives across encounters. The stem text does not. On revisit #2 of the same underlying fact the system re-rolls the stem wording and reshuffles the option order. By revisit #5 you have seen five different sentences asking the same question, which forces five genuine retrievals instead of one retrieval and four pattern matches. The cognitive-science version of this is the testing effect under varied retrieval conditions. The product version is that you cannot memorize the first three words of the question and call it studying. ChatGPT cannot do this because the chat does not store a stable topic-pin across sessions; it stores a transcript, and a transcript is a wording, not a fact.

Does the source anchor actually help on wrong answers, or is it window dressing?

It is the single most useful thing on a miss. When you get a card wrong, the explain panel quotes the bullet line on the slide it came from. Not a paraphrase, the bullet your professor wrote. That gives you something concrete to update against and lets you re-pull the slide for context if you want. ChatGPT will paraphrase the source from training data and occasionally invent a citation. The difference shows up most on the slides where the professor's wording is the wording your TA will mark against on the exam.

What does the four-axis rubric actually score against?

Factual correctness (is the answer the right answer for this slide), clarity (would a peer reading the stem cold understand what is being asked), distractor quality (do wrong answers discriminate from the right answer, or are they synonyms or obviously wrong), and question-type coverage (does the set include recall, application, case stems, and image-occlusion rather than only one shape). Candidates failing any axis are regenerated before output. On a held-out three-document eval (a microbiology lecture, an internal medicine deck, a pharmacology PDF) the gated output scored 81.3 of 100. The next best generator scored 78.0, the one after that 68.0, and an un-gated generator scored 57.8. Source: Jungle internal admin Quality Comparison panel, 2026-04-24.

Will this work for USMLE, NCLEX, MCAT, or COMLEX style stems?

Yes. The case-style generator runs on every fact slide regardless of source, so a 90-slide cardiology deck produces around 50 case-style stems alongside the MCQs and free-response cards. Case stems are the highest-leverage format for board-style exams because they mirror the question shape on the actual test. You can filter the queue to case-style only when you want to drill clinical reasoning instead of recognition. AnKing, Zanki, and other community decks remain the right choice for the boards content itself. The class-deck side is where the chat-vs-loop comparison matters most.

What about cost, is the chat-window approach cheaper?

Marginally and not in a way that matters. ChatGPT Plus is $20 a month. Studyly has a free tier, no card on file, and a paid tier. The cost question is the wrong one because the bottleneck on cramming season is not dollars per month, it is whether you finish the loop. The chat-window approach loses on whether you actually drill the lectures more than once, which is where retention is paid for.