Comparison · prep score vs clinical retention

A high MCAT prep score and clinical retention are not the same number.

You drill practice questions, your score climbs, and you read the climb as knowledge. Some of it is. Some of it is the question getting familiar. This page is about telling those two apart, because only one of them follows you into a clerkship.

Direct answer · verified 2026-05-16

Does a high MCAT prep score mean real retention?

Not on its own. In a study of premedical students, full-length practice-exam scores tracked the actual MCAT score at about r 0.92, almost lockstep. But the MCAT itself predicts clinical-years performance, clerkship grades and Step 2 CK, at only about r 0.42 to 0.61 in AAMC validity research. So a prep score is a near-perfect predictor of the next test and a moderate predictor of what you keep. Which one your number actually is depends on how you earned it: by recognizing questions you already drilled, or by retrieving concepts from varied, spaced practice. Re-verified 2026-05-16 against the sources listed at the bottom.

Two correlations, two timescales

The phrase “prep score vs clinical retention” has a numeric answer. It is the distance between these two figures.

Practice score predicting MCAT day

r 0

Median full-length practice score against the real MCAT score in a study of premedical students. Same instrument, weeks apart, near lockstep.

MCAT predicting clinical years

r 0.42 to 0.61

Median correlation of the MCAT total with clerkship grades and Step 2 CK in AAMC validity research. Years later, applying the knowledge. Moderate.

Put the two correlations next to each other. A 2021 study of premedical students at Khalifa University tracked their full-length practice exams against their real MCAT score. The median practice score correlated with the MCAT at roughly r 0.92. That is about as tight as educational measurement gets, and it should be. The practice test and the real test are the same instrument, taken weeks apart, measuring the same skill in roughly the same state. Of course they agree.

Now stretch the timescale. AAMC validity research follows medical students from matriculation through their clinical years. The median correlation of the MCAT total score with clinical-years outcomes, clerkship grades and Step 2 CK, lands in the range of roughly 0.42 to 0.61. The AAMC calls that medium to large, and it is a real signal. It is also a much softer one. A correlation near 0.5 means about a quarter of the variance is shared. Three-quarters of how much you can apply two years later is explained by something other than your MCAT total.

That is the gap the question names. "Prep score vs clinical retention" is not a rhetorical contrast. It is the measured difference between r 0.92 and r 0.5. Your prep score is an honest report on a 30-day question. It is a much weaker report on a two-year question. Nobody is lying to you. The number is just answering something narrower than you assume it is.

Why the gap opens: recognition rides along with retrieval

Here is the mechanism, and it is not subtle. Every time you face the exact same practice question, two different things can carry you to the right answer. One is retrieval: you reconstruct the concept and derive the answer. The other is recognition: you remember this specific question, the shape of its stem, the position of the correct option. Recognition is real memory. It is memory of the question, though, not of the biology, and it does not transfer to a reworded stem, a different vignette, or a patient.

On a static bank, recognition compounds. Pass one is mostly retrieval. By pass four the stem is an old friend. Your accuracy on that bank goes up, while your accuracy on a question you have never seen moves much less. The score rises faster than the retention does. That divergence is the entire prep-score-versus-clinical-retention problem in one sentence.

The testing-effect literature adds a second blade. Karpicke and Roediger reported in 2007 that repeatedly retrieving items a student had already gotten right boosted one-week retention by more than 100 percent, while repeatedly studying those same items did almost nothing. And the effect runs larger for formats that make you generate an answer than for formats where you select one. A multiple-choice bank is selection-heavy by design. Drilled to the point of recognition, it is the weaker half of the testing effect.

r 0.92 to r 0.5

“Auto-rephrasing means I can't lazy-pattern-match the first three words. Eight days into spaced repetition I actually retain the renal stuff.”

Studyly user, on the auto-rephrasing loop

Studyly vs a static question bank, line by line

If a static question bank lets recognition inflate the number, the fix is a study loop where it cannot. That is the design difference between a fixed bank and Studyly, dimension by dimension.

Where the two study loops diverge

Feature	Static question bank	Studyly
What a rising score measures	Retrieval and recognition mixed together. Each pass over the same item makes the question familiar, so accuracy climbs even when concept memory does not. You cannot separate the two from the number.	Retrieval, as far as the format allows. Because the stem is reworded and the options reshuffled on every revisit, a correct answer on pass five means you reached the concept, not the sentence.
The second time you see a question	Identical wording. By the third exposure your brain indexes on the first few words of the stem and the position of the right answer.	New wording, same fact, same correct option. Auto-rephrasing rewrites the stem so revisit three is still a cold retrieval.
Where the questions come from	A fixed bank written for a generic curriculum. It may or may not match the chapters and notes you are actually reviewing.	Generated from the material you upload: a textbook chapter, your class notes, a content-review PDF. About 200 questions per source in 60 seconds.
Question formats	Almost always multiple choice, which leans on recognition. Selecting an answer produces a smaller testing effect than generating one.	Four formats from one source: multiple choice, free response, case-style, and image-occlusion. Free response forces you to generate an answer, not pick one.
When you revisit	You decide. Most people re-drill what they already know and quietly skip what they missed.	A spaced-repetition schedule resurfaces missed items days later, automatically. The schedule sets the interval, not your mood that night.
What the score predicts	Your score on that exact bank. Transfer to a reworded question, a later exam, or clinical recall is unmeasured.	A held-out accuracy. Every revisit is a question you have not drilled in that form, so the number is closer to what you would score cold.

Held-out is the word that matters

Studyly publishes a quality benchmark: 81.3 out of 100, scored against Turbolearn at 57.8, Gauntlet at 68.0, and Unattle at 78.0 on the same rubric. The figure people quote is the 81.3. The word that actually carries the weight sits in front of it: held-out.

A held-out evaluation scores a system on material it never saw while it was being built. Studyly's 81.3 comes from a three-document held-out eval, graded blind on factual correctness, stem clarity, distractor quality, and question-type coverage, on lecture documents the generator was not tuned against. It is a cold-transfer score, not a score on the practice set. That is the whole reason to hold documents out: a score on material a system trained on tells you almost nothing about new material.

0Studyly held-out eval score

0Turbolearn, same rubric

0sTo convert one source

0Questions per source

The reason this matters for your studying is that Studyly runs the same discipline on you. Auto-rephrasing rewrites the stem of every question on every revisit while the underlying fact and the correct option stay fixed. Revisit three is a stem you have not drilled in that form. So the accuracy you watch inside the app is a held-out accuracy by construction. It cannot be inflated by recognition, because there is no fixed wording left to recognize. The number tracks the concept, which is the number that follows you past test day.

You can read the full eval methodology, the rubric and the leaderboard, on the Studyly quality page. The mechanic itself is covered in detail in the write-up on auto-rephrasing practice questions.

How Studyly keeps every retrieval held-out

The loop has four moving parts. None of them leaves a fixed question for recognition to latch onto.

Upload the material you actually review

A textbook chapter, your class notes, a content-review PDF, even a YouTube lecture. The MCAT is standardized, so you study from your own review material; Studyly works from whatever you upload.

Get about 200 questions in 60 seconds

From one source, across four formats: multiple choice, free response, case-style, and image-occlusion flashcards. Free response makes you generate an answer rather than select one.

Revisit on a reworded stem

Auto-rephrasing rewrites the question every time it resurfaces. Same fact, same correct option, new sentence. There is no fixed wording left to memorize.

Miss it, and the spacing brings it back

A spaced-repetition schedule resurfaces the items you got wrong days later, on its own. The interval is set by the schedule, not by what you feel like reviewing.

The honest case for a static bank

A comparison page that concedes nothing is not worth reading. A fixed question bank has one genuine advantage, and for the MCAT it is decisive: the exam is standardized, and the AAMC's own official practice material is the closest thing to the real test you will ever drill. Use it. Nothing here argues for skipping official practice. The argument is narrower. Official practice is a measurement instrument, best spent a handful of times to check where you stand, not a learning loop to grind for content retention. Grinding the same bank for content is exactly where recognition starts inflating the number. Drill content on a loop that stays held-out, and save your scarce official full-lengths for honest checkpoints.

If your bottleneck is generating questions from your own content rather than calibrating to the exam, the related write-up on MCAT practice questions from a textbook walks through that side of the workflow.

Common follow-ups

Does a high MCAT practice score mean I will retain the material?

Not by itself. Practice-exam scores predict your MCAT-day score very tightly, around r 0.92 in one study of premedical students, because they measure the same skill on the same timescale. But the MCAT predicts clinical-years performance only moderately, roughly r 0.42 to 0.61 in AAMC validity research. A high practice score is strong evidence you will do well on the next test and weaker evidence about what you will still know two years later. How you earned the score decides which of the two it really is.

Why does my practice score keep rising while it does not feel like it sticks?

Because re-drilling the same questions lets recognition do part of the work. The first time you see a question you retrieve the concept; by the fourth time you partly remember the question itself, its wording and the position of the right answer. Your accuracy on that bank climbs, but your accuracy on a question you have never seen climbs more slowly. The feeling that it is not sticking is accurate: the bank score and the retention have quietly come apart.

Is the MCAT a good predictor of how I will do in clinical years?

It is a moderate one. AAMC validity research reports median correlations of MCAT total scores with clerkship and Step 2 CK performance in roughly the 0.42 to 0.61 range, which the AAMC describes as medium to large. That is a real signal rather than noise, but it leaves most of the variance in clinical-years performance to other factors. The MCAT was built to predict readiness, not to cap how much you can retain.

How is Studyly different from a regular MCAT question bank?

A fixed bank shows you the same question with the same wording every time, so by the third pass you can pattern-match the stem. Studyly generates questions from material you upload and rewrites the stem on every revisit through auto-rephrasing, so each pass is a cold retrieval. It also produces four formats from one source, including free response, which forces you to generate an answer rather than select one. The accuracy you see is a held-out accuracy, not a recognition score.

Can Studyly write questions for the MCAT itself?

Studyly does not reproduce the official MCAT. The MCAT is standardized, and for exam calibration you should drill the AAMC's own official practice material. What Studyly does is turn the content you review for the MCAT, your textbook chapters, class notes, and content-review PDFs, into about 200 practice questions per source in 60 seconds. It is a content-retention loop, not an official practice exam.

What does held-out mean, and why does it matter here?

Held-out describes an evaluation run on material a system was not built or tuned against. Studyly's 81.3 out of 100 quality score comes from a three-document held-out eval, so it reflects cold transfer rather than performance on its own practice set. The same idea applies to your studying: because auto-rephrasing means every revisit is a stem you have not drilled in that form, the accuracy you build inside the app is a held-out accuracy, the kind that transfers.

Is there a free tier, and do I need a credit card?

Yes. There is a free tier on app.jungleai.com and no credit card is required to start. Upload a chapter or a set of notes, generate questions, and drill them without entering a card. Paid is opt-in.

Sources

Albadr, N. et al. (2021). The Predictive Value of Full-length Practice Exams for the New MCAT Exam for Premedical Students. PMC7780194. Median practice score to MCAT score correlation r 0.92.
AAMC. MCAT Validity Research. Median correlations of MCAT total scores with clerkship and Step 2 CK performance, roughly 0.42 to 0.61.
Karpicke, J.D. and Roediger, H.L. (2007). Repeated retrieval during learning is the key to long-term retention. Journal of Memory and Language, open PDF.
Studyly internal Quality Comparison panel, 2026-04-24. Held-out three-document eval, scored blind on factual correctness, stem clarity, distractor quality, and question-type coverage: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8.

Make the number you watch the number that counts

Upload a chapter or a set of notes and get about 200 questions in 60 seconds. Auto-rephrasing keeps every revisit held-out, so the accuracy you build is the kind that transfers. Free tier on app.jungleai.com, no credit card.