Method comparison · active recall vs hours-based study
Hours is the wrong unit. Count retrievals per slide instead.
Two students put four hours into the same deck. One rereads, one runs 200 retrievals. One week later their recall scores diverge by a factor of roughly two. The number that predicted the outcome was retrieval attempts, not minutes elapsed.
Direct answer · verified 2026-05-16
Track active recall, not study hours. Karpicke and Roediger's 2008 paper in Science found about 80 percent one-week recall from repeated retrieval against about 36 percent from repeated study, and extra study after the first correct retrieval added almost nothing. The unit worth counting is distinct retrieval attempts per slide, not minutes elapsed. Source re-verified 2026-05-16 against the open PDF hosted at psychnet.wustl.edu.
The two numbers from the 2008 paper
Same students, same vocabulary list, same one-week delay. The only thing that varied was whether the items kept getting tested after the first correct retrieval, or kept getting studied. The result was not subtle.
Repeated retrieval
0%
One-week recall in the two repeated-testing conditions, even when the items were dropped from further study after first correct retrieval.
Repeated study
0%
One-week recall in the two no-more-testing conditions, even when the items kept getting restudied after first correct retrieval.
Karpicke, J.D. and Roediger, H.L. (2008). The Critical Importance of Retrieval for Learning. Science, 319(5865), 966 to 968.
Hours measures input. Recall measures output.
Hours is a measurement of input. Recall is a measurement of output. Two students can put four hours into the same cardiology deck on the same evening. One spends the four hours rereading and highlighting. The other spends 60 seconds converting the deck into about 200 questions, then 90 minutes failing them, fixing the gaps, and re-attempting. On a one-week recall test the second student is at roughly 80 percent and the first is at roughly 36 percent. Same lecture, same evening, half the wall-clock time, double the retention. The number that predicted the outcome was retrieval attempts, not minutes elapsed.
That ratio is not vibes. It comes from Karpicke and Roediger's 2008 paper in Science (link in the sources at the bottom). Students learned foreign-language vocabulary across four conditions: keep studying and keep testing, keep studying but drop tested items, keep testing but drop studied items, or drop both once correct. One week later the two repeated-testing conditions recalled about 80 percent of the items. The two no-more-testing conditions recalled about 36 percent. Extra study after the first correct retrieval did almost nothing. The unit of work that moved retention was a retrieval attempt, not an additional minute of looking at the page.
So the question is not really "active recall vs study hours". It is "what unit should I track if I want to predict my exam score". The honest answer is retrieval attempts per concept, or, if you study from a slide deck, retrieval attempts per slide. Two students with the same hour count and very different retrieval counts will get very different grades. Two students with the same retrieval count and very different hour counts will get roughly the same grade, with the lower-hours student winning on time efficiency.
The unit that replaces hours
The reason every study-skills article tells you to "do active recall" and then never tells you how many is that the unit was never operationalized. You cannot count something you cannot see. If you reread a chapter, you have nothing to count except minutes. If you self-quiz on a flashcard once, you have one retrieval to count. If you put a 90-slide cardiology lecture through a question generator that emits roughly two questions per slide across four formats, you now have 200 countable units of work attached to that lecture. The deck stops being a thing you spend time on. It starts being a thing you have a known number of retrievals against.
Once retrievals are countable, you can do the obvious things. You can ask whether you have completed at least one retrieval against every slide in the deck. You can ask whether your hit rate is above some threshold per slide, say four-out-of-five for first-pass mastery. You can ask which slides have a hit rate of zero, which is where your next 30 minutes should go. None of that is possible when the unit is hours, because hours do not attach to slides. An hour spent rereading a 90-slide deck distributes invisibly. A retrieval attached to slide 47 does not.
How one evening looks under each metric
Same cardiology deck, same student, two ways of carving up the evening. Toggle to see the version the exam rewards.
4 hours logged on the cardio deck tonight. Reread slides 1 to 90 twice. Highlighted the parts that felt important. Made a one-page summary at the end. Closed the laptop at 1 a.m. feeling like I covered it.
- Total attached retrievals against any specific slide: 0.
- Slides with a known hit rate: 0 of 90.
- Time spent in the activity the exam actually tests: 0 minutes.
- Felt productive. Was not.
The part the other guides skip: where the questions come from
Here is the part no other "active recall vs study hours" page addresses, because it is a product problem and they only have advice to give: where do the questions come from. Karpicke and Roediger handed their participants the vocabulary list and the test. In real life, the test does not exist. You are a second-year med student with thirty PDFs from the renal block and no question bank that maps to your professor's actual slides. You cannot "do retrieval practice" against questions that do not exist. So you reread, because rereading is the only action available, and you log four hours, because hours are the only countable thing.
Writing the questions yourself is technically the right answer. It is also a one-to-two-hour task per 90-slide deck. If you have thirty decks left before the exam, that is 30 to 60 hours of authoring before you have run a single retrieval attempt. The crammer does not have those hours. So the crammer reads, gets the 36 percent, and writes off active recall as something that works for other people.
studyly's whole reason to exist is to make that authoring step a non-event. Drop in a 90-slide PDF or PPTX. Roughly 60 seconds later you have on the order of 200 retrieval-shaped items against that one deck, across four formats: multiple-choice, free-response, case-style vignettes, and image-occlusion flashcards over labeled diagrams. The questions are written against your professor's actual slides, not pulled from a generic web question bank for a different curriculum. On a held-out three-document eval scored blind on factual correctness, stem clarity, distractor quality, and question-type coverage, studyly scored 81.3 of 100. Turbolearn scored 57.8 on the same documents and rubric; the rest of the field landed at 78.0 and 68.0. The retrievals are real retrievals, not noise.
“From spending an hour or two making 100 flashcards to doing that in 60 seconds.”
Hours-tracking vs retrieval-tracking, line by line
| Feature | Study hours | Retrieval per slide |
|---|---|---|
| What you are measuring | Minutes elapsed since you opened the deck. | Distinct retrieval attempts per slide, hit rate per slide, slides not yet attempted. |
| What the measurement predicts | Almost nothing. Hours correlate weakly with retention because most of the time inside the hour is rereading, which Karpicke and Roediger 2008 showed does not move one-week recall. | Retention. The two repeated-testing conditions in Karpicke and Roediger 2008 hit about 80 percent one-week recall, the no-more-testing conditions about 36 percent. |
| Resolution of the metric | Whole-deck. An hour spent on a 90-slide deck does not attach to any one slide. | Per-slide. A retrieval attempt is attached to the slide that authored the question, so a zero-hit slide is a visible gap. |
| Where the questions come from | They do not. Hours-tracked study is rereading or highlighting, because authoring 200 questions yourself takes one to two hours per deck. | Generated from your professor's actual slide deck in about 60 seconds, roughly 200 items across MCQ, free-response, case-style, and image-occlusion. |
| What happens when you take the same deck twice | Rereading the second time feels easier. That is the illusion of fluency, not retention. | Stems are auto-rephrased and distractors reshuffled on revisit, so take number five is still measuring retrieval, not memorized wording. |
| Adherence past week two | Hour-targets are the goal nobody is happy to meet, which is why they get skipped. | Each deck grows a tree, decks chain into a river, weekly leagues turn five minutes a night into a visible streak. |
Why hours-targets also fail at adherence
The second reason hours-based tracking quietly fails is adherence. "Study for two hours tonight" is a goal nobody is happy to meet. "Run the queue on the cardio deck until the tree finishes growing" takes about five minutes and has a visible end state. The product mechanic is small: each deck grows its own tree, and a fixed-size block of correct retrievals moves the tree from sapling through to full canopy. By exam morning the screen looks like a forest, one tree per deck. People who fall off five-minutes-a-day spaced repetition tools usually fall off around week two; the tree-and-river loop is the thing that keeps the daily action small enough to actually do.
That is the same reason the cramming version of the loop works. A 1 a.m. session dies from boredom long before it dies from a question bank running dry. Visible per-deck progress is one of the most reliable interventions against quitting a study session early. The pedagogy is the rubric; the staying power is the tree.
Common follow-ups
Should I track active recall or study hours?
Track active recall, specifically distinct retrieval attempts per slide or per concept. Karpicke and Roediger 2008 (Science) found about 80 percent one-week retention from repeated retrieval against about 36 percent from repeated study, and extra study after the first correct retrieval added almost nothing. Hours is a measurement of input, retrieval is a measurement of output, and only the output predicts the exam score. The honest target is at least one retrieval against every slide in the deck and a hit rate above some threshold (four-out-of-five is a reasonable first-pass bar).
How many active-recall attempts per slide is enough?
There is no single right number, but a practical floor is one retrieval per slide for first-pass coverage and three to five for mastery, with the spacing widening once you start hitting the item correctly. The studyly default of roughly two questions per slide across four formats (MCQ, free-response, case-style, image-occlusion) means a 90-slide deck gives you about 200 retrievable items, which is enough for a first pass with headroom for the cards you miss to come back several times under the spacing algorithm.
Is 30 minutes of active recall really better than 3 hours of rereading?
For exam-shaped retention, yes, in the direction the research keeps reproducing. The 2008 Karpicke and Roediger paradigm is the cleanest version: repeated retrieval beat repeated study by roughly 80 percent to 36 percent on a one-week test, and the gap held with substantially less total study time on the retrieval side. The caveat is that you need actual questions to retrieve from. 30 minutes of staring at a deck wishing you had practice questions is not 30 minutes of retrieval practice.
Why do my study hours go up but my grades do not?
Almost always because the hours are spent on activities that feel like studying but do not exercise retrieval: rereading the slides, highlighting, recopying notes, watching the lecture at 1.5x for a second time. Those produce the illusion of fluency: the material feels familiar on the page, your brain reads familiarity as mastery, and on the exam you cannot retrieve it because retrieving is a separate skill from recognizing. Re-budget your hours toward attempts that force cold recall. The hour count will go down. The grades will go up.
Does ChatGPT solve the authoring bottleneck?
It moves the floor up, but not all the way. ChatGPT will write you questions on demand, which is better than the zero questions you had before. The problems show up at scale: it does not enforce a quality rubric on what it emits, so distractors are often implausible and stems often paraphrase the answer; it does not track which items you got wrong, so the deck does not adapt; it returns the same wording on the second take, so by your third pass you are pattern-matching the question, not the fact. On the held-out three-document eval studyly scored 81.3, Turbolearn scored 57.8, and a generic prompt-only baseline scored worse than both. A weak distractor is not a neutral event on a cardiology question; it teaches you the wrong direction.
How does spaced repetition fit on top of all this?
Spaced repetition is the schedule for how often a given retrieval attempt resurfaces. Active recall is the unit, spaced repetition is the cadence. You need both. A spaced-repetition algorithm with no questions to schedule has nothing to do; a wall of questions with no schedule means you drill the same items into the ground while ignoring the ones you missed. studyly runs spaced repetition on top of the generated questions, so each retrieval feeds the schedule and items you miss come back sooner.
What if I genuinely have to study for a specific number of hours, like a tutor billing block?
Then use the hours as the container, not the metric. Inside the two-hour block, the thing you are counting is still retrieval attempts and hit rate per slide, and the goal at the end of the block is something like 'every slide in this deck has at least one retrieval and the lowest-hit-rate slides have been re-attempted twice'. The hours are the time you are paid to be in the chair. The retrievals are the work that happens in the chair.
Is there a free tier and do I need a credit card?
Yes, free tier on app.jungleai.com, no credit card required to start. Upload a lecture, generate questions, and drill them without entering a card. Paid is opt-in.
Sources
- Karpicke, J.D. and Roediger, H.L. (2008). The Critical Importance of Retrieval for Learning. Science, 319(5865), 966 to 968. Open PDF at psychnet.wustl.edu.
- studyly internal Quality Comparison panel, 2026-04-24. Held-out three-document eval, scored blind on factual correctness, stem clarity, distractor quality, and question-type coverage: studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8.
Switch the metric.
Drop one lecture deck and watch it convert in about 60 seconds. Drill the queue, see your hit rate per slide, find the gaps that hours never could. Free tier on app.jungleai.com, no credit card.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.