Argument · scaling math vs rating honesty

Anki's algorithm scales fine. The rating signal does not.

People say Anki "breaks" at 30,000 cards. The algorithm does not break. It does the same FSRS gradient descent on a 30k-card collection that it does on a 200-card one, and it does it in seconds. What collapses at scale is your time-per-card budget, and once that budget falls below about 8 seconds, the rating you press is no longer measuring recall. It is measuring how fast your eye can find the familiar wording.

FSRS cannot tell the difference. It updates Stability, extends the interval, and reports rising retention while you forget the concept the day the wording rotates. The honest scaling problem in spaced repetition is upstream of the algorithm. It lives in what the card looks like when it surfaces.

Matthew Diakonov, Written with AI

Published May 22, 202610 min read

Direct answer · verified 2026-05-22

Does Anki's spaced repetition algorithm scale to large decks?

Yes. FSRS is well-behaved on 30,000-card or 100,000-card collections; optimization runs in seconds and per-card scheduling is O(1). Daily review volume grows logarithmically in time and linearly in new-cards-per-day rate, settling near the often-quoted 10x rule (roughly 10x your new-cards-per-day average). The user-facing limit at scale is not algorithmic. It is the time-per-card budget. At 800 reviews in a 90-minute session you have around 6 seconds per card, which is enough to pattern-match a static stem but not to retrieve a concept. The Good rating you press then becomes a cache hit, and FSRS schedules on that.

Authoritative reference for FSRS in Anki: docs.ankiweb.net/deck-options.html. The logarithmic review-load model: issarice.com/long-run-anki-review-load.

0Cards in a typical med-school AnKing-style collection

0Daily reviews at 20 new cards/day, steady state

0sSeconds per card at 800 reviews in a 90-minute session

0.3Studyly card-quality eval (vs field average 67.9)

The scaling math, written out

Issa Rice published the cleanest closed-form for SM-2-shaped review growth. Cards added at rate c per day, ease factor a around 2.5, day number D, total reviews due on day D equals c times log base a of ((a minus 1) times D plus a). The function is unbounded in D and grows logarithmically, which means the per-year increment shrinks fast but never quite stops.

Concretely, at c equals 20 cards a day the curve settles near 200 reviews a day within the first year and inches up to about 280 reviews a day twenty years later. The 10x rule of thumb (steady-state reviews per day equal roughly 10 times your new-cards-per-day rate) falls out of this math for the regimes most students actually live in. FSRS is more efficient than SM-2 at the same retention target, so the real multiplier is usually a bit lower, but the logarithmic shape is identical.

review-load.formula

Pre-clinical med students who add 80 to 120 cards a day land at 800 to 1,200 reviews a day within months. That is the regime where the algorithmic story stops being the interesting one.

Where the second axis lives: time per card

The first axis is the c·log_a((a-1)D+a) curve. The second axis, the one most algorithm explainers skip, is how much actual answering time you have per surfacing once that curve has stretched out. The arithmetic is unromantic: one 90-minute session, ten percent overhead, divide by the review count.

time-budget.txt

A med MCQ written against a real lecture slide takes about 8 to 14 seconds of honest retrieval if the stem is fresh and the distractors are length-matched. A rephrased clinical vignette of the same concept can take 10 to 20 seconds. At 200 reviews a day you can afford either. At 800 reviews a day you cannot afford either; you can only afford pattern-match.

Why pattern-match poisons the rating signal

FSRS sees one signal per review: the button you press, and the time since the last review. From that, it updates the DSR triple per card and predicts the next interval that maintains your desired retention. The model has no opinion on the cognitive route you took to the answer. A correct answer reached by pattern-matching the first ten words of the stem is identical, to FSRS, to a correct answer reached by running the differential.

When the same MCQ has surfaced three or four times with the same opening words, those words become a unique key into your mental cache. You see them, your brain returns the cached answer in around 1.2 seconds, you press Good. FSRS updates Stability upward. The interval extends. The deck statistics show 92 percent predicted retention. The cards look mastered.

The exam stem is a paraphrase. New patient age, new presentation order, same underlying concept. The first ten words do not match the cache key. Your brain runs the actual differential for the first time in three months and either finds the concept or does not. The result is bimodal and roughly half the cards you thought were mastered land on the wrong side of it.

The same card, two scaling outcomes

Toggle below. Identical concept (inferior wall MI, RCA territory). Identical FSRS parameters. Identical desired retention of 0.90. The only variable is whether the verbal surface rotates between surfacings.

# Day 1, original wording Q: A 58-year-old man presents with crushing substernal chest pain radiating to the jaw. ECG shows ST elevation in II, III, aVF. Which artery is occluded? You: 9 seconds. Pick C (RCA). Rate Good. # Day 8, identical wording Q: A 58-year-old man presents with crushing substernal chest pain radiating to the jaw. ECG shows ST elevation in II, III, aVF. Which artery is occluded? You: 1.2 seconds. Cache hit. Pick C. Rate Good (or Easy). FSRS: extends to day 23. # Day 23, identical wording You: 0.9 seconds. Rate Easy. FSRS: extends to day 60. # Day 60, exam day, rephrased Q: A 61-year-old smoker has new ST elevation in II, III, aVF with reciprocal changes in I and aVL. Which vessel? You: cache miss. Stare. Guess. Fail.

Same opening 10 words become a unique cache key after exposure 2
Good ratings at 1 second reflect cache hits, not retrieval
FSRS stretches the interval based on the cache-hit ratings
Predicted retention reads 92 percent; exam-day retrieval ~50 percent

What scales, what does not, and where the rubric layer fits

The table below reads as a column-by-column inventory of what FSRS handles, what it does not, and which side of the pipeline the upstream fix lives on. The point is not that FSRS is bad at large decks. It is the opposite: FSRS is surgical about the one job it does, and silent about the jobs it does not.

Feature	Static cards into FSRS at scale	Studyly cards into FSRS at scale
What scales at deck size 30,000+	Daily review volume (logarithmic in D, linear in new-cards-per-day rate c), FSRS optimizer cost (still seconds), parameter fit quality (improves with more reviews).	Same. The algorithm does not stop working. The thing that stops working is the user's per-card time budget under that review volume.
What FSRS actually sees per review	One rating (Again/Hard/Good/Easy), one timestamp, one card ID. Nothing about why the rating was what it was.	Same input. The MCQ rubric layer changes the conditions under which the rating gets pressed, so the same input arrives as a more honest signal.
Failure mode at scale	Static stems become unique cache keys after 3 to 5 exposures. A Good at 1.2 seconds is a cache hit, not a recall. FSRS extends intervals on cache hits; predicted retention drifts above true retention.	Auto-rephrasing rotates the verbal surface on each surfacing. The cache key never matches twice in a row. Retrieval has to reach the concept; the Good rating actually reflects concept recall.
Effect of raising desired retention from 0.90 to 0.95	Shortens intervals, multiplies review burden 1.5x to 3x. Reduces forgetting probability per card. Does NOT make any individual rating more honest.	Same arithmetic effect on intervals. If the rating signal is already honest (Studyly auto-rephrasing engaged), the retention number is achievable at a lower review cost.
Capping new cards per day	Keeps review queue manageable at the cost of coverage. A 20-new-cards/day cap puts a pre-clinical med student permanently behind the lecture stream.	Studyly generates the full deck from one upload (200+ cards from a 90-slide lecture in 60 seconds). The cap moves from card creation to whatever subset you choose to drill that night.
When the FSRS optimizer should run	After ~1,000 new reviews or after any large import. Monthly cadence at med-school scale is plenty.	Same. Importing a Studyly .apkg counts as a large import. Let the optimizer re-fit before reading the predicted intervals on the new deck.

Diagnostic: is your rating signal honest yet?

If predicted retention is consistently above true retention in Anki Stats, the algorithm is faithfully scheduling something other than recall. The five checks below isolate where the dishonesty enters. Each is roughly five minutes.

Five-check rating-signal audit

Pull 20 cards FSRS marked Good in the last 7 days. Mentally erase the first 10 words of each stem. Can you still answer? If fewer than 16 of 20 hold up, your ratings are measuring stem-cache, not retrieval.
Open Anki Stats, then FSRS Stats. Compare predicted retention with true retention over the last 30 days. A gap wider than 3 percentage points means the schedule has drifted away from your real recall, almost always because Goods are too cheap.
Sample 10 cards that have been Good for at least three intervals in a row. Check whether all four distractors are within 25 percent of each other in character length. A length tell turns the MCQ into a 1-second cache hit even on a fresh deck.
Time yourself on one full Anki session, just clock-in to clock-out. Divide by review count. If you are under 8 seconds per card on a non-cloze deck, you are pattern-matching by physics, not by choice.
If you import .apkg files from a generator, check whether the same concept surfaces with different stem wording across cards in the same deck. If 200 cards all open with 'The patient presents with', the cache key is the same for every card and the algorithm cannot tell you apart from a recognition test.

81.3

“On the held-out three-document eval (factual correctness, clarity, distractor quality, question-type coverage), Studyly cards score 81.3 against a field average of 67.9. The eval measures the upstream layer, the one FSRS does not see. If the cards FSRS will be scheduling are pattern-match-able, retention numbers drift up while real recall stays flat. Quality of the card matters more at large scale than at small scale because the time budget is tighter.”

Held-out three-document eval, April 2026 · methodology at studyly.io/quality

What to actually do if your deck is past 10,000 cards

Leave FSRS settings at default. Desired retention 0.90. Run Tools then FSRS then Optimize once a month or after a large import. None of the deck-options sliders fix the rating-signal problem; the slider only changes the price of forgetting.

Audit twenty Goods. If most pass the first-ten-words-erased test, you are fine, just keep going. If they do not, the fix is upstream. Generate cards against your real source (slides, PDFs, lectures, not generic web question banks), let the in-flight rubric reject cards that fail source-anchoring or distractor-length checks, and review either inside Studyly (where the stem auto-rephrases on each surfacing) or in Anki with periodic re-exports so the wording rotates that way instead.

Cards where the rote wording is the point (drug names, anatomical labels, Latin terms) do not need rephrasing. Static .apkg works fine for them. Cards where the underlying concept matters more than the surface (mechanism, vignette, differential) are the ones the rephrasing layer is for. Most med, dental, nursing, pharmacy, and PA decks are roughly 30 percent rote and 70 percent concept; splitting them across two presets gives the optimizer a cleaner fit too.

Common questions about Anki's algorithm at scale

Does Anki's spaced repetition algorithm actually break at 30,000 cards?

No. The algorithm is well-behaved at that scale. FSRS fits a per-card DSR triple (Difficulty, Stability, Retrievability) and re-fits parameters when you run the optimizer; the cost per scheduling decision is O(1) per card and O(N) over the deck during optimization. A 30,000-card collection on a modern laptop optimizes in seconds. The user-facing problem at scale is review volume, not algorithmic cost. Daily reviews settle at roughly 10x your new-cards-per-day rate (Issa Rice's c·log_a((a-1)D+a) formula), so a med student adding 80 to 120 new cards a day during pre-clinical years lands at 800 to 1,200 reviews a day by year 2.

Where does the 'reviews settle at 10x new-cards-per-day' rule of thumb come from?

It comes from integrating the review schedule under SM-2-like spacing. If you add c cards per day and each card is reviewed at intervals of a, a^2, a^3 days (with ease factor a around 2.5), the total cards due on day D works out to c·log_a((a-1)D+a). The function grows logarithmically without bound: 5 cards/day at a=2.5 produces 5 reviews on day 1, 42 by year 5, 50 by year 20, 56 by year 50. The practical 10x rule falls out of that math once D is large. FSRS is more efficient than SM-2 at the same retention target, so the multiplier is usually a bit lower than 10x in practice, but the logarithmic shape is the same.

Why does my time per card collapse as the deck grows?

Available study time per day does not grow with deck size. If you have 90 minutes to do Anki in the morning and your due queue is 200, you have 27 seconds per card. If the queue is 800, you have 6.75 seconds. At 6 seconds you can scan a stem, recognize wording you have seen before, and rate Good. You cannot read a clinical vignette, run a differential, eliminate two distractors, and decide between the remaining two. The algorithm does not know which kind of retrieval just happened; it only sees the rating button you pressed.

If the algorithm only sees the rating, why does that matter at scale?

Because at low time-per-card budgets, the only retrieval the brain has time to run is verbal pattern-matching. Pattern-matching succeeds when the stem is static, so you rate Good in 1 to 2 seconds. FSRS updates Stability upward, extends the interval, and the deck statistics report higher and higher retention. The exam stem reads differently from the deck stem (different patient age, different presentation, different distractors), the pattern-match key does not fire, and you fail. The algorithm did exactly what you told it to do. You just told it the wrong thing for several thousand reviews in a row.

Doesn't raising desired retention from 0.90 to 0.95 fix this?

No. Raising desired retention shortens intervals and gives you more reviews per day on the same deck, roughly 1.5x to 3x more depending on the deck. If your rating signal is dishonest (you are pattern-matching static stems), more reviews of the same dishonest cards just compound the false-mastery feedback faster. The Anki docs flag this in deck-options.html: the retention slider trades off review burden against forgetting probability, it does not change what the rating means. The fix has to happen upstream of FSRS, where the card surface lives.

What's the upstream fix that survives at 800 reviews a day?

Two parts. (1) Card-quality controls before emission: source-anchored stems (every card traces to a slide or page), length-matched distractors (within ~25 percent character length), grammar parallelism (no a/an tells), no filler templates (no 'all of the above', no 'both A and C'). (2) Stem rephrasing on each surfacing: the same underlying concept appears in different verbal surfaces (original wording, paraphrase, clinical vignette, analogy) so the pattern-match cache never fires twice on the same key. Studyly handles both as part of the generation and review pipeline; static .apkg files imported into Anki get only the first part, which is why a periodic re-export is the workaround when reviewing in Anki rather than in Studyly.

What does FSRS Optimize actually do, and how often should I run it at scale?

FSRS Optimize re-fits the 17 weights of the DSR model against your review log via gradient descent. It runs in seconds even on a collection with hundreds of thousands of reviews. Anki recommends running it after roughly 1,000 reviews of new history and after any large import. At med-school scale a monthly cadence is plenty; running it after every study session is wasted effort. The optimizer cannot fix the rating-signal problem either; it just fits the DSR triple better to whatever signal you have given it.

Can I just cap new cards per day to keep reviews under 200?

You can, but the math is unforgiving. The 10x rule (reviews per day at steady state ≈ 10x new-cards-per-day) means 200 reviews/day requires you to cap at 20 new cards/day. A pre-clinical block with 30 lectures a week and 50 to 100 testable facts per lecture produces around 1,500 to 3,000 facts a week. Twenty new cards a day is 140 a week. You will be permanently behind. Capping new cards solves the volume problem by trading away the coverage problem; that is a different failure mode, not a solution.

How does Studyly's auto-rephrasing actually change the FSRS rating signal?

When the same concept surfaces with rotating verbal surfaces, your retrieval has to traverse the concept on each surfacing instead of hitting a cached stem-to-answer mapping. The wall-clock time per card goes up (8 to 12 seconds instead of 2 to 3) but the rating you give back actually reflects concept retention. FSRS then sees a more honest stream of ratings: Good means 'I retrieved the concept', not 'I recognized the wording'. The interval growth tracks real retention. On the held-out three-document eval, Studyly's cards score 81.3 on factual correctness, clarity, distractor quality, and question-type coverage (vs Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8). The eval is orthogonal to FSRS but it measures the upstream layer.

Is there a deck size at which the algorithm itself stops working well?

Not in any practical sense. People run 100,000-card collections in Anki without algorithmic issues; the FSRS model is happy with that. The user runs out before the algorithm does. The two soft ceilings in practice are (1) time-per-card collapsing below the retrieval-honesty threshold (around 8 to 10 seconds per card for non-trivial MCQs) and (2) deck heterogeneity making the DSR fit noisier (mixing 5,000 anatomy cards with 5,000 pharmacology cards in one preset gives a less personal fit than two presets). Both are organizational problems, not algorithmic ones.

Anki's algorithm scales fine. The rating signal does not.

Does Anki's spaced repetition algorithm scale to large decks?

The scaling math, written out

Where the second axis lives: time per card

Why pattern-match poisons the rating signal

The same card, two scaling outcomes

What scales, what does not, and where the rubric layer fits

Diagnostic: is your rating signal honest yet?

What to actually do if your deck is past 10,000 cards

Related reading

One lecture deck, 200 rubric-gated MCQs, into your FSRS preset

Common questions about Anki's algorithm at scale

Comments ()

Does Anki's spaced repetition algorithm scale to large decks?

The scaling math, written out

Where the second axis lives: time per card

Why pattern-match poisons the rating signal

The same card, two scaling outcomes

What scales, what does not, and where the rubric layer fits

Diagnostic: is your rating signal honest yet?

What to actually do if your deck is past 10,000 cards

Related reading

One lecture deck, 200 rubric-gated MCQs, into your FSRS preset

Common questions about Anki's algorithm at scale

Comments (••)

Comments ()