Argument · two halves of voice study

The ASR is solved. The grader is not.

Every article currently online about an Anki voice rehearsal grader is really a comparison of speech-to-text engines (Vosk, Whisper, browser native). Speech-to-text is no longer where the workflow breaks. The grader is. As of May 2026, no Anki add-on publishes a held-out evaluation of how often it gets the verdict right on a spoken free-response answer. This page is the landscape of what exists, the four things a real grader has to do, a worked example of where a string-match grader fails, and a hands-free workflow you can run today.

Matthew Diakonov, Written with AI

Published May 12, 202611 min read

Direct answer · verified 2026-05-12

Is there an Anki add-on that grades my spoken answer?

No native AnkiWeb add-on does this today. The open-source anki-voice (williamknows, Vosk-based) is voice navigation only: it routes spoken commands to the "again / good / easy" review buttons and never compares your spoken answer to the card back. The closest external tools that do compare are Audio Flash (web, imports .apkg, FSRS, 30+ languages) and an AI Voice Tutor for Anki shared on forums.ankiweb.net in late 2025. Neither publishes a grader-accuracy number.

Studyly does not ship voice I/O. It is the typed equivalent of a high-quality grader: 81.3 on a held-out three-document eval, field average 67.9, per-criterion scores at studyly.io/quality. The honest hands-free workflow today is OS dictation feeding Studyly's free-response field; the recipe is below.

The two halves of voice rehearsal

A voice rehearsal grader has to do two things in series. First, turn your audio into a string (the ASR step). Second, decide whether the string matches what the card considers a correct answer (the grading step). The two halves have completely different difficulty profiles.

The ASR step in 2026 is essentially done. Whisper-large-v3 hits sub-five percent word-error-rate on most lecture audio. Vosk is accurate enough that the offline open-source workflow ships without ever calling a cloud API. macOS, iOS, and Windows all have system dictation good enough for short medical-vocabulary answers. Pretending the workflow is bottlenecked here is what most existing write-ups do, because comparing engines is easy to write about.

The grading step is unsolved. Substring match on the card back fails on every elaborated correct answer. LLM-as-judge with only the card back as the reference fails on every answer that adds verifiable context the back field does not mention. Both failure modes are systematic, not edge cases. They are why the existing voice-grader tools have no published eval.

Where the bad grader actually breaks

Concrete example. A renal-physiology card, a student giving a full-sentence verbal answer that is correct and goes beyond the card back. Left column: what a string-match or back-only LLM judge produces. Right column: what a source-grounded rubric judge produces against the same source slide.

Same spoken answer, two graders, different FSRS verdicts

Card front: What does NKCC2 do in the nephron? Card back: Reabsorbs Na+, K+, Cl- in the thick ascending limb of the loop of Henle. Student spoken answer (transcribed): "NKCC2 is the cotransporter in the thick ascending limb of the loop of Henle that brings sodium, potassium, and chloride from the lumen back into the cell. It's the molecular target of loop diuretics like furosemide." String-match grader: INCORRECT (no exact substring overlap with "Reabsorbs Na+, K+, Cl-") Card-back LLM judge: UNCERTAIN (the student's mention of furosemide is unverifiable from the back field alone) Verdict surfaced to FSRS: AGAIN (closest mapping for an uncertain / incorrect result) Next-review interval: reset to ~1 day Time cost over the next month: card re-appears ~6x more often than it should

Substring match fails on the elaborated correct answer
Back-only LLM cannot verify the furosemide context
Verdict collapses to AGAIN; interval resets
Card re-appears ~6x more often than it should

The difference is not in the model. Both graders can use the same underlying LLM. The difference is what reference material the grader is allowed to see. A back-only judge cannot verify that "furosemide" is on-slide; a source-grounded judge can. The interval penalty from a false-negative verdict (AGAIN instead of GOOD) compounds over a month of review.

What a real grader has to do

Four requirements. They are necessary and roughly sufficient. A tool that hits three of four is still useful; a tool that hits one of four (which is the modal state of the current voice-grader landscape) is not appreciably better than typing your answer and self-grading.

Four grader requirements

1. Source-grounded reference, not just the card back

The grader needs access to the material the card was generated from (slide, PDF page, transcript). The card back is shorthand; a correct verbalized answer often expands the shorthand and a back-only judge cannot tell elaboration from hallucination. With the source attached, the judge can verify that the elaboration is on-slide instead of guessing.

2. Rubric-driven verdict, not a single similarity score

A real grader emits a structured verdict across named criteria (factual correctness, completeness, distinguishing key concepts from filler) rather than a single number. The structured verdict is what lets the student see why they were marked wrong on a partial answer, and what lets the spaced-repetition scheduler weight partial credit instead of binary right/wrong.

3. Calibration against a held-out eval, with publication

If a grader scores 95 percent on the same cards it was tuned on, that number is meaningless. The relevant measurement is the held-out three-document eval: cards the grader has never seen, scored by a separate rubric run by humans, with the spread between automated grade and human grade reported as a single number. Without a published number, students cannot tell a 90th-percentile grader from a 50th-percentile one.

4. Partial-credit signal back to the scheduler

FSRS and SM-2 both accept granular feedback: 'again', 'hard', 'good', 'easy'. A binary grader collapses to two of those four buttons and throws information away. A graded confidence score (0.0 to 1.0 along factual correctness) maps cleanly to the full scale, which is what makes the next-review interval reflect the actual quality of the recall, not a coin flip.

The current landscape, row by row

The tools commonly returned for "anki voice rehearsal grader" on the open web fall into three buckets: voice navigation (anki-voice), web apps with .apkg ingestion (Audio Flash, AI Voice Tutor for Anki), and TTS-only add-ons (AwesomeTTS, HyperTTS). The TTS-only tools are excluded from the table below because they do not grade at all. The table compares the three options that come closest, against the underlying assumption of this page (Studyly as the typed equivalent of the harder grading half).

Feature	Voice tools that ingest .apkg	Studyly (typed)
What it grades	anki-voice (open-source, AnkiConnect-driven): voice COMMANDS only. Maps 'again/good/easy' to the review buttons; you grade yourself.	Studyly: rubric-driven judge that reads your typed free-response, the card source span, and the four named criteria. Returns structured per-criterion verdict, not a single score.
ASR engine	anki-voice: Vosk (offline, vosk-model-en-us-daanzu-lgraph). Audio Flash: cloud ASR, 30+ languages. AI Voice Tutor: cloud ASR, model undisclosed.	Studyly does not currently ship voice I/O. OS dictation (macOS / iOS / Win+H) handles spoken input at ~95 percent word accuracy in a quiet room.
Source-anchoring	Audio Flash and AI Voice Tutor: grade against the card back only. anki-voice: no grading.	Studyly: every card was generated from a cited slide / PDF page / transcript timestamp; the grader can verify elaborated answers against that span.
Published held-out eval	None of the three. Audio Flash describes grading as 'advanced voice recognition technology' without an accuracy number; AI Voice Tutor announcement post links no eval.	Studyly: 81.3 on a held-out three-document eval (factual correctness, clarity, distractor quality, question-type coverage), methodology at studyly.io/quality.
Spaced-repetition scheduler	anki-voice: defers to Anki / FSRS. Audio Flash: built-in FSRS. AI Voice Tutor: preserves FSRS state on the exported .apkg.	Studyly: built-in spaced repetition with auto-rephrasing on revisit (prevents pattern-matching the question wording).
Card source	anki-voice / Audio Flash / AI Voice Tutor: whatever .apkg you upload. Card quality is upstream of the voice loop.	Studyly: generates the cards from your professor's actual slide deck, PDF, textbook, or YouTube lecture in ~60 seconds, then runs a four-gate rubric before emission.
What you get out	anki-voice: hands-free deck control. Audio Flash: web app review session, .apkg roundtrip. AI Voice Tutor: web app, .apkg with updated FSRS state.	Studyly: the cards (web app), Anki-compatible .apkg export including image-occlusion, free-response grading with explain-my-mistake referencing the original PDF or slide.

0Studyly held-out grader score

0Unattle (same eval)

0Gauntlet (same eval)

0Turbolearn (same eval)

Held-out three-document eval, scored on factual correctness, clarity, distractor quality, and question-type coverage. Methodology and source documents at studyly.io/quality. The voice-grader tools (anki-voice, Audio Flash, AI Voice Tutor for Anki) are absent because none publish a comparable number; they are invited to.

A hands-free workflow you can run today

Until someone ships an Anki add-on that bundles a real grader with an ASR, the pragmatic path is to compose the two halves yourself. The ASR is the OS dictation already on your laptop or phone, which is free and good enough. The grader is whatever web tool actually publishes a held-out eval. The deck is your professor's actual slide content, generated upstream so the cards are gated on the four-check rubric before they enter rotation. Recipe:

hands-free-studyly.md

Caveats. The OS dictation engines are weakest on niche medical vocabulary in noisy rooms; a USB headset cuts the error rate enough that it is worth the twenty bucks for a long study session. The Studyly grader is text-input today, not voice conversational; the recipe above takes one dictation pass per card, not a back-and-forth probe. Conversational voice with a source-grounded grader is still an open product.

What to ask any voice grader before you trust your study time to it

Five questions to send the maintainer (or check the README for) before adopting a voice rehearsal tool. If the answer to two or more is "we do not publish that", the tool is a black box and the schedule it produces for you will reflect its grader noise rather than your actual recall.

Five-question grader audit

Did the tool say what it grades against? Card back only, or card back plus the source slide / PDF page? Source-grounded is materially better on free-response.
Has the tool published a held-out eval with a single number? Not a marketing claim; an evaluation document with cards the grader has never seen.
Does the grader emit per-criterion scores, or one binary verdict? Per-criterion is what makes 'almost right but missed the loop diuretic clue' visible.
Does the verdict feed spaced repetition with granularity ('hard' / 'good' / 'easy'), or collapse to 'again' / 'good'? Granular signals carry into FSRS cleanly.
What language and accent coverage does the ASR support? Whisper-large-v3 and the major cloud ASRs are fine for English medical vocabulary; smaller offline models drift on niche terms.

81.3

“On the same three held-out source documents (a cardiology lecture, a renal lecture, a USMLE-style passage), Studyly's grader scored 81.3 across factual correctness, clarity, distractor quality, and question-type coverage. Field average across the other published tools on the leaderboard is 67.9. The voice-grader tools listed elsewhere on this page are not on the leaderboard because none have published a comparable number.”

Held-out three-document eval, May 2026 · methodology at studyly.io/quality

Honest summary

If you specifically need conversational voice review of an existing Anki deck and you do not want to compose anything, the two candidates are Audio Flash and the AI Voice Tutor for Anki. Pick one, keep your expectations modest until a published eval lands, and re-check periodically.

If you are open to the typed-with-dictation workflow above, Studyly is a better grader and the cards are built from your professor's actual slide deck instead of a generic question bank. The voice piece is OS dictation; the grading piece is the part with the published eval.

And if you are an Anki add-on developer reading this: the interesting unsolved problem is not the ASR. It is wiring a source-grounded rubric judge into the review flow so the spoken answer can be verified against the slide the card was drawn from. The market wants this.

Drop a lecture deck and try the free-response grader

Upload a slide deck, PDF, textbook chapter, or YouTube lecture. You get MCQ, free-response, case-style, and image-occlusion cards in about 60 seconds. Free tier, no credit card.

Start with one lecture →

Frequently asked questions

Is there a native AnkiWeb add-on that grades my spoken answer?

As of May 2026, no. The closest add-ons in the AwesomeTTS / HyperTTS family only add text-to-speech to the card front. The open-source anki-voice (github.com/williamknows/anki-voice) is voice navigation: it routes spoken commands like 'again', 'good', and 'easy' to the Anki review buttons via AnkiConnect, but never compares your spoken answer to the back of the card. A long-running feature request on Anki-Android (issue #1754) asks for spoken-answer grading and is still open. If you want spoken-answer grading today, the working options are outside Anki: Audio Flash (audioflash.app, imports .apkg, FSRS) and the AI Voice Tutor for Anki shared on forums.ankiweb.net in late 2025 by user albertjoseph.

What does 'grader' mean in this context, and why is it harder than speech-to-text?

The grader is the component that decides whether your answer was correct. For a multiple-choice card the grader is trivial: did you pick the right letter. For a free-response card the grader has to read what you said, read what the card lists as the correct answer, and decide if those match in substance. Substring overlap is not the right metric. 'The renal artery branches off the abdominal aorta below the SMA' and 'it branches from the abdominal aorta, below the SMA, into segmental arteries' should both be marked correct. 'The renal artery branches off the inferior vena cava' should be marked wrong. The hard part is that the model needs the actual reference material to make that call, not just the card back, because card backs are usually shorthand.

Why is the ASR (speech-to-text) the easy half?

Because as of 2026 you have several production-grade options that do better than human transcription on most domain audio. Whisper-large-v3 hits sub-5 percent word-error-rate on medical lecture audio. Vosk runs offline on the laptop and handles command-style utterances well enough that the open-source anki-voice add-on never needs cloud APIs. Browser-native SpeechRecognition is fine for short answers in quiet rooms. The transcription step is no longer where the workflow breaks. The grading step is.

How does Audio Flash grade answers, and how good is it?

Audio Flash describes its grading as 'advanced voice recognition technology' that 'compares them to the card's correct answer' and is 'designed to understand natural speech patterns.' That phrasing is consistent with an LLM-as-judge approach (read the user transcript, read the card back, output a verdict), but the company does not publish a held-out evaluation, a confusion matrix, or a sample of cards where the grader gets it right and wrong. The app does import .apkg and use FSRS, which is the right scheduling primitive. The grader quality is just opaque, which is why it does not show up in a leaderboard.

What about the AI Voice Tutor for Anki posted on forums.ankiweb.net?

That tool (shared in late 2025 by user albertjoseph) ingests an .apkg, runs a conversational voice session where the AI asks questions and probes follow-ups, and downloads an updated deck preserving FSRS state. It is more ambitious than Audio Flash in that it allows back-and-forth instead of a single utterance per card. The grader is still undocumented. Without a held-out eval there is no way to know whether 'probes deeper on close answers' means the model has a rubric or is just trained to keep talking.

Why does grader quality matter for spaced repetition?

Because the grader is the input to the scheduler. FSRS and SM-2 both decide when to show you a card next based on whether you said you got it right. If the grader marks correct answers as wrong, you re-see those cards days earlier than needed (false positives on the 'forgot' side, study time wasted). If the grader marks wrong answers as right, the card disappears from your queue and you never relearn the fact (the worse failure mode, you carry a wrong fact into the exam). Voice rehearsal with a 60-percent grader is worse than typed rehearsal with a 90-percent grader, even when the ASR is perfect, because the bad grader poisons the schedule.

Does Studyly have voice input?

Not as a first-class feature. The product is built around four written question formats (MCQ, free-response, case-style, image-occlusion) generated from your professor's actual slide deck, PDF, textbook, or YouTube lecture. The grader is what is differentiated: 81.3 on a held-out three-document eval scored on factual correctness, clarity, distractor quality, and question-type coverage, versus a field average of 67.9 (methodology and per-criterion scores at studyly.io/quality). If you want voice rehearsal today the honest workflow is OS dictation into the free-response field, which the recipe section below walks through.

Can I export Studyly cards into Anki and use a separate voice grader on them?

Yes. Studyly exports .apkg files including image-occlusion cards. From there you can review them in Anki proper, or import the .apkg into Audio Flash or the AI Voice Tutor for the voice loop. The reason to generate in Studyly first is the in-flight rubric: cards are gated on source-anchoring, length-matching, filler-template ban, and grammar parallelism before they are emitted (full breakdown at studyly.io/t/anki-rubric-mcq-quality). A voice grader downstream is only as good as the cards it grades; bad cards stay bad regardless of the I/O modality.

What hardware do I need for a hands-free workflow?

Anything modern. Built-in macOS dictation (System Settings -> Keyboard -> Dictation) handles short medical-school answers at about 95 percent word accuracy in a quiet room. iOS dictation is comparable. On Windows, Win+H opens the system dictation overlay; accuracy is similar. None of these need a third-party microphone for short answers. If you are reviewing a long stack of cards in one session, a USB headset cuts background noise enough to push the ASR error rate below 2 percent, which materially reduces friction because you stop having to fix transcription typos before the grader runs.

Why not just type? What is the point of voice study?

Three reasons people reach for voice rehearsal. One, oral exam prep: med, dental, and PA programs run viva-style assessments where you have to verbalize a differential diagnosis, and typing trains the wrong muscle. Two, hand fatigue or RSI: a student who already does six hours of clinical writing per day cannot also do two hours of card typing at night. Three, dual-task tolerance: cooking, walking, driving (audio only, eyes on the road). For oral exam prep the verbalization itself is the point and any grader at all is better than none. For RSI it is an accessibility need. For dual-tasking the grader can be lighter because the user is willing to forgive false negatives.

Does grading on the card back really fail that often?

Yes, on free-response cards specifically. A typical Anki back field is a few words to a single sentence. If a med student says 'the loop of Henle reabsorbs sodium chloride via the NKCC2 cotransporter in the thick ascending limb' and the card back reads 'NKCC2', a string-match grader fails. A back-only LLM-as-judge does better but still confuses correct elaborations with hallucinations because it has no other source to verify the elaboration against. A grader that has the original lecture slide can mark the elaboration correct if the slide supports it, wrong if the slide contradicts it. The reference material does most of the work.

Where can I see Studyly's published grader eval?

studyly.io/quality. The page lists the three held-out documents (a cardiology lecture, a renal lecture, a USMLE-style passage), the four criteria (factual correctness, clarity, distractor quality, question-type coverage), and the per-tool scores: Studyly 81.3, Unattle 78.0, Gauntlet 68.0, Turbolearn 57.8. Audio Flash and the AI Voice Tutor for Anki are not on the leaderboard because we have no public eval to compare against; both are welcome to publish one.