Claude Opus 4.7 · medical school · the lecture-deck workflow

Claude Opus 4.7 is a real upgrade for med school. The chat session is not the system.

Anthropic released Claude Opus 4.7 on April 16, 2026. For medical students working from lecture decks, three things actually matter: vision resolution jumped to 3.75 megapixels (so labeled anatomy figures and histology slides are finally legible), the new xhigh effort level sharpens mechanism explanations, and the model is honest about uncertainty in clinical scenarios in a way 4.6 was not. Pricing is unchanged at $5 per million input tokens and $25 per million output.

What did not change is the part that decides whether you actually learn the slide deck. A raw chat session still cannot enforce a quality rubric on every output, cannot reword stems on revisit gated by your miss data, and cannot run spaced repetition. This page is the runnable prompt template, the four 4.7 changes that move output, and the gap between raw 4.7 and a rubric-gated pipeline measured on the same three documents.

Jump to the prompt →
M
Matthew Diakonov
11 min read

Direct answer · verified 2026-05-08

How do I use Claude Opus 4.7 for med school?

Drop the lecture PDF or PowerPoint into Claude Opus 4.7 (set effort to xhigh), run the rubric-encoded prompt below to generate a first-pass batch of 50 source-grounded multiple-choice questions with slide-number citations, then layer a rephrasing and spaced-repetition system on top so the questions you missed today come back tomorrow with reworded stems. The chat is the generator; retention is the wrapped system. Specs and release notes are on anthropic.com.

What changed in 4.7 that actually moves med-study output

Most of the public Opus 4.7 coverage is about coding benchmarks and agentic use cases. For medical students, the changes that matter sit in vision, effort, and calibration. Four specifics:

Four 4.7 changes that move med-study output

  • Vision lifted to 3.75 megapixels (from 1.15 in 4.6), so labeled anatomy figures, biochem pathway diagrams, and histology slides are now legible end to end. This is the single biggest unlock for image-heavy lectures.
  • New xhigh effort level produces tighter mechanism explanations on multi-step pathways. Pyruvate to ATP, the renin-angiotensin cascade, and the coagulation cascade explain in fewer hand-wave steps and more correct intermediates.
  • Calibrated uncertainty on clinical-scenario items. The model is more willing to say 'the slide does not specify' instead of fabricating a confident wrong answer, which lowers the rate of fluent-but-incorrect distractors.
  • Same pricing as 4.6: $5 per million input tokens, $25 per million output. Roughly $0.50 to $1.50 of API cost per 90-slide deck for a full 200-question generation, depending on figure count.

The prompt, copy-pasteable

The block below encodes the four-criterion rubric directly into a single Claude Opus 4.7 turn: source grounding, distractor quality, type coverage, and stem rewording. Drop your lecture deck into the chat, set effort to xhigh, paste the prompt, and you get back strict JSON with slide-number citations on every item. The explanation_quote constraint is the part that makes the output checkable: any question without a verbatim line from a real slide gets thrown out before you study it.

claude-opus-4-7-medschool-prompt.md

This prompt is the starting point, not the finished system. Run it on a single 90-slide deck and check the first 20 outputs against the deck before you study from any of them. The model is deterministic enough at xhigh that a clean run produces low single-digit ungrounded-question rates, but you should verify on your own material before betting an exam on it.

Anchor fact · why a rubric in the prompt is not the same as a rubric in the system

The prompt encodes the rubric. The pipeline enforces it.

A rubric written into a prompt is a request. A rubric run as a post-generation gate is a guarantee. In a single Claude turn, question 1 may pass factual correctness, distractor quality, clarity, and type coverage; question 47 may quietly fail distractor quality (the right answer is twice as long as the wrong ones) and you will not notice until you have already reviewed it. A wrapped pipeline runs the same four criteria against each generated item and drops the ones below threshold. That is what a held-out eval actually measures: not the model's ceiling, but the floor of what reaches the student.

On the held-out three-document eval (a slide deck, a textbook chapter, and a paper), Studyly scored 81.3 versus the field average of 67.9. The 13.4-point gap is almost entirely the post-generation gate, the auto-rephrasing on revisit, and the slide-number grounding, sitting on top of whatever frontier model is generating. Swap the model and the gap stays roughly the same; remove the gate and the gap disappears.

What 4.7 still cannot do, even with a perfect prompt

The model is the generator. The retention is the system. Three gaps that no single chat turn can close, no matter how tight the prompt.

The three gaps a chat session cannot close

  • No rubric on the second question. Question 1 may pass factual-correctness, distractor-quality, clarity, and type-coverage; question 47 may quietly fail one of them. A chat session does not run a quality gate per item.
  • No stem rewording on revisit gated by your miss data. You can ask for a reword, but the model does not know which underlying fact you have already mastered and which is still soft, so the reword loop does not drive retention.
  • No spaced repetition. The questions you got wrong are not surfaced more often than the ones you got right. By revisit five from the same deck, you are mostly seeing the same questions in the same order, which trains recognition rather than recall.

Raw 4.7 chat session vs a rubric-gated pipeline

Side by side, on the dimensions that matter for actually learning a lecture deck. Both rows assume the same model is doing the generation; the difference is the system around it.

FeatureRaw Claude Opus 4.7 chatStudyly (rubric-gated pipeline)
Question generation from a lecture deckDrop the file in. ~30-90 seconds for 50 questions. Quality varies across the batch.Drop the file in. ~60 seconds for 200 questions across MCQ, free-response, case-style, and image-occlusion.
Per-question quality gate (rubric)None. The four criteria live in the prompt; nothing checks the output.Same four criteria run as a gate on every item. Below threshold gets dropped or rolled back.
Stem rewording on revisitManual: ask the chat to reword each item one at a time.Automatic. Different opening words, different sentence shape, distractor pool rotated. Right-answer index moves.
Slide-number citation on every wrong answerPossible if you encode it in the prompt; not enforced by default.Always. The explain panel quotes the slide line and the slide number when you miss.
Spaced repetition driven by your miss dataNone. The chat has no memory of which fact you got wrong yesterday.Built in. Hard misses surface sooner with a different surface form; easy hits drift further out.
Image-occlusion cards from labeled figuresNot natively. The vision pass reads the figure but does not produce a mask card.Yes. The labeled structure is masked; you recall the term cold. Exports to Anki .apkg.
Cost for one 90-slide deckRoughly $0.50 to $1.50 in API tokens, plus your time to wrangle the prompt.Free tier covers a real lecture deck end to end on app.jungleai.com.

The hour of prompt-wrangling vs the 60 seconds

Even with the prompt template above, the workflow has friction: you upload the deck, paste the prompt, wait, parse the JSON, sift the ungrounded questions, run the next deck the same way, and then build your own study loop on top. Most students do this once, decide it is too much, and revert to highlighting. Two paths to consider:

Running 4.7 on a med school deck

Drop the lecture deck. Paste the prompt. Wait for 50 questions. Parse the JSON in your head, copy the questions you trust into a doc, throw out the rest, then rebuild a study loop manually for spaced repetition and rephrasing. Repeat per deck. Most students stop after the first one.

  • $0.50 to $1.50 per deck in API tokens
  • 10 to 20 minutes per deck of human review
  • No spaced repetition unless you build one yourself

The held-out eval, in numbers

Three source documents (a slide deck, a textbook chapter, a paper) were held out. Each tool generated questions from the same three documents. Every output was graded on factual correctness, clarity, distractor quality, and question-type coverage. Same documents, same rubric, same graders.

0Studyly
0Unattle
0Gauntlet
0Turbolearn

Higher is better. Methodology and the rubric definitions are on /quality. The eval is model-agnostic: the score reflects the rubric and the system, not which frontier model the generation is wrapped around. That is why the gap is stable across model releases.

81.3 / 100

Auto-rephrasing means I can't lazy-pattern-match the first three words. Eight days into spaced repetition I actually retain the renal stuff.

Held-out three-document eval (Studyly), April 2026

When using 4.7 directly is the right answer

Two cases where the chat session beats a wrapped pipeline, even for med school work.

  • One-off concept clarification. You misunderstood Starling forces in the lecture and want a 5-minute mechanism walkthrough with a re-derived equation. A chat session is a better tool for this than any quiz wrapper; you do not need spaced repetition on a one-off explanation.
  • Custom prompt experiments. You want a question shape no wrapped tool generates, like extended matching questions in the AMBOSS style or counterfactual stems where the slide's premise is reversed. Iterating on the prompt is what 4.7 is good for. Once you find a shape that works, encode it and run it at scale.

Honest verdict

Claude Opus 4.7 is the best model in the Anthropic family for medical reading at the time of writing. The vision lift to 3.75 megapixels matters. The xhigh effort level matters on mechanism stems. Pricing is unchanged. If you want to run questions directly off your professor's slide deck with full control over the prompt, the template above is the starting point and the $0.50 to $1.50 per deck of API cost is reasonable.

What 4.7 does not change is the part of medical study that actually decides whether you remember the slide on Friday. A generation pass is not a study system. The gating, the rephrasing, and the spaced repetition are what move the eval number from 67.9 (the field) to 81.3 (Studyly), and they sit around the model rather than inside it.

The right move for most med students is to use 4.7 directly when you want one-off explanations or custom prompt experiments, and to use a wrapped pipeline (free tier on app.jungleai.com is the honest first stop) when you want the rubric, the rephrasing, and the slide-grounded retention loop without rebuilding all of that yourself.

Frequently asked questions

What actually changed in Claude Opus 4.7 that matters for medical school study?

Four things. First, vision resolution rose from 1.15 to 3.75 megapixels, which is the difference between a labeled anatomy figure being unreadable and being parseable, and is the single biggest unlock for slide decks built around microscopy and histology images. Second, the new xhigh effort level produces sharper mechanism explanations on the kind of multi-step pathway questions you see in biochemistry and physiology. Third, the model is more honest about uncertainty in clinical-scenario items, which means fewer confidently wrong distractors. Fourth, the new tokenizer uses roughly 1.0 to 1.35 times as many tokens per page, so a long lecture deck that fit in 4.6 may need to be chunked. The pricing is unchanged at $5 per million input tokens and $25 per million output tokens.

Can Claude Opus 4.7 read my scanned PDF lecture deck with labeled diagrams?

Yes for born-digital PDFs, conditionally yes for scans. A born-digital deck (exported from PowerPoint or Keynote) parses cleanly: the model reads the text layer and the slide images at the new 3.75MP cap. Image-only scans go through Anthropic's vision pipeline as well, which now resolves small caption text on histology and microscopy slides that 4.6 routinely missed. The remaining failure mode is photocopied-three-times handout scans where the OCR is hostile and the labels are ghosted; for those, run them through a separate OCR pass first, or upload the original PowerPoint file if you have access to it.

Why doesn't a raw Claude Opus 4.7 chat session work as well as a study tool wrapped around the same model?

Three reasons, all about the system around the model rather than the model itself. First, no rubric is enforced at output time, so factual correctness, distractor quality, type coverage, and clarity vary across the 200 questions in a single response. Second, no stem rewording on revisit; by the time you have answered the same MCQ twice, you are memorizing the wording, not the slide. Third, no spaced repetition, so the questions you got wrong are not surfaced more often than the questions you got right. On the held-out three-document eval, Studyly scored 81.3 versus Turbolearn 57.8 and the field average 67.9, where Turbolearn and the field are roughly the level of a generic LLM-quiz wrapper. The model is the generator. The retention is the system.

What's the cheapest way for a student to actually run Claude Opus 4.7 on their lecture decks?

Three honest options. First, the Claude Pro consumer subscription at $20 a month gives you generous usage in claude.ai, which is enough for a few decks a week if you avoid long agent sessions. Second, the API at $5 input / $25 output per million tokens is metered; a 90-slide PDF at the new tokenizer rate costs maybe $0.20 to $0.60 input and $0.30 to $1.00 output for a full 200-question generation, so roughly $0.50 to $1.50 per deck end to end. Third, an agentic wrapper like Studyly's free tier on app.jungleai.com lets you drop in real lecture decks at no cost (no credit card) to test the workflow before you decide whether the rubric and the spaced repetition are worth the paid tier.

Does Studyly use Claude Opus 4.7 under the hood?

Studyly does not publish the model behind the question pipeline, and the eval on /quality is model-agnostic by design: the 81.3 score is a property of the generation system and the rubric, not of any one frontier model. Practically, this means the eval result holds even when the underlying model rotates between releases (4.6, 4.7, and whatever ships next), because the rubric is the gate. If you want to run Claude Opus 4.7 directly on your own decks with full transparency, the prompt template above is the starting point. If you want the rubric, the auto-rephrasing, the slide-number citations, and the spaced repetition to come built in, that is what app.jungleai.com adds.

How do I handle hallucinations when Claude writes a question that contradicts the slide?

Build the slide-number citation into the prompt and do not accept any question that does not include it. The prompt template above adds an explanation_quote and explanation_quote_slide field; any output without those gets thrown out before you study it. Spot-check the first 20 questions against the deck. If the citation slide actually contains the explanation_quote string, the question is grounded. If it does not, the model invented the fact, and you discard it. In practice, with the rubric encoded in the prompt and xhigh effort, the rate of ungrounded questions is low single digits on a clean PDF deck, and is higher (roughly 8 to 12 percent) on heavily abbreviated bullet decks where the model has to expand the bullets into stems.

What about USMLE-style questions; can I generate Step 1 practice items this way?

You can, but it is not the highest-leverage use of Claude Opus 4.7. The Step exams are item-writer-vetted by NBME, and UWorld and AMBOSS already publish first-exposure question banks at a level no LLM consistently matches on clinical-vignette stems. Where 4.7 actually wins is class exams, where the source material is your professor's slide deck and there is no pre-existing question bank. Use UWorld for breadth on Step content. Use 4.7 (or a system wrapped around 4.7) for depth on class material. The mistake students make is trying to replace UWorld with a chat session; the right framing is to use the chat session for the part UWorld doesn't cover.

Will the model rewrite the stem on revisit if I just keep asking?

Sort of, and not in a useful way. If you paste the same MCQ back into a chat and ask for a reworded version, you get a reworded version. The problem is that a chat session has no memory of which underlying fact you have already mastered and which you are still missing, so you cannot drive the reword loop from your own miss data. The reword loop only produces retention when it is gated by spaced repetition: hard misses come back sooner with a different surface form, easy hits drift further out. That gating happens in the system around the model, not in any single Claude turn, which is why the same model produces different retention outcomes inside vs outside a wrapped study app.