Skip to content

Lessons from the bench

What 20+ annotation sessions taught us about how Claude perceives images, where it consistently fails, and what we changed in the annomate MCP server to make those failures fewer or smaller.

Each lesson is a recurring failure mode we saw across multiple examples, plus the tool or skill change it drove. The thesis is The perception loop; this essay is the empirical follow-up.

The most common error class. Claude has strong cultural and statistical priors about what kind of image it’s looking at, and those priors fire before the model checks the pixels.

  • Hokusai, Great Wave (example 14). Pre-look prior placed Mount Fuji on the central white foam mass — the same iconographic confusion first-time human viewers make. The actual Fuji is the small snow-capped peak right of centre, ~240 px to the right.
  • GRACE-FO launch (08). The timestamp “12:48 UTC = 05:48 local” was read as “pre-dawn marine layer” using a winter intuition. At 34.6 °N in late May, sunrise is ~04:55 local — 05:48 is well after dawn. The shadows in the image were sharp, the sky was blue, none of it matched “pre-dawn.”
  • Avercamp, Winter Landscape (11). The pre-look prior included a defecating peasant (a common but not universal Avercamp motif) and a V-formation of geese in the upper-right corner. Neither was actually present in those positions.
  • Vermeer, The Milkmaid (12). A strong “directly below” spatial prior placed the brass milk pail directly under the wicker basket on the wall. The pail is actually below-and-right by ~440 px Euclidean — the worst single placement in the corpus.

Mitigation. The skill now asks Claude to state its prior in writing before calling via_get_image, so the prior is recorded as a falsifiable prediction rather than silently shaping placement. The via_classify_scene and via_read_metadata tools were added to ground priors in pixel statistics and EXIF/GPS data before they get loaded onto features.

2. Sub-canvas-tenth features can’t be placed from a thumbnail

Section titled “2. Sub-canvas-tenth features can’t be placed from a thumbnail”

A feature smaller than ~10% of the canvas extent cannot be reliably placed from the full-image view, even at 2048 px. The eye reads “there’s a bird there” or “the pail is on the wall” without internalising the exact coordinates.

  • Jaguar eyes (01). Three loops at 1024 px to get the eye boxes onto the iris row — each correction shifted the boxes ~50 px down. Fixed in the re-annotation pass by starting at 2048 px and cropping the head before placing.
  • Avercamp bird circles (11). Five circles placed around individual flying birds were each 30–100 px off in inconsistent directions on the first overlay pass. Three refinement rounds to centre them.
  • Vermeer brass pail (12). Cropping the wall area at session start showed the pail clearly, but by the time Claude placed the wall-mounted regions (~10 tool calls later), the crop had fallen out of working memory and placement reverted to thumbnail-eyeball precision.

Mitigation. via_get_image_crop was added to the MCP server. The skill now recommends re-cropping immediately before placing small features, not just at the start of the session — orientation crops decay after about five unrelated tool calls.

A box labelled “church” should cover nave + tower + spire, not just the tower. A box labelled “foot warmer” should cover the wooden housing + base + cast shadow, not just the pierced cube on top. Claude consistently boxes the most prominent or most-identifiable part of a multi-part object rather than the whole named thing.

  • Avercamp church (11). First-pass box covered only the tower and spire. User enlarged it ~3× to include the nave.
  • Vermeer foot-warmer (12). First-pass box was small and offset right; user shifted it ~225 px left and enlarged it to capture the base and shadow.
  • Vermeer window (12). Conversely, the leaded-glass-window box over-extended into the wall and floor; user trimmed ~189 px off the right and ~419 px off the bottom.

Mitigation. Added to the skill’s “Perception gotchas”: “Box the whole named object, not its most-prominent part.” A symmetric reminder for over-extension was added the same way.

The “upright-person” heuristic places the face roughly at the top of the body box. Leaning, seated, or bent figures break the heuristic, and the resulting face boxes are offset by 50–100 px.

  • Camus at the Nobel banquet (18). All three face boxes were corrected by the user — Camus’s was 56 px low because he was hunched over the autograph book; the two women’s were 79–103 px laterally off because they were leaning inward, not standing upright.
  • Camus crowning Lucia (19). Profile-face boxes drifted left when anchored at the visible ear — fix is to locate the nose/front cheek first and work back to the ear.
  • Avercamp skater with stick (11). Box landed on the wrong figure in a cluster of two visually similar stick-holders; correction shifted ~135 px left and ~113 px down.

Mitigation. Two reusable rules went into Claude’s memory: initial y-placements run ~15% too high on vertically-dense compositions and profile-face boxes drift left when anchored at the ear. Neither is a tool change — both are placement heuristics learned from repeated failure.

5. Specialist vocabulary breaks the VLM verifier

Section titled “5. Specialist vocabulary breaks the VLM verifier”

via_verify_region runs a small vision-language model (Florence-2 / Qwen-3B) on a crop and asks whether the box matches the label. It works well for generic visual categories (“is this a face?”, “is this a cloth?”) but fails for specialist vocabulary.

  • Howe sewing machine (20). Qwen verified “no” on both the overview and close-up crops of the “curved eye-pointed needle” — describing the close-up as “wooden surface… light blue background” while completely missing the needle. CLIP-cosine grading agreed with Claude’s label_match at 0.758 and 0.879 (strong); the verifier was the unreliable voice.
  • Camus identity boxes (18, 19). Qwen-3B cannot verify “Camus face” or any other named-person label — same pattern.

Mitigation. The skill now flags via_verify_region as reliable only for generic visual-category checks; specialist vocabulary should rely on CLIP-rubric grading (via_grade_annotations) plus the user’s review. The corpus also revealed a bug — Florence-2 fails to load on some installs because of a transformers version skew — which is tracked upstream.

6. Coordinate-space arithmetic is a footgun

Section titled “6. Coordinate-space arithmetic is a footgun”

Early sessions had to compute x_returned × original_dim / returned_dim for every box. Twenty boxes per session × two axes = forty multiplications, each a chance to introduce error. This was the single biggest source of placement-error variance in sessions 1–4.

Mitigation. The MCP server gained an xy_space parameter on every region-write tool: "original" (default), "returned_<dim>", or "fraction". The fraction encoding (0.0–1.0 of the original image’s extent) is now the default-recommended path and has zero arithmetic errors across the corpus once adopted. It also reads naturally — humans describing positions tend to default to fractions of the frame (“a third from the top, just left of centre”).

Pre-tool-evolution, via_get_image(fid) returned the raw image without any annotations laid on top. To verify whether a box landed where intended, Claude had to mentally project the box coordinates back into the returned image and check. The browser showed the overlay; Claude couldn’t see the browser.

Mitigation. via_get_image and via_get_image_crop both gained an overlay=true flag (and via_get_image_with_regions was added as a sibling). The model can now see its own annotations laid on the image — the perception-loop equivalent of being able to run code you just wrote. This was the single largest reduction in friction the corpus surfaced.

The MCP server grew in response to the lab notebook. Roughly:

AddedReplaces (or fixes)Friction it solved
xy_space="fraction" on region writesManual arithmetic from returned-px to original-pxTwenty multiplications per session, each a chance to err
via_get_image_crop (with overlay=true)Re-fetching the whole image at higher res just to inspect a small featureCrops are token-cheap and information-dense; also a go/no-go tool for uncertain features
overlay=true on via_get_imageDead-reckoning on box positions in the returned imageLets Claude see its own annotations the way the user sees them
[ai] extra (the “third voice”)Two-voice loop (Claude + user), no automated check between draft and STOPPixel-statistic verdicts on placements before user review
via_classify_sceneHand-rolled scene priorsSub-second CLIP classification that auto-routes subsequent detection calls
via_read_metadataInferring time/place from filename and captionEXIF + GPS catch a whole class of prior errors before they get placed (the GRACE-FO “pre-dawn” reading would have failed against this immediately)
via_grade_annotations”Eyeball every box”A CLIP-cosine rubric across the project — flags egregious misplacements, also surfaces shape-encoding suggestions
via_verify_regionTrusting Claude’s identity claimsA VLM crop-caption pass per region. Useful for generic categories; unreliable for specialist vocabulary
via_tighten_regionManual box-edge fiddlingSAM-derived tight bbox and IoU score per rectangle
via_suggest_regions”What did I miss?”Find-missing pass that excludes existing regions and proposes new ones with broad prompts
via_run_ocrHand-placing text labelsTesseract pass for label-rich images (the Vincent signature, the Howe patent stamp)

Recurring asks from the lab notebook that haven’t yet landed:

  • via_version / via_capabilities. No way to record which server build a session was produced against. Six months from now, a .training/ entry won’t tell us which version of the server produced it — which makes regression attribution impossible.
  • via_get_annotation_changes(since=<marker>). After the user corrects six boxes in the browser, Claude has to diff via_get_annotations output by hand to figure out what was learned. The most valuable signal in the session (what the human caught that Claude missed) currently requires manual arithmetic to extract.
  • Polygon SAM-snap. via_tighten_region exists for rectangles. The analogous “snap polygon vertices to the SAM mask of the same class” would close the loop for the hand-trace + user-edit pattern.
  • Close-crop overlay scoping. When rendering a close-crop overlay, the server currently renders labels for regions whose bounding box intersects the crop window — even if the region is mostly off-crop. This produced phantom-label confusion on the Vermeer foot-warmer crop and erodes trust in the overlay tool.

The pattern that emerged after ~20 sessions:

  1. Run an annotation session on a deliberately-chosen image.
  2. Record what happened — prior, corrections, loops, tool friction — in a structured lab-notebook entry.
  3. Read the prior entries before the next session, so failure modes are visible and don’t silently repeat.
  4. Promote durable patterns out of the lab notebook into either the skill (placement heuristics like “crop immediately before placing small features”) or the MCP server (capability gaps like “need a fraction-coordinate parameter”).
  5. Test the change against new examples. If the pattern is genuinely fixed, it stops appearing in subsequent reviews; if it isn’t, the next review flags it again, more loudly.

The corpus is the controlled variable: each example was chosen for what it tests, and the test stays valid across tool versions. The lab-notebook entries are the dependent variable. The tool changes are the independent variable.

The result, after 20-odd sessions, is that the most-cited recurring failures from the first ten reviews — coordinate arithmetic, dead-reckoning without overlays, blind small-feature placement — no longer appear in recent reviews. The current open failures are mostly language, not mechanics: how to express uncertainty in a region’s identity, how to route specialist vocabulary around verifier blind spots, how to preserve “the user wasn’t sure either” as first-class signal in the project JSON.

Those are the next-generation problems. The mechanical loop works.


← Back to annomate · The perception loop · The 440-pixel pail · The corpus