Skip to content

The perception loop

It is surprisingly hard to tell an agent what to look at in an image — and even harder to know whether the agent is actually looking.

A vision-language model will fluently say “there’s a person near the centre, in conversation” about a photograph that actually shows three people leaning over a book; it will say “Mount Fuji is the small peak right of centre” in The Great Wave while initially pointing at a foam mass 240 px to the left; it will describe a Falcon 9 launch as “pre-dawn marine layer” because the timestamp reads “5 a.m. local,” skipping over the season and latitude.

Each of those errors is invisible if the only output is prose. The model speaks confidently, the human nods, the wrong claim travels downstream.

annomate exists to make those errors visible by forcing the model to commit to its claims at specific pixel coordinates.

A claim like “the milkmaid’s brass pail hangs directly below the wicker basket” costs nothing to assert. A bounding box at [2, 1850, 2400, 380, 410] is either on the pail or it isn’t. The model has to look before it can place the box, and the user can see immediately whether it placed it correctly. The first time we ran example 12 the brass-pail box landed 440 pixels off — the prior was real, the placement was wrong, and the moment the box rendered, both we and the model knew.

This is the core thesis: vague language hides perception errors; precise coordinates expose them.

annomate gives Claude eyes on a VIA v3 image annotator running in the user’s browser. The loop has three voices:

VoiceWhat it knowsHow it speaks
ClaudeCultural priors, language, the whole-image gestaltPlaces boxes, polygons, polylines, circles; writes evidence-based labels
Local model ([ai] extra)Pixel statistics, SAM masks, CLIP embeddings, VLM captionsSuggests candidates, tightens boxes, verifies labels, grades placements
UserThe image at full resolution in their browser, plus actual knowledge of the subjectAdjudicates, corrects, deletes

The loop is:

  1. Claude calls via_get_image to look at the image at a working resolution.
  2. Claude places a region with via_add_region using either pixel or fraction coordinates.
  3. Claude re-fetches the image with the region overlaid and looks at where the box actually landed.
  4. Optional: Claude calls via_grade_annotations / via_verify_region to ask a local model whether the box is well-placed.
  5. The user reviews in the browser and corrects anything the loop missed.

Steps 1–3 are the part that distinguishes annomate from a naive “ask the VLM what it sees” workflow. The model isn’t only describing; it is committing to a falsifiable claim, then checking that claim against the pixels. When the claim and the pixels disagree, the disagreement is structural — a wrong rectangle is wrong in a way that can be measured.

jscad-mcp is the sibling tool: Claude writes OpenJSCAD geometry, renders it, and looks at what came out. annomate is the inverse: Claude receives an image it did not create, then commits to claims about its content. Together they cover both halves of Claude with eyes on the physical world.

LoopOrigin of the artifactThe forcing functionWhat gets exposed
jscad-mcpClaude wrote itRender and lookCode-vs-intent mismatch
annomateSomeone else made itCommit coordinates and lookPrior-vs-reality mismatch

The shared structural insight is the same in both: an artifact that can be inspected at pixel level turns a fluent description into a verifiable claim. In jscad-mcp the artifact is a rendered model; in annomate it is a bounding box overlaid on a photograph. Both tools refuse to let the model get away with sounding right.

The two loops also feed each other. Sessions on annotation make a model more disciplined about the screenshots it takes when building 3D models — and vice versa. Both demand the same thing: don’t reason from priors; look, commit, look again.

The examples in annomate-examples aren’t demos. They are a research bench. Each image was chosen because it tests something specific:

  • Contemporary photographs (jaguar, mangrove, GRACE-FO launch) — where vision models do well, so the interesting failures are subtle: small features, scale, perspective.
  • Historical art (Vermeer, Rembrandt, Hokusai, Van Gogh, Avercamp) — where the model’s cultural priors are strong and have to be checked against the actual pixels, not the canonical reading.
  • Multi-view objects (Howe sewing machine) — where the question is whether the same model identifies the same named part consistently across two photographs of the same artifact.
  • Mark-vocabulary targets (Van Gogh’s Montmajour reed-pen drawing) — where the annotation isn’t “what’s the object” but “what’s the technique”: stippling vs. hatching vs. contour vs. wash.

Every annotation session produces both an artifact (the VIA project JSON, shipped in the example directory) and a private lab-notebook entry recording what happened: prior, what the image actually showed, every correction, every loop. Those notebook entries are the raw material for Lessons from the bench — the recurring failure modes and the tool changes they drove.

It is not a benchmark. There is no leaderboard, no score, no claim that annomate “solves” image grounding. The corpus exists to expose how Claude perceives, where it fails, and what changes to the tool make those failures fewer or smaller. The output is an evolving tool, not a number.

It is also not a replacement for human review. The user’s browser is the ground truth. Every session ends with a STOP for user adjudication before save — the three-voice loop is structurally asymmetric, and the human’s full-resolution view stays authoritative.


← Back to annomate · The corpus · Lessons from the bench · The 440-pixel pail