Skip to content

The 440-pixel pail

The worst single placement in the annomate-examples corpus is a brass milk pail.

It hangs on the back wall of Vermeer’s The Milkmaid — a small dark vessel suspended from a nail, partly in shadow, partly catching the same window light that pours onto the maid’s white cap. There is also a small wicker basket on the same wall, hung from a separate nail, and there is a famous nail-and-cast-shadow at the upper right that Vermeer fans usually want to talk about first.

The session that annotated the painting placed nineteen regions. Eighteen of them ended up within shoving distance of where they should be. The brass pail landed 440 pixels off.

Vermeer's Milkmaid annotated with 19 regions via annomate

Before the image loaded, Claude wrote down what it expected to find: yellow bodice over blue under-sleeves, ultramarine apron, white linen cap, brown pouring jug into a wide redware bowl, leaded window with a cracked pane, a wicker basket and a brass pail hung on the back wall, foot-warmer with a Cupid Delft tile.

Almost all of that is right. Vermeer is a canonical painter; the Milkmaid is one of the most-described canvases in Western art; there is no shortage of pixel-free language about it in any large language model’s training data. The prior was strong because the painting is famous, not because Claude had looked.

Inside the prior, however, was a piece of spatial language that the painting does not actually support. The standard description — repeated in essays, museum captions, and undergraduate lectures — places the basket and the pail together on the back wall: “a wicker basket hung from a nail, with a brass pail directly below it.”

The pail is not directly below the basket. The pail is below-and-right. The basket sits high on the wall near the corner; the pail hangs lower and offset, closer to the window than to the maid. In words, the difference is rounding error. In coordinates, it is 440 pixels of Euclidean distance.

When Claude placed the pail box from the full-image view, the rectangle landed almost exactly under the basket — right where the language predicted. The prior had loaded onto the placement, not from the pixels.

Then Claude re-fetched the image with the regions overlaid.

This is the part of the loop that is easy to skip and easy to underestimate. Before via_get_image_crop had an overlay=true flag, a wrong rectangle was wrong only in numerical space — the model had to mentally project box coordinates back into the returned image and decide whether they made sense. With the overlay, the wrongness becomes pictorial. The brass pail in the overlay was suddenly visibly hanging in mid-air, several inches of canvas above the actual vessel, and the actual vessel was sitting unannotated in the lower-right of the box’s intended neighbourhood.

The argument was over the moment the overlay rendered. Both Claude and the user could see the same thing at the same time. No description in language could have produced the same effect — the gap was spatial, and the only honest representation of a spatial gap is a picture with the wrong rectangle in it.

It is tempting to file the brass pail under “a tool worked correctly.” The overlay revealed the mistake; the mistake got fixed; the regions are now correct.

The reason the story is worth telling is that the same prior-shaped misplacement had been happening, quietly, for weeks. Earlier sessions had cropped regions of the image at session start, named priors, and corrected boxes after user review — but the moment of overlay was missing. Claude’s verification step before user review was “does the coordinate look reasonable on the screen-relative axes.” That step cannot catch a placement that is wrong by a consistent prior; the placement will look reasonable, because the prior is what makes things look reasonable.

The 440-pixel pail forced the addition of overlay=true. It also forced the rule, now embedded in the skill, that Claude states its prior in writing before fetching the image — so the prior is a falsifiable prediction the user can read after the fact, rather than a silent influence on every placement.

Both of those changes were retroactively the cheapest possible fixes. Neither required a new model. Neither required more compute. Neither required Claude to “be better” at perception. They required only that the perception loop close once — that the model, after committing a claim to pixel coordinates, look at the claim against the pixels.

There is a structural argument that runs through the perception-loop essay and the broader bench lessons: vague language hides perception errors; precise coordinates expose them. The 440-pixel pail is the moment that argument stops being abstract.

A vague description (“the pail hangs below the basket”) cannot be falsified by another vague description. A bounding box at specific pixels can be falsified by overlaying it on the image. The moment you make the claim falsifiable, the moment of error is no longer interpretable — it is visible.

That visibility is the whole point of the tool. It is what annomate exists to do.


← Back to annomate · The perception loop · Lessons from the bench · Session source