"uses_retry_backoff" — Bucket B (tile gap): both baseline and with-context score 0/4. The skill never mentions the retry backoff pattern the rubric checks for.
"auth_url_capture" — Bucket D (regression): baseline was 3/5, with-context is 1/5. Something in the tile is actively confusing the agent on auth URL handling.

The user says: "Go ahead and fix these. Show me what you're going to change before you make the edits."

Fix both issues, commit the changes, and re-run the evals.