{
  "context": "Negative case — tests whether the agent identifies the developer's situation (stable system prompt content with unique per-call user inputs) as a provider-side caching problem, not a response caching problem, and routes them to the right tile skill. Wrapping the executor in an in-process response cache here would be wrong: responses are not actually repeated, so the cache would never hit and the developer would conclude caching doesn't work.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Names the cache-vs-response distinction",
      "description": "Surfaces that there are two different caching layers — provider-side reduction on stable prompt prefixes vs. in-process avoidance of duplicate API calls — and identifies which one this developer needs. Without this distinction the developer cannot decide between offerings",
      "max_score": 25
    },
    {
      "name": "Recommends the provider-side caching path",
      "description": "Recommends the Anthropic prompt caching surface (cacheControl breakpoints on the system content) — or invokes the appropriate sibling skill. Does not invent a custom cache header injection and does not point at a generic OpenTelemetry feature",
      "max_score": 30
    },
    {
      "name": "Does not wrap the executor in an in-process cache",
      "description": "Does NOT produce a CachedPromptExecutor / FilePromptCache / RedisPromptCache wrapper around the developer's executor. The user inputs are unique per call so the cache key would never hit — implementing this would mislead the developer into thinking caching is broken",
      "max_score": 25
    },
    {
      "name": "Acknowledges the framing without capitulating",
      "description": "Engages with the developer's phrasing (\"cache the LLM responses\") and explains that what they actually want bills the API at a reduced rate rather than skipping the API. Does not just refuse without naming the mismatch",
      "max_score": 10
    },
    {
      "name": "Mentions Anthropic's minimum-token requirement",
      "description": "Notes that Anthropic enforces a minimum token threshold for caching to take effect — the developer's multi-thousand-word prompt likely qualifies, but breakpoints on shorter content are silently ignored. They need this so a no-savings outcome isn't a surprise",
      "max_score": 10
    }
  ]
}

evals

scenario-1

scenario-2

scenario-3

scenario-4

scenario-5

scenario-6

scenario-7

scenario-8

scenario-9

scenario-10

scenario-11

scenario-12

scenario-13

scenario-14

scenario-15

scenario-16

scenario-17

scenario-18

scenario-19

scenario-20

scenario-21

scenario-22

scenario-23

scenario-24

scenario-25

scenario-26

scenario-27

scenario-28

scenario-29

scenario-30

scenario-31

scenario-32

criteria.json

task.md

scenario-33

scenario-34

scenario-35

scenario-36

scenario-37

scenario-38

scenario-39

scenario-40

scenario-41

scenario-42

scenario-43

rules

README.md

tile.json

jbaruch/koog

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-32/

criteria.jsonevals/scenario-32/