{
  "context": "Tests whether the agent integrates the full pattern — tools sliced by access, typed @LLMDescription handoff classes between phases, subgraphWithTask per phase with per-phase model selection, subgraphWithVerification + CriticResult for the verify/adjust loop. The developer named all the requirements; the question is whether the agent reaches for the integrated pattern rather than picking one piece and missing the others.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Tools sliced into communication / read / write ToolSets",
      "description": "Produces three distinct ToolSet classes — one for communication (asking the customer), one for reading account data, one for mutating state — not a single monolithic ToolSet and not a feature-axis split. The grouping is by access pattern as the developer's phase constraints demand",
      "max_score": 17
    },
    {
      "name": "Each phase grants only the matching tool subset",
      "description": "The understanding phase's subgraphWithTask receives communication + read tools (no write); the action phase receives read + write (no communication); the verification phase receives communication + read (no write); the adjustment phase receives read + write (the task says it re-runs the action — communication-only adjustment cannot apply the corrected fix). NOT every phase getting every tool",
      "max_score": 13
    },
    {
      "name": "Typed @LLMDescription'd data classes for inter-phase handoffs",
      "description": "Defines @Serializable data classes (e.g., IssueSummary, Resolution, etc.) with @LLMDescription on the class and on every property. The phases consume and produce these typed shapes — not String handoffs and not Map<String, Any>. The contract is type-checked between phases",
      "max_score": 25
    },
    {
      "name": "Uses subgraphWithTask<In, Out> for each phase",
      "description": "Each phase is wired with subgraphWithTask<InputType, OutputType> (or subgraphWithVerification for the verifier). The Input/Output type parameters match the handoff classes from the previous criterion. Does NOT inline each phase as a plain nodeLLMRequest in a single strategy",
      "max_score": 15
    },
    {
      "name": "Verification + adjust loop with CriticResult branching",
      "description": "Verification is a subgraphWithVerification<T> whose CriticResult<T> drives two edges: success goes to nodeFinish (transformed { it.input }); failure goes to the adjust phase, coercing the nullable critic feedback to a non-null String via transformed { it.feedback.orEmpty() } and into a subgraphWithTask<String, _> adjust subgraph. The adjust phase has a back-edge to verification. NOT a fire-and-forget linear chain",
      "max_score": 15
    },
    {
      "name": "Mixed model selection per phase",
      "description": "Picks distinct models per phase honoring the developer's stated preferences — a cheap model for understanding, a mid-tier model for action, a reasoning-tier model for verification. NOT the same model on all four phases",
      "max_score": 10
    },
    {
      "name": "Declares the required Gradle dependencies",
      "description": "Names the Gradle dependency lines needed for the pipeline — at minimum the umbrella `ai.koog:koog-agents:1.0.0` (which transitively provides `subgraphWithTask` / `subgraphWithVerification` / `CriticResult` via `agents-core` — they do NOT require the standalone `agents-ext` beta artifact) plus whatever provider modules (`prompt-executor-openai-client`, `prompt-executor-anthropic-client`) the chosen models require. Penalize answers that add an unnecessary `ai.koog:agents-ext` line — those APIs ship with `agents-core`",
      "max_score": 5
    }
  ]
}

evals

scenario-1

scenario-2

scenario-3

scenario-4

scenario-5

scenario-6

scenario-7

scenario-8

scenario-9

scenario-10

scenario-11

scenario-12

scenario-13

scenario-14

scenario-15

scenario-16

scenario-17

scenario-18

scenario-19

scenario-20

scenario-21

scenario-22

scenario-23

scenario-24

scenario-25

scenario-26

scenario-27

scenario-28

scenario-29

scenario-30

scenario-31

scenario-32

scenario-33

scenario-34

scenario-35

scenario-36

scenario-37

scenario-38

scenario-39

scenario-40

scenario-41

scenario-42

criteria.json

task.md

scenario-43

scenario-44

scenario-45

rules

README.md

tile.json

jbaruch/koog

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-42/

criteria.jsonevals/scenario-42/