{
  "context": "Tests whether the agent selects ACP — not A2A (agent-to-agent RPC) or MCP (tool host) — for a tooling client that drives an agent's run lifecycle with cancellation and progress streaming, then wires the ACP feature correctly. A baseline with no Koog 1.0 knowledge does not know Koog exposes an ACP feature at all, and tends to reach for a generic HTTP/RPC layer or conflate it with MCP.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Chooses ACP for the tooling-client case",
      "description": "Selects ACP for a tooling client that needs lifecycle control. Does NOT pick A2A (agent-to-agent RPC, where the remote is another agent doing its own planning) or MCP (a tool-host protocol that exposes tools, not agents). Failure: wires A2A, MCP, or a hand-rolled HTTP endpoint instead",
      "max_score": 25
    },
    {
      "name": "Adds the ACP dependency",
      "description": "Adds ai.koog:agents-features-acp at 1.0.0 or later to build.gradle.kts (exact group and artifact). Failure: wrong artifact name, missing dependency, or an a2a-*/mcp module",
      "max_score": 15
    },
    {
      "name": "Correct ACP import",
      "description": "Imports ai.koog.agents.features.acp.ACP. Failure: invented package path or missing import",
      "max_score": 10
    },
    {
      "name": "install(ACP) in the AIAgent trailing lambda",
      "description": "Calls install(ACP) { ... } inside the AIAgent(...) trailing lambda — where features are installed during construction. Failure: installs at the HTTP layer, as standalone middleware, or mutates the agent after construction",
      "max_score": 20
    },
    {
      "name": "Configures the endpoint inside install(ACP)",
      "description": "Sets the endpoint property inside the install(ACP) block. Failure: no endpoint configured, or configured outside the install block",
      "max_score": 10
    },
    {
      "name": "Notes progress events surface to the client",
      "description": "Notes that ACP surfaces the agent's outgoing events (LLM round-trips, tool calls, planner steps) as progress notifications the client can subscribe to. Failure: no mention of how the dashboard observes progress",
      "max_score": 10
    },
    {
      "name": "Rationale for ACP over A2A",
      "description": "Explains in notes.md that A2A is RPC-shaped (caller invokes and gets a single result) whereas ACP is bidirectional and gives the tooling client cancellation plus progress streaming. Failure: no rationale, or a rationale that confuses the two protocols' roles",
      "max_score": 10
    }
  ]
}

.gemini

evals

scenario-1

scenario-2

scenario-3

scenario-4

scenario-5

scenario-6

scenario-7

scenario-8

scenario-9

scenario-10

scenario-11

scenario-12

scenario-13

scenario-14

scenario-15

scenario-16

scenario-17

scenario-18

scenario-19

scenario-20

scenario-21

scenario-22

scenario-23

scenario-24

scenario-25

scenario-26

scenario-27

scenario-28

scenario-29

scenario-30

scenario-31

scenario-32

scenario-33

scenario-34

scenario-35

scenario-36

scenario-37

scenario-38

scenario-39

scenario-40

scenario-41

scenario-42

scenario-43

scenario-44

scenario-45

scenario-46

scenario-47

rules

jbaruch/koog

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-47/

criteria.jsonevals/scenario-47/