CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/koog

Koog 1.0 idioms, gotchas, and scaffolding skills for Kotlin agents on the JVM

88

1.95x
Quality

88%

Does it follow best practices?

Impact

88%

1.95x

Average score across 43 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-42/

{
  "context": "Tests whether the agent integrates the full pattern — tools sliced by access, typed @LLMDescription handoff classes between phases, subgraphWithTask per phase with per-phase model selection, subgraphWithVerification + CriticResult for the verify/adjust loop. The developer named all the requirements; the question is whether the agent reaches for the integrated pattern rather than picking one piece and missing the others.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Tools sliced into communication / read / write ToolSets",
      "description": "Produces three distinct ToolSet classes — one for communication (asking the customer), one for reading account data, one for mutating state — not a single monolithic ToolSet and not a feature-axis split. The grouping is by access pattern as the developer's phase constraints demand",
      "max_score": 17
    },
    {
      "name": "Each phase grants only the matching tool subset",
      "description": "The understanding phase's subgraphWithTask receives communication + read tools (no write); the action phase receives read + write (no communication); the verification phase receives communication + read (no write); the adjustment phase receives read + write (the task says it re-runs the action — communication-only adjustment cannot apply the corrected fix). NOT every phase getting every tool",
      "max_score": 13
    },
    {
      "name": "Typed @LLMDescription'd data classes for inter-phase handoffs",
      "description": "Defines @Serializable data classes (e.g., IssueSummary, Resolution, etc.) with @LLMDescription on the class and on every property. The phases consume and produce these typed shapes — not String handoffs and not Map<String, Any>. The contract is type-checked between phases",
      "max_score": 25
    },
    {
      "name": "Uses subgraphWithTask<In, Out> for each phase",
      "description": "Each phase is wired with subgraphWithTask<InputType, OutputType> (or subgraphWithVerification for the verifier). The Input/Output type parameters match the handoff classes from the previous criterion. Does NOT inline each phase as a plain nodeLLMRequest in a single strategy",
      "max_score": 15
    },
    {
      "name": "Verification + adjust loop with CriticResult branching",
      "description": "Verification is a subgraphWithVerification<T> whose CriticResult<T> drives two edges: success goes to nodeFinish (transformed { it.input }); failure goes to the adjust phase, coercing the nullable critic feedback to a non-null String via transformed { it.feedback.orEmpty() } and into a subgraphWithTask<String, _> adjust subgraph. The adjust phase has a back-edge to verification. NOT a fire-and-forget linear chain",
      "max_score": 15
    },
    {
      "name": "Mixed model selection per phase",
      "description": "Picks distinct models per phase honoring the developer's stated preferences — a cheap model for understanding, a mid-tier model for action, a reasoning-tier model for verification. NOT the same model on all four phases",
      "max_score": 10
    },
    {
      "name": "Declares the required Gradle dependencies",
      "description": "Names the Gradle dependency lines needed for the pipeline — at minimum the umbrella `ai.koog:koog-agents:1.0.0` (which transitively provides `subgraphWithTask` / `subgraphWithVerification` / `CriticResult` via `agents-core` — they do NOT require the standalone `agents-ext` beta artifact) plus whatever provider modules (`prompt-executor-openai-client`, `prompt-executor-anthropic-client`) the chosen models require. Penalize answers that add an unnecessary `ai.koog:agents-ext` line — those APIs ship with `agents-core`",
      "max_score": 5
    }
  ]
}

evals

README.md

tile.json