The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesEvaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter?

21 Apr 20267 minute read

Rob Willoughby

When a stronger model ships, there are two questions every skill author should want answered, and evals are the only honest way to answer either:

  1. Which skills just got absorbed? A model that now knows how to do X natively does not need a skill telling it to do X. Fewer skills to maintain, leaner context, lower cost.
  2. Which skills still matter? Behaviour-level guidance (conventions, preferences, project-specific workflows) is not something pretraining will fill in for you. Those skills should keep paying.

Moonshot gave us early access to Kimi K2.6. We ran the Tessl agent skill evaluation harness on the same 21 skills and 100 paired scenarios against three solvers: Kimi K2.5, Kimi K2.6, and Claude Sonnet 4.5.

A solver is the model whose output the grader scores; a paired scenario is the same task run twice per solver, once without the skill installed and once with it. These are early signals from one pre-release on one skill set. A deeper cross-model analysis with clean baselines across the board is in progress and will be its own piece.

What does our setup look like?

Scenarios and rubrics are held fixed across the two Moonshot runs. The only variable is the solver.

  • Solver A: Kimi K2.5
  • Solver B: Kimi K2.6
  • Scenario generator: Claude Sonnet 4.5, up to 5 scenarios per skill, derived from each skill's SKILL.md
  • Grader: Claude Sonnet 4.5, weighted-checklist rubric derived from the same SKILL.md
  • Per skill × per solver: every scenario solved twice, baseline (no skill installed) and with-skill

Per-skill n=5 is noisy; the aggregate over 100 scenarios is where the signal lives.

Three findings:

  1. Kimi 2.6 is a better model than K2.5: Without skills, K2.6 sits ~2 pp (percentage points) above K2.5 in aggregate, with double-digit moves on specific skills.
  2. Kimi 2.6 holds its own against Sonnet 4.5. We picked Sonnet 4.5 as a competitive baseline, and found in this evaluation set that the K2.6 performed better both in the with/without skill scenario by around ~8 p.p.
  3. Skills remain a durable lever as models improve. The uplift skills buy stays roughly similar as Kimi improves (+17.05 pp on K2.5, +17.20 pp on K2.6).

1. Kimi K2.6’s baseline performance is superior

SolverBaseline (no skill)With skillUplift
Kimi K2.573.2%90.2%+17.05 pp
Kimi K2.675.0%92.2%+17.20 pp

Kimi K2.6 is a better model than K2.5 on this skill set. Two findings to back this up:

  • Four skills are now redundant on K2.6. In the 21-skill set, 4 skills have K2.6 baselines ≥ 95%, up from 2 under K2.5. agent-gossip-coordinator is the clearest example: K2.5 needed the skill (+8.0 pp uplift), K2.6 already solves it at 96.4%, and the skill now hurts by 4.8 pp. These skills are no longer earning their context budget as superior models can take care of it.
  • Both K2.5 regressions cleaned up. Two skills that made K2.5 worse (3d-molecule-ray-tracer: −7.0 pp; agent-base-template-generator: −2.6 pp) both resolve on K2.6. The skills were not wrong; the weaker model was just interpreting them awkwardly.

2. Kimi 2.6 holds its own against Sonnet 4.5

Putting K2.6 next to Sonnet 4.5 on the same 21 skills and same rubric, the early picture is this:

SolverBaseline (no skill)With skillUplift
Kimi K2.675.0%92.2%+17.20 pp
Sonnet 4.563.2%84.5%+21.3 pp

On these early signals, it appears that Kimi K2.6 is competitive with Sonnet 4.5 for the task categories these skills cover. We are scheduled to make a deeper cross-model study with clean baselines across all three solvers is in progress - but this is an early signal that Kimi 2.6 is comparable to certain of the world’s leading providers.

3. Skills remain a durable lever as models improve

With vs without the skill installed, on Kimi:

  • K2.5: +17.05 pp.
  • K2.6: +17.20 pp.

The uplift the skill buys does not shrink as the solver gets stronger. The baseline moves, the with-skill score moves with it, and the delta the skill contributes stays in the same range.Two illustrative cases, both Kimi versions, same rubric:

  • agent-agent. K2.5 17.7%79.9%. K2.6 33.9%88.8%. The baseline closed 16 pp of the gap. The skill still buys roughly 55 pp on top.
  • agent-development. K2.5 41.2%100.0% K2.6 55.0%100.0%. The baseline closed 14 pp of the gap. The skill covers the rest.

One nuance worth flagging here and reserving for a dedicated follow-up: not every uplift is equal. An initial pass comparing the same skills on Sonnet 4.5 suggests that skills prescribing ecosystem-specific tool calls or conventions lose the most in the cross-family handoff, while skills graded against real, verifiable behaviour (actual CLI flags, actual API shapes) transfer more readily. We view this as the most actionable signal for skill authors, but a broader sample and matched baselines across models are needed before we publish a complete analysis.

What this means for skill authors

  • Kimi K2.6 is a stronger solver than K2.5 on the task categories in this skill set, and competitive with Sonnet 4.5.
  • Rerun your evals when the model changes. Baselines move unevenly; some skills become redundant, some keep paying. You cannot tell which is which without running the evaluation.
  • If you want to run this kind of comparison on your own skills, the harness used here is the Tessl skill evaluation framework. Same structured scenarios, same weighted-checklist grading, pointed at whichever solver and skill set you give it. You can also spin up your agent and ask it to evaluate your skill with Tessl (and you can pick Kimi as your model).

Closing

Kimi K2.6 is a better model than K2.5 on this skill set: a +1.9 pp baseline gain, four skills now solved without any skill installed, and both K2.5 regressions cleaned up.

Skills still matter as models get better: the +17 pp uplift we saw on K2.5 held on K2.6, and uplift in a similar range appears on Sonnet. All of this comes from a single pre-release evaluation on 21 skills; a deeper study with clean baselines across the board is the next piece.

The above reflect early signals. On early signals it appears Kimi 2.6 is competitive with Sonnet 4.5, though a deeper study across more models and a balanced skill sample is in progress and will be published separately.

Thanks to Moonshot for early access to K2.6! Head over to Tessl to evaluate and optimize your skills.