Blog

Stop guessing whether your Skill works: skill-optimizer measures and improves it

Skill-optimizer evaluates and enhances AI skills by running them through a judge-scored eval pipeline, providing measurable improvements and insights into skill performance.

Simon Maple14 min read30 Apr 2026

Read the article

Explore all

AI Native DevCon Day 2: From Agent Demos to Operating Models

AI Native DevCon Day 2 explored operating models for AI-native delivery, focusing on context pipelines, agent behavior metrics, and organizational ownership.

Rohan Sharma · 12 min read · 3 Jun 2026

Read article

The model's solved, now comes the hard part: Reviewability as the bottleneck

AI engineering shifts focus from model development to ensuring system reviewability, emphasizing manageable task sizes for reliable and governable outputs.

Paul Sawers · 9 min read · 2 Jun 2026

Read article

AI Native DevCon Day 1: Making AI Agents Ready for Enterprise

AI Native DevCon Day 1 focused on making AI agents enterprise-ready, emphasizing reliability, skills as code, and adapting platforms for agent integration.

Rohan Sharma · 12 min read · 2 Jun 2026

Read article

AI Coding Agent Accuracy: Opus 4.7 vs 4.8

Opus 4.8 matches Opus 4.7 in accuracy but improves efficiency, solving tasks in fewer turns and at lower costs, highlighting differences beyond headline metrics.

Rob Willoughby · 9 min read · 29 May 2026

Read article

Opus 4.8 tops the LLM leaderboard with 95% on skill evals

Opus 4.8 leads the LLM leaderboard with a 95% skill evaluation score, surpassing Opus 4.7 and Composer 2.5 Fast, despite being the slowest model tested.

Simon Maple · 8 min read · 29 May 2026

Read article

Why We're Changing Our Default Eval Model

The default eval model is changing from Claude Sonnet 4.6 to GLM 5.1 to reduce costs without losing signal quality, focusing on skill evaluation over model specificity.

Rob Willoughby · 9 min read · 29 May 2026

Read article

We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.

Composer 2.5 Fast outperformed Composer 2.5 across 11 skills, scoring higher and running 32% quicker, while costing the same, challenging typical speed-quality trade-offs.

Simon Maple · 6 min read · 28 May 2026

Read article

Don't Make Your Agent Guess

The article discusses the importance of tool design in agent systems, emphasizing that prompts alone are insufficient for ensuring agent safety and reliability.

Matthias Lübken · 11 min read · 27 May 2026

Read article

The Reinvention Of The Dev Team

Explore how AI is reshaping dev teams, challenging traditional roles, and introducing new dynamics in software development, focusing on speed, safety, and value.

Hannah Foxwell · 11 min read · 26 May 2026

Read article

Securing the Coder, Not the Code: Notes on Agentic Development and Security

Agentic development shifts security focus from code to coder, requiring new tools and metrics as AI agents rapidly create and modify software.

Guy Podjarny · 16 min read · 21 May 2026

Read article

AI Native DevCon’26: The London conference for developers building with AI

AI Native DevCon'26 in London focuses on challenges of deploying AI agents in production, featuring four tracks on engineering, orchestration, enablement, and governance.

Rohan Sharma · 7 min read · 20 May 2026

Read article

OpenAI is shutting down self-serve fine-tuning – what this signals for enterprise AI

OpenAI is phasing out self-serve fine-tuning, citing advanced models reducing its necessity, signaling a shift in enterprise AI towards infrastructure challenges.

Paul Sawers · 7 min read · 20 May 2026

Read article