Articles

Article
AI Coding Agent Accuracy: Opus 4.7 vs 4.8
Opus 4.8 matches Opus 4.7 in accuracy but improves efficiency, solving tasks in fewer turns and at lower costs, highlighting differences beyond headline metrics.
Read more

Article
Why We're Changing Our Default Eval Model
The default eval model is changing from Claude Sonnet 4.6 to GLM 5.1 to reduce costs without losing signal quality, focusing on skill evaluation over model specificity.
Read more

Article
Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter?
Early signals from benchmarking Kimi K2.5, K2.6, and Sonnet 4.5 on 21 agent skills. Kimi K2.6 is a better model than K2.5, and skills still matter as models improve.
Read more

Article
A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by ~35%
This article proposes an evaluation framework highlighting how specifications enhance coding agents' effective use of public APIs, increasing code quality and efficiency by approximately 35% amidst evolving software interfaces.
Read more
