| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| | — | — | |
| Agent success vs baseline Does it follow best practices? Average score across 8 eval scenarios Reviewed: Version: 5.0.0 | 0.57x | — | |
| | — | — | |
| | — | — | |
| Agent success vs baseline Does it follow best practices? Average score across 10 eval scenarios Reviewed: Version: 2.10.0 | 0.97x | — | |
| | — | — | |
| | — | — | |