| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| Agent success vs baseline Does it follow best practices? Average score across 10 eval scenarios Reviewed: Version: 3.13.0 | 1.05x | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| No change in agent success vs baseline Does it follow best practices? Average score across 10 eval scenarios Reviewed: Version: 4.4.0 | 1.00x | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |
| | Pending | — | |