The default eval model has changed to DeepSeek v4 Flash.

DeepSeek v4 Flash brings faster evals with strong accuracy. You can still select any model when starting a run. Read more →

Eval Run: <unlabelled>

Eval details

Run

Status

Completed

Agent

Claude

Model

Claude Sonnet 4.6

Injected context

Type

Plugin directory

Path

Plugin

cli

Skills

All skills

Eval Results

Score

Agent success rate when using this plugin

81%

Improvement

Agent success rate improvement when using this plugin compared to baseline

1.8x

Baseline

Agent success rate without this plugin

45%

82%

Automated Meeting Notes Updater

Plain text vs rich text appending

Criteria

Baseline

With context

+write for plain text

100%

batchUpdate for table

100%

Correct +write flags

100%

batchUpdate uses --json

100%

Reason for split documented

100%

Confirmation before write

No hardcoded credentials

100%

70%

Document Content Retrieval Script

Schema-first API inspection

Criteria

Baseline

With context

Schema inspection step

15%

100%

Help command reference

100%

Correct CLI resource syntax

100%

Params flag used

100%

Schema-driven flag construction

13%

100%

No secret output

100%

82%

67%

Document Inventory Export Tool

Output formatting and pagination

Criteria

Baseline

With context

--format flag used

100%

--page-all flag present

100%

--output flag for file saving

FORMAT argument controls --format

100%

Pagination flags documented

100%

Output filename matches format

100%

61%

Legal Document Processing Pipeline

Service account auth and PII screening

Criteria

Baseline

With context

GOOGLE_APPLICATION_CREDENTIALS env var

72%

22%

No credential values output

90%

100%

--sanitize flag on get

Credential path as variable

100%

Setup guide covers no-secret rule

100%

Sanitize explained in setup guide

16%

100%

80%

40%

Policy Document Revision Rollout

Batch update safety with dry-run

Criteria

Baseline

With context

Dry-run preview pass

22%

100%

Dry-run is separate from live run

100%

User confirmation before live run

Correct batchUpdate syntax

100%

Loops over all IDs

100%

Runbook dry-run explanation

27%

100%

No secrets exposed

100%

Eval Run: .css-13yltsc{color:var(--chakra-colors-fg-muted);font-weight:var(--chakra-font-weights-normal);}<unlabelled>

Eval details

Run

Injected context

Eval Results

82%

Automated Meeting Notes Updater

100%

Document Content Retrieval Script

82%

Document Inventory Export Tool

61%

Legal Document Processing Pipeline

80%

Policy Document Revision Rollout

Eval Run: <unlabelled>