CtrlK
BlogDocsLog inGet started
Tessl Logo

phoenix-cli

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

68

Quality

82%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a comprehensive CLI reference with excellent actionability — nearly every command is copy-paste ready with realistic jq pipelines and filtering patterns. The main weaknesses are repetitive patterns across resource types (traces/spans/sessions share nearly identical annotation workflows that could be factored out) and the core analytical workflow (open-coding → axial-coding) being deferred to reference files that aren't available in the bundle. The skill serves well as a command reference but the higher-level debugging workflow could be more explicit inline.

Suggestions

Factor out the repeated annotation/add-note/delete patterns (identical across trace, span, session) into a shared section to reduce token count by ~30%

Add a brief inline workflow with explicit validation checkpoints for the open-coding → axial-coding flow, rather than fully deferring to missing reference files

Include the referenced bundle files (references/open-coding.md, references/axial-coding.md) or provide a minimal inline summary of each stage's key steps

DimensionReasoningScore

Conciseness

The skill is largely efficient with concrete commands and JSON shapes, but there's significant repetition across traces/spans/sessions sections (nearly identical annotation, add-note, and delete patterns repeated three times). The JSON shape sections for Trace and Span also overlap heavily. Some trimming via a shared pattern section would save tokens.

2 / 3

Actionability

Excellent actionability — nearly every section provides copy-paste-ready CLI commands with jq pipelines, concrete flag combinations, and real filtering examples. The JSON shape documentation gives exact field names and types. Commands cover the full lifecycle from listing to annotating to deleting.

3 / 3

Workflow Clarity

The high-level workflow ('open-coding → axial-coding → build evals') is mentioned but the actual multi-step process is deferred to reference files that aren't provided in the bundle. The coding annotation identifier lifecycle (pick → use → revert) is described but the validation/verification steps are only hinted at. The revert process mentions 'three identifier-bound DELETEs only after explicit user confirmation' but doesn't spell out the sequence with checkpoints.

2 / 3

Progressive Disclosure

The skill references open-coding.md and axial-coding.md appropriately, and the reference table is well-organized. However, no bundle files were provided, so the referenced files don't actually exist in the evaluated context. Additionally, the main SKILL.md is quite long (~300+ lines) with extensive inline command references that could be split into separate reference files for different resource types (traces, spans, sessions, datasets).

2 / 3

Total

9

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that covers all dimensions well. It provides specific concrete actions, includes a comprehensive 'Use when' clause with both technical triggers and natural language phrases users would say, and occupies a clearly distinct niche around Phoenix CLI-based LLM debugging. The description is thorough without being padded, and uses proper third-person voice throughout.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Very detailed and actionable.

3 / 3

Completeness

Clearly answers both 'what' (debug LLM applications using Phoenix CLI with specific actions listed) and 'when' (explicit 'Use whenever...' clause covering multiple trigger scenarios including natural user phrases like 'what's going wrong').

3 / 3

Trigger Term Quality

Excellent coverage of natural terms users would say: 'traces', 'spans', 'LLM failures', 'agent failures', 'what's going wrong', 'what kinds of mistakes', 'where do I focus', 'evals', 'failure taxonomies'. Includes both technical terms and natural language phrases users would actually type.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with a clear niche: Phoenix CLI for LLM application debugging with specific techniques like open coding and axial coding. The combination of Phoenix, traces/spans, and LLM debugging creates a unique fingerprint unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
Arize-ai/phoenix
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.