CtrlK
BlogDocsLog inGet started
Tessl Logo

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

88

1.63x
Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured skill with strong actionability and workflow clarity. The quick start provides a clear, validated sequence from installation through test generation and CI integration, with a concrete real-world example. Minor verbosity in some descriptions prevents a perfect conciseness score, but overall the content is efficient and well-organized with appropriate progressive disclosure to REFERENCE.md.

DimensionReasoningScore

Conciseness

Generally efficient but includes some unnecessary context like explaining what the demo dataset contains ('51 traces across 6 failure modes') and some verbose descriptions. The common errors section is useful but could be tighter. The example section earns its place but the surrounding prose has minor padding.

2 / 3

Actionability

Provides fully executable commands at every step: pip install, env vars, ingestion script, pytest validation, uvicorn launch. The example shows concrete input/output/label and the resulting pytest code with real assertions. Copy-paste ready throughout.

3 / 3

Workflow Clarity

Clear 7-step sequence with explicit validation checkpoints: step 4 sanity-checks the dataset before generation, step 6 explicitly validates the generated test by confirming it fails against bad output and passes after a fix. This is a proper feedback loop for a generative pipeline.

3 / 3

Progressive Disclosure

Clean overview in SKILL.md with detailed content explicitly delegated to REFERENCE.md (pipeline architecture, schema reference, ingestion flags, web UI internals, Cloud Run deploy). References are one level deep and clearly signaled at both the top and bottom of the file.

3 / 3

Total

11

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong skill description that clearly defines a specific pipeline (phoenix2pytest), enumerates concrete actions and failure types, and provides comprehensive trigger terms spanning the Arize Phoenix and LLM testing ecosystem. The explicit 'Use when' clause with multiple scenarios ensures Claude can accurately select this skill. The only minor weakness is that the description is somewhat dense as a single long sentence, but the content quality is excellent.

DimensionReasoningScore

Specificity

Lists multiple concrete actions: turning labeled LLM failure traces into runnable pytest regression tests, extracting test cases from observed LLM bugs (with specific bug types enumerated), bridging Phoenix-labeled traces into pytest-based suites for CI.

3 / 3

Completeness

Clearly answers both 'what' (turn labeled LLM failure traces into runnable pytest regression tests using phoenix2pytest pipeline) and 'when' with explicit 'Use when' clause covering multiple detailed trigger scenarios.

3 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'Arize Phoenix', 'OpenInference', 'pytest', 'regression tests', 'hallucination', 'format break', 'off-topic drift', 'LLM observability', 'Gemini test synthesis', 'Vertex AI agent evaluation', 'Phoenix MCP', 'production failures'.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive niche combining Arize Phoenix traces, OpenInference spans, and pytest regression test generation. The specific pipeline name 'phoenix2pytest' and the narrow domain of LLM failure trace conversion make it very unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

Table of Contents