CtrlK
BlogDocsLog inGet started
Tessl Logo

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

88

1.63x
Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

Evaluation results

100%

39%

Regression Tests for LLM Failure Traces

pytest template compliance and naming conventions

Criteria
Without context
With context

Required imports present

37%

100%

google-genai import

0%

100%

VERTEXAI env var set

0%

100%

_ask_gemini helper defined

0%

100%

Test naming convention

20%

100%

Hallucination assertion strategy

100%

100%

Format_break assertion strategy

100%

100%

No LLM-as-judge

100%

100%

Output file paths

100%

100%

Markdown fence stripping present

100%

100%

synthesis_notes.md produced

100%

100%

97%

37%

Regression Suite for Repeated Fabrication Pattern

Multi-trace parametrized pytest synthesis for shared failure mode

Criteria
Without context
With context

Output file path

0%

100%

Parametrize decorator

100%

100%

All three prompts covered

100%

100%

Test function name pattern

25%

100%

Required imports

28%

57%

VERTEXAI env var

0%

100%

_ask_gemini helper

0%

100%

Fabricated strings excluded

100%

100%

Concrete assertions only

100%

100%

Grouping notes file

100%

100%

Grouping notes content

100%

100%

synthesise_many reference

0%

100%

Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents