Name: golikovichev/phoenix2pytest
Rating: 88.11 (1 reviews)
Author: golikovichev

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

1.63x

Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Evaluation results

97%

37%

Regression Suite for Repeated Fabrication Pattern

Multi-trace parametrized pytest synthesis for shared failure mode

Criteria

Baseline

With context

Output file path

100%

Parametrize decorator

100%

All three prompts covered

100%

Test function name pattern

25%

100%

Required imports

28%

57%

VERTEXAI env var

100%

_ask_gemini helper

100%

Fabricated strings excluded

100%

Concrete assertions only

100%

Grouping notes file

100%

Grouping notes content

100%

synthesise_many reference

100%

39%

Regression Tests for LLM Failure Traces

pytest template compliance and naming conventions

Criteria

Baseline

With context

Required imports present

37%

100%

google-genai import

100%

VERTEXAI env var set

100%

_ask_gemini helper defined

100%

Test naming convention

20%

100%

Hallucination assertion strategy

100%

Format_break assertion strategy

100%

No LLM-as-judge

100%

Output file paths

100%

Markdown fence stripping present

100%

synthesis_notes.md produced

100%

Evaluated: about 1 month ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Regression Tests for LLM Failure Traces Regression Suite for Repeated Fabrication Pattern