Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.
88
94%
Does it follow best practices?
Impact
98%
1.63xAverage score across 2 eval scenarios
Advisory
Suggest reviewing before use
Turn production LLM failures into regression tests. Automatically.

Built for: Google Cloud Rapid Agent Hackathon, Arize track
Live demo: phoenix2pytest-etm7pvfo3a-nw.a.run.app - the example is pre-filled, just click Generate. Single trace at /, batch at /batch.
Stack: Google Cloud, Vertex AI, Gemini 2.5, Agent Builder, Arize Phoenix MCP, FastAPI, pytest
Status: Alpha (v0.1.0). Built for the Devpost submission cycle ending June 2026.
You ship an LLM feature. Three weeks later, a Slack thread mentions a customer got a weird response. You dig. The prompt has been edited twice since release. The model has been quietly re-quantised by the provider. Nobody added a test that would have caught it.
Existing eval frameworks ask you to predict failures up front. You write evals against your LLM, run them, get scores. That works for known failure modes you can imagine. It does not work for the failure that just got escalated to your phone.
phoenix2pytest goes the other direction. It reads traces from your Arize Phoenix project, picks the ones flagged as failures, and synthesises pytest cases that would have caught them. Production traffic feeds your regression suite without manual translation.
| Existing tools | phoenix2pytest |
|---|---|
| Direction: spec to eval to run | Direction: trace to failure to test |
| You predict what to test | You react to what broke |
| Eval scores | Concrete pytest assertions |
| Catches what you imagined | Catches what actually happened |
The pipeline runs end-to-end on a single trace (
/) or on many annotated traces in one request (/batch), both on the web UI and Cloud Run. Batch mode groups traces by failure mode and folds shared modes into one parametrised test.
flowchart LR
A[Phoenix project<br/>annotated traces] -->|MCP| B[Agent Builder<br/>orchestrator]
B --> C[Gemini Flash<br/>extractor]
B --> D[Gemini Pro<br/>synthesiser]
C --> D
D --> E[Generated<br/>pytest file]
E --> F[CI / dev<br/>runs pytest]The orchestrator runs on Cloud Run, fetches traces through the Arize Phoenix MCP server, calls Gemini twice per trace (Flash for evidence extraction, Pro for code generation), and writes the synthesised test file.
The web UI is the primary entry point during the hackathon. A console-script CLI is on the roadmap (see below).
Local web UI:
git clone https://github.com/golikovichev/phoenix2pytest
cd phoenix2pytest
pip install -e ".[dev]"Create a .env file in the repo root:
PHOENIX_BASE_URL=https://app.phoenix.arize.com/s/your-space
PHOENIX_API_KEY=your-phoenix-api-key
GOOGLE_CLOUD_PROJECT=your-gcp-project
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=TrueApplication Default Credentials are picked up automatically for Vertex AI, so no API key is needed if you have run gcloud auth application-default login. If you prefer the direct Gemini API, set GEMINI_API_KEY instead of the Vertex variables.
Run the FastAPI web UI and open it in a browser:
uvicorn phoenix2pytest.web:app --reload --port 8000
# http://localhost:8000Cloud Run deploy (see cloudbuild.yaml for the full pipeline):
gcloud builds submit --config cloudbuild.yamlA 3-minute walkthrough video accompanies the Devpost submission. The demo shows a real Phoenix trace with a hallucination, phoenix2pytest extracting the failure, generating a pytest file, the run showing red, a prompt fix, and the run showing green.
Short answer: those tools are about running evals you wrote. phoenix2pytest is about generating tests from failures you saw. Different direction, different mental model.
| Tool | What it is | When to use |
|---|---|---|
| DeepEval | pytest-style framework for writing LLM evals | You know the failure modes you care about and want to define metrics |
| Opik | LLM observability with pytest integration | You want eval scores in CI |
| pytest-evals | Minimal pytest plugin for running evals at scale | You want parametrised eval runs |
| Langfuse | LLM tracing platform with evals | You want production tracing plus scoring |
| phoenix2pytest | Generates pytest tests from observed failures | You want your regression suite to keep up with production reality |
You can use phoenix2pytest alongside the others. It does not compete with eval frameworks; it feeds them. The output of phoenix2pytest is a pytest file you can run via DeepEval, Opik, pytest-evals, or plain pytest. Your choice.
Catches:
Does not catch yet:
The roadmap covers paraphrase tolerance via embedding-similarity assertions (post-hackathon).
phoenix2pytest console-script CLI and broader documentation.MIT. See LICENSE.
Built on Arize Phoenix, Google Cloud Vertex AI, OpenTelemetry, and OpenInference semantic conventions. Thanks to the maintainers of all four projects.