Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.
88
94%
Does it follow best practices?
Impact
98%
1.63xAverage score across 2 eval scenarios
Advisory
Suggest reviewing before use
Read labeled-as-failure traces from an Arize Phoenix project, write pytest regression tests that catch them. The Phoenix trace stays the source of truth; the generated suite is committable code. Re-run when new failures land in your Phoenix project.
Full pipeline architecture, schema reference, ingestion script, web UI quickstart, comparison vs other eval frameworks, and Cloud Run deployment notes live in REFERENCE.md next to this file.
Install Phoenix client + Vertex AI Gemini deps (this project lists pinned versions):
pip install -e .Set Phoenix + Vertex AI credentials in .env:
PHOENIX_BASE_URL=https://app.phoenix.arize.com
PHOENIX_API_KEY=<your-phoenix-api-key>
GOOGLE_CLOUD_PROJECT=<your-gcp-project>
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=TrueIngest the demo dataset (51 traces across 6 failure modes) to populate your Phoenix project:
python scripts/ingest_demo_dataset.py --project phoenix2pytest-demoThis emits OpenInference spans for each trace.
Sanity-check the dataset shape before generating tests:
python -m pytest tests/test_demo_dataset.py -vLaunch the web UI to browse failures and generate tests interactively:
uvicorn phoenix2pytest.web:app --reload --port 8000Open http://127.0.0.1:8000, paste a Phoenix span ID or pick from the demo dataset, click Generate.
Validate the generated test before committing: run it once to confirm it fails against the original bad response (reproducing the regression) and passes after applying a fix. This confirms the assertion targets the right behavior.
The generated pytest module calls Gemini with the original prompt and asserts the specific broken behaviors are NOT in the response. Commit it. Re-run pytest in CI on every prompt edit or model upgrade.
Real halluc_001 span captured in Phoenix shows Gemini fabricating page 47 quotes from The Great Gatsby:
INPUT: Tell me three specific lines of dialogue from page 47 of
'The Great Gatsby' by F. Scott Fitzgerald. Quote them exactly.
OUTPUT: 1. "I was in the Third Division during the war," I told him...
2. "I was in the Ninth Machine-Gun Battalion," he said approvingly...
3. "Why, I thought you knew, old sport. I'm Gatsby."
LABEL: phoenix2pytest.failure_mode = hallucinationThe synthesised pytest assertion targets the exact fabricated strings:
def test_no_hallucination_gatsby_page_47():
response = _ask_gemini("Tell me three specific lines of dialogue from page 47...")
assert "I was in the Third Division during the war," not in response
assert "I was in the Ninth Machine-Gun Battalion," not in response
assert "I thought you knew, old sport. I'm Gatsby." not in responseThe test stays in CI. Next time someone edits the system prompt or the model gets re-quantised, this exact regression test catches it.
PHOENIX_API_KEY not set: check .env is loaded by the script (uses python-dotenv).--port 8001 to uvicorn.Full error-handling tree, schema validation rules, and ingestion-script flags in REFERENCE.md.
REFERENCE.md (pipeline architecture, schema reference, ingestion flags, web UI internals, Cloud Run deploy).tessl-plugin
docs
evals
scenario-1
scenario-2
scripts
src
phoenix2pytest
tests