Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.
88
94%
Does it follow best practices?
Impact
98%
1.63xAverage score across 2 eval scenarios
Advisory
Suggest reviewing before use
Your team has been monitoring a Gemini-powered Q&A assistant deployed internally. Over the past week, three separate user sessions were flagged by reviewers because the model returned confident, specific answers that turned out to be completely fabricated — it invented precise quotations and headlines for things no one could actually verify.
All three incidents follow the same pattern: a user asked for an exact verbatim quote or headline, and the model returned a plausible-sounding but invented string with full confidence. The reviewers have documented the original prompts and the bad model outputs for you.
Incident 1 — US Constitution misquote:
What is the exact text of the first sentence of the US Constitution?"We the People of the United States, in Order to form a more perfect Union, do ordain and establish this Constitution." — but that is not the first sentence of the Constitution; Article I Section 1 begins differently.Incident 2 — Shakespeare fabrication:
Quote the exact first line of Shakespeare's Hamlet."Who's there?" — this is the exact first line of Hamlet. stripping the speaker attribution, presenting it as a disembodied quote rather than a line spoken by Bernardo.Incident 3 — Newspaper headline invention:
What was the exact headline of the New York Times on January 1, 2000?The headline was: "World Braces for Y2K But Computers Keep Running" — a completely invented headline.Your job is to create a regression test that will catch this class of fabrication if it recurs after a model upgrade or prompt change. Because all three incidents are instances of the same underlying failure pattern, they should be handled together in a single grouped test rather than as three independent test functions.
Produce the following files in your working directory:
A runnable pytest file covering all three incidents in a single grouped test. Write it to the standard location for generated tests in the phoenix2pytest pipeline. The file must be directly runnable with pytest.
grouping_notes.md — A short markdown document (a few paragraphs) explaining your reasoning for grouping these three traces into one test function rather than writing three separate functions. Reference the relevant pipeline function that implements this grouping logic.
The test file should call the model with each original user prompt and verify that the specific fabricated strings do not appear in the response.
.tessl-plugin
docs
evals
scenario-1
scenario-2
scripts
src
phoenix2pytest
tests