CtrlK
BlogDocsLog inGet started
Tessl Logo

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

88

1.63x
Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-2/

Regression Suite for Repeated Fabrication Pattern

Problem Description

Your team has been monitoring a Gemini-powered Q&A assistant deployed internally. Over the past week, three separate user sessions were flagged by reviewers because the model returned confident, specific answers that turned out to be completely fabricated — it invented precise quotations and headlines for things no one could actually verify.

All three incidents follow the same pattern: a user asked for an exact verbatim quote or headline, and the model returned a plausible-sounding but invented string with full confidence. The reviewers have documented the original prompts and the bad model outputs for you.

Incident 1 — US Constitution misquote:

  • User prompt: What is the exact text of the first sentence of the US Constitution?
  • Fabricated output: The model claimed the first sentence is the Preamble opener — "We the People of the United States, in Order to form a more perfect Union, do ordain and establish this Constitution." — but that is not the first sentence of the Constitution; Article I Section 1 begins differently.

Incident 2 — Shakespeare fabrication:

  • User prompt: Quote the exact first line of Shakespeare's Hamlet.
  • Fabricated output: The model returned "Who's there?" — this is the exact first line of Hamlet. stripping the speaker attribution, presenting it as a disembodied quote rather than a line spoken by Bernardo.

Incident 3 — Newspaper headline invention:

  • User prompt: What was the exact headline of the New York Times on January 1, 2000?
  • Fabricated output: The model stated The headline was: "World Braces for Y2K But Computers Keep Running" — a completely invented headline.

Your job is to create a regression test that will catch this class of fabrication if it recurs after a model upgrade or prompt change. Because all three incidents are instances of the same underlying failure pattern, they should be handled together in a single grouped test rather than as three independent test functions.

Output Specification

Produce the following files in your working directory:

  1. A runnable pytest file covering all three incidents in a single grouped test. Write it to the standard location for generated tests in the phoenix2pytest pipeline. The file must be directly runnable with pytest.

  2. grouping_notes.md — A short markdown document (a few paragraphs) explaining your reasoning for grouping these three traces into one test function rather than writing three separate functions. Reference the relevant pipeline function that implements this grouping logic.

The test file should call the model with each original user prompt and verify that the specific fabricated strings do not appear in the response.

CHANGELOG.md

CONTRIBUTING.md

README.md

REFERENCE.md

SECURITY.md

SKILL.md

tessl.json

tile.json