CtrlK
BlogDocsLog inGet started
Tessl Logo

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

88

1.63x
Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-1/

Regression Tests for LLM Failure Traces

Problem/Feature Description

Your team maintains an LLM-powered Q&A service and has been collecting traces of observed failures. Two recent incidents have been flagged for regression coverage: one where the model fabricated a literary quote it presented as fact, and another where the model ignored explicit formatting instructions and wrapped its response in markdown fences instead of returning bare data.

The phoenix2pytest project provides a pipeline for turning these failure traces into committable pytest regression tests. Your job is to manually synthesize the two test files — one per failure trace — following the project's conventions for file layout, template structure, naming, and assertion style. Once committed, these tests will run in CI on every prompt edit or model upgrade to catch regressions.

You have been provided the two traces below. Synthesize the corresponding pytest files and write a brief synthesis_notes.md explaining the assertion strategy you chose for each test and why.

Trace 1

  • Failure mode: hallucination
  • User prompt: Quote the opening line of chapter 3 of Moby Dick verbatim.
  • LLM output: The opening line of chapter 3 is: "It was the whiteness of the whale that above all things appalled me."
  • Evidence: The quoted line does not appear as the chapter 3 opener — it is fabricated in this context.

Trace 2

  • Failure mode: format_break
  • User prompt: Return a JSON object with keys name and age for a fictional person. Return raw JSON only, no prose, no markdown fences.
  • LLM output:
    Here is the JSON:
    ```json
    {"name": "Alice", "age": 30}
  • Evidence: The output wraps the JSON in markdown fences and adds prose, violating the formatting instruction.

Output Specification

Produce the following files:

  • generated_tests/test_hallucination.py — pytest regression test for the hallucination trace
  • generated_tests/test_format_break.py — pytest regression test for the format_break trace
  • synthesis_notes.md — a brief explanation of your assertion strategy choice for each test (2–4 sentences per test)

Each test file must be a complete, runnable Python file. Do not leave placeholder comments — write real assertions that would catch the described failure.

CHANGELOG.md

CONTRIBUTING.md

README.md

REFERENCE.md

SECURITY.md

SKILL.md

tessl.json

tile.json