CtrlK
BlogDocsLog inGet started
Tessl Logo

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

88

1.63x
Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "The agent must produce a single parametrized pytest file covering three hallucination traces, following the phoenix2pytest template and grouping conventions. The file must be placed at generated_tests/test_hallucination.py and use @pytest.mark.parametrize to cover all three prompts in one test function with concrete string-level assertions.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Output file path",
      "description": "Generated test file is written to generated_tests/test_hallucination.py (directory named generated_tests, file named test_hallucination.py)",
      "max_score": 8
    },
    {
      "name": "Parametrize decorator",
      "description": "Test function uses @pytest.mark.parametrize to cover all three user prompts in a single test function (not three separate functions)",
      "max_score": 12
    },
    {
      "name": "All three prompts covered",
      "description": "The parametrize values include all three prompts: the US Constitution first sentence prompt, the Hamlet first line prompt, and the New York Times Y2K headline prompt",
      "max_score": 10
    },
    {
      "name": "Test function name pattern",
      "description": "Test function name begins with test_no_hallucination_ (matching the test_no_<failure_mode>_<short_context> pattern)",
      "max_score": 8
    },
    {
      "name": "Required imports",
      "description": "File imports os, re, and pytest at the top (all three must be present)",
      "max_score": 7
    },
    {
      "name": "VERTEXAI env var",
      "description": "File sets os.environ[\"GOOGLE_GENAI_USE_VERTEXAI\"] = \"True\"",
      "max_score": 8
    },
    {
      "name": "_ask_gemini helper",
      "description": "File defines a _ask_gemini(prompt: str) -> str helper function",
      "max_score": 8
    },
    {
      "name": "Fabricated strings excluded",
      "description": "Assertions check that the fabricated strings are NOT present in the response (substring_excluded strategy: assert ... not in response, or equivalent negative check)",
      "max_score": 12
    },
    {
      "name": "Concrete assertions only",
      "description": "Assertions use direct string comparisons or membership checks — does NOT call an LLM or use any judge/eval function to verify the response",
      "max_score": 10
    },
    {
      "name": "Grouping notes file",
      "description": "grouping_notes.md exists in the workspace",
      "max_score": 5
    },
    {
      "name": "Grouping notes content",
      "description": "grouping_notes.md states that all three traces share the hallucination failure mode as the reason they are grouped into one parametrized function",
      "max_score": 7
    },
    {
      "name": "synthesise_many reference",
      "description": "grouping_notes.md mentions synthesise_many (the function responsible for grouping same-failure-mode traces)",
      "max_score": 5
    }
  ]
}

CHANGELOG.md

CONTRIBUTING.md

README.md

REFERENCE.md

SECURITY.md

SKILL.md

tessl.json

tile.json