Name: golikovichev/phoenix2pytest
Rating: 88.11 (1 reviews)
Author: golikovichev

golikovichev/phoenix2pytest

Turn labeled LLM failure traces from an Arize Phoenix project into runnable pytest regression tests using the phoenix2pytest pipeline. Use when the user has an LLM application emitting OpenInference spans to Phoenix and wants a regression suite from real production failures, when extracting test cases from observed LLM bugs (hallucination, format break, off-topic drift, stale data, wrong reasoning, refusal bug), when bridging Phoenix-labeled traces into pytest-based suites for CI, when the user mentions Arize Phoenix MCP, OpenInference instrumentation, LLM observability, Gemini test synthesis, Vertex AI agent evaluation, or wants to react to LLM failures rather than predict them upfront.

1.63x

Quality

94%

Does it follow best practices?

Impact

98%

1.63x

Average score across 2 eval scenarios

Securityby

Advisory

Suggest reviewing before use

{
  "context": "Tests whether the agent produces phoenix2pytest-compliant pytest regression files: correct template structure, standard imports, required environment setup, _ask_gemini helper, naming convention, concrete string-level assertions, and correct output file paths. Both the hallucination and format_break test files are evaluated.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Required imports present",
      "description": "Both test files import os, re, and pytest at the top level",
      "max_score": 8
    },
    {
      "name": "google-genai import",
      "description": "Both test files import from google.genai and google.genai.types (specifically HttpOptions)",
      "max_score": 8
    },
    {
      "name": "VERTEXAI env var set",
      "description": "Both test files set os.environ[\"GOOGLE_GENAI_USE_VERTEXAI\"] = \"True\" at module level",
      "max_score": 8
    },
    {
      "name": "_ask_gemini helper defined",
      "description": "Both test files define a _ask_gemini(prompt: str) -> str function that calls client.models.generate_content with model=\"gemini-2.5-flash\"",
      "max_score": 10
    },
    {
      "name": "Test naming convention",
      "description": "Both test functions follow the pattern test_no_<failure_mode>_<short_context> (e.g. test_no_hallucination_..., test_no_format_break_...)",
      "max_score": 10
    },
    {
      "name": "Hallucination assertion strategy",
      "description": "test_hallucination.py uses a substring_excluded assertion — asserts that the fabricated quote string is NOT in the response (e.g. assert \"...\" not in response)",
      "max_score": 10
    },
    {
      "name": "Format_break assertion strategy",
      "description": "test_format_break.py checks the output is valid bare JSON — either uses json.loads() without raising, or asserts that markdown fence markers (```) are NOT in the response",
      "max_score": 10
    },
    {
      "name": "No LLM-as-judge",
      "description": "Neither test file calls any external model or LLM to evaluate the response — all assertions are concrete string checks, regex checks, or json.loads() calls",
      "max_score": 9
    },
    {
      "name": "Output file paths",
      "description": "Files are written to generated_tests/test_hallucination.py and generated_tests/test_format_break.py (directory name is generated_tests, filenames use sanitized failure mode slug)",
      "max_score": 9
    },
    {
      "name": "Markdown fence stripping present",
      "description": "At least one of the test files or synthesis_notes.md references stripping markdown fences from Gemini output, OR the format_break test explicitly checks for absence of ``` markers",
      "max_score": 9
    },
    {
      "name": "synthesis_notes.md produced",
      "description": "A synthesis_notes.md file is present and covers assertion strategy rationale for both the hallucination and format_break tests",
      "max_score": 9
    }
  ]
}

golikovichev/phoenix2pytest

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-1/

criteria.jsonevals/scenario-1/