CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

77

Quality

71%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

MLflow 3 GenAI Evaluation

Before Writing Any Code

  1. Read GOTCHAS.md - 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

StepActionReference Files
1Understand what to evaluateuser-journeys.md (Journey 0: Strategy)
2Learn API patternsGOTCHAS.md + CRITICAL-interfaces.md
3Build initial datasetpatterns-datasets.md (Patterns 1-4)
4Choose/create scorerspatterns-scorers.md + CRITICAL-interfaces.md (built-in list)
5Run evaluationpatterns-evaluation.md (Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

StepActionReference Files
1Search and filter tracespatterns-trace-analysis.md (MCP tools section)
2Analyze trace qualitypatterns-trace-analysis.md (Patterns 1-7)
3Tag traces for inclusionpatterns-datasets.md (Patterns 16-17)
4Build dataset from tracespatterns-datasets.md (Patterns 6-7)
5Add expectations/ground truthpatterns-datasets.md (Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

StepActionReference Files
1Profile latency by spanpatterns-trace-analysis.md (Patterns 4-6)
2Analyze token usagepatterns-trace-analysis.md (Pattern 9)
3Detect context issuespatterns-context-optimization.md (Section 5)
4Apply optimizationspatterns-context-optimization.md (Sections 1-4, 6)
5Re-evaluate to measure impactpatterns-evaluation.md (Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

StepActionReference Files
1Establish baselinepatterns-evaluation.md (Pattern 4: named runs)
2Run current versionpatterns-evaluation.md (Pattern 1)
3Compare metricspatterns-evaluation.md (Patterns 6-7)
4Analyze failing tracespatterns-trace-analysis.md (Pattern 7)
5Debug specific failurespatterns-trace-analysis.md (Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

StepActionReference Files
1Understand scorer interfaceCRITICAL-interfaces.md (Scorer section)
2Choose scorer patternpatterns-scorers.md (Patterns 4-11)
3For multi-agent scorerspatterns-scorers.md (Patterns 13-16)
4Test with evaluationpatterns-evaluation.md (Pattern 1)

Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring

For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.

StepActionReference Files
1Link UC schema to experimentpatterns-trace-ingestion.md (Patterns 1-2)
2Set trace destinationpatterns-trace-ingestion.md (Patterns 3-4)
3Instrument your applicationpatterns-trace-ingestion.md (Patterns 5-8)
4Configure trace sources (Apps/Serving/OTEL)patterns-trace-ingestion.md (Patterns 9-11)
5Enable production monitoringpatterns-trace-ingestion.md (Patterns 12-13)
6Query and analyze UC tracespatterns-trace-ingestion.md (Pattern 14)

Workflow 7: Judge Alignment with MemAlign

For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.

StepActionReference Files
1Design base judge with make_judge (any feedback type)patterns-judge-alignment.md (Pattern 1)
2Run evaluate(), tag successful tracespatterns-judge-alignment.md (Pattern 2)
3Build UC dataset + create SME labeling sessionpatterns-judge-alignment.md (Pattern 3)
4Align judge with MemAlign after labeling completespatterns-judge-alignment.md (Pattern 4)
5Register aligned judge to experimentpatterns-judge-alignment.md (Pattern 5)
6Re-evaluate with aligned judge (baseline)patterns-judge-alignment.md (Pattern 6)

Workflow 8: Automated Prompt Optimization with GEPA

For automatically improving a registered system prompt using optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.

StepActionReference Files
1Build optimization dataset (inputs + expectations)patterns-prompt-optimization.md (Pattern 1)
2Run optimize_prompts() with GEPA + scorerpatterns-prompt-optimization.md (Pattern 2)
3Register new version, promote conditionallypatterns-prompt-optimization.md (Pattern 3)

Reference Files Quick Lookup

ReferencePurposeWhen to Read
GOTCHAS.mdCommon mistakesAlways read first before writing code
CRITICAL-interfaces.mdAPI signatures, schemasWhen writing any evaluation code
patterns-evaluation.mdRunning evals, comparingWhen executing evaluations
patterns-scorers.mdCustom scorer creationWhen built-in scorers aren't enough
patterns-datasets.mdDataset buildingWhen preparing evaluation data
patterns-trace-analysis.mdTrace debuggingWhen analyzing agent behavior
patterns-context-optimization.mdToken/latency fixesWhen agent is slow or expensive
patterns-trace-ingestion.mdUC trace setup, monitoringWhen setting up trace storage or production monitoring
patterns-judge-alignment.mdMemAlign judge alignment, labeling sessions, SME feedbackWhen aligning judges to domain expert preferences
patterns-prompt-optimization.mdGEPA optimization: build dataset, optimize_prompts(), promoteWhen running automated prompt improvement
user-journeys.mdHigh-level workflows, full domain-expert optimization loopWhen starting a new evaluation project or running the full align + optimize cycle

Critical API Facts

  • Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())
  • Data format: {"inputs": {"query": "..."}} (nested structure required)
  • predict_fn: Receives **unpacked kwargs (not a dict)
  • MemAlign: Scorer-agnostic (works with any feedback_value_type -- float, bool, categorical); token-heavy on the embedding model so set embedding_model explicitly
  • Label schema name matching: The label schema name in the labeling session MUST match the judge name used in evaluate() for align() to pair scores
  • Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
  • GEPA optimization dataset: Must have both inputs AND expectations per record (different from eval dataset)
  • Episodic memory: Lazily loaded -- get_scorer() results won't show episodic memory on print until the judge is first used
  • optimize_prompts: Requires MLflow >= 3.5.0

See GOTCHAS.md for complete list.

Related Skills

  • databricks-docs - General Databricks documentation reference
  • databricks-model-serving - Deploying models and agents to serving endpoints
  • databricks-agent-bricks - Building agents that can be evaluated with this skill
  • databricks-python-sdk - SDK patterns used alongside MLflow APIs
  • databricks-unity-catalog - Unity Catalog tables for managed evaluation datasets
Repository
databricks-solutions/ai-dev-kit
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.