databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

Quality

86%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Low

Low-risk findings worth noting

Quality

Content

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is an exemplar of progressive disclosure: a lean routing layer of sequenced workflow tables and a quick-lookup index pointing to real, well-organized reference files, plus a concise Critical API Facts list. It loses points on actionability and workflow_clarity because executable code and explicit validation checkpoints are delegated to the reference files rather than appearing in the body itself.

Suggestions

Add one small executable code snippet (e.g. a minimal mlflow.genai.evaluate() call with the required {"inputs": {"query": ...}} data shape) to the body so the skill is copy-paste ready before diving into references.

Insert an explicit validation/checkpoint step into at least the regression and optimization workflows (e.g. "Confirm metrics improved before promoting the new prompt version") to close the feedback loop.

Tighten the mild redundancy between the per-workflow Reference Files columns and the standalone Reference Files Quick Lookup table so each reference's purpose is stated once.

Dimension	Reasoning	Score
Conciseness	The body is a lean navigation layer of tables and a bullet list of API facts; it explains no concepts Claude already knows (no "what is MLflow" preamble) and every section routes to a reference; not a 2 because there is no padded explanation, only mild cross-table reference repetition that still earns its place as a lookup aid.	3 / 3
Actionability	Routing is concrete (e.g. "patterns-evaluation.md (Pattern 1)") and "Critical API Facts" gives specific details like "Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())" and "Data format: {\"inputs\": {\"query\": \"...\"}}", but the body contains no executable code blocks - the actual copy-paste code lives in the reference files; not a 3 because nothing in the body is copy-paste ready, and not a 1 because the API-facts and file+pattern routing are specific rather than vague.	2 / 3
Workflow Clarity	Eight workflows are each laid out as clearly numbered, sequenced step tables, but checkpoints are implicit (e.g. Workflow 4 ends at "Debug specific failures" with no validate-and-confirm step); not a 3 because explicit validation/feedback-loop steps are absent, and not a 1 because the sequences themselves are unambiguous.	2 / 3
Progressive Disclosure	The body is an overview that points to 11 one-level-deep reference files, all of which exist in ./references/, with a dedicated "Reference Files Quick Lookup" table signaling when to read each; not a 2 because content is appropriately split into references rather than inlined and navigation is explicit.	3 / 3
	Total	10 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is third-person, concise, and pairs a clear capability statement with an explicit, multi-trigger "Use when" clause. It lists concrete actions and domain-specific trigger terms that a developer would naturally invoke, with no vague fluff or over-claims.

Dimension	Reasoning	Score
Specificity	Lists many concrete actions - "writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign... running optimize_prompts() with GEPA" - matching the multiple-specific-actions anchor; not a 2 because coverage is comprehensive rather than partial.	3 / 3
Completeness	Explicitly states what ("MLflow 3 GenAI agent evaluation") and when ("Use when writing... creating... building... or running...") with explicit triggers, satisfying both halves; not a 2 because the when-clause is present and explicit rather than implied.	3 / 3
Trigger Term Quality	Covers the natural terms a developer in this niche would say - "evaluate", "scorer", "built-in scorers", "eval datasets from traces", "trace ingestion", "production monitoring", "MemAlign", "optimize_prompts", "GEPA" - giving good coverage; not a 2 because the spread of relevant terms is broad rather than missing common variations.	3 / 3
Distinctiveness Conflict Risk	Occupies a clear niche (MLflow 3 GenAI agent evaluation) with distinct, API-specific triggers unlikely to fire for unrelated skills; not a 2 because the scope is narrowly scoped to GenAI evaluation rather than overlapping with general ML skills.	3 / 3
	Total	12 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
relative_links	Relative link issues: 5 suspicious	Warning

	Total	15 / 16 Passed

Repository: databricks-solutions/ai-dev-kit
Path: databricks-skills/databricks-mlflow-evaluation/SKILL.md
Commit: a7e1d51

Reviewed: about 5 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.