databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

Quality

78%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.md

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly defines its scope (MLflow 3 GenAI agent evaluation), lists numerous specific concrete actions and capabilities, and provides explicit trigger guidance via a comprehensive 'Use when...' clause. The description uses third person voice appropriately and includes highly specific technical terms that serve as strong discriminators for skill selection.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA.	3 / 3
Completeness	Clearly answers both 'what' (MLflow 3 GenAI agent evaluation) and 'when' with an explicit 'Use when...' clause listing seven distinct trigger scenarios covering the full scope of the skill.	3 / 3
Trigger Term Quality	Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific technical terms that users working with MLflow 3 GenAI evaluation would naturally use.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive with very specific triggers like 'mlflow.genai.evaluate()', '@scorer functions', 'MemAlign', 'GEPA', and named built-in scorers. This is unlikely to conflict with other skills due to the precise domain and tool-specific terminology.	3 / 3
	Total	12 / 12 Passed

Implementation

57%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This SKILL.md functions well as a routing/navigation document, directing users to the right reference files for their specific goal. Its main weakness is that it contains almost no executable code or concrete examples itself — it's entirely dependent on external files for actionability. The workflows, while well-organized, lack validation checkpoints and error recovery steps that would be important for operations like trace ingestion setup or prompt optimization.

Suggestions

Add at least one minimal executable code example (e.g., a basic mlflow.genai.evaluate() call) directly in the SKILL.md so it provides immediate actionable value without requiring file lookups.

Add validation/verification checkpoints to workflows involving production systems (Workflows 6, 7, 8) — e.g., 'Verify traces appear in UC table before proceeding' or 'Check alignment score before registering judge'.

Consider condensing the eight workflow tables into a more compact format — the table structure with three columns adds visual weight; a numbered list with inline references would be more token-efficient.

Dimension	Reasoning	Score
Conciseness	The content is mostly efficient and avoids explaining basic concepts, but the eight workflow tables are quite verbose and repetitive in structure. The reference lookup table and workflow tables could be more compact, though each element does serve a purpose.	2 / 3
Actionability	The Critical API Facts section provides some concrete, actionable details (exact function names, data format, kwargs behavior), but the vast majority of the skill delegates all executable guidance to external reference files. There are no code examples or copy-paste-ready snippets in the SKILL.md itself.	2 / 3
Workflow Clarity	The eight workflows are clearly sequenced with numbered steps and reference files, which is good. However, there are no validation checkpoints, error recovery steps, or feedback loops in any workflow — particularly concerning for destructive/batch operations like production monitoring setup or prompt optimization.	2 / 3
Progressive Disclosure	The skill excels at progressive disclosure: it serves as a clear overview/routing document with well-signaled one-level-deep references to specific pattern files. The reference lookup table provides clear navigation, and content is appropriately split across files by concern.	3 / 3
	Total	9 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: databricks-solutions/ai-dev-kit
Commit: 93cb4e3

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.