MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.
77
71%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly defines its scope (MLflow 3 GenAI agent evaluation), lists numerous specific concrete actions and API elements, and provides explicit trigger guidance via a comprehensive 'Use when...' clause. The description is concise yet thorough, with highly distinctive terminology that minimizes conflict risk with other skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA. | 3 / 3 |
Completeness | Clearly answers both 'what' (MLflow 3 GenAI agent evaluation) and 'when' with an explicit 'Use when...' clause listing seven distinct trigger scenarios covering the full scope of the skill. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', specific scorer names like 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific terms that users working with MLflow 3 GenAI evaluation would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with very specific triggers tied to MLflow 3's GenAI evaluation API, including named functions, decorators, and proprietary techniques (MemAlign, GEPA). Extremely unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is essentially a well-organized navigation hub that routes users to external reference files, but it provides almost no actionable content on its own. The progressive disclosure and structural organization are excellent, with clear workflow sequences and a comprehensive reference lookup table. However, the near-complete absence of executable code, concrete examples, or inline guidance means Claude would need to read multiple external files before being able to write any evaluation code.
Suggestions
Add at least one minimal executable code example for the most common use case (e.g., a complete mlflow.genai.evaluate() call with a built-in scorer) so the skill provides immediate actionable value without requiring external file reads.
Include a 'Quick Start' section with a copy-paste ready 3-5 line evaluation snippet before the workflow tables, similar to the rubric's good example pattern.
Add validation/verification checkpoints to workflows involving production operations (Workflows 2, 6) — e.g., 'Verify traces are being stored: run mlflow.search_traces()' or 'Confirm dataset row count before running evaluation'.
Consider condensing the eight workflow tables into a more compact format — perhaps a decision tree or shorter bullet lists — since the tables are mostly pointers to external files rather than substantive instructions.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is mostly efficient and well-structured with tables, but the eight detailed workflow tables with step-by-step references to external files add significant bulk. The 'Critical API Facts' section is lean and valuable, but the workflow tables could be more condensed since they primarily serve as navigation indexes rather than actionable content. | 2 / 3 |
Actionability | The skill contains zero executable code, no concrete commands, and no copy-paste ready examples. Every workflow step delegates to an external reference file rather than providing any inline code or specific API calls. The 'Critical API Facts' section provides some concrete details but they are declarative facts, not executable guidance. | 1 / 3 |
Workflow Clarity | The eight workflows are clearly sequenced with numbered steps and organized in tables, which is good. However, there are no validation checkpoints, no error recovery steps, and no feedback loops within any workflow. For operations involving evaluation and production monitoring, the absence of verification steps is notable. | 2 / 3 |
Progressive Disclosure | The skill excels at progressive disclosure with a clear overview structure, well-signaled one-level-deep references to specific pattern files, a quick lookup table mapping references to purposes and when to read them, and logical organization from 'before writing code' through specific workflows to reference lookup. | 3 / 3 |
Total | 8 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
b4071a0
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.