MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.
66
78%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly defines its scope (MLflow 3 GenAI agent evaluation), lists numerous specific concrete actions and capabilities, and provides explicit trigger guidance via a comprehensive 'Use when...' clause. The description uses third person voice appropriately and includes highly specific technical terms that serve as strong discriminators for skill selection.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA. | 3 / 3 |
Completeness | Clearly answers both 'what' (MLflow 3 GenAI agent evaluation) and 'when' with an explicit 'Use when...' clause listing seven distinct trigger scenarios covering the full scope of the skill. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific technical terms that users working with MLflow 3 GenAI evaluation would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with very specific triggers like 'mlflow.genai.evaluate()', '@scorer functions', 'MemAlign', 'GEPA', and named built-in scorers. This is unlikely to conflict with other skills due to the precise domain and tool-specific terminology. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
57%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This SKILL.md functions well as a routing/navigation document, directing users to the right reference files for their specific goal. Its main weakness is that it contains almost no executable code or concrete examples itself — it's entirely dependent on external files for actionability. The workflows, while well-organized, lack validation checkpoints and error recovery steps that would be important for operations like trace ingestion setup or prompt optimization.
Suggestions
Add at least one minimal executable code example (e.g., a basic mlflow.genai.evaluate() call) directly in the SKILL.md so it provides immediate actionable value without requiring file lookups.
Add validation/verification checkpoints to workflows involving production systems (Workflows 6, 7, 8) — e.g., 'Verify traces appear in UC table before proceeding' or 'Check alignment score before registering judge'.
Consider condensing the eight workflow tables into a more compact format — the table structure with three columns adds visual weight; a numbered list with inline references would be more token-efficient.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is mostly efficient and avoids explaining basic concepts, but the eight workflow tables are quite verbose and repetitive in structure. The reference lookup table and workflow tables could be more compact, though each element does serve a purpose. | 2 / 3 |
Actionability | The Critical API Facts section provides some concrete, actionable details (exact function names, data format, kwargs behavior), but the vast majority of the skill delegates all executable guidance to external reference files. There are no code examples or copy-paste-ready snippets in the SKILL.md itself. | 2 / 3 |
Workflow Clarity | The eight workflows are clearly sequenced with numbered steps and reference files, which is good. However, there are no validation checkpoints, error recovery steps, or feedback loops in any workflow — particularly concerning for destructive/batch operations like production monitoring setup or prompt optimization. | 2 / 3 |
Progressive Disclosure | The skill excels at progressive disclosure: it serves as a clear overview/routing document with well-signaled one-level-deep references to specific pattern files. The reference lookup table provides clear navigation, and content is appropriately split across files by concern. | 3 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
93cb4e3
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.