MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.
77
71%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly defines a specific niche (MLflow 3 GenAI agent evaluation), lists numerous concrete actions and API-level details, and includes an explicit 'Use when...' clause with rich trigger terms. The description is concise yet comprehensive, uses proper third-person voice, and would be easily distinguishable from other skills in a large skill library.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA. | 3 / 3 |
Completeness | Clearly answers both 'what' (MLflow 3 GenAI agent evaluation with specific capabilities listed) and 'when' (explicit 'Use when...' clause with multiple concrete trigger scenarios covering the full range of use cases). | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific terms that users working with MLflow 3 GenAI evaluation would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with very specific triggers like 'mlflow.genai.evaluate()', '@scorer functions', 'MemAlign', 'GEPA', and named built-in scorers. This is unlikely to conflict with any other skill due to the highly specialized MLflow 3 GenAI evaluation domain and specific API/function references. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is essentially a well-organized navigation hub that routes users to external reference files, but it provides almost no actionable content on its own. The progressive disclosure and organization are excellent, but the complete absence of executable code examples or inline concrete guidance means a reader cannot accomplish anything without consulting multiple external files. The skill would benefit greatly from including at least minimal executable examples for the most common operations.
Suggestions
Add at least 2-3 minimal executable code examples inline (e.g., a basic mlflow.genai.evaluate() call, a simple @scorer function, and a dataset creation snippet) so the skill provides some actionable content without requiring external file reads.
Add validation/verification checkpoints to workflows, especially Workflow 1 and Workflow 5 (e.g., 'Verify scorer output format before running full evaluation', 'Check dataset schema matches expected format').
Include a 'Quick Start' section at the top with a single copy-paste-ready end-to-end example that demonstrates the most common use case (running a basic evaluation with a built-in scorer).
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is mostly efficient and well-structured with tables, but the eight detailed workflow tables with step-by-step references to external files add significant bulk. The 'Critical API Facts' section is lean and valuable, but the workflow tables could be more condensed since they primarily serve as navigation indexes rather than actionable content. | 2 / 3 |
Actionability | The skill contains zero executable code, no concrete commands, and no copy-paste ready examples. Every workflow step delegates to an external reference file rather than providing any inline code or specific API calls. The 'Critical API Facts' section provides some concrete details but no executable examples. | 1 / 3 |
Workflow Clarity | The workflows are clearly sequenced with numbered steps and organized by goal, which is good. However, there are no validation checkpoints, no error recovery steps, and no feedback loops within any workflow. For operations involving evaluation and data manipulation, the absence of verification steps is notable. | 2 / 3 |
Progressive Disclosure | The content excels at progressive disclosure with a clear overview structure, well-signaled one-level-deep references to specific pattern files, a quick lookup table, and organized workflows that point to exact sections within reference files. Navigation is excellent. | 3 / 3 |
Total | 8 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
02aac8c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.