CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

77

Quality

71%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly defines its scope (MLflow 3 GenAI agent evaluation), lists numerous specific concrete actions and API elements, and provides explicit trigger guidance via a comprehensive 'Use when...' clause. The description is concise yet thorough, with highly distinctive terminology that minimizes conflict risk with other skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA.

3 / 3

Completeness

Clearly answers both 'what' (MLflow 3 GenAI agent evaluation) and 'when' with an explicit 'Use when...' clause listing seven distinct trigger scenarios covering the full scope of the skill.

3 / 3

Trigger Term Quality

Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', specific scorer names like 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific terms that users working with MLflow 3 GenAI evaluation would naturally use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with very specific triggers tied to MLflow 3's GenAI evaluation API, including named functions, decorators, and proprietary techniques (MemAlign, GEPA). Extremely unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is essentially a well-organized navigation hub that routes users to external reference files, but it provides almost no actionable content on its own. The progressive disclosure and structural organization are excellent, with clear workflow sequences and a comprehensive reference lookup table. However, the near-complete absence of executable code, concrete examples, or inline guidance means Claude would need to read multiple external files before being able to write any evaluation code.

Suggestions

Add at least one minimal executable code example for the most common use case (e.g., a complete mlflow.genai.evaluate() call with a built-in scorer) so the skill provides immediate actionable value without requiring external file reads.

Include a 'Quick Start' section with a copy-paste ready 3-5 line evaluation snippet before the workflow tables, similar to the rubric's good example pattern.

Add validation/verification checkpoints to workflows involving production operations (Workflows 2, 6) — e.g., 'Verify traces are being stored: run mlflow.search_traces()' or 'Confirm dataset row count before running evaluation'.

Consider condensing the eight workflow tables into a more compact format — perhaps a decision tree or shorter bullet lists — since the tables are mostly pointers to external files rather than substantive instructions.

DimensionReasoningScore

Conciseness

The content is mostly efficient and well-structured with tables, but the eight detailed workflow tables with step-by-step references to external files add significant bulk. The 'Critical API Facts' section is lean and valuable, but the workflow tables could be more condensed since they primarily serve as navigation indexes rather than actionable content.

2 / 3

Actionability

The skill contains zero executable code, no concrete commands, and no copy-paste ready examples. Every workflow step delegates to an external reference file rather than providing any inline code or specific API calls. The 'Critical API Facts' section provides some concrete details but they are declarative facts, not executable guidance.

1 / 3

Workflow Clarity

The eight workflows are clearly sequenced with numbered steps and organized in tables, which is good. However, there are no validation checkpoints, no error recovery steps, and no feedback loops within any workflow. For operations involving evaluation and production monitoring, the absence of verification steps is notable.

2 / 3

Progressive Disclosure

The skill excels at progressive disclosure with a clear overview structure, well-signaled one-level-deep references to specific pattern files, a quick lookup table mapping references to purposes and when to read them, and logical organization from 'before writing code' through specific workflows to reference lookup.

3 / 3

Total

8

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.