CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

77

Quality

71%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./databricks-skills/databricks-mlflow-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly defines a specific niche (MLflow 3 GenAI agent evaluation), lists numerous concrete actions and API-level details, and includes an explicit 'Use when...' clause with rich trigger terms. The description is concise yet comprehensive, uses proper third-person voice, and would be easily distinguishable from other skills in a large skill library.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (with specific names), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign, and running optimize_prompts() with GEPA.

3 / 3

Completeness

Clearly answers both 'what' (MLflow 3 GenAI agent evaluation with specific capabilities listed) and 'when' (explicit 'Use when...' clause with multiple concrete trigger scenarios covering the full range of use cases).

3 / 3

Trigger Term Quality

Excellent coverage of natural terms a user would say: 'mlflow.genai.evaluate()', '@scorer', 'Guidelines', 'Correctness', 'Safety', 'RetrievalGroundedness', 'eval datasets', 'traces', 'production monitoring', 'MemAlign', 'optimize_prompts()', 'GEPA'. These are highly specific terms that users working with MLflow 3 GenAI evaluation would naturally use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with very specific triggers like 'mlflow.genai.evaluate()', '@scorer functions', 'MemAlign', 'GEPA', and named built-in scorers. This is unlikely to conflict with any other skill due to the highly specialized MLflow 3 GenAI evaluation domain and specific API/function references.

3 / 3

Total

12

/

12

Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is essentially a well-organized navigation hub that routes users to external reference files, but it provides almost no actionable content on its own. The progressive disclosure and organization are excellent, but the complete absence of executable code examples or inline concrete guidance means a reader cannot accomplish anything without consulting multiple external files. The skill would benefit greatly from including at least minimal executable examples for the most common operations.

Suggestions

Add at least 2-3 minimal executable code examples inline (e.g., a basic mlflow.genai.evaluate() call, a simple @scorer function, and a dataset creation snippet) so the skill provides some actionable content without requiring external file reads.

Add validation/verification checkpoints to workflows, especially Workflow 1 and Workflow 5 (e.g., 'Verify scorer output format before running full evaluation', 'Check dataset schema matches expected format').

Include a 'Quick Start' section at the top with a single copy-paste-ready end-to-end example that demonstrates the most common use case (running a basic evaluation with a built-in scorer).

DimensionReasoningScore

Conciseness

The content is mostly efficient and well-structured with tables, but the eight detailed workflow tables with step-by-step references to external files add significant bulk. The 'Critical API Facts' section is lean and valuable, but the workflow tables could be more condensed since they primarily serve as navigation indexes rather than actionable content.

2 / 3

Actionability

The skill contains zero executable code, no concrete commands, and no copy-paste ready examples. Every workflow step delegates to an external reference file rather than providing any inline code or specific API calls. The 'Critical API Facts' section provides some concrete details but no executable examples.

1 / 3

Workflow Clarity

The workflows are clearly sequenced with numbered steps and organized by goal, which is good. However, there are no validation checkpoints, no error recovery steps, and no feedback loops within any workflow. For operations involving evaluation and data manipulation, the absence of verification steps is notable.

2 / 3

Progressive Disclosure

The content excels at progressive disclosure with a clear overview structure, well-signaled one-level-deep references to specific pattern files, a quick lookup table, and organized workflows that point to exact sections within reference files. Navigation is excellent.

3 / 3

Total

8

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.