CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

79

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

100%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

A well-structured progressive-disclosure hub that keeps the body lean and delegates detail to eleven real reference files via explicit, navigable workflows and a quick-lookup table. Inline 'Critical API Facts' add concrete, copy-paste-ready guidance without bloating the overview.

DimensionReasoningScore

Conciseness

The body is a lean navigation hub: it points to reference files instead of re-explaining concepts Claude already knows, and the dense 'Critical API Facts' section earns every token without padding.

3 / 3

Actionability

Concrete, specific guidance throughout — exact file names to read, exact APIs ('mlflow.genai.evaluate() NOT mlflow.evaluate()'), exact data format ('{"inputs": {"query": "..."}}'), and version requirement ('MLflow >= 3.5.0'); for an instruction-only overview skill this is actionable rather than abstract.

3 / 3

Workflow Clarity

Eight workflows each present a clearly numbered step sequence with per-step reference files and goals; these are informational navigation workflows (not destructive/batch operations), so the absence of validation checkpoints does not cap the score.

3 / 3

Progressive Disclosure

A clear overview with well-signaled one-level-deep references: the 'Reference Files Quick Lookup' table and per-step citations map to real bundle files (all 11 referenced files verified present in ./references/), with no nested reference chains.

3 / 3

Total

12

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

A strong, specific, third-person description that pairs a concrete capability list with an explicit 'Use when...' trigger clause covering the skill's full surface area. It avoids vagueness and over-claims while remaining concise.

DimensionReasoningScore

Specificity

Lists multiple concrete actions — 'writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign... or running optimize_prompts() with GEPA' — matching the anchor for several specific concrete actions.

3 / 3

Completeness

Clearly answers both 'what' ('MLflow 3 GenAI agent evaluation' plus the capability list) and 'when' (explicit 'Use when...' triggers), and uses third-person voice, so it is not capped at 2.

3 / 3

Trigger Term Quality

An explicit 'Use when...' clause enumerates natural domain terms a practitioner would actually say (mlflow.genai.evaluate(), @scorer, MemAlign, GEPA, optimize_prompts(), trace ingestion), giving good coverage rather than generic jargon.

3 / 3

Distinctiveness Conflict Risk

The MLflow 3 GenAI evaluation niche with its specific API and scorer triggers is clearly distinguishable from sibling Databricks skills and unlikely to fire for the wrong skill.

3 / 3

Total

12

/

12

Passed

Validation

93%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation15 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

relative_links

Relative link issues: 5 suspicious

Warning

Total

15

/

16

Passed

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.