Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
91
91%
Does it follow best practices?
Impact
92%
1.10xAverage score across 25 eval scenarios
Passed
No known issues
The ML platform team at Crescendo has finished the initial eval setup for their pull-request-reviewer tile and has received the first batch of results. Before jumping into improvements, the engineering lead wants a structured triage document that analyzes the results and recommends what to do next.
The team is time-constrained — they don't want to spend hours writing tile improvements only to discover the real problem is something structural. They need someone to read the results critically and flag any issues that should be addressed before diving into content edits.
Produce a triage-report.md that:
The following files are provided as inputs. Extract them before beginning.
=============== FILE: inputs/eval-results.json =============== { "tile": "pull-request-reviewer", "eval_run_id": "eval-run-3391", "scenarios": [ { "name": "security-review", "description": "Review a pull request for security vulnerabilities and flag risky patterns", "baseline_score_pct": 84, "with_context_score_pct": 79, "delta": -5, "criteria": [ { "name": "flags_injection_risks", "baseline": 9, "with_context": 7, "max": 10 }, { "name": "checks_auth_bypass", "baseline": 8, "with_context": 6, "max": 10 }, { "name": "identifies_hardcoded_secrets", "baseline": 9, "with_context": 9, "max": 10 }, { "name": "structured_summary", "baseline": 10, "with_context": 10, "max": 10 }, { "name": "severity_labels", "baseline": 10, "with_context": 9, "max": 10 }, { "name": "references_cwe", "baseline": 7, "with_context": 5, "max": 10 }, { "name": "pr_scope_respected", "baseline": 8, "with_context": 7, "max": 10 }, { "name": "no_false_positives", "baseline": 7, "with_context": 6, "max": 10 } ] }, { "name": "style-and-clarity", "description": "Review a PR for code style issues, naming clarity, and documentation completeness", "baseline_score_pct": 87, "with_context_score_pct": 91, "delta": 4, "criteria": [ { "name": "naming_conventions", "baseline": 9, "with_context": 10, "max": 10 }, { "name": "doc_completeness", "baseline": 8, "with_context": 9, "max": 10 }, { "name": "inline_comment_quality", "baseline": 8, "with_context": 9, "max": 10 }, { "name": "consistent_formatting", "baseline": 9, "with_context": 9, "max": 10 }, { "name": "unused_imports_flagged", "baseline": 9, "with_context": 9, "max": 10 } ] }, { "name": "performance-analysis", "description": "Identify performance bottlenecks, unnecessary allocations, and inefficient patterns in a PR", "baseline_score_pct": 82, "with_context_score_pct": 85, "delta": 3, "criteria": [ { "name": "loop_complexity", "baseline": 9, "with_context": 9, "max": 10 }, { "name": "memory_allocation", "baseline": 8, "with_context": 9, "max": 10 }, { "name": "db_query_patterns", "baseline": 7, "with_context": 8, "max": 10 }, { "name": "caching_opportunities", "baseline": 9, "with_context": 9, "max": 10 }, { "name": "algorithm_choice", "baseline": 7, "with_context": 8, "max": 10 } ] } ] }
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions