Build robust, production-grade backtesting systems that avoid common pitfalls and produce reliable strategy performance estimates.
52
41%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/backtesting-frameworks/SKILL.mdQuality
Discovery
32%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description identifies a clear domain (backtesting) but relies on qualitative adjectives ('robust', 'production-grade', 'reliable') rather than listing concrete actions. It completely lacks a 'Use when...' clause, making it harder for Claude to know when to select this skill. The description would benefit significantly from specific actions and explicit trigger conditions.
Suggestions
Add a 'Use when...' clause with trigger terms like 'backtest', 'trading strategy', 'historical simulation', 'strategy evaluation', 'portfolio backtest'.
Replace vague qualifiers ('robust', 'production-grade') with specific concrete actions such as 'simulate trades against historical data, model slippage and transaction costs, calculate risk-adjusted returns, detect lookahead bias'.
Include common file types or tool references users might mention, such as 'OHLCV data', 'equity curves', 'Sharpe ratio', or 'drawdown analysis'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (backtesting systems) and mentions some qualities (robust, production-grade, avoid pitfalls, reliable performance estimates), but doesn't list specific concrete actions like 'simulate trades', 'calculate Sharpe ratios', 'handle slippage modeling', etc. | 2 / 3 |
Completeness | Describes what it does at a high level (build backtesting systems) but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' is also somewhat vague, warranting a score of 1. | 1 / 3 |
Trigger Term Quality | Includes 'backtesting' which is a strong trigger term, and 'strategy performance' is relevant. However, it misses common variations users might say like 'backtest', 'trading strategy', 'historical simulation', 'portfolio testing', 'quantitative finance', or 'strategy evaluation'. | 2 / 3 |
Distinctiveness Conflict Risk | 'Backtesting systems' is a fairly specific niche that wouldn't overlap with most skills, but the vague phrasing around 'production-grade systems' and 'strategy performance' could potentially overlap with general software engineering or quantitative analysis skills. | 2 / 3 |
Total | 7 / 12 Passed |
Implementation
50%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill is concise and well-structured at a high level, but critically lacks actionability — the core instructions read like a table of contents rather than executable guidance. Without any concrete code examples, specific commands, or detailed steps in the main file, Claude would struggle to act on this skill without immediately needing the external resource. The workflow sequence is present but lacks validation checkpoints important for a complex multi-step process like backtesting.
Suggestions
Add at least one concrete, executable code example in the main SKILL.md (e.g., a minimal backtest loop skeleton with realistic cost modeling) so the skill is actionable without requiring the external resource.
Make the workflow steps more specific with explicit validation checkpoints — e.g., 'Verify no future data leakage by checking that all features use only data available at signal time' and 'Validate results by comparing in-sample vs out-of-sample Sharpe ratios.'
Expand the instructions section to include specific patterns for common pitfalls (look-ahead bias detection, survivorship bias handling) with concrete checks rather than abstract bullet points.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is lean and efficient. It avoids explaining what backtesting is or how trading works, assumes Claude's competence, and every section serves a clear purpose without padding. | 3 / 3 |
Actionability | The instructions are entirely abstract and vague — 'Build point-in-time data pipelines,' 'Implement event-driven simulation' — with no concrete code, commands, specific examples, or executable guidance. Everything actionable is deferred to an external resource file. | 1 / 3 |
Workflow Clarity | There is a rough sequence implied (define hypothesis → build pipelines → implement simulation → use splits), but steps lack specificity, there are no validation checkpoints, and no feedback loops for error recovery in what is inherently a multi-step, error-prone process. | 2 / 3 |
Progressive Disclosure | There is a reference to an external resource file for detailed patterns, which is good structure. However, the SKILL.md itself provides almost no substantive quick-start content — it's essentially just a pointer with bullet-point abstractions, making the overview too thin to be useful on its own. | 2 / 3 |
Total | 8 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
43280f9
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.