Comprehensive developer toolkit providing reusable skills for Java/Spring Boot, TypeScript/NestJS/React/Next.js, Python, PHP, AWS CloudFormation, AI/RAG, DevOps, and more.
90
90%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
The KPI Evaluation system provides objective, quantitative quality metrics for task implementations. It removes evaluator bias by using pre-calculated data instead of subjective assessments.
┌─────────────────────────────────────────────────────────────┐
│ Step 1: HOOK (automatic) │
│ Trigger: PostToolUse on TASK-*.md files │
│ Script: hooks/task-kpi-analyzer.py │
│ Output: tasks/TASK-XXX--kpi.json │
├─────────────────────────────────────────────────────────────┤
│ Step 2: EVALUATOR AGENT │
│ Input: tasks/TASK-XXX--kpi.json │
│ Action: Data-driven pass/fail decision │
│ Output: tasks/TASK-XXX--evaluation.md │
└─────────────────────────────────────────────────────────────┘You don't run KPI analysis manually. The hook fires automatically every time a task file is saved. The Evaluator Agent reads the results during task review.
Measures how well the implementation matches the specification.
| Metric | Calculation | Max Score |
|---|---|---|
| Acceptance Criteria Met | (checked / total) × 10 | 10 |
| Requirements Coverage | Count of REQ-IDs covered by implementation | 10 |
| No Scope Creep | (implemented_files / expected_files) × 10 | 10 |
Example:
Task: Implement JWT token service
Acceptance Criteria: 4 defined, 4 checked → score: 10.0
Requirements: REQ-05, REQ-06 covered → score: 10.0
Scope: 3 files expected, 3 implemented → score: 10.0
Weighted: 10.0 × 0.30 = 3.0Measures implementation quality against language-specific standards.
| Metric | Calculation | Max Score |
|---|---|---|
| Static Analysis | Tool results (ESLint, Checkstyle, etc.) | 10 |
| Complexity | Function length analysis | 10 |
| Patterns Alignment | Knowledge Graph pattern matching | 10 |
Example:
Static Analysis: 0 errors, 2 warnings → score: 9.0
Complexity: All functions <20 lines → score: 10.0
Patterns: Follows project conventions → score: 9.5
Weighted: 9.5 × 0.25 = 2.375Measures testing completeness.
| Metric | Calculation | Max Score |
|---|---|---|
| Unit Tests | min(10, test_files × 5) | 10 |
| Test/Code Ratio | (test_count / code_count) × 10 | 10 |
| Coverage Percentage | Coverage report / 10 | 10 |
Example:
Unit Tests: 2 test files for 1 source file → score: 10.0
Ratio: 120 test LOC / 95 code LOC = 1.26 → score: 10.0
Coverage: 87% → score: 8.7
Weighted: 9.57 × 0.25 = 2.39Measures whether task contracts (provides/expects) are satisfied.
| Metric | Calculation | Max Score |
|---|---|---|
| Provides Verified | Symbols from provides found in code | 10 |
| Expects Satisfied | Dependencies from expects exist | 10 |
Example:
Provides: [JwtTokenService, TokenPair] — both found → score: 10.0
Expects: [UserDetails, SecretKey] — both exist → score: 10.0
Weighted: 10.0 × 0.20 = 2.0Overall = (Spec Compliance × 0.30) +
(Code Quality × 0.25) +
(Test Coverage × 0.25) +
(Contract Fulfillment × 0.20)| Category | Raw Score | Weight | Weighted |
|---|---|---|---|
| Spec Compliance | 10.0 | 30% | 3.00 |
| Code Quality | 9.5 | 25% | 2.38 |
| Test Coverage | 9.57 | 25% | 2.39 |
| Contract Fulfillment | 10.0 | 20% | 2.00 |
| Overall | 9.77/10 |
| Decision | Condition |
|---|---|
| APPROVE | score ≥ threshold AND critical_issues == 0 |
| CONDITIONAL APPROVE | score ≥ threshold - 0.5 AND critical_issues == 0 |
| REQUEST FIXES | score < threshold OR critical_issues > 0 |
7.5/10 — configurable per project type.
| Project Type | Threshold | Rationale |
|---|---|---|
| Production MVP | 8.0 | High quality required for production |
| Internal Tool | 7.0 | Good enough for internal use |
| Prototype | 6.0 | Functional over perfect |
| Critical System | 8.5 | No compromises (payments, security, medical) |
| Score Range | Level | Action |
|---|---|---|
| 9.0 - 10.0 | Exceptional | Approve, document best practices |
| 8.0 - 8.9 | Good | Approve with minor notes |
| 7.0 - 7.9 | Acceptable | Approve (if meets threshold) |
| 6.0 - 6.9 | Below Standard | Request specific improvements |
| < 6.0 | Poor | Significant rework required |
Auto-generated at tasks/TASK-XXX--kpi.json:
{
"task_id": "TASK-002",
"spec_id": "001-user-auth",
"evaluated_at": "2026-04-10T14:30:00Z",
"overall_score": 8.2,
"passed_threshold": true,
"threshold": 7.5,
"kpi_scores": [
{
"category": "spec_compliance",
"weight": 0.30,
"score": 9.0,
"weighted_score": 2.70,
"metrics": {
"acceptance_criteria_met": 10.0,
"requirements_coverage": 8.0,
"no_scope_creep": 9.0
}
},
{
"category": "code_quality",
"weight": 0.25,
"score": 8.0,
"weighted_score": 2.00,
"metrics": {
"static_analysis": 8.5,
"complexity": 8.0,
"patterns_alignment": 7.5
}
},
{
"category": "test_coverage",
"weight": 0.25,
"score": 7.5,
"weighted_score": 1.875,
"metrics": {
"unit_tests": 8.0,
"test_code_ratio": 7.0,
"coverage_percentage": 7.5
}
},
{
"category": "contract_fulfillment",
"weight": 0.20,
"score": 8.1,
"weighted_score": 1.62,
"metrics": {
"provides_verified": 10.0,
"expects_satisfied": 6.2
}
}
],
"critical_issues": [],
"recommendations": [
"Add more integration tests for expects contracts",
"Consider reducing function complexity in JwtTokenService.validateToken()"
],
"summary": "Score: 8.2/10 - PASSED"
}The Evaluator Agent (evaluator-agent) is a specialized subagent that reads KPI files and makes data-driven decisions.
"Don't trust your gut — trust the data."
The Evaluator Agent counters LLM leniency bias by grounding decisions in quantitative metrics.
TASK-XXX--evaluation.mdThe Evaluator can only lower scores (never raise) with documented justification:
Output at tasks/TASK-XXX--evaluation.md:
# Task Evaluation: TASK-002
## Decision: APPROVED
## Scores
| Category | Score | Weight | Weighted |
|----------|-------|--------|----------|
| Spec Compliance | 9.0 | 30% | 2.70 |
| Code Quality | 8.0 | 25% | 2.00 |
| Test Coverage | 7.5 | 25% | 1.88 |
| Contract Fulfillment | 8.1 | 20% | 1.62 |
| **Overall** | | | **8.2/10** |
## Critical Issues
None
## Recommendations
1. Add integration tests for expects contracts (score: 6.2/10)
2. Reduce complexity in JwtTokenService.validateToken()
## Evidence
- KPI data source: TASK-002--kpi.json (evaluated 2026-04-10T14:30:00Z)
- Code review: passed with minor notes
- Spec compliance: 4/4 acceptance criteria verifiedThe Ralph Loop uses KPI evaluation to decide whether to retry failed tasks:
implementation → review → (read KPI score)
├─ score ≥ threshold → cleanup → sync
└─ score < threshold → fix → implementation (retry)
└─ max 3 retries# View KPI scores for a task
cat docs/specs/001-user-auth/tasks/TASK-002--kpi.json | python3 -m json.tool
# Quick summary
cat docs/specs/001-user-auth/tasks/TASK-002--kpi.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
print(f\"Score: {data['overall_score']}/10\")
print(f\"Status: {'PASSED' if data['passed_threshold'] else 'FAILED'}\")
print(f\"Threshold: {data['threshold']}\")
for kpi in data['kpi_scores']:
print(f\" {kpi['category']}: {kpi['score']}/10 (weight: {kpi['weight']})\")
"docs
plugins
developer-kit-ai
developer-kit-aws
agents
docs
skills
aws
aws-cli-beast
aws-cost-optimization
aws-drawio-architecture-diagrams
aws-sam-bootstrap
aws-cloudformation
aws-cloudformation-auto-scaling
aws-cloudformation-bedrock
aws-cloudformation-cloudfront
aws-cloudformation-cloudwatch
aws-cloudformation-dynamodb
aws-cloudformation-ec2
aws-cloudformation-ecs
aws-cloudformation-elasticache
references
aws-cloudformation-iam
references
aws-cloudformation-lambda
aws-cloudformation-rds
aws-cloudformation-s3
aws-cloudformation-security
aws-cloudformation-task-ecs-deploy-gh
aws-cloudformation-vpc
references
developer-kit-core
agents
commands
skills
developer-kit-devops
developer-kit-java
agents
commands
docs
skills
aws-lambda-java-integration
aws-rds-spring-boot-integration
aws-sdk-java-v2-bedrock
aws-sdk-java-v2-core
aws-sdk-java-v2-dynamodb
aws-sdk-java-v2-kms
aws-sdk-java-v2-lambda
aws-sdk-java-v2-messaging
aws-sdk-java-v2-rds
aws-sdk-java-v2-s3
aws-sdk-java-v2-secrets-manager
clean-architecture
graalvm-native-image
langchain4j-ai-services-patterns
references
langchain4j-mcp-server-patterns
references
langchain4j-rag-implementation-patterns
references
langchain4j-spring-boot-integration
langchain4j-testing-strategies
langchain4j-tool-function-calling-patterns
langchain4j-vector-stores-configuration
references
qdrant
references
spring-ai-mcp-server-patterns
spring-boot-actuator
spring-boot-cache
spring-boot-crud-patterns
spring-boot-dependency-injection
spring-boot-event-driven-patterns
spring-boot-openapi-documentation
spring-boot-project-creator
spring-boot-resilience4j
spring-boot-rest-api-standards
spring-boot-saga-pattern
spring-boot-security-jwt
assets
references
scripts
spring-boot-test-patterns
spring-data-jpa
references
spring-data-neo4j
references
unit-test-application-events
unit-test-bean-validation
unit-test-boundary-conditions
unit-test-caching
unit-test-config-properties
references
unit-test-controller-layer
unit-test-exception-handler
references
unit-test-json-serialization
unit-test-mapper-converter
references
unit-test-parameterized
unit-test-scheduled-async
references
unit-test-service-layer
references
unit-test-utility-methods
unit-test-wiremock-rest-api
references
developer-kit-php
developer-kit-project-management
developer-kit-python
developer-kit-specs
commands
docs
hooks
test-templates
tests
skills
developer-kit-tools
developer-kit-typescript
agents
docs
hooks
rules
skills
aws-cdk
aws-lambda-typescript-integration
better-auth
clean-architecture
drizzle-orm-patterns
dynamodb-toolbox-patterns
references
nestjs
nestjs-best-practices
nestjs-code-review
nestjs-drizzle-crud-generator
nextjs-app-router
nextjs-authentication
nextjs-code-review
nextjs-data-fetching
nextjs-deployment
nextjs-performance
nx-monorepo
react-code-review
react-patterns
shadcn-ui
tailwind-css-patterns
tailwind-design-system
references
turborepo-monorepo
typescript-docs
typescript-security-review
zod-validation-utilities
references
github-spec-kit