This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches.
69
61%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/project-development/SKILL.mdThis skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.
Activate this skill when:
Evaluate task-model fit before writing any code, because building automation on a fundamentally mismatched task wastes days of effort. Run every proposed task through these two tables to decide proceed-or-stop.
Proceed when the task has these characteristics:
| Characteristic | Rationale |
|---|---|
| Synthesis across sources | LLMs combine information from multiple inputs better than rule-based alternatives |
| Subjective judgment with rubrics | Grading, evaluation, and classification with criteria map naturally to language reasoning |
| Natural language output | When the goal is human-readable text, LLMs deliver it natively |
| Error tolerance | Individual failures do not break the overall system, so LLM non-determinism is acceptable |
| Batch processing | No conversational state required between items, which keeps context clean |
| Domain knowledge in training | The model already has relevant context, reducing prompt engineering overhead |
Stop when the task has these characteristics:
| Characteristic | Rationale |
|---|---|
| Precise computation | Math, counting, and exact algorithms are unreliable in language models |
| Real-time requirements | LLM latency is too high for sub-second responses |
| Perfect accuracy requirements | Hallucination risk makes 100% accuracy impossible |
| Proprietary data dependence | The model lacks necessary context and cannot acquire it from prompts alone |
| Sequential dependencies | Each step depends heavily on the previous result, compounding errors |
| Deterministic output requirements | Same input must produce identical output, which LLMs cannot guarantee |
Always validate task-model fit with a manual test before investing in automation. Copy one representative input into the model interface, evaluate the output quality, and use the result to answer these questions:
Do this because a failed manual prototype predicts a failed automated system, while a successful one provides both a quality baseline and a prompt-design template. The test takes minutes and prevents hours of wasted development.
Structure LLM projects as staged pipelines because separation of deterministic and non-deterministic stages enables fast iteration and cost control. Design each stage to be:
Use this canonical pipeline structure:
acquire -> prepare -> process -> parse -> renderStages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. Maintain this separation because it allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.
Use the file system to track pipeline state rather than databases or in-memory structures, because file existence provides natural idempotency and human-readable debugging.
data/{id}/
raw.json # acquire stage complete
prompt.md # prepare stage complete
response.md # process stage complete
parsed.json # parse stage completeCheck if an item needs processing by checking whether the output file exists. Re-run a stage by deleting its output file and downstream files. Debug by reading the intermediate files directly. This pattern works because each directory is independent, enabling simple parallelization and trivial caching.
Design prompts for structured, parseable outputs because prompt design directly determines parsing reliability. Include these elements in every structured prompt:
Build parsers that handle LLM output variations gracefully, because LLMs do not follow instructions perfectly. Use regex patterns flexible enough for minor formatting variations, provide sensible defaults when sections are missing, and log parsing failures for review rather than crashing.
Use agent-capable models to accelerate development through rapid iteration: describe the project goal and constraints, let the agent generate initial implementation, test and iterate on specific failures, then refine prompts and architecture based on results.
Adopt these practices because they keep agent output focused and high-quality:
Estimate LLM processing costs before starting, because token costs compound quickly at scale and late discovery of budget overruns forces costly rework. Use this formula:
Total cost = (items x tokens_per_item x price_per_token) + API overheadFor batch processing, estimate input tokens per item (prompt + context), estimate output tokens per item (typical response length), multiply by item count, and add 20-30% buffer for retries and failures.
Track actual costs during development. If costs exceed estimates significantly, reduce context length through truncation, use smaller models for simpler items, cache and reuse partial results, or add parallel processing to reduce wall-clock time.
Default to single-agent pipelines for batch processing with independent items, because they are simpler to manage, cheaper to run, and easier to debug. Escalate to multi-agent architectures only when one of these conditions holds:
Choose multi-agent for context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks, which prevents context degradation on long-running tasks.
See multi-agent-patterns skill for detailed architecture guidance.
Start with minimal architecture and add complexity only when production evidence proves it necessary, because over-engineered scaffolding often constrains rather than enables model performance.
Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.
Reduce when:
Add complexity when:
See tool-design skill for detailed tool architecture guidance.
Plan for multiple architectural iterations from the start, because production agent systems at scale always require refactoring. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.
Build for change by following these practices:
Follow this template in order, because each step validates assumptions before the next step invests effort.
Task Analysis
Manual Validation
Architecture Selection
Cost Estimation
Development Plan
Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)
Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.
Architecture:
Results: $58 total cost, ~1 hour execution, static HTML output.
Example 2: Architectural Reduction (Vercel d0)
Task: Text-to-SQL agent for internal analytics.
Before: 17 specialized tools, 80% success rate, 274s average execution.
After: 2 tools (bash + SQL), 100% success rate, 77s average execution.
Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.
See Case Studies for detailed analysis.
This skill connects to:
Internal references:
Related skills in this collection:
External resources:
Created: 2025-12-25 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0
3ab8c94
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.