Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo

Claude vs Codex

vs Gemini

Yaniv Aknin
Founding engineer, Tessl
Back to podcasts

Building an AI Agent in 100 Lines of Code

with Yaniv Aknin

Transcript

Chapters

Introduction
[00:01:01]
Research on AI Agents and Context
[00:04:06]
Deep Dive into System Prompts and Tools
[00:05:38]
Defensive Tasks and Authorization
[00:21:41]
Tool Usage in Language Models
[00:24:26]
Planning and Agentic Harness
[00:32:43]

In this episode

In this episode of AI Native Dev, host Simon Maple and guest Yaniv Aknin explore the balance between built-in system contexts and developer-added instructions in coding agents. Yaniv demonstrates how a simple 100-line "nano agent" can effectively generate code, highlighting the importance of minimal system prompts and well-chosen tools. The discussion sheds light on how developers can optimise agent performance by designing complementary contexts and leveraging benchmarks alongside real-world scenarios.

In this episode of AI Native Dev, host Simon Maple welcomes Yaniv Aknin, a software engineer at Tessl, to unpack a deceptively simple question with big implications: how much of an agent’s power comes from its built-in system context and tools, and how much comes from the context we add as developers? Yaniv walks through hands-on research—from a 100-line “nano agent” to observations about flagship agents like Claude Code and Gemini—to show how system prompts, tool descriptions, and evaluation benchmarks shape agent performance.

From Zero to Agent: The Nano-Agent Baseline

Yaniv starts with a live, minimal example: a working coding agent implemented in under 100 lines of Python. Built with Simon Willison’s LLM Python library, the agent uses a tiny system prompt (about 250 bytes) that essentially says “you’re a coding agent—do well,” and exposes a few tools: execute (shell commands), read_file, and write_file. The core is a simple run loop that submits the conversation (system prompt + user goal) to an LLM, processes tool calls returned by the model, executes them, appends results, and repeats until the model says it’s done or the step budget is reached.

Despite its simplicity, this nano agent reliably creates usable artifacts (like a functional to-do app) with a modern model (Yaniv ran it with Claude Sonnet 4.5). Running the agent inside a container gives it a predictable environment and a safe place to execute commands. The code structure is intentionally lean: a short system prompt, a minimal tool surface, and a straightforward loop. In practice, that’s enough to bootstrap code generation, execute tests, and iterate toward a goal in a few tool-invocation turns.

For developers, the lesson is profound: you don’t need a complex framework to get started. A baseline agent can be a simple message loop with 2–3 well-chosen tools. Start there to understand how your model reasons about goals and tools before layering on fancy orchestration, multi-agent handoffs, or retrieval pipelines. Keep the system prompt minimal so your added task context gets more attention in the token budget.

What Benchmarks Reveal—and Miss

Yaniv anchors the discussion in TerminalBench, a respected agent evaluation used internally at Tessl. A mini agent from the SWE-bench team—Mini SWE Agent—places 15th on TerminalBench with a similarly small codebase (roughly 100 lines for the loop and a small amount more for utilities). It’s not top of the leaderboard, but it’s in a very credible cohort, often close to much heavier-weight agents.

This suggests two things. First, baseline agents with little context and a few tools can be surprisingly capable on structured coding tasks. Second, benchmark success doesn’t fully capture what richer system prompts, bespoke tools, and domain-specific context buy you in real-world scenarios. Benchmarks are essential—but they’re an imperfect proxy for production work where repos are messy, dependencies are unclear, and the “definition of done” is nuanced.

Developers should use benchmarks for regression checks and model comparisons, then validate in-the-wild tasks. A good approach is dual-track evaluation: keep a TerminalBench (or similar) run for quantitative signal and pair it with a realistic scenario suite (e.g., “add feature X to repo Y with tests Z”) that exercises your actual workflow, dependencies, and runtime. Instrument tool calls and outcomes so you can see where the model stalls, loops, or misuses tools.

Inside the Black Box: System Context and Tooling

The episode’s central investigation is: what system context do flagship coding agents include by default, and how does that interact with the context you add? Without claiming insider knowledge, Yaniv observes that agents like Claude Code or Gemini send sizeable system contexts up front: high-level behavior (“you are a helpful coding assistant”), persona and safety instructions, and detailed tool descriptions including schemas and usage guidelines. Those tool descriptions often specify when to use a tool, what the parameters mean, and constraints or safeguards.

Crucially, all of this context—system prompt, tool manifests, your custom instructions, and the user’s latest request—arrives as a single input to the LLM. The model doesn’t “know” which parts came from you versus the platform; it just predicts the next token conditioned on everything. Order, phrasing, and relative length matter. If the built-in system prompt is long and prescriptive, your domain-specific guidance might be diluted unless you keep it concise and structured.

For developers, that means two practical implications: your custom context must coexist with strong built-in priors, and your tool design must be easy for the LLM to reason about amid many available tools. Think clearly named tools, concise one-line descriptions followed by an explicit “use when…” directive, and schemas that reduce ambiguity. The simpler and more discriminative your tool interface, the more consistently the model will call it.

Designing Context That Plays Nicely with Built-ins

Yaniv frames context in two layers: the system context (shipper-provided defaults inside the agent) and the task/domain context you add (repo details, objectives, run commands). Because everything merges at inference time, your goal is not to override the system but to complement it. Keep instructions narrowly scoped to the task: define the goal, constraints, and the environment’s “rules of the road” (e.g., “run tests with make test; code lives in src/; prefer FastAPI; follow PEP8”).

Tool descriptions benefit from being explicit about intent and side effects. For example: “execute: run a shell command in the project root. Use to install dependencies, run tests, or scaffold. Side effects: changes filesystem and environment.” Pair this with read_file and write_file tools that clarify default paths, allowed file sizes, and expected encodings. The model is better at planning when tools declare both capabilities and boundaries.

Also, constrain the step budget and encourage summarization. A short system prompt like “Operate in as few tool calls as possible. Write small, verifiable changes. After each tool result, summarise the new state” can reduce thrashing and make logs easier to inspect. Finally, avoid burying key details in long prose. Use concise bullet points and explicit labels (Goal, Constraints, Commands, Repo Layout) so the model can “pattern match” the structure and retrieve the right facts at the right time.

A Practical Evaluation Loop for Teams

Yaniv’s workflow at Tessl is pragmatic: start with a minimal, inspectable agent, then iteratively add context and observe deltas. Begin with the nano baseline (short system prompt, execute/read/write tools, containerised runtime) and run a small suite of tasks. Add one change at a time—e.g., expand tool descriptions, include repo layout hints, or add a “test-first” directive—and measure completion rates, tool call counts, and time to solution.

Use TerminalBench (or similar) for repeatable checks, but pair it with an internal scenario bank that mirrors your customers’ realities. Log every model message, tool call, and return value so you can replay failure cases. Track where the model hesitates: missing context (e.g., how to run tests), tool confusion (e.g., wrong working directory), or environment issues (e.g., missing dependencies). Each failure class suggests a targeted context or tooling fix.

Finally, sandbox execution. Yaniv runs the agent inside a container so execute is powerful but contained. In production, adopt even tighter controls: non-root users, network restrictions, resource limits, and a allowlist of commands. Consider adding a “dry-run” option or a plan-then-execute pattern for risky operations. Human-in-the-loop checkpoints can also be valuable during initial rollouts, especially when agents touch customer repos.

Key Takeaways

  • Start small: a 100-line agent with a tiny system prompt and 2–3 tools (execute, read_file, write_file) is enough to ship working results, especially with a modern LLM like Claude Sonnet 4.5.
  • Benchmarks are helpful but incomplete: Mini agents rank respectably on TerminalBench. Use benchmarks plus a curated set of real tasks from your environment to measure what actually matters.
  • System context matters—and it mixes with yours: flagship agents include long prompts and detailed tool descriptions. Keep your added context concise, structured, and complementary so it doesn’t get drowned out.
  • Design tools for discriminability: clear names, short “use when…” guidance, explicit schemas, and side-effect notes help the model choose the right tool at the right time.
  • Constrain and guide the loop: set a step budget, encourage small changes and summaries, and log everything. This reduces thrash and makes debugging tractable.
  • Sandbox execution: run agents in containers with limited privileges and explicit allowlists. Add human checkpoints for sensitive operations.
  • Iterate with intent: add one context or tooling change at a time and measure its impact on success rate, tool calls, and runtime. Treat agent design as an engineering feedback loop, not a one-shot prompt.

As Yaniv underscores, none of this depends on insider knowledge of proprietary agents. It’s about understanding how the model consumes context and designing your prompts, tools, and evaluations so the agent—and your developers—can do their best work.