ARTICLE
How Small Can an Agent Model Get? The Nemotron Floor
Explore the Nemotron Floor to find the smallest viable agent model for your needs. Discover which model clears the capability threshold for tasks.

Most model comparisons ask which model is best. This one starts with a model that never even produced a single result.
We tested NVIDIA's open-weight Nemotron family, from the 30B Nano to the 120B Super, on a benchmark of real-world coding tasks: the kind of models an indie developer on a tight budget, or an enterprise cutting inference cost and keeping data in-house, would run.
The main finding is that model size is not a dial you turn for a little more quality, it is a threshold. Below a certain capability floor a model cannot drive an agent loop at all, which is why the smallest variant we tried, Nano 12B, produced nothing to score.
Above the floor, the question stops being which model is cheapest and becomes which one clears the bar your work actually needs: Nano 30B is an extremely cheap workhorse for narrow, well-scoped jobs, while Super 120B is the size that holds up on demanding multi-step agent work.
An agent size floor is the minimum model capacity below which a model cannot reliably complete the act-observe-decide loop an agent depends on. Below it you don't get a slower or sloppier agent, you get a non-agent: a model that reads the task, takes a few steps, and never converges. For anyone choosing a model, this changes the question from "which is cheaper" to "which clears the floor for my work", and that is the question to answer first.
Where the numbers come from
Every scenario in the evaluation is a real-world agent task tied to a published skill, scored on two axes: instruction-following (does the agent do what it was told, in the way it was told) and task-completion (does it reach the goal). The overall score weights instruction-following at 4 and task-completion at 3, then divides by 7. Each task runs with and without the skill, so the lift from the skill is visible directly. The tasks and skills are public, in the task-evals-for-skills dataset, so you can inspect any scenario yourself.
This design is deliberate. The tasks are derived from published skills, so they mirror the work teams write skills for, not contrived benchmark puzzles. That changes what a low score means. For a model that can do the work, the gap that remains is instruction-following: doing the job the way it was asked. For a model that cannot reach the goal even on ordinary work, the problem is more fundamental than guidance.
Both models were served the same way, OpenHands on Bedrock, and graded by the same judges, which leaves close to a thousand paired scenarios for each model. Every comparison below is apples-to-apples within NVIDIA, with no cross-harness confound and no provider pricing to reconcile. Cost is solve-only dollars per task, taken from each run's measured token usage. Neither model triggered a single rubric-gaming flag.
Two sizes, two different walls
Here are the headline results, baseline → with-skill.
| Class | Goal completion | Instruction following | Overall | $/task | Near-zero solves (overall < 25) |
|---|---|---|---|---|---|
| Super 120B | 68.4 → 69.3 | 31.3 → 49.2 | 47.2 → 57.8 | 0.083 | 19% → 22% |
| Nano 30B | 46.6 → 51.3 | 19.0 → 26.0 | 30.8 → 36.8 | 0.040 | 43% → 38% |
The two sizes hit their limits for different reasons. Super 120B can mostly finish. Its goal completion sits near 69, and the skill barely moves it, adding only 0.9 points. What it struggles with is doing the task the prescribed way: the skill adds 17.9 points of instruction-following. Super has the capability, and is helped by the guidance the skill provides.
Nano 30B, the smaller model, has the opposite problem. Reliable completion is where it wavers. Goal completion is 46.6, and 43% of its baseline attempts come back near-zero. It is close enough to the floor that the loop itself is the bottleneck, not the formatting of the answer.
There is a pattern hiding in those averages, and it matters just as much as the averages do. With these agents you rarely get a mediocre run. You mostly get a near-finished result or a near-total miss. With the skill, Super scores 75 or above on 40% of tasks and misses badly on 22%. Nano flips that shape: it tops out on only 11% of tasks and misses badly on 38%. Scale does not make the agent gently better. It changes which of the two outcomes you get most of the time. This is why the average is a rough guide to any single run: the average of "mostly great" and "mostly broken" is a number that rarely actually happens on any given run.
It also means Nano is not uniformly weak. On well-scoped tasks, like calling a documented API or following a focused doc-retrieval skill, it clears the usable bar often enough to be worth a look. Its trouble is the longer, multi-step work, where it may struggle.
Where scale helps, and where skills help
Scale and skills are not competing answers to the same question. They do different jobs, and the eval shows where each one pays off. The familiar story, one we have told ourselves, is that a relevant skill can let a cheaper model catch a pricier one. That holds, with one condition: the model has to be capable enough to act on the skill in the first place.
Start with the job scale does. Going from 30B to 120B, a 4x jump in parameters, buys 16.4 overall points at baseline. That is scale carrying a model over the floor, to where it can complete the task at all. Adding a skill to Nano 30B buys 6.0 points, but it still sits below Super with no skill at all (47.2). Below the floor, there was not yet enough capability for a skill to build on.
A skill is a multiplier, and on a model above the floor the multiplier can be large. The same skill lifts Super by 17.9 points on instruction following, while barely touching its goal completion (up 0.9). This reflects where Super had room to grow. It could already finish most tasks, so the skill's gain showed up in instruction-following, not completion. A skill can help a model finish too; Super simply had little completion headroom left. The two are a sequence, not a contest. Get a model over the floor, then a skill delivers outsized returns.
The effect is sharpest skill by skill, and it shows how much a skill can do for a capable model. A Brave Search location skill adds 76 points of instruction-following for Super. A Neon auth skill adds 68. On Nano those same skills add 1 point and nothing, because there is no capability yet for the guidance to land on. Match a skill to a model that can act on it and the payoff is substantial.
Single tasks tell the same story. On the stripe_ai_upgrade-stripe scenario, the skill takes Super from a complete miss to a perfect 100, while the same skill on the same task leaves Nano at 0. The skill is doing the work in the first case and has nothing to build on in the second. Across the set there are 163 tasks where Super clears the usable bar and Nano comes back near-zero, the kind of gap a skill alone will not close.
The same pattern emerges in effort. Nano 30B takes more turns than the larger model (29.9 with skill, against Super's 24.5) for roughly half the score. Its turns split into two habits: when it fails outright it gives up fast, in around ten turns, and when it engages it grinds for thirty or more to reach a middling result. Below the floor the extra guidance adds turns and cost (24.7 to 29.9, cost up 25%) without a matching gain, because the model cannot act on it efficiently yet. Above the floor, a model puts a skill's instructions to work; below it, capability has to catch up first.
When the cheaper model is not necessarily the better value
Here is where the intuition most teams carry into a self-hosting decision breaks. Nano costs half what Super does per task, $0.040 against $0.083, so the natural conclusion is that Nano is the better value and Super is the option you reach for only when you must.
The per-task price leaves out one thing: failures. With the skill, Nano comes back with a near-zero result on 38% of tasks, against Super's 22%. Every one of those is a retry, and retries cost money the per-task price never shows. Count them and the model that looked cheaper per task can end up costing more for each result you can actually use.
Points-per-dollar makes Nano look like the bargain, 928 against Super's 694. But that number only rewards cheapness, not quality: a model that regularly does the wrong thing but is extremely cheap will still score well on it. So decide the quality you need first, then compare pricing.
Cost is also only half the decision. The other half is fit. Nano earns its low price on the well-scoped tasks it does reliably, while on the longer, multi-step work, Super is worth paying for. The value is in matching each model to the work it can do, not in naming one model the cheapest.
Which size fits your work?
The findings turn into a simple rule of thumb. Reach for Nano 30B when the task is narrow and well-scoped: a documented API call, a focused doc-retrieval job, or a single-file change, run at high volume where a passable result or a cheap retry is acceptable. It costs half as much per task and is small enough to self-host on consumer hardware, which makes it a genuine workhorse.
Reach for Super 120B when the work is multi-step or longer-horizon, when the result has to be usable on the first try, or when you cannot predict the shape of the tasks coming in. It is the first open-weight size that reliably clears the floor for real agent work, and the place to start for anything headed to production.
Finding your floor
This study only exists because NVIDIA ships an open-weight size ladder you can self-host. This lets you match the model to the job, and step up only when the smaller one cannot clear your quality floor. The framing to carry away is the smallest usable agent, not fastest or cheapest on paper.
So when it comes to choosing a model, don't start with price or parameter count. The model that looks like a bargain on the invoice could be the one quietly costing you. Take the work you actually need done, set the quality bar it has to clear, and measure which models successfully clear it. That is the comparison that will predict what actually works for you, and it is worth running before you commit to a model. Once the decision is made, the Tessl Registry is where you find the skills that take it the rest of the way.
READING
·
0%
IN THIS POST
COPY & SHARE
YOUR NEXT READ
Open-Source Agents vs Sonnet 4.6: GLM 5.2, MiniMax M3, Kimi 2.7 and Qwen 3.7 Tested
Open-source coding models like GLM 5.2 and Qwen3.7-Plus are compared to Claude Sonnet 4.6, revealing varied performance in instruction-following and task completion.


Nicolas Fortuin, Baptiste Fernandez



