8 Apr 20266 minute read

GitHub adds 'Rubber Duck' to Copilot CLI for second opinions on AI code
8 Apr 20266 minute read

AI coding agents are getting faster at writing software, but they still make mistakes with confident abandon. And when those errors show up early in a workflow, they can shape everything that follows.
That’s the problem GitHub is targeting with a new experimental feature in its Copilot CLI. “Rubber Duck,” as it’s called, brings in a second model from a different AI family to review an agent’s plan and output before it moves ahead – think of it like asking a colleague to sense-check your approach before you commit.
A second model, not self-reflection
Coding agents typically follow a loop: plan, implement, test, and iterate. While effective, that process has a critical weakness. Early assumptions – whether about structure, dependencies, or edge cases – can become embedded in the final result.
Developers have tried to address this with self-reflection, prompting models to critique their own output. But that approach has limits, such as reinforcing flawed assumptions or missing structural issues that were baked in from the start.
“Using self-reflection and having the agent review its own output before moving forward is a proven technique,” GitHub said in a blog post. “However, a model reviewing its own work is still bounded by its own training biases.”
Rubber Duck takes a different approach. Instead of asking a model to check its own work, Copilot calls on a second model trained on a different dataset and architecture. For now, it only works when a Claude model is selected as the primary agent, with Rubber Duck running on GPT-5.4, acting as an independent reviewer.
Its role is narrow: flag assumptions, highlight edge cases, and surface issues that may not be obvious to the primary model.
“The job of Rubber Duck is to check the agent’s work and surface a short, focused list of high-value concerns,” GitHub said.
GitHub gets its ducks in a row
GitHub tested the feature on SWE-Bench Pro, a benchmark built around complex, real-world coding problems. The results suggest the second opinion can meaningfully improve performance, particularly on longer and more involved tasks.
According to GitHub, pairing Claude Sonnet 4.6 with Rubber Duck running GPT-5.4 closed 74.7% of the performance gap between Sonnet and the more capable Claude Opus 4.6 model.
The gains were most noticeable on tasks spanning multiple files and requiring dozens of steps. In some cases, the review agent caught structural flaws that would have otherwise gone unnoticed – such as schedulers that never ran, loops that overwrote data, or cross-file dependencies that broke downstream functionality.
Copilot doesn’t call Rubber Duck on every step. Instead, it targets specific points in the workflow where feedback is likely to matter most.
That includes after drafting a plan, after complex implementations, and after writing tests, before execution.
The agent can also invoke the review if it gets stuck, or users can request a critique directly.
GitHub said the system is designed to be selective, focusing on moments where the signal is highest without slowing down the overall workflow.
Community reacts: Quality and consumption
Rubber Duck is available now in experimental mode through Copilot CLI. Users can enable it via a /experimental slash command, after which it becomes available when selecting a supported Claude model with access to GPT-5.4. Once enabled, Copilot can invoke Rubber Duck automatically at key checkpoints—such as after planning or testing—or users can request a critique on demand by asking Copilot to review its work.
Early feedback has generally been positive, with the usual array of developers asking when it will be available in other environments, such as VS Code.
Some discussions also focused on how the feature affects usage limits. One user noted that it appeared not to consume additional requests despite involving a second model, suggesting the underlying task-based architecture may be handling the interaction efficiently.
An improvement in output quality was a recurring theme. “Actually this is insanely good, I had deeply flawed plans from Claude and after using ducky with GPT-5.4 it corrected itself… 10 prompts became one prompt,” one Reddit commenter reported.
Elsewhere, a cloud advocate from Microsoft pointed to a more subtle effect: that the presence of a second reviewing agent may actually push the primary model to be more rigorous in its own reasoning, sharing a screenshot that showed the Rubber Duck agent still running while the system had already identified a potential risk in the migration plan.
Early reactions suggest that while questions remain around how the feature is priced and deployed, the idea of pairing models to review each other is resonating with developers. Whether that holds up beyond initial experiments will depend on how reliably it improves outcomes without adding friction.




