CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-models

Cross-model benchmark for gstack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes. Separate from /benchmark, which measures web page performance. Use when: "benchmark models", "compare models", "which model is best for X", "cross-model comparison", "model shootout". (gstack) Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best".

Invalid
This skill can't be scored yet
Validation errors are blocking scoring. Review and fix them to unlock Quality, Impact and Security scores. See what needs fixing →
SKILL.md
Quality
Evals
Security

Security

2 findings — 1 critical severity, 1 medium severity. Installing this skill is not recommended: please review these findings carefully if you do intend to do so.

Critical

E004: Prompt injection detected in skill instructions

What this means

Detected a prompt injection in the skill instructions. The skill contains hidden or deceptive instructions that fall outside its stated purpose and attempt to override the agent’s safety guidelines or intended behavior.

Why it was flagged

Potential prompt injection detected (high risk: 0.90). The skill embeds numerous side-effecting instructions unrelated to benchmarking—automatic telemetry writes, config changes (proactive/routing/telemetry), possible git commits to CLAUDE.md, and GBrain sync/publish flows (potentially publishing session memory)—and it instructs the agent to prioritize/auto-run these steps (including plan-mode exceptions), which are hidden/outsized behaviors outside the skill's stated benchmarking purpose.

Report incorrect finding
Medium

W011: Third-party content exposure detected (indirect prompt injection risk)

What this means

The skill exposes the agent to untrusted, user-generated content from public third-party sources, creating a risk of indirect prompt injection. This includes browsing arbitrary URLs, reading social media posts or forum comments, and analyzing content from unknown websites.

Why it was flagged

Third-party content exposure detected (high risk: 0.80). The skill explicitly runs external providers via the gstack-model-benchmark (Step 4/5) which streams and interprets remote model outputs to decide the "best" model, and its GBrain Sync (skill start) and optional Lake intro flow perform git fetch / gstack-brain-sync and may open a public URL (https://garryslist.org/...)—all clear instances of ingesting third‑party web/repo/model content that can materially influence actions.

Repository
garrytan/gstack
Audited
Security analysis
Snyk

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.