CtrlK
BlogDocsLog inGet started
Tessl Logo

bapfernandez/article-creator

Content creator for tessl.io — generates publish-ready blog articles with SEO metadata, Tessl house style, and technical authority.

90

1.26x
Quality

79%

Does it follow best practices?

Impact

92%

1.26x

Average score across 10 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-1/

Task: Comparison Article on Agent Evaluation Frameworks

Background

The Tessl content team wants to publish a comparison article that helps developers choose between the leading agent evaluation frameworks currently available. The audience is engineers who are actively building AI agents and have reached the point where they need a structured approach to measuring agent accuracy, but haven't yet committed to a framework.

The content lead's brief:

"This needs to be genuinely useful, not a feature matrix that tells people nothing. Pick the dimensions that actually matter for this decision and give us your honest take on when to use each. Don't just describe features — tell us who each framework is right for. Be specific. Opinionated is fine; wishy-washy is not. Publish-ready with full metadata."

Frameworks to Compare

Compare the following three agent evaluation frameworks (invented but realistic):

EvalKit

  • Open source, MIT license
  • Focuses on deterministic scoring: regex match, JSON schema validation, tool call verification
  • Supports 12 built-in scorers; custom scorers via Python plugins
  • Runs locally or in CI; no hosted service
  • Average setup time reported by users: ~45 minutes to first eval run
  • Best-known for its speed and CI integration

Orion Evals

  • Commercial SaaS, free tier available (500 eval runs/month)
  • LLM-as-judge scoring with a configurable rubric builder
  • Hosted dashboard with trend graphs and regression alerts
  • Integrates with GitHub Actions, CircleCI, and GitLab CI out of the box
  • Pricing: $49/month for 5,000 runs; $199/month for 25,000 runs
  • Best-known for its rubric builder and visual reporting

spec-eval

  • Open source, Apache 2.0 license
  • Eval scenarios defined in YAML; scoring criteria derived from the spec
  • Designed to align evals with a written specification document
  • No hosted service; runs locally or in any CI environment
  • Smaller community; roughly 1,200 GitHub stars as of the comparison date
  • Best-known for its spec-driven workflow

What to Write

Write a publish-ready comparison article for the tessl.io blog. The article should:

  • Briefly introduce each framework so a reader unfamiliar with them can follow along
  • Compare them across 4-6 meaningful dimensions relevant to choosing an eval framework (you decide which dimensions matter most)
  • Include a comparison table summarizing the dimensions
  • Follow the table with expanded analysis for each dimension
  • End with specific, opinionated guidance on when to choose each framework ("If you're X doing Y, choose Z")
  • Close with something that gives the reader a clear next action or leaves them with a useful provocation

Use suggestive language when describing claimed benefits from any of the three frameworks.

Output Specification

Save the completed article as article.md in the current working directory.

The file must include a metadata block at the top (title, type, primary keyword, meta description, URL slug, internal links, estimated read time) followed by the full article body in markdown.

evals

scenario-1

criteria.json

task.md

tile.json