CtrlK
BlogDocsLog inGet started
Tessl Logo

roboflow-inference

Deployment option comparison (serverless, dedicated, self-hosted, batch) and Workflow execution patterns. For raw API URL patterns, auth, and request/response formats, see roboflow-api-reference.

53

Quality

58%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/inference/SKILL.md
SKILL.md
Quality
Evals
Security

For agents — source-of-truth: This skill is authored in roboflow/computer-vision-skills and shipped with the Roboflow plugin. If your client has loaded the plugin (you'll see roboflow:<name> skills in your available skills list), use those local skills — they're read fresh from disk every session. The same content served as MCP resources at roboflow://skills/<name>/... is a fallback for clients without the plugin and may lag this repo. Don't call ReadMcpResourceTool for roboflow://skills/... URIs when a local roboflow:<name> skill is available.

Tip: If you're connected to the Roboflow MCP server, prefer its inference tools over raw HTTP — auth is handled. For workflows the headline tool is workflows_run (run a saved workflow by workflow_id — the workflow URL slug; workspace is inferred from the API key — see Finding your workspace slug). For single-model calls use models_infer. workflow_specs_run and workflow_specs_validate exist for narrow inline-spec exceptions described under "Authoring Workflows" below.

Inference & Deployment

Prefer Workflows over direct model inference. Workflows let you chain model + visualization + logic blocks in one call. Direct models_infer returns JSON only — no annotated images, and instance segmentation responses can be very large. See workflows and workflow-templates.

Authoring Workflows — don't paste JSON into chat or scripts. Workflows are authored on the Roboflow platform (storage, versioning, and retrieval go through the platform) and run from code by identifier. Two authoring modes — propose / infer the right one from session context, never silently pick:

  • Mode A — Agent-driven (MCP, in-session) — for demos, previews, or when the user is committed to in-session "vibe coding". Agent designs the blocks, uses MCP authoring tools to create+save the workflow on the platform during the session (ground the design with workflow_blocks_list / workflow_blocks_get_schema; validate with workflow_specs_validate), then runs it.
  • Mode B — Platform-driven (Roboflow app + in-app agent) — better default for non-trivial / sophisticated cases, when the user prefers visual iteration, when they aren't committed to agent-driven authoring this session, or as the fallback when Mode A hits an issue. Agent proposes the block design and hands the user a link to the Workflows builder; the user builds (manually or with the more context-grounded in-app agent), tests in the preview, saves, and shares the workspace + workflow URL slugs back (both visible in the builder URL: app.roboflow.com/<workspace-slug>/workflows/<workflow-slug>).

Either mode lands at the same run path: workflows_run (MCP) or client.run_workflow(workspace_name=..., workflow_id=...) (SDK). Inline specs (workflow_specs_run) are an exception, not a default — only when the user explicitly asks for a throwaway run, and validate the spec first with workflow_specs_validate. See workflows "Authoring & Deployment" for the full flow.

For live video (webcam, RTSP, file): the MCP workflows_run tool only handles single static images. For live video, present the user with three options (don't pick one silently): (A) WebRTC → serverless GPU, (B) WebRTC → local inference server, or (C) in-process InferencePipeline. They have different setup costs, dep sizes, and latency characteristics — surface a brief 1-line summary of each and let the user choose. See roboflow://skills/inference/workflows ("Video Stream" section) for full code and the comparison table.

Deployment Options

OptionBest ForLatencyScalingCost ModelGPU
ServerlessGetting started, variable trafficLowAutoPer-inference creditYes
DedicatedPredictable workloads, low latencyVery lowManual/autoscalePer-hour creditsOptional
Self-hostedFull control, edgeHardware-dependentManualMetered + infra costOptional
Batch ProcessingLarge offline datasets, videosAsync (minutes-hours)Auto-provisionedPer-jobOptional

When to Use Which

  • Serverless -- default choice. Zero setup, auto-scales, 20MB upload limit. Use models_infer or workflows_run MCP tools.

  • Dedicated -- need consistent latency, large models (Florence 2), or high throughput. Development and production tiers available. Subdomain: <name>.roboflow.cloud.

  • Self-hosted -- deploy Roboflow Inference via Docker on your own hardware (Jetson, cloud VMs, RPi). Same API surface as serverless -- just change api_url.

  • Batch Processing -- runs a Workflow on uploaded images/videos asynchronously. No real-time requirement. Results delivered as JSON.

  • Real-time video (webcam/RTSP/file) -- three deployment options; ask the user which one before writing code:

    • (A) Serverless GPU + WebRTC — zero setup, just an API key; per-minute credits, plan-tiered (webrtc-gpu-small/medium/large).
    • (B) Local inference server + WebRTCpip install inference-cli && inference server start (Docker recommended); lowest latency, isolates the heavy CV/model deps inside the server.
    • (C) InferencePipeline in-processpip install inference in a venv (prefer uv); runs the workflow loop directly in the user's Python process, no separate server. Heavy deps (torch, opencv, onnxruntime) install locally.

    All three have a slower first run (model download / warmup) before subsequent runs hit cached state — tell the user this so they don't think the script is hung.

    • See roboflow://skills/inference/workflows ("Video Stream" section) for full code and a comparison table.

MCP Tools for Inference

ToolPurpose
models_listList trained models for a project
models_getGet details for a trained model
models_inferRun single-model inference on one image via serverless API
models_trainStart training a model on a dataset version
models_get_training_statusCheck training progress and metrics
workflows_runPreferred. Run a saved workflow by workflow_id (the workflow URL slug; workspace is inferred from the API key — see Finding your workspace slug). Optional parameters.
workflow_specs_validateValidate an inline workflow spec without running it — use before any inline run.
workflow_specs_runException only. Run an inline workflow spec — for explicit throwaway runs the user asked for.

Local tooling: when MCP isn't enough

For most operations, prefer the Roboflow MCP tools above — they handle auth and need nothing installed locally. Reach for local Python packages only for the gaps: integration scripts (inference-sdk), Batch Processing / Data Staging (inference-cli), the self-hosted server (inference-cli), and asset scripts that need typed Python objects.

See local-tooling for what to install for which use case, the recommended uv-based env setup, conda / venv fallbacks, and common pitfalls.

Response Shapes by Task

For canonical response shapes (object detection, classification, segmentation, keypoint) with all fields including class_id, detection_id, class_confidence, see roboflow://skills/api-reference/inference.

Large Response Handling

Instance segmentation points arrays are the main culprit for bloated responses. Each detection includes a polygon with potentially hundreds of coordinate pairs. A single image with many detections can return megabytes of JSON.

Mitigation strategies:

  1. Use Workflows instead of direct inference -- add a polygon simplification or property extraction block to reduce output before it reaches the client
  2. Filter classes -- use class_filter to only return classes you need
  3. Raise confidence threshold -- fewer detections = smaller response
  4. Post-process -- if consuming raw responses, drop or simplify the points array when you only need bounding boxes
  5. Avoid returning raw segmentation results through LLM context -- extract only the fields you need (class counts, bounding boxes) and discard polygon data

Workflow image outputs are a second culprit. Visualization blocks (bounding box, polygon, mask, label, halo, …) emit rendered images as base64-encoded blobs inside the response — a 720p annotated frame is hundreds of KB of JSON-escaped string. When you call workflows_run / workflow_specs_run via MCP, this routinely overflows the tool-result token budget. Decode every image-shaped output ({"type": "base64", "value": "..."}) and write it to disk instead of carrying it through agent context. Don't hard-code field names — the output keys are whatever the workflow author declared via JsonField; iterate output.keys() and shape-check.

Batch Processing

What it is. A Roboflow-managed cloud service that runs a Workflow over a batch of images or videos asynchronously, provisioning the infrastructure for you. "Ideal for asynchronously processing large amounts of data."Roboflow docs.

Problem it solves. Bulk inference over thousands to millions of files without standing up your own GPUs, queues, or autoscaler. You hand Roboflow a Workflow plus a batch of inputs, pay per job, and get JSON results back when the job finishes.

Pick it when the data is stored (not live), per-file cost matters more than per-file latency, and minutes-to-hours per job is acceptable. Pick something else when you need real-time per-request results (use Serverless or Dedicated) or air-gapped/on-prem processing (use Self-hosted).

Surfaces: Roboflow web UI, inference rf-cloud CLI, and REST API.

Flow

  1. Have a saved Workflow in your workspace.
  2. Stage inputs as a Data Staging batch (local directory, JSONL of signed URLs, or cloud-storage path on S3 / GCS / Azure).
  3. Submit a job referencing the Workflow + input batch; choose CPU or GPU.
  4. Monitor — poll job status or register a webhook.
  5. Export the output batch as JSON.

CLI

The inference rf-cloud CLI exposes two subcommand groups: data-staging (manage input/output batches) and batch-processing (submit and monitor jobs). Run any command with --help for the full option list.

Minimal end-to-end:

# Stage images
inference rf-cloud data-staging create-batch-of-images \
  --images-dir ./my-images --batch-id my-batch

# Submit
inference rf-cloud batch-processing process-images-with-workflow \
  --workflow-id my-workflow --batch-id my-batch
# -> prints JOB_ID

# Monitor
inference rf-cloud batch-processing show-job-details --job-id JOB_ID

# Export results
inference rf-cloud data-staging export-batch \
  --target-dir ./results --batch-id OUTPUT_BATCH_ID

Data Staging commands — see batch-staging for nuances (data sources, JSONL reference format, multipart batches, webhook notifications):

CommandPurpose
data-staging list-batchesList staging batches in the workspace
data-staging create-batch-of-imagesCreate an input batch from a local directory, signed-URL JSONL, or cloud-storage path
data-staging create-batch-of-videosSame as above, but for video files
data-staging show-batch-detailsShow metadata for a single batch
data-staging list-batch-contentList file URLs in a batch (filter by part, write JSONL)
data-staging list-ingest-detailsPer-shard ingest status for debugging URL ingests
data-staging export-batchDownload all files from a batch (e.g. job outputs) to a local directory

Batch Processing (job) commands — see batch-jobs for nuances (compute configuration, workflow parameters, image-output persistence, aggregation format, video FPS, restarts, TRT compilation):

CommandPurpose
batch-processing list-jobsList jobs in the workspace
batch-processing show-job-detailsShow stages and current status of a single job
batch-processing process-images-with-workflowSubmit an image-batch job
batch-processing process-videos-with-workflowSubmit a video-batch job
batch-processing fetch-logsFetch job logs (filter by severity, write JSONL)
batch-processing abort-jobTerminate a running job
batch-processing restart-jobRestart a failed job (optionally with new compute settings)
batch-processing trt-compileCompile a model to TensorRT for one or more NVIDIA devices

Notes and constraints

  • Async only — minutes-to-hours latency depending on volume and hardware. Not for real-time.
  • Pricing — per job; GPU jobs cost more than CPU. See plans-and-pricing.
  • Image-references ingest requires signed URLs from trusted sources; arbitrary public URLs are rejected — stage to a local directory or cloud-storage path instead.

Full reference: Roboflow Batch Processing docs.

Repository
roboflow/computer-vision-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.