deployment-pipeline-design

Design multi-stage CI/CD pipelines with approval gates, security checks, and deployment orchestration. Use this skill when designing zero-downtime deployment pipelines, implementing canary rollout strategies, setting up multi-environment promotion workflows, or debugging failed deployment gates in CI/CD.

Quality

92%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

Deployment Pipeline Design

Architecture patterns for multi-stage CI/CD pipelines with approval gates, deployment strategies, and environment promotion workflows.

Purpose

Design robust, secure deployment pipelines that balance speed with safety through proper stage organization, automated quality gates, and progressive delivery strategies. This skill covers both the structural design of pipeline architecture and the operational patterns for reliable production deployments.

Input / Output

What You Provide

Application type: Language/runtime, containerized or bare-metal, monolith or microservices
Deployment target: Kubernetes, ECS, VMs, serverless, or platform-as-a-service
Environment topology: Number of environments (dev/staging/prod), region layout, air-gap requirements
Rollout requirements: Acceptable downtime, rollback SLA, traffic splitting needs, canary vs blue-green preference
Gate constraints: Approval teams, required test coverage thresholds, compliance scans (SAST, DAST, SCA)
Monitoring stack: Prometheus, Datadog, CloudWatch, or other metrics sources used for automated promotion decisions

What This Skill Produces

Pipeline configuration: Stage definitions, job dependencies, parallelism, and caching strategy
Deployment strategy: Chosen rollout pattern with annotated configuration (canary weights, blue-green switchover, rolling parameters)
Health check setup: Shallow vs deep readiness probes, post-deployment smoke test scripts
Gate definitions: Automated metric thresholds and manual approval workflows
Rollback plan: Automated rollback triggers and manual runbook steps

When to Use

Design CI/CD architecture for a new service or platform migration
Implement deployment gates between environments
Configure multi-environment pipelines with mandatory security scanning
Establish progressive delivery with canary or blue-green strategies
Debug pipelines where stages succeed but production behavior is wrong
Reduce mean time to recovery by automating rollback on metric degradation

Pipeline Stages

Standard Pipeline Flow

┌─────────┐   ┌──────┐   ┌─────────┐   ┌────────┐   ┌──────────┐
│  Build  │ → │ Test │ → │ Staging │ → │ Approve│ → │Production│
└─────────┘   └──────┘   └─────────┘   └────────┘   └──────────┘

Detailed Stage Breakdown

Source - Code checkout, dependency graph resolution
Build - Compile, package, containerize, sign artifacts
Test - Unit, integration, SAST/SCA security scans
Staging Deploy - Deploy to staging environment with smoke tests
Integration Tests - E2E, contract tests, performance baselines
Approval Gate - Manual or automated metric-based gate
Production Deploy - Canary, blue-green, or rolling strategy
Verification - Deep health checks, synthetic monitoring
Rollback - Automated rollback on failure signals

Approval Gate Patterns

Pattern 1: Manual Approval (GitHub Actions)

production-deploy:
  needs: staging-deploy
  environment:
    name: production
    url: https://app.example.com
  runs-on: ubuntu-latest
  steps:
    - name: Deploy to production
      run: kubectl apply -f k8s/production/

Environment protection rules in GitHub enforce required reviewers before this job starts. Configure reviewers at Settings → Environments → production → Required reviewers.

Pattern 2: Time-Based Approval (GitLab CI)

deploy:production:
  stage: deploy
  script:
    - deploy.sh production
  environment:
    name: production
  when: delayed
  start_in: 30 minutes
  only:
    - main

Pattern 3: Multi-Approver (Azure Pipelines)

stages:
  - stage: Production
    dependsOn: Staging
    jobs:
      - deployment: Deploy
        environment:
          name: production
          resourceType: Kubernetes
        strategy:
          runOnce:
            preDeploy:
              steps:
                - task: ManualValidation@0
                  inputs:
                    notifyUsers: "team-leads@example.com"
                    instructions: "Review staging metrics before approving"

Pattern 4: Automated Metric Gate

Use an AnalysisTemplate (Argo Rollouts) or a custom gate script to block promotion when error rates exceed a threshold:

# Argo Rollouts AnalysisTemplate — blocks canary promotion automatically
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 60s
    successCondition: "result[0] >= 0.95"
    failureCondition: "result[0] < 0.90"
    inconclusiveLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status!~"5..",job="my-app"}[2m]))
          / sum(rate(http_requests_total{job="my-app"}[2m]))

Deployment Strategies

Decision Table

Strategy	Downtime	Rollback Speed	Cost Impact	Best For
Rolling	None	~minutes	None	Most stateless services
Blue-Green	None	Instant	2x infra (temp)	High-risk or database migrations
Canary	None	Instant	Minimal	High-traffic, metric-driven
Recreate	Yes	Fast	None	Dev/test, batch jobs
Feature Flag	None	Instant	None	Gradual feature exposure

1. Rolling Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2         # at most 12 pods during rollout
      maxUnavailable: 1   # at least 9 pods always serving

Characteristics: gradual rollout, zero downtime, easy rollback, best for most applications.

2. Blue-Green Deployment

# Switch traffic from blue to green
kubectl apply -f k8s/green-deployment.yaml
kubectl rollout status deployment/my-app-green

# Flip the service selector
kubectl patch service my-app -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback instantly if needed
kubectl patch service my-app -p '{"spec":{"selector":{"version":"blue"}}}'

Characteristics: instant switchover, easy rollback, doubles infrastructure cost temporarily, good for high-risk deployments with long warm-up times.

3. Canary Deployment (Argo Rollouts)

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 2
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

Characteristics: gradual traffic shift, real-user metric validation, automated promotion or rollback, requires Argo Rollouts or a service mesh.

4. Feature Flags

from flagsmith import Flagsmith

flagsmith = Flagsmith(environment_key="API_KEY")

if flagsmith.has_feature("new_checkout_flow"):
    process_checkout_v2()
else:
    process_checkout_v1()

Characteristics: deploy without releasing, A/B testing, instant rollback per user segment, granular control independent of deployment.

Pipeline Orchestration

Multi-Stage Pipeline Example (GitHub Actions)

name: Production Pipeline

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image: ${{ steps.build.outputs.image }}
    steps:
      - uses: actions/checkout@v4
      - name: Build and push Docker image
        id: build
        run: |
          IMAGE=myapp:${{ github.sha }}
          docker build -t $IMAGE .
          docker push $IMAGE
          echo "image=$IMAGE" >> $GITHUB_OUTPUT

  test:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Unit tests
        run: make test
      - name: Security scan
        run: trivy image ${{ needs.build.outputs.image }}

  deploy-staging:
    needs: test
    environment:
      name: staging
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: kubectl apply -f k8s/staging/

  integration-test:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - name: Run E2E tests
        run: npm run test:e2e

  deploy-production:
    needs: integration-test
    environment:
      name: production        # blocks here until required reviewers approve
    runs-on: ubuntu-latest
    steps:
      - name: Canary deployment
        run: |
          kubectl apply -f k8s/production/
          kubectl argo rollouts promote my-app

  verify:
    needs: deploy-production
    runs-on: ubuntu-latest
    steps:
      - name: Deep health check
        run: |
          for i in {1..12}; do
            STATUS=$(curl -sf https://app.example.com/health/ready | jq -r '.status')
            [ "$STATUS" = "ok" ] && exit 0
            sleep 10
          done
          exit 1
      - name: Notify on success
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Health Checks

Shallow vs Deep Health Endpoints

A shallow /ping returns 200 even when downstream dependencies are broken. Use a deep readiness endpoint that verifies actual dependencies before promoting traffic.

# /health/ready — checks real dependencies, used by pipeline gate
@app.get("/health/ready")
async def readiness():
    checks = {
        "database": await check_db_connection(),
        "cache":    await check_redis_connection(),
        "queue":    await check_queue_connection(),
    }
    status = "ok" if all(checks.values()) else "degraded"
    code = 200 if status == "ok" else 503
    return JSONResponse({"status": status, "checks": checks}, status_code=code)

Post-Deployment Verification Script

#!/usr/bin/env bash
# verify-deployment.sh — run after every production deploy
set -euo pipefail

ENDPOINT="${1:?usage: verify-deployment.sh <base-url>}"
MAX_ATTEMPTS=12
SLEEP_SECONDS=10

for i in $(seq 1 $MAX_ATTEMPTS); do
  STATUS=$(curl -sf "$ENDPOINT/health/ready" | jq -r '.status' 2>/dev/null || echo "unreachable")
  if [ "$STATUS" = "ok" ]; then
    echo "Health check passed after $((i * SLEEP_SECONDS))s"
    exit 0
  fi
  echo "Attempt $i/$MAX_ATTEMPTS: status=$STATUS — retrying in ${SLEEP_SECONDS}s"
  sleep "$SLEEP_SECONDS"
done

echo "Health check failed after $((MAX_ATTEMPTS * SLEEP_SECONDS))s"
exit 1

Rollback Strategies

Automated Rollback in Pipeline

deploy-and-verify:
  steps:
    - name: Deploy new version
      run: kubectl apply -f k8s/

    - name: Wait for rollout
      run: kubectl rollout status deployment/my-app --timeout=5m

    - name: Post-deployment health check
      id: health
      run: ./scripts/verify-deployment.sh https://app.example.com

    - name: Rollback on failure
      if: failure()
      run: |
        kubectl rollout undo deployment/my-app
        echo "Rolled back to previous revision"

Manual Rollback Commands

# List revision history with change-cause annotations
kubectl rollout history deployment/my-app

# Rollback to previous version
kubectl rollout undo deployment/my-app

# Rollback to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3

# Verify rollback completed
kubectl rollout status deployment/my-app

For advanced rollback strategies including database migration rollbacks and Argo Rollouts abort flows, see references/advanced-strategies.md.

Monitoring and Metrics

Key DORA Metrics to Track

Metric	Target (Elite)	How to Measure
Deployment Frequency	Multiple/day	Pipeline run count per day
Lead Time for Changes	< 1 hour	Commit timestamp → production deploy
Change Failure Rate	< 5%	Failed deploys / total deploys
Mean Time to Recovery	< 1 hour	Incident open → service restored

Post-Deployment Metric Verification

- name: Verify error rate post-deployment
  run: |
    sleep 60  # allow metrics to accumulate

    ERROR_RATE=$(curl -sf "$PROMETHEUS_URL/api/v1/query" \
      --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \
      | jq '.data.result[0].value[1]')

    echo "Current error rate: $ERROR_RATE"
    if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
      echo "Error rate $ERROR_RATE exceeds 1% threshold — triggering rollback"
      exit 1
    fi

Pipeline Best Practices

Fail fast — Run quick checks (lint, unit tests) before slow ones (E2E, security scans)
Parallel execution — Run independent jobs concurrently to minimize total pipeline time
Caching — Cache dependency layers and build artifacts between runs
Artifact promotion — Build once, promote the same artifact through all environments
Environment parity — Keep staging infrastructure as close to production as possible
Secrets management — Use secret stores (Vault, AWS Secrets Manager, GitHub encrypted secrets) — never hardcode
Deployment windows — Prefer low-traffic windows; enforce change freeze periods via gate policies
Idempotent deploys — Ensure re-running a deploy produces the same result
Rollback automation — Trigger rollback automatically on health check or metric threshold failure
Annotate deployments — Send deployment markers to monitoring tools (Datadog, Grafana) for correlation

Troubleshooting

Health check passes in pipeline but service is unhealthy in production

The pipeline health check is hitting a shallow /ping endpoint that returns 200 even when the database is unreachable. Use a deep readiness check that verifies actual dependencies (see Health Checks section above).

Canary deployment never promotes to 100%

Argo Rollouts requires a valid AnalysisTemplate to auto-promote. If the Prometheus query returns no data (e.g., metric name changed), the analysis stays inconclusive and promotion stalls. Add inconclusiveLimit so the rollout fails fast rather than hanging:

spec:
  metrics:
  - name: error-rate
    failureCondition: "result[0] > 0.05"
    inconclusiveLimit: 2   # fail after 2 inconclusive results, not hang indefinitely
    provider:
      prometheus:
        query: |
          sum(rate(http_requests_total{status=~"5.."}[2m]))
          / sum(rate(http_requests_total[2m]))

Staging deploy succeeds but production job never starts

Check that production environment protection rules are configured — a missing reviewer assignment means the approval gate waits indefinitely with no notification. In GitHub Actions, ensure Required reviewers is set to an existing user or team in Settings → Environments → production.

Docker layer cache busted on every run causing slow builds

If COPY . . appears before dependency installation, any source file change invalidates the dependency layer. Reorder to copy dependency manifests first:

# Good: dependencies cached separately from source code
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

Rollback leaves database migrations applied to old code

A service rollback without a migration rollback causes schema/code mismatch errors. Always make migrations backward-compatible (additive only) for at least one release cycle, and keep undo scripts versioned alongside the migration:

# migrations/V20240315__add_nullable_column.sql       (forward)
# migrations/V20240315__add_nullable_column.undo.sql  (backward)

Never run destructive migrations (DROP COLUMN, ALTER NOT NULL) until the old code version is fully retired from all environments.

Advanced Topics

For platform-specific pipeline configurations, multi-region promotion workflows, and advanced Argo Rollouts patterns, see:

references/advanced-strategies.md — Extended YAML examples, platform-specific configs (GitHub Actions, GitLab CI, Azure Pipelines), multi-region canary patterns, and database migration rollback strategies

Related Skills

github-actions-templates - For GitHub Actions implementation patterns and reusable workflows
gitlab-ci-patterns - For GitLab CI/CD pipeline implementation
secrets-management - For secrets handling in CI/CD pipelines

Repository: wshobson/agents
Commit: 91fe43e

Last updated: 4 days ago
Created: 4 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.