Design multi-stage CI/CD pipelines with approval gates, security checks, and deployment orchestration. Use this skill when designing zero-downtime deployment pipelines, implementing canary rollout strategies, setting up multi-environment promotion workflows, or debugging failed deployment gates in CI/CD.
93
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Architecture patterns for multi-stage CI/CD pipelines with approval gates, deployment strategies, and environment promotion workflows.
Design robust, secure deployment pipelines that balance speed with safety through proper stage organization, automated quality gates, and progressive delivery strategies. This skill covers both the structural design of pipeline architecture and the operational patterns for reliable production deployments.
┌─────────┐ ┌──────┐ ┌─────────┐ ┌────────┐ ┌──────────┐
│ Build │ → │ Test │ → │ Staging │ → │ Approve│ → │Production│
└─────────┘ └──────┘ └─────────┘ └────────┘ └──────────┘production-deploy:
needs: staging-deploy
environment:
name: production
url: https://app.example.com
runs-on: ubuntu-latest
steps:
- name: Deploy to production
run: kubectl apply -f k8s/production/Environment protection rules in GitHub enforce required reviewers before this job starts. Configure reviewers at Settings → Environments → production → Required reviewers.
deploy:production:
stage: deploy
script:
- deploy.sh production
environment:
name: production
when: delayed
start_in: 30 minutes
only:
- mainstages:
- stage: Production
dependsOn: Staging
jobs:
- deployment: Deploy
environment:
name: production
resourceType: Kubernetes
strategy:
runOnce:
preDeploy:
steps:
- task: ManualValidation@0
inputs:
notifyUsers: "team-leads@example.com"
instructions: "Review staging metrics before approving"Use an AnalysisTemplate (Argo Rollouts) or a custom gate script to block promotion when error rates exceed a threshold:
# Argo Rollouts AnalysisTemplate — blocks canary promotion automatically
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: "result[0] >= 0.95"
failureCondition: "result[0] < 0.90"
inconclusiveLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status!~"5..",job="my-app"}[2m]))
/ sum(rate(http_requests_total{job="my-app"}[2m]))| Strategy | Downtime | Rollback Speed | Cost Impact | Best For |
|---|---|---|---|---|
| Rolling | None | ~minutes | None | Most stateless services |
| Blue-Green | None | Instant | 2x infra (temp) | High-risk or database migrations |
| Canary | None | Instant | Minimal | High-traffic, metric-driven |
| Recreate | Yes | Fast | None | Dev/test, batch jobs |
| Feature Flag | None | Instant | None | Gradual feature exposure |
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # at most 12 pods during rollout
maxUnavailable: 1 # at least 9 pods always servingCharacteristics: gradual rollout, zero downtime, easy rollback, best for most applications.
# Switch traffic from blue to green
kubectl apply -f k8s/green-deployment.yaml
kubectl rollout status deployment/my-app-green
# Flip the service selector
kubectl patch service my-app -p '{"spec":{"selector":{"version":"green"}}}'
# Rollback instantly if needed
kubectl patch service my-app -p '{"spec":{"selector":{"version":"blue"}}}'Characteristics: instant switchover, easy rollback, doubles infrastructure cost temporarily, good for high-risk deployments with long warm-up times.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
analysis:
templates:
- templateName: success-rate
startingStep: 2
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100Characteristics: gradual traffic shift, real-user metric validation, automated promotion or rollback, requires Argo Rollouts or a service mesh.
from flagsmith import Flagsmith
flagsmith = Flagsmith(environment_key="API_KEY")
if flagsmith.has_feature("new_checkout_flow"):
process_checkout_v2()
else:
process_checkout_v1()Characteristics: deploy without releasing, A/B testing, instant rollback per user segment, granular control independent of deployment.
name: Production Pipeline
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
outputs:
image: ${{ steps.build.outputs.image }}
steps:
- uses: actions/checkout@v4
- name: Build and push Docker image
id: build
run: |
IMAGE=myapp:${{ github.sha }}
docker build -t $IMAGE .
docker push $IMAGE
echo "image=$IMAGE" >> $GITHUB_OUTPUT
test:
needs: build
runs-on: ubuntu-latest
steps:
- name: Unit tests
run: make test
- name: Security scan
run: trivy image ${{ needs.build.outputs.image }}
deploy-staging:
needs: test
environment:
name: staging
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: kubectl apply -f k8s/staging/
integration-test:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- name: Run E2E tests
run: npm run test:e2e
deploy-production:
needs: integration-test
environment:
name: production # blocks here until required reviewers approve
runs-on: ubuntu-latest
steps:
- name: Canary deployment
run: |
kubectl apply -f k8s/production/
kubectl argo rollouts promote my-app
verify:
needs: deploy-production
runs-on: ubuntu-latest
steps:
- name: Deep health check
run: |
for i in {1..12}; do
STATUS=$(curl -sf https://app.example.com/health/ready | jq -r '.status')
[ "$STATUS" = "ok" ] && exit 0
sleep 10
done
exit 1
- name: Notify on success
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"Production deployment successful: ${{ github.sha }}"}'A shallow /ping returns 200 even when downstream dependencies are broken. Use a deep readiness endpoint that verifies actual dependencies before promoting traffic.
# /health/ready — checks real dependencies, used by pipeline gate
@app.get("/health/ready")
async def readiness():
checks = {
"database": await check_db_connection(),
"cache": await check_redis_connection(),
"queue": await check_queue_connection(),
}
status = "ok" if all(checks.values()) else "degraded"
code = 200 if status == "ok" else 503
return JSONResponse({"status": status, "checks": checks}, status_code=code)#!/usr/bin/env bash
# verify-deployment.sh — run after every production deploy
set -euo pipefail
ENDPOINT="${1:?usage: verify-deployment.sh <base-url>}"
MAX_ATTEMPTS=12
SLEEP_SECONDS=10
for i in $(seq 1 $MAX_ATTEMPTS); do
STATUS=$(curl -sf "$ENDPOINT/health/ready" | jq -r '.status' 2>/dev/null || echo "unreachable")
if [ "$STATUS" = "ok" ]; then
echo "Health check passed after $((i * SLEEP_SECONDS))s"
exit 0
fi
echo "Attempt $i/$MAX_ATTEMPTS: status=$STATUS — retrying in ${SLEEP_SECONDS}s"
sleep "$SLEEP_SECONDS"
done
echo "Health check failed after $((MAX_ATTEMPTS * SLEEP_SECONDS))s"
exit 1deploy-and-verify:
steps:
- name: Deploy new version
run: kubectl apply -f k8s/
- name: Wait for rollout
run: kubectl rollout status deployment/my-app --timeout=5m
- name: Post-deployment health check
id: health
run: ./scripts/verify-deployment.sh https://app.example.com
- name: Rollback on failure
if: failure()
run: |
kubectl rollout undo deployment/my-app
echo "Rolled back to previous revision"# List revision history with change-cause annotations
kubectl rollout history deployment/my-app
# Rollback to previous version
kubectl rollout undo deployment/my-app
# Rollback to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3
# Verify rollback completed
kubectl rollout status deployment/my-appFor advanced rollback strategies including database migration rollbacks and Argo Rollouts abort flows, see references/advanced-strategies.md.
| Metric | Target (Elite) | How to Measure |
|---|---|---|
| Deployment Frequency | Multiple/day | Pipeline run count per day |
| Lead Time for Changes | < 1 hour | Commit timestamp → production deploy |
| Change Failure Rate | < 5% | Failed deploys / total deploys |
| Mean Time to Recovery | < 1 hour | Incident open → service restored |
- name: Verify error rate post-deployment
run: |
sleep 60 # allow metrics to accumulate
ERROR_RATE=$(curl -sf "$PROMETHEUS_URL/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \
| jq '.data.result[0].value[1]')
echo "Current error rate: $ERROR_RATE"
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate $ERROR_RATE exceeds 1% threshold — triggering rollback"
exit 1
fiThe pipeline health check is hitting a shallow /ping endpoint that returns 200 even when the database is unreachable. Use a deep readiness check that verifies actual dependencies (see Health Checks section above).
Argo Rollouts requires a valid AnalysisTemplate to auto-promote. If the Prometheus query returns no data (e.g., metric name changed), the analysis stays inconclusive and promotion stalls. Add inconclusiveLimit so the rollout fails fast rather than hanging:
spec:
metrics:
- name: error-rate
failureCondition: "result[0] > 0.05"
inconclusiveLimit: 2 # fail after 2 inconclusive results, not hang indefinitely
provider:
prometheus:
query: |
sum(rate(http_requests_total{status=~"5.."}[2m]))
/ sum(rate(http_requests_total[2m]))Check that production environment protection rules are configured — a missing reviewer assignment means the approval gate waits indefinitely with no notification. In GitHub Actions, ensure Required reviewers is set to an existing user or team in Settings → Environments → production.
If COPY . . appears before dependency installation, any source file change invalidates the dependency layer. Reorder to copy dependency manifests first:
# Good: dependencies cached separately from source code
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run buildA service rollback without a migration rollback causes schema/code mismatch errors. Always make migrations backward-compatible (additive only) for at least one release cycle, and keep undo scripts versioned alongside the migration:
# migrations/V20240315__add_nullable_column.sql (forward)
# migrations/V20240315__add_nullable_column.undo.sql (backward)Never run destructive migrations (DROP COLUMN, ALTER NOT NULL) until the old code version is fully retired from all environments.
For platform-specific pipeline configurations, multi-region promotion workflows, and advanced Argo Rollouts patterns, see:
references/advanced-strategies.md — Extended YAML examples, platform-specific configs (GitHub Actions, GitLab CI, Azure Pipelines), multi-region canary patterns, and database migration rollback strategiesgithub-actions-templates - For GitHub Actions implementation patterns and reusable workflowsgitlab-ci-patterns - For GitLab CI/CD pipeline implementationsecrets-management - For secrets handling in CI/CD pipelines91fe43e
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.