TechBlast - Tech News for Builders and Operators

Deploy and Pray

Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.

After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.

What Canary Deployments Actually Mean

A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:

Traffic flow:
  Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
                            └──→  5% → v1.2.4 (canary)

If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.

Our Canary Pipeline

# .github/workflows/canary-deploy.yml
canary_deploy:
  steps:
    - name: Deploy canary (5%)
      run: |
        kubectl set image deployment/api-canary api=api:${{ github.sha }}
        kubectl scale deployment/api-canary --replicas=1
        # Configure traffic split
        kubectl apply -f - <<EOF
        apiVersion: split.smi-spec.io/v1alpha1
        kind: TrafficSplit
        metadata:
          name: api-canary
        spec:
          service: api
          backends:
          - service: api-stable
            weight: 95
          - service: api-canary
            weight: 5
        EOF

    - name: Wait and analyze (10 minutes)
      run: |
        sleep 600
        # Check canary health
        ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
        LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')

        echo "Canary error rate: $ERROR_RATE"
        echo "Canary p99 latency: $LATENCY"

        if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
          echo "CANARY FAILED: Error rate too high"
          exit 1
        fi

    - name: Promote to 50%
      run: |
        kubectl apply -f traffic-split-50.yaml
        sleep 600  # Wait another 10 min

    - name: Full rollout
      run: |
        kubectl set image deployment/api-stable api=api:${{ github.sha }}
        kubectl delete deployment api-canary
        kubectl delete trafficsplit api-canary

The Canary Checklist

What we check during the canary window:

CANARY_CHECKS = {
    'error_rate': {
        'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.01,  # Max 1% errors
        'comparison': 'less_than'
    },
    'latency_p99': {
        'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
        'threshold': 0.5,  # Max 500ms
        'comparison': 'less_than'
    },
    'success_rate': {
        'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
        'threshold': 0.99,  # Min 99% success
        'comparison': 'greater_than'
    },
    'memory_usage': {
        'query': 'container_memory_working_set_bytes{version="canary"}',
        'threshold': 512 * 1024 * 1024,  # Max 512MB
        'comparison': 'less_than'
    }
}

Results After 6 Months

Metric	Before	After
Rollback rate	15% of deploys	3% of deploys
Mean time to detect bad deploy	25 min	8 min
Customer-facing incidents from deploys	4/month	0.5/month
Deploy frequency	1x/day (afraid)	5x/day (confident)

The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.

Start Simple

You don't need Istio or a service mesh for canary deploys. Start with:

Two deployment objects (stable + canary)
A load balancer that supports weighted routing
A script that checks error rates after deploy
A human who decides whether to promote or rollback

Automate from there.

If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Deploy and Pray

What Canary Deployments Actually Mean

Our Canary Pipeline

The Canary Checklist

Results After 6 Months

Start Simple

Comments (0)

United States

Related News

How Braze’s CTO is rethinking engineering for the agentic area

Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools

KDE Receives $1.4 Million Investment From Sovereign Tech Fund

Instagram’s new ‘Instants’ feature combines elements from Snapchat and BeReal

Six Claude Code Skills That Close the AI Agent Feedback Loop