Deploy and Pray
Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.
After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.
What Canary Deployments Actually Mean
A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:
Traffic flow:
Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
└──→ 5% → v1.2.4 (canary)
If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.
Our Canary Pipeline
# .github/workflows/canary-deploy.yml
canary_deploy:
steps:
- name: Deploy canary (5%)
run: |
kubectl set image deployment/api-canary api=api:${{ github.sha }}
kubectl scale deployment/api-canary --replicas=1
# Configure traffic split
kubectl apply -f - <<EOF
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: api-canary
spec:
service: api
backends:
- service: api-stable
weight: 95
- service: api-canary
weight: 5
EOF
- name: Wait and analyze (10 minutes)
run: |
sleep 600
# Check canary health
ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')
echo "Canary error rate: $ERROR_RATE"
echo "Canary p99 latency: $LATENCY"
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "CANARY FAILED: Error rate too high"
exit 1
fi
- name: Promote to 50%
run: |
kubectl apply -f traffic-split-50.yaml
sleep 600 # Wait another 10 min
- name: Full rollout
run: |
kubectl set image deployment/api-stable api=api:${{ github.sha }}
kubectl delete deployment api-canary
kubectl delete trafficsplit api-canary
The Canary Checklist
What we check during the canary window:
CANARY_CHECKS = {
'error_rate': {
'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.01, # Max 1% errors
'comparison': 'less_than'
},
'latency_p99': {
'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
'threshold': 0.5, # Max 500ms
'comparison': 'less_than'
},
'success_rate': {
'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
'threshold': 0.99, # Min 99% success
'comparison': 'greater_than'
},
'memory_usage': {
'query': 'container_memory_working_set_bytes{version="canary"}',
'threshold': 512 * 1024 * 1024, # Max 512MB
'comparison': 'less_than'
}
}
Results After 6 Months
| Metric | Before | After |
|---|---|---|
| Rollback rate | 15% of deploys | 3% of deploys |
| Mean time to detect bad deploy | 25 min | 8 min |
| Customer-facing incidents from deploys | 4/month | 0.5/month |
| Deploy frequency | 1x/day (afraid) | 5x/day (confident) |
The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.
Start Simple
You don't need Istio or a service mesh for canary deploys. Start with:
- Two deployment objects (stable + canary)
- A load balancer that supports weighted routing
- A script that checks error rates after deploy
- A human who decides whether to promote or rollback
Automate from there.
If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
United States
NORTH AMERICA
Related News
How Braze’s CTO is rethinking engineering for the agentic area
11h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
22h ago
KDE Receives $1.4 Million Investment From Sovereign Tech Fund
2h ago
Instagram’s new ‘Instants’ feature combines elements from Snapchat and BeReal
2h ago
Six Claude Code Skills That Close the AI Agent Feedback Loop
2h ago