Introduction
As part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform.
In Stage 4A, SwiftDeploy could:
- generate infrastructure files from a declarative manifest
- deploy containers using Docker Compose
- manage deployment modes (stable/canary)
- configure Nginx automatically
Stage 4B transformed it into something much closer to a real production deployment system by adding:
- Prometheus instrumentation
- Open Policy Agent (OPA) policy enforcement
- live operational dashboards
- deployment safety gates
- audit logging and reporting
- chaos engineering validation
The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed.
The Core Philosophy: One Manifest, Everything Else Generated
SwiftDeploy is built around a single principle:
manifest.yaml is the only file you should ever edit manually.
Everything else is generated from it.
Here is the manifest structure:
services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridge
From this manifest, the CLI generates:
- generated/nginx.conf
- generated/docker-compose.yml
- OPA runtime configuration
This design provides:
- consistency
- reproducibility
- environment portability
- infrastructure-as-code discipline
The grader can delete all generated files and rerun:
./swiftdeploy init
and the entire stack regenerates correctly.
Architecture Overview
The system architecture consists of four major components:
User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy Engine
The deployment stack includes:
- Flask application container
- Nginx reverse proxy
- Open Policy Agent (OPA)
- internal Docker network
- named log volumes
The SwiftDeploy CLI
The heart of the project is the swiftdeploy executable.
It is a Python-based CLI tool that manages the entire deployment lifecycle.
Supported Commands
CommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks
The API Service
The API service is a Flask application that supports both stable and canary deployment modes.
Deployment mode is controlled through the MODE environment variable.
Endpoints
Root Endpoint
GET /
Returns:
- deployment mode
- version
- timestamp
Example:
{ "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"}
Health Endpoint
GET /healthz
Returns:
- health status
- application uptime
Chaos Endpoint
POST /chaos
Available only in canary mode.
Supports:
{ "mode": "slow", "duration": 3 }
{ "mode": "error", "rate": 0.5 }
{ "mode": "recover" }
This endpoint was used to simulate:
- degraded latency
- random failures
- recovery workflows
Instrumentation: The /metrics Endpoint
One of the biggest upgrades in Stage 4B was observability.
I instrumented the Flask service using the prometheus_client library.
The service now exposes:
GET /metrics
in Prometheus text format.
Metrics Collected
Request Throughput
http_requests_total
Labels:
method
path
status_code
Example:
http_requests_total{method="GET",path="/",status_code="200"} 152
Request Latency
http_request_duration_seconds
Histogram used for:
- latency analysis
- P99 calculation
Application Uptime
app_uptime_seconds
Tracks process uptime.
Deployment Mode
app_mode
Values:
- 0 = stable
- 1 = canary
Chaos State
chaos_active
Values:
- 0 = none
- 1 = slow
- 2 = error
Why Metrics Matter
Without metrics:
- deployments are blind
- failures become invisible
- canary safety cannot be enforced
Metrics became the foundation for:
- policy decisions
- dashboards
- auditing
- promotion safety
Open Policy Agent (OPA): The Brain of SwiftDeploy
The most important design principle in Stage 4B was:
The CLI must never make allow/deny decisions itself.
All decision-making lives entirely inside OPA.
SwiftDeploy only:
- gathers data
- sends context to OPA
- acts on the response
This separation makes the system:
- modular
- secure
- maintainable
- extensible
OPA Policy Domains
I separated policies into independent domains.
Each policy:
- answers one question
- owns its own logic
- operates independently
Infrastructure Policy
Runs before deployment.
Blocks deployment when:
- disk free space is below 10GB
- CPU load exceeds 2.0
Rego Example
package infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load}
Canary Safety Policy
Runs before promotion.
Blocks promotion when:
- error rate exceeds 1%
- P99 latency exceeds 500ms Rego Example package canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms}
Policy Thresholds
Thresholds are stored separately in:
policies/data.json
Example:
{ "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }}
This prevents:
- hardcoded values
- duplicated configuration
- policy coupling OPA Isolation The OPA container runs on an internal Docker network. It is intentionally NOT exposed through Nginx. Only the CLI can access OPA directly via: http://localhost:8181
This prevents external users from:
- querying policies
- bypassing deployment logic
- inspecting internal rules This mirrors real production security architecture.
Pre-Deploy Policy Enforcement
Before deployment, SwiftDeploy collects:
- CPU load
- available disk space
Example payload:
{ "disk_free_gb": 8.5, "cpu_load": 2.4}
OPA evaluates the payload.
If policies fail:
Deployment blocked:Infrastructure policy violation
The deployment never proceeds.
Canary Safety Enforcement
Before promotion, SwiftDeploy:
- scrapes /metrics
- calculates error rate
- calculates P99 latency
- submits metrics to OPA
If the canary is unhealthy:
- promotion is blocked
- rollout is prevented This introduces production-grade deployment safety.
The Status Dashboard
The status command provides a live operational dashboard.
./swiftdeploy status
The dashboard:
- refreshes continuously
- scrapes live metrics
- calculates request rate
- calculates P99 latency
- evaluates policy compliance
- appends results to history.jsonl
Example output:
SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING
Chaos Engineering
This was one of the most interesting parts of the project.
I intentionally injected:
- high error rates
- slow responses
Example:
curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}'
Immediately:
- metrics reflected failures
- policies began failing
promotions were blocked
This validated that:metrics were accurate
policies were functional
safety gates worked correctly
Audit Logging
Every:
- deploy
- promote
- status scrape
- policy violation is appended to: history.jsonl
Example entry:
{ "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52}
Audit Report Generation
Running:
./swiftdeploy audit
generates:
audit_report.md
The report includes:
- deployment timeline
- mode changes
- chaos injections
- policy violations
Example:
| Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% |
Challenges Faced
- Python Virtual Environment Issues Ubuntu’s externally-managed Python environment caused repeated package installation failures.
The solution was:
- recreating the virtual environment
- installing dependencies inside the venv only
- Nginx Validation Problems Generated Nginx configs initially failed validation due to unresolved upstream references.
Fix:
- validate only inside container context
- avoid host-side upstream resolution
- Metrics Parsing
- Calculating:
- error rate
- P99 latency from Prometheus text format required careful parsing and aggregation.
- OPA Failure Handling The CLI had to gracefully handle:
- OPA downtime
- connection failures
- malformed responses The system never crashes when OPA becomes unavailable.
Lessons Learned
Declarative Systems Scale Better
A single source of truth drastically reduces configuration drift.
Observability Is Mandatory
Without metrics:
- policy enforcement becomes impossible
- deployments become blind
- Policy Engines Should Be Isolated
- Keeping OPA internal-only mirrors real enterprise architectures.
Chaos Engineering Builds Confidence
Breaking the system intentionally proved that:
- metrics were accurate
- policies were effective
- safety mechanisms worked
Automation Must Be Explainable
Every policy response included human-readable reasoning.
This made debugging and operational decisions much easier.
Final Thoughts
Stage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with:
- observability
- governance
- auditing
- deployment safety
The project demonstrated how:
- metrics
- policy engines
- infrastructure generation
- deployment orchestration can work together to create reliable deployment systems. Most importantly, it reinforced a key DevOps principle: Safe automation is more valuable than fast automation.
United States
NORTH AMERICA
Related News
How Braze’s CTO is rethinking engineering for the agentic area
11h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
22h ago
KDE Receives $1.4 Million Investment From Sovereign Tech Fund
2h ago
Instagram’s new ‘Instants’ feature combines elements from Snapchat and BeReal
2h ago
Six Claude Code Skills That Close the AI Agent Feedback Loop
2h ago