Most DevOps tasks start with a manual checklist: Is the disk full? Is the latency too high? Should we promote this Canary? In my latest project for the HNG Internship, I decided that "manual" wasn't fast enough. I didn't just want to deploy code; I wanted to build a tool that protects itself.

I upgraded my CLI tool, swiftdeploy, from a simple script to a policy-driven engine with its own "Eyes" (Metrics) and "Brain" (Open Policy Agent). Here is how I did it.

The Architecture: A Single Source of Truth
The core of the project is the manifest.yaml. I wanted to follow the Declarative Infrastructure philosophy—where I describe what I want, and the tool figured out how to build it.

My tool takes this manifest and programmatically generates the docker-compose.yml and nginx.conf. No more hand-writing configs or fixing typos in Nginx blocks.

Giving it "Eyes" (Instrumentation) You can't manage what you can't see. I instrumented my API service (the engine) to expose a /metrics endpoint in Prometheus format.

I focused on the Golden Signals:

Throughput: Tracking every request and status code.

Latency: Using histograms to calculate P99 latency. (Because if 1% of your users are waiting 5 seconds, your app is broken, even if the average is fine).

Health: Tracking uptime and whether Chaos Mode was active.

Giving it a "Brain" (The OPA Sidecar) This was the biggest challenge. I integrated Open Policy Agent (OPA) as a sidecar container.

Instead of hardcoding "if" statements in my Python/Bash script, I moved all the decision-making logic into Rego files.

Why decoupling matters:
If I want to change the "Safety Standard" (e.g., changing the allowed error rate from 1% to 0.5%), I don't touch my CLI code. I just update the .rego policy.

I implemented two core policies:

Infra Policy: Denies deployment if the host has less than 10GB of disk space.

Canary Safety Policy: Denies promotion if the Canary's P99 Latency is over 500ms or error rates spike.

The "Gated" Lifecycle: Look Before You Leap I updated the swiftdeploy CLI to be "Gated."

Before the promote command actually switches traffic from Canary to Stable, it does a Pre-Promote Check:

It scrapes the /metrics from the running Canary.

It sends that data to OPA.

OPA evaluates the data against the Rego policies.

If OPA says "Deny," the CLI stops the deployment and explains exactly why (e.g., "Error rate too high").

Testing with Chaos To prove it worked, I had to break things. I used a /chaos endpoint to inject a "slow" state into the Canary.

When I ran swiftdeploy status, my real-time dashboard showed the P99 latency shooting up. When I tried to promote that "sick" Canary to production, the CLI refused. > CLI Output: Promotion Blocked: P99 Latency is 2000ms (Threshold: 500ms).

That is the moment I knew the "Brain" was working.

Lessons Learned
Fail Fast: Pre-flight validation is a lifesaver. My tool checks if the Nginx port is already taken before it even tries to start a container.

Observability is not optional: Without the /metrics endpoint, I would have been flying blind.

Policy as Code: OPA makes infrastructure audit-friendly and incredibly flexible.

Final Thought
Most DevOps tasks ask you to configure infrastructure. This one asked me to build the tool that manages the infrastructure. It’s been an intense journey from writing basic PHP/MySQL apps to building self-healing DevOps CLI tools, but the control you gain is worth every line of code.
What’s your favorite tool for enforcing deployment policies? Let me know in the comments!

DevOps #CloudEngineering #OpenPolicyAgent #Docker #HNG

I Built a DevOps Tool That Thinks: Adding "Eyes" and a "Brain" to SwiftDeploy

DevOps #CloudEngineering #OpenPolicyAgent #Docker #HNG

Comments (0)

United States

Related News

‘The Testaments’ Just Brought Back Another Surprising ‘Handmaid’s Tale’ Character

Islamic Medicine (2018)

LLM and Generative AI Interview Questions with Answers 2026

How nylas mcp uninstall Works: Remove MCP integration from an AI assistant

🌍 Earth's Last Letter