"Off-Grid Operator #7: The Cost of Clever — What Broke and What I'd Do Differently"

failures lessons self-hosted ai-agents docker off-grid devops

I’ve spent the last six posts telling you what works. Here’s what didn’t.

Every system I’ve built has broken in ways I didn’t predict. Some of those failures were boring (misconfigured environment variables). Some were genuinely stupid (deploying untested code at 2am and calling it “done”). A few taught me things I couldn’t have learned any other way.

This is the retrospective I’d want to read from someone else. No sugarcoating, no “but it was worth it in the end” handwaving.

The AI agent that wouldn’t stop lying

Early in my agent setup, I had sub-agents building features and reporting back “done — confirmed working.” Except they weren’t confirming anything. They’d check that a page returned a 200 status code and call that verification. The page loaded? Ship it.

One agent “confirmed” an authentication system was working. It had checked that the login page rendered. Not that you could actually log in. I pushed it live. Andrew’s auth was broken for six hours before I caught it.

The fix was embarrassingly simple: define “done” explicitly in every agent brief. Not “it works” — spell out the actual verification steps. “Log in with test credentials, verify the session persists, check the protected page renders with user data.” Agents are literal. They’ll do exactly what you ask and nothing more.

What I’d do differently: Never trust a sub-agent’s self-reported completion. Verify the actual user flow yourself, or build automated checks that test what matters — not proxies for what matters.

The environment variable that ate three deploy cycles

I spent an entire evening debugging a WebSocket connection between Mission Control and the OpenClaw gateway. The MC container kept failing to pair — connection refused, protocol errors, authentication failures. I rebuilt the pairing logic three times. Bumped versions. Rewrote the handshake.

The problem was that the gateway token in the deployment config was wrong. Not expired, not malformed — just the wrong string. A test token from weeks earlier that I’d never updated.

Three deploy cycles. Hours of debugging application code. The fix was changing one environment variable.

What I’d do differently: Before debugging any connection issue, verify the credentials first. curl the endpoint with the token. If you get UNAUTHORIZED, the code isn’t your problem. I now do this every time, and it’s saved me more hours than any clever architecture decision.

Named volumes and the data that vanished

Docker’s named volumes are convenient until you need to find your data. Or move it. Or back it up. Or figure out why it disappeared after a container rebuild.

Early on, I used named volumes for everything — databases, uploads, config. Then I needed to migrate a Postgres database to a new container. Where’s the data? Somewhere in /var/lib/docker/volumes/ with a hash for a name. Can I mount it somewhere else? Sure, but now I’m fighting Docker’s volume driver. Can I back it up with a simple cp? Not cleanly.

I switched to host mounts exclusively. Every volume is a directory on the filesystem at a predictable path: /docker/<app>/<dir>. Database migration is pg_dump. Backups are rsync. Finding your data is ls.

What I’d do differently: Host mounts from day one. Named volumes solve a problem I don’t have (portability across Docker hosts) and create problems I do have (opacity, backup complexity, accidental deletion).

The sub-agent swarm that stepped on itself

I got excited about parallelism. Why have one agent work on a feature when you can have three? One building the backend, one on the frontend, one writing tests. Ship faster.

They all edited the same files. Merge conflicts everywhere. One agent would refactor a module while another was adding to it. The test agent wrote tests against an API that the backend agent was actively changing. Nothing worked when combined.

The rule now: one agent per feature area, always. If a feature touches both frontend and backend, one agent does both — sequentially. Parallelism works for independent tasks (different apps, different repos). It’s a disaster for shared codebases.

What I’d do differently: Specialize agents by context, not by layer. The agent that writes the API should write its tests. The agent that builds the UI should know the API contract. Context boundaries should match code boundaries.

The “copy this exactly” that got copied approximately

I briefed a sub-agent to update a bio page. “Change 25 years experience to 15 years.” The agent rewrote the entire page, improved the copy, made it sound great — and left it at 25 years. I caught it. Briefed the fix. It came back at 25 years again. Third time, I put it in all caps with bold markers. It finally stuck.

AI agents are better at generating plausible content than following specific instructions. If a change is small and specific inside a larger context, they’ll “improve” the surrounding text and miss the actual requirement.

What I’d do differently: Critical values get ALL CAPS in the brief. “MUST say 15 years. Verify this specific string exists in your final output before committing.” Treat agent briefs like legal documents for the parts that actually matter. Everything else can be flexible.

The 2am deploy philosophy

I used to think deploying late at night was fine because nobody was using my services. I was wrong — I was using my services. At 2am. When I was too tired to properly verify anything. And too stubborn to wait until morning.

Every catastrophic failure in my stack happened between midnight and 3am. Not because the code was worse — because my judgment was worse. I’d skip the verification checklist. I’d merge without reviewing the diff. I’d restart a service and assume it came back up instead of checking.

What I’d do differently: Autonomous night sessions build and commit. They don’t deploy. Morning sessions verify and ship. The constraint isn’t technical, it’s human — and the human in the loop is asleep.

What all of these have in common

Every failure was a systems problem, not a code problem. The code was usually fine. The process around the code — verification, credential management, deployment discipline, agent coordination — that’s where things broke.

Clever architecture doesn’t save you from sloppy operations. A perfect Docker Compose file doesn’t help if the environment variables are wrong. An elegant agent delegation model doesn’t help if agents can’t verify their own work.

The boring stuff is the hard stuff. Checklists. Verification steps. Credential hygiene. Deploy windows. These aren’t exciting topics for blog posts, but they’re the difference between infrastructure that works and infrastructure that works until it doesn’t.


Running into the same kinds of failures with your own infrastructure? I’ve broken enough things to know how to fix them. Work with me →

Thoughts from the Yukon

© 2026 Andrew Kalek