How HADES Monitor tracked an outage
The site went down in the night. My laptop was closed. The other machine was on, so it should have stayed up. It did not.
It took hours to find. A second project on the same Mac had taken the port my proxy uses, and every request that landed on it came back a 404 from the wrong server. The logs had said so the whole time. Nobody was reading them.
So I gave the host an agent. Once a minute it reads the daemon log and every app's log and writes down anything wrong: a dropped connector, a flood of 404s, memory running out. It keeps the list in a small database on the machine, and it only speaks up when something is actually broken.
When I turned it on it found the memory pressure first. The host had been low for an hour, pausing apps and resuming them seconds later, again and again. I had not noticed. It noticed in one pass.
Then I broke a connector on purpose. It caught that and fixed it. It has a short list of things it is allowed to do, all of them reversible, and restarting a connector is one of them. Two seconds, and a line in the log saying what it had done.
It is careful about the rest. The memory problem had no safe button to press, so it left that one for me. That is the part I trust. It knows the difference between a thing it can fix and a thing it should hand back.
The point was never to watch a dashboard. The point is to never have to.