Today's outage

Warhorn was offline for about 4 hours this morning. The outage was caused by the server’s disk filling up as a result of 1) last night’s automated backup and 2) a proliferation of old log files. Because we don’t have any real alerting systems in place, we didn’t notice the outage until users started sending complaints in email. Once we were aware of the situation, we made some space on the disk and got the service running again.

Takeaways from this incident:

  1. We need real alerting systems in place to let us know as soon as some part of the system fails. It is unacceptable for hours to go by before we discover that something’s wrong. We have two options: set up some basic monitoring and alerting tools, or pay some third party to handle monitoring and alerting for us. Given my lack of free time and experience with managing that kind of thing by hand, we’d prefer the second option, but that depends on you folks giving a little more money a little more often than happens now.
  2. Backups need to be stored somewhere other than on the server. This is already in place; I set it up just after this morning’s incident. ~10G of space has been reclaimed, and our hosting provider’s automated backup system has now replaced our old system (at an additional monthly cost, which we’re crossing our fingers will be covered by donations).
  3. Log files need to be pruned after some amount of time. I gotta just roll up my sleeves and make that happen; nobody’s gonna do it for me.
I think it’s pretty obvious that we’re running Warhorn on a shoestring budget. We get very little in the way of donations and therefore don’t have the budget to afford to pay other folks to help us with the system level stuff that I don’t have time and/or expertise to deal with. So if you value the service we’re providing, please consider donating a few bucks to help us get Warhorn working better. If we could bring in even just $100 a month we could afford to run this outfit in a much more professional manner 🙂 Thanks very much to those of you who do support us financially, we sure do appreciate the help. And sorry for the outage – we’ll keep doing the best we can to keep this creaky old ship afloat.
