Warhorn was offline for about 4 hours this morning. The outage was caused by the server’s disk filling up as a result of 1) last night’s automated backup and 2) a proliferation of old log files. Because we don’t have any real alerting systems in place, we didn’t notice the outage until users started sending complaints in email. Once we were aware of the situation, we made some space on the disk and got the service running again.
Takeaways from this incident:
- We need real alerting systems in place to let us know as soon as some part of the system fails. It is unacceptable for hours to go by before we discover that something’s wrong. We have two options: set up some basic monitoring and alerting tools, or pay some third party to handle monitoring and alerting for us. Given my lack of free time and experience with managing that kind of thing by hand, we’d prefer the second option, but that depends on you folks giving a little more money a little more often than happens now.
- Backups need to be stored somewhere other than on the server. This is already in place; I set it up just after this morning’s incident. ~10G of space has been reclaimed, and our hosting provider’s automated backup system has now replaced our old system (at an additional monthly cost, which we’re crossing our fingers will be covered by donations).
- Log files need to be pruned after some amount of time. I gotta just roll up my sleeves and make that happen; nobody’s gonna do it for me.