RallyX & US10 Post-Mortem
As some of you may be aware, we recently had issues with one of the host servers for Reclaim Cloud that significantly impacted one of our Shared Hosting servers (rallyx.reclaimhosting.com) and resulted in some downtime.
TL;DR: MySQL crashed and we had to restore backups to bring the server and the sites it hosts back online. This did result in a few hours of lost content (between ~11am on 27 April and early morning 28 April), which we sincerely apologize for.
We are working with our hosting partner to optimize backups to ensure quicker resolution should something like this happen again, and will work to distribute load across all of Reclaim Cloud's host servers so that downtime on one host impacts as few users as possible.
Below is a more in-depth breakdown of the cause of the incident and resolution if you need more information. And please, if you do have an issue, don't hesitate to reach out.
27 April 2024
On Saturday, 27 April 2024, around 9pm EDT (0100 UTC on 28 April 2024), we saw brief downtime on one of Reclaim Cloud’s hardware nodes (hn10.us.reclaim.cloud). This resolved itself, and its initial investigation suggested the cause of the crash to be due to heavy I/O during backup processes.
28 April 2024
The previous crash of the aforementioned hardware node resulted in the corruption of MySQL data on one of the virtual servers it hosted; rallyx.reclaimhosting.com. This was not immediately clear though, and downtime initially appeared to be caused by a stuck process related to the heavy I/O. However, once the container was successfully restarted the MySQL corruption was discovered.
On Sunday, 28 April 2024, around 10am EDT (1400 UTC) work began to repair the MySQL corruption and restore the server and the sites hosted on it to working order. All attempts failed, and around 2pm EDT (1800 UTC) we decided to restore service from a backup.
The server itself was rolled back to a backup taken on 21 April 2024, and this process completed around 5:40pm EDT (2140 UTC). User accounts were then restored from a backup taken on 27 April 2024; prior to the previously mentioned crash of the hardware node that resulted in downtime. This restoration completed around 11:30pm EDT (0330 UTC on 29 April 2024). Given that these backups were a day old, some data loss is expected.
29 April 2024
On Monday, 29 April 2024, around 9:30am EDT (1330 UTC), the same hardware node crashed again. Given backups were not running, the cause was not immediately clear. We reached out to our hardware vendor, OVH, for them to investigate.
Around 11am EDT (1500 UTC) OVH determined that the cause was a faulty power supply on the server and replaced it. However, there were subsequent issues associated with a reboot after the power supply was replaced. We continued to work with OVH to determine a solution.
By 12:30pm EDT (1630 UTC), we were able to reboot the server to stabilize the connection. The server remains online at this point.