Recently we have been testing our new hypervisor clusters backed by Ceph storage before putting it into production. This way we can ensure that the product remains stable and reliable. The test of time is typically the best way to test stability and this issue definitely confirmed that view here.
What went wrong?
Currently our test cluster consists of 3 hypervisors, which also act as Ceph monitors, managers and have Ceph OSD daemons running on them. With this, we have data replicated 3x meaning the cluster can survive the failure of 1 hypervisor at a time. Unfortunately, the cause of this issue we’ll discuss today, was that 2 hypervisors ceased function.
Yesterday on 7/16 we noticed that I/O was locked for VMs running on the hypervisors. After investigating we found that our RAID 1 OS drives were completely gone from 2 hypervisors. Essentially, kicked out of the RAID if you will. I/O wasn’t being read or written to/from the RAID devices or Ceph OSDs, which caused a total halt.
We had a few Ceph OSD drives per node that were set to JBOD (just a bunch of disks). This essentially exposes the drives to the OS as if they were directly connected SATA drives and not going through a RAID controller. This is essential for Ceph (and ZFS) as exposing the drives via RAID devices can cause a whole host of issues. To our surprise, we learned that JBOD mode on the LSI 2208 controller chip is very buggy. It’s generally recommended that only IT-mode flashed controllers are used. As you can tell, we found out the hard way.
Attempting to fix the damage
Carrying on, once we knew in full what the issue was, we started attempted fixes. First we powered down one of the non-functional hypervisors (hv2) and booted into the RAID BIOS. We noticed that neither of the OS drives were showing up. Rebooting and having the controller rescan resulted in no success. We eventually set the controller back to factory defaults which disabled JBOD mode. This was a major mistake, as one of the OS drives showed up (but not the other) and started a RAID rebuild with one of the Ceph OSDs.
Taking the controller out of JBOD mode essentially marked the previous Ceph OSDs as unconfigured + good. The controller took action and initialized the drive (wiping the Ceph data) and put it into the RAID array. When we stopped that RAID rebuild, it took the 2nd Ceph OSD and initialized that and started a rebuild too. We quickly realized things were going south and had to divert our strategy.
After some thinking, we decided to abandon recovery of hv2 for now and move onto hv3. With hv3, we hard powered it down, waited a bit, then powered it back up. Strangely, this caused both of the OS drives to show up. Since JBOD mode didn’t get disabled, hv3 didn’t suffer the same fate as hv2. Everything looked good at this point. It booted into the OS just fine and the Ceph OSDs were intact. Since hv1 didn’t suffer from any fault, when hv3 came back up, Ceph recovery started. Eventually the recovery finished without issue (besides active+undersized+degraded PGs in Ceph we’ll get to that later) and it was time to look at the problematic hv2 again.
When we tried to boot hv2 it didn’t. At first it looked like the OS was just gone from the RAID until a further peek. We booted a recovery environment and found that the OS was intact and that the issue was likely with a corrupted bootloader. Then, we reinstalled grub onto the RAID device and it booted! We re-enabled JBOD mode on the controller and the previous Ceph OSD drives showed up. Sadly, the Ceph OSDs were completely wiped due to the RAID controller initializing them. One of the OS drives didn’t and still isn’t showing up either. For that issue, it’s possible that the RAID metadata on that OS drive is corrupt and the controller doesn’t want to read it (to be continued).
What can we do to prevent this?
To prevent such a catastrophe from happening again, we will need to flash our LSI 2208 controller with IT-mode firmware (making it a simple HBA). Alternatively, we can rewire the chassis with a dedicated HBA PCIe card and have everything but the OS drives go through it. In any case, it’s clear that our design needs to change. We’re glad that this happened in testing, and NOT in production. Although data is relatively intact at this point, we still have degraded Ceph PGs which need to be repaired.
Transparency and honesty
We tend to favor transparency and honesty. We’re taking after companies like Cloudflare who publish detailed descriptions of their infrastructure and postmortems of disasters. We aren’t worried about whether honesty makes us look bad, because everyone makes mistakes and everything runs into issues at some point. We believe being quiet about these types of things is harmful to customers and the IT community, who might find posts like these helpful in troubleshooting similar issues. It’s better to be transparent and honest, and that’s what we’re about!