At long last, I'm fairly confident that I've found the culprit in my Hyper-V Linux virtual machine struggles. Recently I wrote about a new set of tweaks I was making to my Linux VMs in the hope that it would resolve a recurring file system crash / journal abort issue that's been ruining my life.
While it turns out I was occasionally on the right trail, the complexity of a Hyper-V cluster over iSCSI SAN left so many possibilities that I never quite saw the real problem. I knew that the problem began when we implemented a cluster aware backup solution. What I obsessed over was the performance of the SAN and how it related to the virtual machine's operating system. My strong suspicion was that the SAN was being overloaded to the point where the guest OS was experiencing an I/O log jam while the SAN couldn't be written to. My thinking was that if a bunch of writes came in at a time when the SAN could not accept them, the writes would pile up in the Linux kernel network request queue until it became overloaded - at which point it would protect itself by switching the file system into read only mode. While this was kind of true, the real culprit was VSS.
I did infer that this had something to do with the VSS (Volume Shadow Copy Service) but I wasn't sure what or why exactly. I noticed strange behavior and performance issues whenever the shadow copy was being taken, but rather than try to eliminate it (the true solution) I tried to compensate for it.
Our backup solution offers a couple types of backup:
- Application Consistent
- Crash Consistent
With Application Consistent backup, the operating system is made aware that a backup is about to take place. This triggers a flush of the memory to disk before a shadow copy is taken so that any application data currently held in memory will become part if the backup. This is usually important for databases. The triggering is handled using VSS and a VSS writer service.
With Crash Consistent backup, a backup is taken at a precise moment in time. Whatever is on the disk at that time is what gets backed up. The applications are unaware that a backup is occurring and you get the equivalent of a system-state had someone just yanked the power cord.
Obviously given those two descriptions, you'd want the Application Consistent backup, and that's the default setting for the backup software for every virtual machine. It turns out however that Linux doesn't know how to deal with VSS. While I knew that to be the case, I also knew that Hyper-V installed guest services tools which include libraries to make the OS backup aware. This allows you to take checkpoints of a Linux VM and even export them while the guest OS is still running - but it turns out they will crash and burn in a full backup scenario when set to Application Consistent.
The solution, of course, is to use crash consistent backups for the Linux virtual machines. In Altaro Hyper-V Backup, this is a per VM setting under the Advanced Settings section.
I'm happy to report that since we've made this change not a single VM has crashed. I almost want to be mad that it was so simple after all of this time, but I'm too happy for that :)