Hyper-V tuning: Linux virtual machines over iSCSI

A job that's never done

lun
Credit: Cypress North

I've written many posts about my adventures in virtualized server building and management over the past couple of years. Mostly to document what I've learned and to see what others are doing. Things have come a long way since we started out, but new problems are constantly arising, and achieving lasting stability seems perpetually just out of our grasp. 

Recently we migrated our systems from a couple of servers with direct storage and no failover to a High Availability Cluster using an iSCSI SAN. While this allowed us to consolidate our storage, ease capacity expansion, and provide VM failover, it also created some new challenges and problems for our environment. The problems all stem from the same point, the SAN.

Currently we have a cluster of 3 host servers, each with roughly 6 virtual machines on them of varying size and resource utilization. Every virtual machine, including its root file system (VHD), is stored on the SAN using a clustered shared volume. Our SAN is a single appliance with 4 disks in RAID 10 (block-level LUN) over a dedicated 1Gb network. The appliance is connected via 4 NICs using link aggregation and each cluster host is connected via 2 NICs using MPIO. While the 1Gb link doesn't sound like much, we see surprisingly good results in most cases. Typical disk utilization doesn't exceed 20% and network traffic averages around just 5MB/s with the occasional 20MB/s spike.

Now, if you've just read those specs over, you'll quickly come to the conclusion that we have too many VMs for our little SAN. With just 4 disks in the storage appliance, I/O can quickly become a problem if several VMs decide to get busy at the same time. Fortunately we've loaded our host servers with RAM and apply generous amounts to our VMs along with careful tuning, which results in low disk I/O under normal conditions. What we're seeing however is that under less normal conditions, an increase in I/O can have crippling side effects to some of our virtual machines, namely the Linux ones.

What we're seeing is that some Linux VMs will encounter a timeout to the root file system causing the main file system to be remounted in read-only mode, along with a bunch of journal aborted and file system errors. Obviously a read only file system isn't going to work out, so I'd wake up in the morning to a locked server or two. The only way to bring the server back online is to access the VM console directly, reboot it with a forced reset, run fsck to repair the file system, and reboot it again. Not great. What's odd is that Windows VMs are never affected, and other Linux VMs, even on the same host, may not be affected either. 

We'd already upped the iSCSI disk timeout of all Linux VMs to 180 seconds from the default of 30 seconds and reduced I/O swap as outlined here: Running a virtual machine over iSCSI SAN? Check your swappiness which helped for a time but lately hasn't been enough. The frequency of the read-only file system issue has increased dramatically since we've implemented a Hyper-V backup solution on the cluster. What we've noticed is that when the backup executes, the software triggers a Volume Shadow Copy on a couple of VMs at a time in preparation for the snapshot transfer. During that time, I/O on the storage server increases considerably causing some VMs to wait to write out to the disk. Our suspicion is that write operations on the VM start to pile up as they are waiting to be flushed to disk, which at that moment may be too busy.

Researching this topic for months has lead me to some solutions involving the I/O queue size parameter and the I/O scheduler. I've recently adjusted the scheduler to 'noop', the simplest I/O scheduler which is basically just first in first out. This has proven to be the most performant option for use over an iSCSI SAN. I've also increased the size of the I/O queue from the default of 128 up to 1024.

$ echo noop > /sys/block/sda/queue/scheduler 
$ echo 1024 > /sys/block/sda/queue/nr_requests

My hope is that the increase in queue size will allow the VM to persevere during a couple minute span where disk utilization is too high. So far so good on this front, but I'll need to report back after more time. If this doesn't work, my next thought it to change the VM Checkpoint storage location to a different disk, either a Host server's non system disk or a NAS, which would increase network traffic but decrease write I/O on the SAN.

Now I know there are piles system and network administrators out there who are graced with the budget to do things properly and actually see long term stability and high availability, but that's not us. We've made the best with what we can afford and are not able to continue to throw money at the problem. I'll continue to tune what we have to make the most of it until we've reached the maximum potential. If you've got any ideas or related experience to share, I'm all ears. 

ITWorld DealPost: The best in tech deals and discounts.
Shop Tech Products at Amazon