One of the downsides to virtualization is the relative complexity of system monitoring. Traditional monitoring systems are well tuned for physical servers, but they can run into issues in the virtual world. When servers move freely between physical hosts, it can be a challenge to keep track of them. Also, visibility at every layer of the stack is a tough nut to crack. Nevertheless, the ability to access trend and alert data from each layer is crucial to maintaining the overall health of the environment, and it's critical to future planning.
VMware's vCenter Operations Manager was designed to address these problems -- and it does so quite well. VMware didn't develop the Operations Manager tools internally, but acquired them with Integrien back in 2010. Now at release 5.7.0, vCenter Operations Manager is available in four different editions across the VMware line. I tested the Standard edition as part of the new VMware vSphere with Operations Management (VSOM) offering.
[ Also on InfoWorld: Take a visual tour of VMware vCenter Operations Manager. | Virtualization showdown: Microsoft Hyper-V 2012 vs. VMware vSphere 5.1 | Get the latest practical data center info and news with Paul Venezia's Deep End blog and InfoWorld's Data Center newsletter. ]
Smooth setupThe framework for vCenter Operations Manager is a vApp, downloaded and deployed on a VMware vSphere cluster. It consists of two Linux VMs, and the initial installation is as simple as browsing to the OVF file and deploying it through the vSphere management tools. Once deployed, Operations Manager requires very little initial configuration. It needs to hook into one or more vCenter servers and requires credentials to access every tracked object that's visible to the vCenter server. All told, getting the solution up and running took about 15 minutes.
The vApp is configured by default with two vCPUs on each of the two VMs, with 9GB of RAM presented to the Analytics VM and 7GB to the UI VM. These instances are self-tuning, so larger implementations will require more vCPUs and RAM, and the VMs will take advantage of those resources when detected.
There are further configuration steps to take once Operations Manager is running, such as defining SMTP and SNMP servers, along with uploading SSL certificates (if desired). However, when linked to one or more vCenter servers, Operations Manager begins its data collection tasks and can be left alone to digest the wealth of data it collects and analyzes.
Operations Manager monitors every aspect of a vSphere build, from the hypervisor to the individual VM, including CPU, RAM, network I/O, and storage I/O. It constructs profiles of each monitored object over time, then uses that profile to determine normal and abnormal behavior. Thus, if a particular set of VMs spikes CPU utilization every night at 11 p.m. over a week or two, Operations Manager will determine this is a normal occurrence, then factor it into determining when to trigger an alert or flag a subsystem for abnormal behavior. This type of profiling is very useful, as it prevents false positive alerts and smooths out trends.
Naturally, the learning process initially takes many weeks. During this time, Operations Manager shows data for every monitored aspect, but refrains from making determinations on normal/abnormal conditions, and the overall health scoring will not be completely accurate. Once enough data has been collected and analyzed, Operations Manager can begin making accurate determinations on the health of the monitored objects.
Navigating the UIOperations Manager's UI is Flash-based and well appointed. The left sidebar is a hierarchical tree view of all monitored vCenter servers and child objects, with the center area displaying whatever element is currently in focus. The basic vSphere Datacenter/Cluster/Host/VM tree is present, but you can define your own groups and display data relevant to those group members alone. This is very handy, as it lets you collect certain objects relevant to a particular application or framework together, as well as get analysis and monitoring data displays for just those objects, rather than a whole cluster or a single host or VM.
These groups can be configured manually or dynamically, with manual selection allowing the addition of specific objects, and dynamic selection appending objects to the group based on defined criteria, such as workload, child/parent status, and name. If you wanted to group together VMs named web01, web02, web03, and so on, you could create a dynamic group and define name as contains "Web." From there, all VMs with "Web" in the name will automatically become part of that group. It should be noted that whenever you define a group, Operations Manager will need to gather data for the newly defined group before it can report on it. That is, it will not be able to provide Health, Risk, and Efficiency scores right away, and some badges and values can take up to 24 hours to calculate.
The UI is reasonably fast and responsive, and the graphs and data displays are clean and easily digested. Given the vast amounts of data on display, it's somewhat of a challenge to absorb every element at first, but after some time working with the UI, you begin to know where to look for certain information, and you can access it quite quickly.
Monitoring clusters, hosts, and VMsThe overview of this information is displayed through a series of grades and symbols. There is a dashboard view for every monitored object, from the World, which is inclusive of all linked vCenter instances, down to the physical host and data store level, as well as every VM on the system. Clicking a cluster header and selecting Dashboard will show a series of three columns: Health, Risk, and Efficiency. Each column will be graded from 1 to 100, with a badge reflective of the graded status.
For instance, a physical host or cluster might show a Health of 84, a Risk of 27, and an Efficiency of 20. At a high level, this means the selected object is in good shape in terms of available resources and workload, and Risk is fairly low because there are no expected conditions that should upset proper operations. Efficiency, however, is quite low in this example, perhaps due to a number of powered-on but dormant or low-utilization VMs, and several that are oversized.
vCenter Operations Manager's Dashboard views display Health, Risk, and Efficiency assessments for every VM, host, and cluster in your virtual infrastructure.
Clicking on the Efficiency score will give you a drill-down view into the reasoning behind the score, where you might see that perhaps 63% of the VMs running on a particular host are overspec'd. This drill-down will include a list of those VMs and recommendations on how to better utilize existing resources. You may see that several VMs that were assigned four vCPUs are really only using one or two, or a VM with 4GB of RAM never uses more than 1GB. Based on this data, you can then make adjustments to their resource allocations to free up resources for other VMs.
This level of granularity extends to every monitored object in Operations Manager. You can bring up a dashboard for a single VM; view its calculated Health, Risk, and Efficiency scores; and dig into the metrics that produced those scores. By selecting the Operations tab, you can view a series of graphs depicting the VM's key metrics and resource assignments, as well as an overall chart showing its health. Each chart has a bar above or next to it that shows the expected normal range for that metric, so you might see a current CPU workload of 10%, while the range bar shows that anything between 4 and 25% is normal for that VM. Anything outside of the calculated normal boundary can generate an alert.
It's important to note that the normal ranges are not merely calculated on an overall basis, but also balanced against day of week and time of day. A VM that pegs the CPU every night during a batch run does not necessarily generate an alarm during those times, because that's been observed to be normal operation. If the CPU pegs in the early afternoon, however, it may eclipse the calculated normal value, and an alert could then be generated.
The Operations view makes myriad data points available, and selecting All Metrics will bring up a display that allows you to specify multiple metrics across all resources, even including vCenter Server operations. The resulting graphs can be manipulated to change the date ranges for the display, and several other controls can be used to modify the graphical display itself, such as showing or hiding an axis or displaying a trend line. The scope of detail presented in this view is exemplary.
As with the Dashboard and Operations tabs, there is an Analysis tab for each monitored object. When looking at the cluster level, this shows various data sets that can be called up to reference the cluster. One example: "Which hosts have the most free capacity and the least stress?" Clicking on it brings up a heat map and host list that answers the question. Another example: "Which hosts have the most abnormal workload?"
These views are also available for different monitored objects, but differ in focus. For instance, when looking at a host, you might select "Which VMs have the highest CPU demand and contention?" This brings up another heat map and a list of VMs on that host, with their current CPU utilization. There are other views for data stores as well. This makes it very easy to get a solid look at the performance and capacity of the existing infrastructure, as well as to pinpoint problem areas.
The Operations view shows overall health and key metrics for each monitored object. The Operations tab for a VM, for example, details the overall workload, along with CPU, memory, storage I/O, and network I/O metrics.
Capacity planningOperations Manager also incorporates VMware's forecasting and planning tools. When looking at a cluster and clicking the Planning tab, you get a view that shows the expected time remaining for the resources currently available in the cluster. This information is based on the number of hosts and their resources, as well as the currently running VM loads and the number of new VMs introduced over time. These calculations are then made to forecast how much time remains before cluster resources will be at their limits. The Planning tab also shows the total number of VMs that this cluster is likely to be able to handle, based on current workloads.
For example, if you've been adding many VMs to a certain cluster recently, the forecasting calculation might show that if that process continues, the cluster will exhaust current resources within the next two months. This data is broken out into each resource, such as CPU, RAM, data store space, disk I/O, and so forth.
In addition, by selecting "New what-if scenario," you can see how those numbers change as you add more hosts to the cluster, or add CPU and RAM resources to existing hosts. Operations Manager will then calculate how the cluster resources will be used within that scenario, helping to plan cluster upgrades. You can also add and remove hosts and data stores in a what-if scenario.
Alerting and reportingOne of the more significant problems with alerting is false positives. For instance, there are many times when a VM spiking to 100% CPU utilization is not a cause for alarm, but basic threshold triggers do not know this. As a result alarms are generated during normal operation that do not actually reflect a problem. Because Operations Manager has historical data on each object in the infrastructure, it will only trigger an alarm if previously unseen activity is taking place, such as a VM spiking to 100% on a time and day that it had been previously idle. This reduces the number of false positives and provides a better idea of what's actually happening on the cluster, host, or VM.
Alerts themselves are controlled by the Notification settings, which allow for fairly granular selections of objects, alert types, and criticality levels. In addition to being sent via email to one or more addresses and optionally via SNMP, alerts are shown on the right-hand sidebar of the UI, clickable from any place within the tool. This sidebar also expands to show at-a-glance overall Health, Risk, and Efficiency scores and graphs over time.
Operations Manager has a selection of predefined reports that can be called up on demand or run on a scheduled basis. These are exportable in PDF or CSV format, and they include data on under/oversized VMs, host utilization, capacity overviews, and idle VMs, among others.
The Analysis display is a quick way to reveal trouble spots or slack resources based on performance characteristics. We can see the analysis being performed, and the resulting heat map and detail below.
The reports are well designed and attractive, but there is no capacity for custom report generation in the UI. VMware states that they can create custom reports from XML files via the command-line interface, but there's no UI control over this function.
Further, if a cluster, host, VM, or other object is removed from vCenter, the historical data for that object becomes inaccessible from within Operations Manager. VMware states that the data is actually present, but cannot currently be viewed through the UI.