What's happening with your infrastructure right now? Is the network healthy? Are all of your servers online and responding? Gaining insight into these things is an important part of a system administrator's life. Implementing a monitoring solution can be extremely complex and time consuming though, and the road begins with choosing a monitoring platform.
There are a lot of choices for infrastructure monitoring ranging from free to costly, self hosted to cloud based. If you need a highly flexible solution with enterprise grade features, and you want it for free, the choice usually comes down to Nagios and one of several lesser known tools.
Nagios is the standard for free, self hosted server monitoring. It's been around a long time and the platform is very mature. It's also extremely extensible and there are thousands of plugins out there to help you monitor just about anything you can think of with your servers. Configuration and setup can be a bear however and the management / overview software is antiquated. Every server and service that you want to monitor is a time consuming and error prone process. If this sound like more that you're willing to tackle, one of the alternatives, Zabbix, is worth your consideration.
Zabbix is a similarly flexible monitoring platform that is also free and relatively mature. Zabbix is capable of monitoring anything you can think of as well, albeit with some custom work. The main advantage it has over Nagios are the included templates that can get you up and running with a variety of standard monitoring systems with very little setup. There are preconfigured monitors for things like CPU data, memory, disk space, network bandwidth, service uptime, and more - all of which must be configured manually with Nagios. As a bonus, graphs are available out of the box with Zabbix where they are another plugin and setup task on Nagios.
Zabbix also eases the configuration of triggers and alerts by including common and desirable settings with the templates, another manual task with Nagios. Reasonable thresholds for monitors come preset so that you can quickly begin monitoring your systems and simply tweak the parameters as you learn more moving forward. The system takes care of alerting the proper staff based on your on call schedule via email, SMS, and jabber out of the box.
When we began setting up server monitoring we started with Nagios. It quickly became obvious that the learning curve was very steep and the time and effort involved with getting even basic monitoring off the ground would be significant. It's clear that Nagios is a powerful tool which is able to accommodate just about every need you can think of, however it proved to be overkill for our needs. At that point we began exploring alternate solutions and settled on Zabbix since it seemed to have just as much potential but was more approachable and faster to get off the ground with. It does a lot of things very well, but still might be overly complicated for our basic server monitoring needs.
If you've got a dedicated team to support and maintain your infrastructure, using a platform like Nagios or Zabbix will make a lot of sense. The bulk of the effort will be up front and once things are configured properly you'll rarely have to adjust the configuration. You will have to spend some time learning the ropes though. If you've got more basic needs you might consider a simpler solution that can at least tell you when things go offline or response times become a problem. Even though we've been running Zabbix for a few months now and it has been pretty helpful, we're considering a switch to New Relic. It's unclear how and when this quest will end for us at this point.