InfoWorld review: Data deduplication appliances

Data deduplication appliances from FalconStor, NetApp, and Spectra Logic provide excellent data reduction for production storage, disk-based backups, and virtual tape

Ever wonder why hard drive capacities continue to get bigger? Do you think IT has ever told management that they will need less storage capacity over the next three years? In fact, three years from now, your company will likely have four times as much data to store as it's storing today. The gigabytes will continue to turn into terabytes, and the terabytes will soon give way to petabytes.

Fortunately, there is a way to slow the inevitable data sprawl: Use data deduplication on your storage system. Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year. If admins can increase storage usage 20, 40, or 60% by removing duplicate data, that allows current storage investments to go that much further.

[ How does data deduplication work? What's the difference between block-level and file-level approaches? How to choose among source, target, and inline methods? Find out by downloading InfoWorld's Data Deduplication Deep Dive Report. ]

To see what data deduplication can do, I reviewed four storage appliances that use the technology: the FalconStor FDS 304, the NetApp FAS2040, and the Spectra Logic nTier v80 and nTier vX. All four appliances provided excellent scalability, performance, and data deduplication functionality. Each solution has a bit of its own personality -- one looks like a rack of tape drives, another a large network-attached storage system, and a third a direct-connect Fibre Channel appliance.

FalconStor's FDS 304 is a 2U NAS (network attached storage) appliance utilizing SATA hard drives and gigabit and 10-gigabit Ethernet network interfaces. IT will typically deploy the FDS 304 as a disk-to-disk backup partner or as a target for disk-based backups, but it can also serve as main-line storage. NetApp's FAS2040, also in a 2U form factor, can be deployed as a Gigabit NAS, Fibre Channel, or IP SAN, as well as a Fibre Channel over Ethernet device. It too is going to see action as a target for disk-based backups and data replication; it can also be used as a general-purpose storage medium. For enterprises that have a large investment in physical tape libraries or that will be virtualizing their tape farms, Spectra Logic's nTier family is a great choice. A "drop in" virtual tape library (VTL) appliance that uses FalconStor's data deduplication engine, the nTier can replace physical tape systems or run in parallel with a physical tape library while deduplicating stored data.

All of these appliances offer an easy-to-implement, easy-to-manage, and effective data deduplication system that any enterprise network could take advantage of. Based on my tests with a highly duplicative set of Windows and Office files and their backups, you can expect similar levels of data deduplication from all of them. Note: If you plan to deduplicate system backup sets as well as raw files, you'll want to make sure that the deduplication engine works with your backup software.

FalconStor FDS 300 Intended to be the target of disk-to-disk backup and archiving applications, the FalconStor FDS (File-interface Deduplication System) appliances provide a solid mix of performance, data replication, and deduplication while integrating seamlessly into the data center. Serving as a target for CIFS, NFS, and Symantec OpenStorage (OST), FDS doesn't require any network reengineering in order to integrate it into an existing environment. It has a flexible deduplication policy engine to allow IT to control if and when deduplication can occur and can even exclude folders from deduplication if necessary. Overall deduplication performance was statistically identical to Spectra Logic, which is no surprise considering that Spectra Logic licenses FalconStor's deduplication engine.

For this roundup, I received the FDS 304 2U chassis loaded up with 4TB of hot-plug SATA RAID 6 disk storage, expandable to a maximum of 32TB through additional storage enclosures. It comes standard with four 1GB Ethernet interfaces and (via two expansion slots) can add more gigabit interfaces (four-port expansion card) or a single-port 10Gb Ethernet interface. This chassis, like all chassis in the FDS family, will connect to the LAN via gigabit and 10Gb Ethernet interfaces, as well as act as an iSCSI target. Like the other appliances, it also includes dual hot-plug power supplies. There are three other models of the FDS 300, scaling to a capacity of 18TB of in-chassis storage, topping out at a maximum of 32TB using external enclosures.

There is a preconfigured virtual version of FalconStor's FDS appliance that runs on VMware ESX/ESXi 3.5 update 4 and vSphere 4; it also provides remote offices with a way to utilize data deduplication without requiring an additional piece of hardware. The virtual FDS is available in both 1TB and 2TB versions and makes it easy to bring deduplication to remote or branch offices.

The core use-case scenario for the FDS 304 is as a target for disk-based storage and backup systems. While FalconStor does offer VTL appliances, the FDS family is intended to be used as a file share for CIFS and NFS clients on the network. It is also meant to take the place of traditional tape-based backup systems. The FDS family comes with Symantec NetBackup OST support to allow tight integration between NetBackup -- or other OST-aware products -- and the appliance. While I did not test using NetBackup, FalconStor claims up to 500MBps maximum inbound speed using OST over 10Gb Ethernet.

I integrated the FDS 304 into my test bed as both a backup destination and CIFS file share. While I could have mounted data shares on the FDS as local storage using iSCSI, I decided to map a drive letter to the various shares from a Windows Server 2008 R2 box and my four virtual Windows 2008 R2 servers. I had no trouble manipulating files on the various shares from any of my servers -- each share behaved just as a typical Windows share would. I also used another share as a backup target for Symantec Backup Exec.

Using FalconStor's FDS management utility, IT can quickly access how much data is stored on the chassis and how much data has been deduplicated.

Over the course of my tests, I ran multiple daily backups of the five Windows servers without any issues. Unlike the NetApp FAS2040, the FDS had no problems deduping my Backup Exec backup sets. Typical file and folder deduplication was very efficient, providing nearly a 90% reduction on highly duplicated data. The backup sets were "full system backups," including Windows, installed applications, and Microsoft Exchange data stores. A mix of some Microsoft Word and Excel files rounded out the set. I saw virtually no difference in deduplication performance whether just a collection of files or the Backup Exec archives.

There are two choices for when to dedupe the data: on a scheduled basis or in real time (as the data is written to the disk). I set up a nightly scheduled deduplication pass, and it ran without any issues. I also was able to run manual dedupe passes when I wanted to check deduplication results immediately. The real-time deduplication policy, which analyzes the data as it is written to the device, serves to keep the data shares as deduped as possible. There is a small performance penalty when deduping in real time, but in my tests it was negligible. No matter what the deduplication needs, FalconStor will let you define a policy that fits.

I tried to fool the deduplication engine by renaming files and folders and changing extensions, but as with the other appliances, regardless of what I tried, the deduplication engine always found the duplicate blocks and either added them to the hash table or removed them to reduce overall data size. Because the deduplication engine works at the block level, it looks past such details as file name and type and correctly analyzed the file structure for the duplicate data. Regardless of the type of file -- PDF, Word document, ZIP archive, and so on -- the dedupe engine ferreted out the duplicate blocks like a champ.

FalconStor's management interface, which is virtually identical to Spectra Logic's, was easy to navigate once I became familiar with the organization of the UI. While it's not as intuitive as NetApp's System Manager, I had little trouble creating file shares, defining deduplication policy, and monitoring the health and performance of the system. I was able to easily view reports on storage usage, amount of deduped data, and percentage of storage reclaimed by deduplication. These reports will help IT keep tabs on overall storage usage and deduplication performance.

The FalconStor FDS 304 is a solid piece of network engineering and proved to be more than effective in storing data and detecting duplicate blocks of information. It makes an excellent target for disk-based backups and general file sharing. I liked the ease-of-use when creating CIFS shares, as well as the ability to serve as an iSCSI target and to export NFS shares offers a good deal of flexibility. While the reporting system isn't anything to get excited about, it does provide enough feedback on the health of the appliance to keep IT well informed.

The dashboard in the FalconStor FDS console provides an at-a-glance overview of disk usage trends.

NetApp FAS2040 Another appliance geared toward disk-based storage and deduplication is NetApp's FAS2040. This appliance allows multiple installation options for the data center, including as a SAN or NAS target, or direct via Fibre Channel. Like the FalconStor appliance, the NetApp can serve as production storage, as a backup device, or as both simultaneously.

The FAS2040 comes with up to two independent storage controllers and scales well, far exceeding that of FalconStor and Spectra Logic. In addition to CFIS and NFS protocols, the FAS2040 can also automatically export an NFS datastore to a VMware ESX server, a nice time saver for adding online disk space to an existing VMware environment. NetApp's deduplication policy didn't have the same level of flexibility as FalconStor, but it did a good job of reducing disk usage on volumes with a standard file/folder structure. However, on backup sets created by Symantec Backup Exec 2010, it didn't fare as well.

My NetApp-provided FAS2040 2U chassis was populated with a dozen 300GB SATA drives, two hot-swap storage controllers, each with four Gigabit Ethernet interfaces and two 4Gb Fibre Channel ports, and dual power supplies. My chassis was configured with two aggregates (RAID arrays) -- one for each controller -- in a dual-parity RAID configuration. To fit most any need, there are a variety of hard drives -- Fibre Channel, SAS, or SATA -- available for the FAS2040. By way of additional external drive chassis, the FAS2040 can access a maximum of 136TB of raw space, far more than the other chassis reviewed here.

I installed the FAS2040 on my test network via Gigabit Ethernet, connecting independently to both controllers in the chassis. I carved both aggregates into multiple volumes and shares, defining some as CIFS file shares while setting others up as iSCSI targets. (Like the other systems reviewed, the NetApp also allows you to create NFS shares for Linux/Unix clients.) As with the FalconStor and Spectra Logic appliances, I used the NetApp's various CIFS shares as NAS file storage and as a backup destination for my physical and virtual Windows Server 2008 machines. I had no trouble using both mapped drives and UNC (Universal Naming Convention) connections to the NetApp from all of my servers, physical and virtual. I also had no trouble mounting iSCSI shares as local storage using Microsoft's iSCSI initiator in Windows Server 2008. Each mounted volume behaved exactly like local storage.

One feature I really liked in the FAS2040 was the dual storage controllers. Depending on your needs and the configuration of the appliance, one chassis can serve as its own Active/Passive failover device. In case one controller should suffer a catastrophic failure, the other controller can take over transparently. Or, as in my case, you can use both controllers in an Active/Active configuration, if you want both controllers online and providing independent storage to your network.

Part of my testing involved simple file copies to the shares on the NetApp, while the other was based on using the NetApp as a destination for multiple Backup Exec jobs. The NetApp's deduplication of files and folders was impressive, showing excellent detection and elimination of duplicate or partially duplicate data. Like the FalconStor and Spectra Logic appliances, data reduction of highly duplicative file shares easily passed 90%. However, I was surprised at the trouble the NetApp had with the Backup Exec backup files.

Using the NetApp System Manager, it is an easy process to define a deduplication policy for each disk share on the appliance.

During my tests, I stored multiple daily backup sets to a NetApp CIFS volume from each server. Regardless of how or when the deduplication engine analyzed the stored backup files, I never got better than about 8% data reduction on the volume. Exchange message stores fared better, showing on average a reduction of 12% in disk usage.

I asked NetApp for a possible reason for this and was told the deduplication engine works on 4KB blocks. It seems the Backup Exec family of software inserts metadata into the backup files, messing up the alignment of the 4KB boundaries and making it much harder for NetApp to locate duplicate byte segments. Symantec has made a change in its Enterprise Vault 8.0 to block align with the NetApp engine, so not all Symantec products suffer from this misalignment. Backup software from some other vendors, including CommVault and VMware, keep the 4KB block boundaries in tact.

Admins can define a deduplication policy on a per-volume basis. The deduplication policy engine doesn't provide an overwhelming number of options, but it gets the job done. IT can define a policy to run dedupes manually on demand, or automatically when a specific amount of new data lands on the volume, or based on time of day or day of week. I was able to create a daily dedupe policy for a volume that started at 9 a.m. and stopped at 10 p.m. and ran every hour. Apart from the most extreme cases, this is overkill, but it is available if needed and it worked flawlessly.

IT has two options for managing the FAS2040: Web browser and the stand-alone management console, the NetApp System Manager. While the browser-based management portal was straightforward, I found System Manager much more user-friendly and intuitive, even more so than FalconStor's UI. Both storage controllers were represented in the management utility with each major function broken into separate grouped tasks, making it very easy to locate specific items.

As with FalconStor and Spectra Logic, there isn't a fancy reporting engine. There are, however, useful graphs and data points, such as volume details and space saved, scattered throughout System Manager. NetApp did a good job of organizing System Manager so that the amount of information presented in it is applicable and useful, without going overboard and inundating you with too much data.

I was really impressed with the FAS2040 from NetApp, both in terms of hardware options and manageability. I found the appliance very easy to integrate into my network and very easy to use. Deduplication was easy to manage, and files and folders that typically reside on a file system deduped with great success. My only complaint is with the poor results when deduping Backup Exec backup sets. Of course, no matter which deduplication solution you choose, you'll want to make sure that it works with your backup software.

On this particular iSCSI volume, I was able to achieve 92% disk space savings due to the highly redundant nature of the file data.

Spectra Logic nTier v80 and nTier vX The nTier family of deduplication appliances from Spectra Logic has a slightly different focus than FalconStor and NetApp. Their target market is IT shops that still primarily use tape and tape libraries for backup. The nTier line of appliances are VTL chassis that look like tape drives to the outside world and allow for the easy addition of deduplication during the backup process. The appliances are highly scalable and modular, allowing for in-place upgrades and a long life span. Making use of FalconStor's deduplication engine, the Spectra Logic appliance does a very good job of reducing the overall size of data backup.

To see Spectra Logic's solution in action, I received the nTier v80 and nTier vX VTL deduplication appliances. The nTier v80 is a 3U rack mount appliance that has a storage capacity of 8TB to 16TB (RAID 6) using SATA drives. The nTier vX is a massive 4U chassis nearly twice as deep as a standard rack mount server. Its storage capacity runs from 10TB to 60TB (RAID 6), upgradable in 10TB increments. Both chassis come with SCSI, Fibre Channel, and iSCSI (Gigabit Ethernet) interfaces for host connectivity and dual Intel Xeon multicore CPUs. Redundant power supplies and lots of fans round out the hardware.

The key to the Spectra Logic solution is its close ties to existing tape backup systems. Each VTL dedupe appliance can emulate a wide range of physical tape drives and libraries. In my lab, I chose the IBM Ultrium TD-3 (LTO3) format and defined 21 virtual tapes for six virtual tape drives. Spectra Logic supports eight different types of tape drives and ten different types of tape drive libraries.

As with FalconStor, I used iSCSI to connect my Windows Server 2008 server to the nTier v80 and Symantec Backup Exec 2010 to handle backup chores. The nTier vX was set up as a replication partner to the nTier v80, with deduplication taking place on the nTier v80 as soon as the backup completed and replication running at midnight. Both appliances worked flawlessly during my tests, with the nTier v80's VTL system "swapping out" tapes exactly as directed.

Spectra Logic licenses both the deduplication engine as well as part of the FDS management console in its nTier appliances. Here we see the VTL definition and an at-a-glance view of storage system usage.

I ran Backup Exec agents on four Windows Server 2008 R2 virtual machines running on Microsoft Hyper-V as well as a physical Windows Server 2008 R2 server. After the initial backups, the deduplication engine did an excellent job of detecting redundant data in each subsequent backup. I even tried to fool it by renaming groups of folders without changing the contents. In each case, the deduplication engine recognized the data and greatly reduced my backup size and replication footprint. In my test configuration, deduplication was done post-backup, but I could have easily run it in parallel with the backup. There is a slight performance penalty to deduping in real time, so for most users, post-processing is the way to go.

When a backup is made to one of the virtual tape drives, the data is written to the virtual tape in the same format as if it were a physical tape. This allows a copy to be saved and deduped on the nTier appliance and then streamed to a physical tape for off-site archival. One big advantage to using VTL is that IT staff already using physical tape drives and libraries don't have to learn a new backup system. They continue to use the same backup programs and schedules they are used to. Also, because each backup contains a catalog of the data, it's very easy to locate and restore files from the VTL.

Spectra Logic licenses FalconStor's deduplication engine and includes FalconStor's UI in its appliances for management purposes. All other aspects of the appliances are managed through Spectra Logic's own BlueScale management platform. BlueScale provides a common user interface across nTier and other Spectra Logic storage systems. From the BlueScale UI, I was able to see how effective the deduplication engine was, manage and maintain my virtual tape libraries, and define my replication schedule. I found it to be pretty intuitive to use after a few initial minutes of exploration.

The Spectra Logic nTier family of VTL appliances does an excellent job of standing in for physical tape drives and libraries, allowing for either a migration away from physical tape, or as an intermediary to physical tape to make it more efficient. Through iSCSI, each virtual drive looked like a physical drive to my backup software, and the deduplication engine worked well in all scenarios. The management UI was easy to navigate, although defining the virtual tape libraries was a little daunting. Nevertheless, for enterprises that want to keep the look and feel of tape but migrate to disk-based deduplication, the nTier family is a perfect fit. 

With the exception of the VTL components, Spectra Logic's management console looks and functions exactly like FalconStor's, providing an excellent quick view of disk usage and deduplication statistics.

  • Good all-around deduplication performance
  • 10Gb Ethernet support
  • Native Symantec OST support
  • Limited scalability (32TB)
  • No Fibre Channel
  • Highly scalable (136TB)
  • Dual storage controllers support active/passive and active/active configurations
  • Supports NAS, SAN, and Fibre Channel in one chassis
  • Poor dedupe results with Backup Exec backup sets
  • Excellent "drop in" VTL appliance
  • Dedupes to both virtual and physical tape libraries
  • Multiple connectivity options
  • Defining VTL libraries was a little difficult

This article, "InfoWorld review: Data deduplication appliances," was originally published at InfoWorld.com. Follow the latest developments in storage and enterprise data management at InfoWorld.com.

Read more about data explosion in InfoWorld's Data Explosion Channel.

This story, "InfoWorld review: Data deduplication appliances" was originally published by InfoWorld.

Top 10 Hot Internet of Things Startups
Join the discussion
Be the first to comment on this article. Our Commenting Policies