Continuous delivery (CD) of software releases offers huge efficiency gains for companies that can implement it. But is continuous delivery even possible when the application’s backbone is a massive relational database? How can one spin up database copies for developers, QA, integration testing, and delivery testing for the rapid flow of features from development to operations to production? It’s not like Chef or Puppet can spin up a 10TB database copy in a few minutes, the way one can spin up a Linux VM.
There is a way to deliver these database copies, and that way is called copy data virtualization. Data Virtualization allows one to spin up that 10TB database in minutes, as well as branch a new copy of that 10TB from Dev to QA in minutes, or for that matter branch several copies at the same time and all for very little storage.
Continuous Delivery, Continuation Integration and Agile Development offer an opportunity to hit deadlines on budget, with gains in efficiency for companies as opposed to waterfall methods. With waterfall methods, we try to get all the requirements, specifications and architecture of a release designed up front, then set the development teams working, and only tackle integration and deployment near the end of the cycle. It’s impossible to precisely target dates when the project will be completed and sufficiently QA’d. Even worse, previously undetected problems and bugs start to pour in during late state integration, further exacerbating the problems of meeting release dates. These problems are solved with Continuous Delivery, where every few changes are passed through QA, run through integration, and tested for deployment in a continuous delivery environment.
The benefits of continuous delivery are driving more and more shops towards continuous integration and continuous delivery. With tools like Jenkins, Team City, and Travis to run continuous integration tests; virtualization technologies such as VMware, AWS, Openstack, Vagrant,and Docker; and tools like Chef, Puppet and Ansible to run the setup and configuration, many shops have moved closer to full continuous integration and delivery.
But there is one huge roadblock. That hurdle is getting the right data, especially when the data is large, into the Agile, Continuous Integration and Continuous Delivery life cycle and flowing through that lifecycle. Gene Kim voices this problem when he points out the top 2 bottlenecks he sees in IT.
Gene Kim lays out the top 2 bottlenecks in IT as
- Provisioning environments for development
- Setting up test and QA environments
and he goes on to say
One of the most powerful things that organizations can do is to enable development and testing to get environment they need when they need it. - Gene Kim
A recent white paper from Contino says
Having worked with many enterprise organisations on their DevOps initiatives, the biggest pain point and source of wastage that we see across the software development lifecycle is around environment provisioning, access and management. - Contino
An article published in Computing [UK] voices this problem:
“From day one our first goal was to have more testing around the system, then it moves on from testing to continuous delivery,” [said Mike Lear, the CIO at City Group]
But to achieve this, while at same time maintaining the integrity of datasets, required a major change in the way Lear’s team managed its data.
“It’s a big part of the system, the database, and we wanted developers to self-serve and base their own development in their own controlled area,” he says.
Lear was determined to speed up this process, and began looking for a solution – although he wasn’t really sure whether such a thing actually existed. - Computing [UK]
Lear eventually did find a solution to the road block of provisioning data and that solution was data virtualization.
The data road block is discussed more and more by experts in the industry as everyone moves towards continuous integration, but not everyone has found a solution. Many teams have still problematically tried to use full backups of data for testing as cited by Jez Humble and David Farley in their book “Continuous Delivery: Reliable Software Releases through Build, Test and Deployment Automation:
"When performing acceptance testing or capacity testing (or even, sometimes, unit testing), the default option for many teams is to take a dump of the production data. This is problematic for many reasons (not least the size of the dataset)"
- Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation - Humble, Jez; Farley, David
What can we do about this enormous obstacle to Continuous Integration—the requirement to readily provide large complex data set copies for development, QA and continuous integration?
Fortunately for us, there is data virtualization technology. As OS virtualization opened the door to continuous integration at the OS level, data virtualization swings it wide open for enterprise level application development, which depends on large databases.
Data virtualization is an architecture which connects to source data or databases, takes an initial copy and then forever collects the changes from the source. Some high profile DV tools are EMC SRDF, Netapp SMO, Oracle Standby database, Actifio and Delphix. The data is saved on storage that has either snapshot capabilities (as in Netapp & ZFS, or software like Delphix that maps a snapshot file system onto any storage, even JBODs). The data is managed as a timeline on the snapshot storage. For example, Delphix saves by default 30 days of changes. Changes older than the 30 days are purged out, meaning that a copy can be made down to the second anywhere within this 30-day time window. From the timeline of snapshots, versions of data or databases can be spun up in minutes for almost no extra storage overhead.
Virtual data improves businesses’ bottom line by eliminating the enormous infrastructure, bureaucracy and time drag that it takes to provision databases and data for development environments. Development environments depend on having a copies of production data and databases and data virtualization allows provisioning in a few minutes with almost no storage overhead by sharing duplicate blocks among all the copies.
As a side note, but an important one, development and QA often require that data be masked to hide sensitive information such as credit cards or patient records; thus, it’s important that a solution come integrated with masking technology. Data virtualization combined with masking can vastly reduce the surface area (amount of potentially exposed data) required to secure by eliminating full copies. Also data virtualization can include chain of authority record keeping which tracks who had access to which data at what time.
The typical architecture before data virtualization requires that a production database be copied to
In development the copies are further propagated to QA, UAT, but because of the difficulty in provisioning these environments, a process which takes multiple teams and people (DBA, storage, system, network, backup), and because of the additional resources required for each copy, the number of provisioned environments is usually limited, and often the data is old and unrepresentative.
With data virtualization, there is a time flow of data states stored efficiently on the “data virtualization appliance” and provisioning a copy only takes a few minutes, uses minimal storage, and can be run by a single administrator or even as self-service by the end users such as developers, QA and business analysts.
With the ease of provisioning large data environments quickly and with little resource consumption, it becomes easy to quickly provision copies of environments for development and to branch those copies into multiple parallel QA lanes in minutes to enable continuous integration.
This article is published as part of the IDG Contributor Network. Want to Join?