Data cloning technology has advanced significantly with data virtualization solutions such as Copy Data Management (CDM) and Database Virtualization Appliances (DVA). The technologies have seen a rapid adoption in the past 4 years. Over that time 100 of the Fortune 500 companies have adopted data virtualization technologies. These adopters have seen massive reduction in storage requirements and reduced time for development projects due to data virtualization technologies' ability to produce clones of data in minutes for almost no storage overhead not by copying data but instead sharing an initial copy of data while at the same time allowing each clone read/write access by storing changes privately for each clone.
Why is copy data management (CDM) and data virtualization important to businesses? Data virtualization solutions reduce the data sprawl created by copies of production data. The average enterprise creates 8-10 copies of every production data source for application development, QA, user acceptance, production support, reporting, and backup. Thus, a 5 TB production database creates 40-50 TB of down stream copies, and a F500 firm might have more than 1,000 production databases generating petabytes of copy data. The amount of storage required to effectively manage copy data is stunning. Data virtualization eliminates all the redundant copies of data while at the same time and more importantly reducing the time required to make copies, which reduces application development times, QA time, and recovery time.
Today there are only a few companies focused in this space with more coming into the market and it is becoming difficult to separate vendor marketing hype from reality.
Savvy IT execs are realizing that “virtualizing” data will soon be baseline functionality required in IT departments. The time has already come where you should be asking questions like:
- What unique features does each vendor provide?
- Does the solution scale?
- Does the solution work in the datacenter and across private and public clouds?
- How much automation, self-service and application integration is pre-built into the solution?
- How complete is the solution and how much custom work will be required to use the solution such as scripting, manual management or other products?
- Does the vendor have customers of my size and in my industry using their solution today?
- How fast can the solution be tested and run in a proof of concept (POC)?
- How rapidly can the solution be installed and rolled out?
So let’s now review the top 10 POC considerations from companies already using these solutions and break them down into two groups:
- Top 5 questions to ask before the POC
- Top 5 test to run during the POC.
Top 5 questions
Top 5 questions to ask are: Does the solution...
- Support my environment?
- The first and most obvious goal is to find a solution that will easily integrate with your company’s infrastructure, data sources, and application stacks. This includes both on-premise environments and scaling into remote or cloud environments. You also want to make sure that you’re not being locked into a solution that only supports a single source environment. Do you have a requirement for more than one type of database such as Oracle, SQL Server, Sybase, MySQL, etc.? Do you have a requirement for application support such as Oracle EBS or SAP? Do you have requirements for multiple host operating systems such as Linux, AIX, HP/UX, Solaris and Windows? Does the solution require specialized hardware or can it run on your existing system resources?
- Have the required features?
- Does the solution have specific built-in features for your intended business goals and requirements such as
- Accelerating application release cycles
- Are there specific interfaces and features to support application developers
- Ensuring data privacy and security
- Include masking, auditing and chain of custody
- Accelerating integration testing
- Does the solution include support for fast QA environments and QA specific features such as rollback for destructive testing.
- Migrating data to cloud environments
- Does the solution support cloud infrastructure? Does the solution support replication from in house data sources to the cloud and vice versa?
- Enhancing backup and DR strategies
- Does the solution support long range and fine grain RPO and fast RTO?
- Accelerating application release cycles
- How well does the solution drive down TCO and provide a greater ROI by enabling other use cases throughout the company. For example, if I have virtualized data that is synchronized with production for an integration testing use case, can I now mask that data, migrate it to a public cloud to enable my analytics team to perform business intelligence on the same datasets?
- Does the vendor have customers similar to my size and business requirements? Will I have to go through the growing pains of helping them break new ground or do they have leaders in finance, retail, manufacturing, government, high-tech, and others industry verticals using their solution today?
- Is the vendor willing to show me all of these features during an actual POC? Are they able to back up all of their sales and marketing claims during an onsite POC with clearly defined success criteria?
Top 5 best practices for the POC
- Point-in-time provisioning
- Provision an environment to an exact point in time (in between “snapshots”). What is the process for finding an exact point in time? How easily does the solution allow me to provision data environments down to the minute, second, or transaction? Can provisioning be accomplished by the push of a button by an end user such as a developer or business analysis or does it require custom scripts and multiple people such as a storage admin, database administrator and system administrator?
- Reset, branch and rollback of environments
- Now that I’ve provisioned a parent environment (a replica of a production environment), I’d like to make some changes and provision a branch (or child) of that environment. After making some changes to the child environment, now I would like to rewind the child environment back an hour, or 6 hours. How is this accomplished? Finally, I would like to reset both environments to their original state.
- Refresh parent and children environments with the latest data
- I’ve created a number parent environments and I’ve spawned a number of child environments off of those parents. What is the process for getting the latest data from production into my parent and children environments? Could this process be performed by a developer or analyst? What is the impact to production (if any)?
- Provision multiple source environments to the same point in time
- I have a number of use cases (business intelligence, integration, DR, etc.) where I will need to align and provision multiple different data sources to a particular point in time. (For example: Align all of my source datasets to 5pm local time) How is that achieved?
- Automation / self-service / auditing capabilities
- Can I perform all of the above tasks via a self-service GUI console? Is the GUI intuitive enough to provide self-service for my developers, analysts, and data owners? How robust is the CLI, and is there a full set of RESTful APIs for integration with DevOps tools? Finally, are all these tasks captured to provide a source of record for access to my data?
In summary, the number one take-away should be to make sure the vendors prove their solution and feature set to you in the POC. Data virtualization solutions represent significant potential for delivering massive advancements in data agility and data center utilization – advancements at a level not seen since VMware popularized server virtualization over ten years ago. For this reason, there will soon be many vendors attaching to this wave and claiming that their solution provides all the capabilities outlined in this document. Your task of separating the contenders from the pretenders is potentially a simple one – just make them prove it!
This article is published as part of the IDG Contributor Network. Want to Join?