Storage Tip: Intelligence for unstructured data

storage.itworld.com |  Storage Add a new comment

Send your Storage question to David Hill today! | See other Storage tips from David



What seems to be the problem? A recent newspaper article stated that the latest job of surveillance cameras is to interpret the threats they see. This is a software-intelligence-enabled real-time use of unstructured data rather than the use of intelligent analytic forensic tools after the fact (see previous storage tip on digital surveillance) on unstructured data. This is just the latest example of how unstructured data is being used in organizations and IT organizations are likely to acquire the custodial responsibility for such applications. And therein lays your challenge. Not only will there be more data to manage and store, but the data protection strategies are more likely different than those used for structured information.


What do you need to know? A great deal of confusion surrounds the discussion of the structure of data. General agreement exists that database information is structured information as data and its associated metadata are tightly coupled. The way to determine whether or not data is structured is to ask whether or not the data can be sorted. If the answer is yes, the data is structured.


The disagreement exists over what is semi-structured and what is unstructured data. General agreement exists that e-mail is considered semi-structured and that videos, pictures, audio files, and medical images are unstructured. However, word processing documents, and presentations are considered unstructured, but they are really semi-structured documents.


The difference between a semi-structured file and an unstructured file is simple. Both have file metadata, but you can search on semi-structured data, such as an HTML document using standard tools (think Google). You cannot do that natively with an unstructured file; you can only sense it, such as viewing a video or listening to an audio file. (You can also sense a word processing document, but you can also search on it.)


The reason for distinguishing the different types is that each of the three is managed differently. However, there is movement afoot to add intelligence to unstructured data and thereby making it more manageable.


Let's use an example. An HR document receives a paper resume in the mail. At this stage the data is unstructured because all that can be done is that it can be sensed, i.e., read visually. However, put the resume through a scanner and apply some optical character recognition. Voila! The data now becomes semi-structured. For example, find all the candidates that have an electrical engineering degree. But wait there's more. More intelligence can be applied and the data, such as education and work experience can now be put in a relational database. Now a query can be issued against multiple criteria in order to find a set of candidates who meet all of those minimum criteria, for say a college degree for education and five or more years of experience for work experience. Note that all three versions may be preserved. The structured version helps determine if a candidate has certain basic qualifications, such as education and work experience. The semi-structured version can be found via keywords that may not be in the structured database. And the unstructured version can show how the candidate presents himself or herself as a whole on paper.


Another example. Voice recognition can recognize the audio track in a video and put it in a semi-structured format so that you can search on keywords. Identify the speaker as well and you now know who said what when. A third example is medical images. A pulmonary diagnostic intelligence tool can help physicians diagnose a medical image of a diseased lung. So increasing the intelligence of unstructured information is not just about one particular area, such as surveillance, but also spreads across a broader area as well.

    Add a comment

    Post a comment using one of these accounts
    Or join now
    At least 6 characters

    Note: Comment will appear soon after you have activated your account.
    Obscene/spam comments will be removed and accounts suspended.
    The information you submit is subject to our Privacy Policy and Terms of Service.

    ITworld LIVE

    StorageWhite Papers & Webcasts

    White Paper

    AppAssure vs Acronis

    In this study of data protection for environments with virtual and physical servers running Windows, openBench Labs tested AppAssure Backup and Replication software v 4.7 and Acronis Backup & Recovery 11. Both solutions utilize block-based technology to unify data protection operations.

    White Paper

    Guaranteeing 100% Backup Recovery

    The single biggest challenge for IT personnel involved in the data protection process is making sure that their backups are recoverable every time. Management and users won't remember the ninety-nine successful recoveries but they will always remember the one failure.

    White Paper

    ESG Analyst White Paper - VMware's vSphere Storage Appliance: High Availability for Small IT Operations

    Learn how small and midsized businesses are increasingly adopting virtualisation to deliver consolidation, improve data back up and disaster recovery and increase security with an in-depth new paper from the Enterprise Strategy Group (ESG). Learn directly from your peer's experiences and see why VMware's solutions are perfect for the growing and ambitious business.

    Webcast On Demand

    Understand Your Data: The Future of Backup and Archiving

    Archiving and Backup are the foundation of the next generation of information governance. However, commodity data protection tools and basic archives are only good for storing data. In the changing IT landscape, understanding what you are keeping, when to delete, and delivering insight to the business from your data is the future of these systems. Join us to hear the impact of private and public cloud solutions, "big data" and your choices while market evolves.

    Sponsor: Autonomy

    White Paper

    NetVault: #1 in the 2011 Oracle Backup Solutions Buyer's Guide

    Want to know how NetVault Backup compared against other Oracle backup software solutions - and why it's DCIG's #1 choice? In this 37-page report you'll get unbiased, third-party evaluations of Oracle backup software - and why NetVault Backup sits on the top of the list. Download your copy today.

    See more White Papers | Webcasts

    Ask a question

    Ask a Question