You are not authorized to post comments.

Storage Tip: Intelligence for unstructured data

storage.itworld.com |  Storage Add a new comment

Send your Storage question to David Hill today! | See other Storage tips from David



What seems to be the problem? A recent newspaper article stated that the latest job of surveillance cameras is to interpret the threats they see. This is a software-intelligence-enabled real-time use of unstructured data rather than the use of intelligent analytic forensic tools after the fact (see previous storage tip on digital surveillance) on unstructured data. This is just the latest example of how unstructured data is being used in organizations and IT organizations are likely to acquire the custodial responsibility for such applications. And therein lays your challenge. Not only will there be more data to manage and store, but the data protection strategies are more likely different than those used for structured information.


What do you need to know? A great deal of confusion surrounds the discussion of the structure of data. General agreement exists that database information is structured information as data and its associated metadata are tightly coupled. The way to determine whether or not data is structured is to ask whether or not the data can be sorted. If the answer is yes, the data is structured.


The disagreement exists over what is semi-structured and what is unstructured data. General agreement exists that e-mail is considered semi-structured and that videos, pictures, audio files, and medical images are unstructured. However, word processing documents, and presentations are considered unstructured, but they are really semi-structured documents.


The difference between a semi-structured file and an unstructured file is simple. Both have file metadata, but you can search on semi-structured data, such as an HTML document using standard tools (think Google). You cannot do that natively with an unstructured file; you can only sense it, such as viewing a video or listening to an audio file. (You can also sense a word processing document, but you can also search on it.)


The reason for distinguishing the different types is that each of the three is managed differently. However, there is movement afoot to add intelligence to unstructured data and thereby making it more manageable.


Let's use an example. An HR document receives a paper resume in the mail. At this stage the data is unstructured because all that can be done is that it can be sensed, i.e., read visually. However, put the resume through a scanner and apply some optical character recognition. Voila! The data now becomes semi-structured. For example, find all the candidates that have an electrical engineering degree. But wait there's more. More intelligence can be applied and the data, such as education and work experience can now be put in a relational database. Now a query can be issued against multiple criteria in order to find a set of candidates who meet all of those minimum criteria, for say a college degree for education and five or more years of experience for work experience. Note that all three versions may be preserved. The structured version helps determine if a candidate has certain basic qualifications, such as education and work experience. The semi-structured version can be found via keywords that may not be in the structured database. And the unstructured version can show how the candidate presents himself or herself as a whole on paper.


Another example. Voice recognition can recognize the audio track in a video and put it in a semi-structured format so that you can search on keywords. Identify the speaker as well and you now know who said what when. A third example is medical images. A pulmonary diagnostic intelligence tool can help physicians diagnose a medical image of a diseased lung. So increasing the intelligence of unstructured information is not just about one particular area, such as surveillance, but also spreads across a broader area as well.

ITworld LIVE

StorageWhite Papers & Webcasts

White Paper

Using BD for Smarter Decision Making

This paper looks at new developments in business analytics and discusses the benefits analyzing big data bring to the business.

White Paper

Protecting Against Database Attacks and Insider Threats: Top 5 Scenarios

Read this new eBook to learn the top five scenarios and essential best practices for preventing database attacks and insider threats.

White Paper

The Best Way to Build a Cloud -- HP CloudSystem Matrix and HP 3PAR Utility Storage provide solid, flexible foundation

Learn how HP CloudSystem Matrix and HP 3PAR Utility Storage provide a solid, flexible foundation for your cloud environment.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Defining Tier One Storage in the Modern Data Center

This report defines "tier-1" storage in the modern IT world and in the data centers and services that support it. What was a simple environment just a few years ago with mainframes or a few large servers to be supported has evolved into a complex web of virtual machines, clouds, and expanding user expectations -- factors which demand and create flexibility, but do so in a way that pushes a lack of predictability upon the storage infrastructure. Learn what your criteria should be for tier-1 storage.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

White Paper

Converged Storage: Utility Storage - The Ideal Platform for Virtual and Cloud Computing

Server virtualization has transformed corporate IT -- companies have enjoyed major cost savings and have gained flexibility and efficiency. But this has also led to a proliferation of virtual machines and servers that threaten to overwhelm data movement and storage technologies. In this IDG Tech Dossier, learn how utility storage makes for massive consolidation, flexibility and scalability, so IT departments can reduce storage infrastructure and lower costs while improving their ability to respond to fast-changing needs of business units.Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

See more White Papers | Webcasts

Ask a question

Ask a Question