Storage Tip: Intelligence for unstructured data
Send your Storage question to David Hill today! | See other Storage tips from David
What seems to be the problem? A recent newspaper article stated that the latest job of surveillance cameras is to interpret the threats they see. This is a software-intelligence-enabled real-time use of unstructured data rather than the use of intelligent analytic forensic tools after the fact (see previous storage tip on digital surveillance) on unstructured data. This is just the latest example of how unstructured data is being used in organizations and IT organizations are likely to acquire the custodial responsibility for such applications. And therein lays your challenge. Not only will there be more data to manage and store, but the data protection strategies are more likely different than those used for structured information.
What do you need to know? A great deal of confusion surrounds the discussion of the structure of data. General agreement exists that database information is structured information as data and its associated metadata are tightly coupled. The way to determine whether or not data is structured is to ask whether or not the data can be sorted. If the answer is yes, the data is structured.
The disagreement exists over what is semi-structured and what is unstructured data. General agreement exists that e-mail is considered semi-structured and that videos, pictures, audio files, and medical images are unstructured. However, word processing documents, and presentations are considered unstructured, but they are really semi-structured documents.
The difference between a semi-structured file and an unstructured file is simple. Both have file metadata, but you can search on semi-structured data, such as an HTML document using standard tools (think Google). You cannot do that natively with an unstructured file; you can only sense it, such as viewing a video or listening to an audio file. (You can also sense a word processing document, but you can also search on it.)
The reason for distinguishing the different types is that each of the three is managed differently. However, there is movement afoot to add intelligence to unstructured data and thereby making it more manageable.
Let's use an example. An HR document receives a paper resume in the mail. At this stage the data is unstructured because all that can be done is that it can be sensed, i.e., read visually. However, put the resume through a scanner and apply some optical character recognition. Voila! The data now becomes semi-structured. For example, find all the candidates that have an electrical engineering degree. But wait there's more. More intelligence can be applied and the data, such as education and work experience can now be put in a relational database. Now a query can be issued against multiple criteria in order to find a set of candidates who meet all of those minimum criteria, for say a college degree for education and five or more years of experience for work experience. Note that all three versions may be preserved. The structured version helps determine if a candidate has certain basic qualifications, such as education and work experience. The semi-structured version can be found via keywords that may not be in the structured database. And the unstructured version can show how the candidate presents himself or herself as a whole on paper.
Another example. Voice recognition can recognize the audio track in a video and put it in a semi-structured format so that you can search on keywords. Identify the speaker as well and you now know who said what when. A third example is medical images. A pulmonary diagnostic intelligence tool can help physicians diagnose a medical image of a diseased lung. So increasing the intelligence of unstructured information is not just about one particular area, such as surveillance, but also spreads across a broader area as well.
What can you do about it? The age of putting increased intelligence in unstructured data is upon us. The resume example has been around for years, but the medical and surveillance examples are fairly recent. Since unstructured data takes up a lot of storage space, IT administrators are going to have more data to manage and storage administrators are going to have more data to store. Moreover, since unstructured data is typically fixed storage, it may be stored on high-capacity SATA disk arrays rather on higher performance FC or SAS disk arrays. Typical data protection, such as backup/restore software that was designed for changeable structured information, may not be used for unstructured data as replication technologies may fit the bill.
Overall, IT organizations have the skills to deal with the data growth caused by unstructured data - and deal with the growth you must. But just as with previous data revolutions (first to a focus on online transaction processing systems, which use structured data, and then to semi-structured information, such as e-mail, office productivity, and HTML documents), the heightened awareness and utility of unstructured information can only make the place of data even more central to an enterprise.
storage.itworld.com
Essential JavaFX
Get started building rich Web apps quickly with an introduction to the power of JavaFX key features -- scene node graphs, nodes as components, the coordinate system, layout options, colors and gradients, custom classes with inheritance, animation, binding, and event handlers.Enter now!
The Nomadic Developer
Consulting can be hugely rewarding, but it's easy to fail if you are unprepared. To succeed, you need a mentor who knows the lay of the land. Aaron Erickson is your mentor, and this is your guidebook. Enter now!












