Storage Tip: Intelligence for unstructured data

March 9, 2007, 03:07 PM —  storage.itworld.com — 

Send your Storage question to David Hill today! | See other Storage tips from David



What seems to be the problem? A recent newspaper article stated that the latest job of surveillance cameras is to interpret the threats they see. This is a software-intelligence-enabled real-time use of unstructured data rather than the use of intelligent analytic forensic tools after the fact (see previous storage tip on digital surveillance) on unstructured data. This is just the latest example of how unstructured data is being used in organizations and IT organizations are likely to acquire the custodial responsibility for such applications. And therein lays your challenge. Not only will there be more data to manage and store, but the data protection strategies are more likely different than those used for structured information.


What do you need to know? A great deal of confusion surrounds the discussion of the structure of data. General agreement exists that database information is structured information as data and its associated metadata are tightly coupled. The way to determine whether or not data is structured is to ask whether or not the data can be sorted. If the answer is yes, the data is structured.


The disagreement exists over what is semi-structured and what is unstructured data. General agreement exists that e-mail is considered semi-structured and that videos, pictures, audio files, and medical images are unstructured. However, word processing documents, and presentations are considered unstructured, but they are really semi-structured documents.


The difference between a semi-structured file and an unstructured file is simple. Both have file metadata, but you can search on semi-structured data, such as an HTML document using standard tools (think Google). You cannot do that natively with an unstructured file; you can only sense it, such as viewing a video or listening to an audio file. (You can also sense a word processing document, but you can also search on it.)


The reason for distinguishing the different types is that each of the three is managed differently. However, there is movement afoot to add intelligence to unstructured data and thereby making it more manageable.


Let's use an example. An HR document receives a paper resume in the mail. At this stage the data is unstructured because all that can be done is that it can be sensed, i.e., read visually. However, put the resume through a scanner and apply some optical character recognition. Voila! The data now becomes semi-structured. For example, find all the candidates that have an electrical engineering degree. But wait there's more. More intelligence can be applied and the data, such as education and work experience can now be put in a relational database. Now a query can be issued against multiple criteria in order to find a set of candidates who meet all of those minimum criteria, for say a college degree for education and five or more years of experience for work experience. Note that all three versions may be preserved. The structured version helps determine if a candidate has certain basic qualifications, such as education and work experience. The semi-structured version can be found via keywords that may not be in the structured database. And the unstructured version can show how the candidate presents himself or herself as a whole on paper.


Another example. Voice recognition can recognize the audio track in a video and put it in a semi-structured format so that you can search on keywords. Identify the speaker as well and you now know who said what when. A third example is medical images. A pulmonary diagnostic intelligence tool can help physicians diagnose a medical image of a diseased lung. So increasing the intelligence of unstructured information is not just about one particular area, such as surveillance, but also spreads across a broader area as well.


What can you do about it? The age of putting increased intelligence in unstructured data is upon us. The resume example has been around for years, but the medical and surveillance examples are fairly recent. Since unstructured data takes up a lot of storage space, IT administrators are going to have more data to manage and storage administrators are going to have more data to store. Moreover, since unstructured data is typically fixed storage, it may be stored on high-capacity SATA disk arrays rather on higher performance FC or SAS disk arrays. Typical data protection, such as backup/restore software that was designed for changeable structured information, may not be used for unstructured data as replication technologies may fit the bill.


Overall, IT organizations have the skills to deal with the data growth caused by unstructured data - and deal with the growth you must. But just as with previous data revolutions (first to a focus on online transaction processing systems, which use structured data, and then to semi-structured information, such as e-mail, office productivity, and HTML documents), the heightened awareness and utility of unstructured information can only make the place of data even more central to an enterprise.

 

storage.itworld.com

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Free books

Essential JavaFX
Get started building rich Web apps quickly with an introduction to the power of JavaFX key features -- scene node graphs, nodes as components, the coordinate system, layout options, colors and gradients, custom classes with inheritance, animation, binding, and event handlers.Enter now!

The Nomadic Developer
Consulting can be hugely rewarding, but it's easy to fail if you are unprepared. To succeed, you need a mentor who knows the lay of the land. Aaron Erickson is your mentor, and this is your guidebook. Enter now!

Featured Sponsor

AISO founders envisioned a Web hosting company that was environmentally friendly. While the company employed energy-efficient innovations like solar panels, its infrastructure produced unacceptable power and cooling requirements. Find out how AISO leveraged AMD technology to overcome their challenge in this case study white paper.

In this whitepaper, Scalar explores the opportunity to change the landscape with respect to mission critical databases built around Oracle. Leveraging technologies such as Linux, high-end commodity processing power and Oracle RAC technology to architect, design, build and maintain database infrastructure that delivers maximum availability, reliability and performance at a fraction of traditional cost.

On a typical day, weather.com, the Web site for The Weather Channel in Atlanta, serves up between 15 million and 20 million page views. But in September 2004, when back-to-back hurricanes ransacked Florida, the peak traffic on one day more than tripled: over 70 million page views by more than 7 million unique visitors. Read the full success story now.

Marketplace