"The Library of Congress has an interesting model" in that part of the information stored is metadata -- or data about what is stored -- while the other is the actual content, says Greg Schulz, an analyst at consultancy StorageIO. Although plenty of organizations use metadata, Schulz explains that what makes the Library of Congress unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and files on other types of media.
The actual content -- which is seldom accessed -- is ideally kept offline and on tape, with perhaps a thumbnail or low-resolution copy kept on disk, Schulz explains. The metadata can reside in a different repository for searching.
The library uses two separate systems as a best practice for preserving data. One is a massive tape library that has 6,000 tape drive slots and uses the IBM General Parallel File System (GPFS). This file system uses a concept similar to metatagging photos at Flickr.com: files are encoded with algorithms that make the data easier to process and retrieve quickly.
A second archive, with about 9,500 tape drive slots, consists of Oracle/Sun tape libraries that use the Sun Quick File System (QFS) with Oracle SL8550 tape libraries.
Another best practice: Every archive is sent to long-term storage, then immediately retrieved to validate the data, then stored again.
Today the library holds around 500 million objects per database, but Youkel expects this number to grow to up to 5 billion objects. To prepare for this growth, Youkel's team has started rethinking the namespace system. "We looking at new file systems that can handle that many objects," he says.
Gene Ruth, a storage analyst at Gartner, says that scaling up and out correctly is critical. When a data store grows beyond 10PB, the time and expense of backing up and otherwise handling all of the files go quickly skyward. One approach: Have one infrastructure in a primary location that handles the ingestion of most of the data, and then have another, secondary long-term archival storage facility.
Splitting files into manageable chunks
Amazon.com, the e-commerce giant that has ventured into cloud services, is quickly becoming one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its own storage needs and those of its customers. Alyssa Henry, vice president of storage services at Amazon Web Services, explains that that translates to about 1,500 objects for each person in the United States and to one object for every star in the Milky Way galaxy.