The Internet Archive's Wayback Machine gets new data center

By Lucas Mearian, IDG News Service |  Data Center/Servers, Internet Archive, wayback machine Add a new comment

The Internet Archive Wednesday announced that it has a new computer behind it its library of 151 billion archived Web pages. The machine fits in a 20-foot-long outdoor metal cargo container filled with 63 server clusters that offer 4.5 million gigabytes of data storage capacity and 1TB of memory.

The Internet Archive has been taking a snapshot of the World Wide Web every two months since 1997, and the images are made available through the Wayback Machine, a Web site that gets about 200,000 visitors a day or about 500 hits per second on the 4.5 petabyte database.

"It may be the single largest database in the world, and it's all in a shipping container. I think of the shipping container as a single machine or expression made up of many smaller machines," said Brewster Kahle, digital librarian and co-founder of the Internet Archive, the nonprofit organization that runs the Wayback Machine site.

For the past 13 years, the Internet Archive has been growing rapidly, most recently by about 100TB of data per month. Until last year, the site had been using a more traditional data center filled with 800 standard Linux servers, each with four hard drives. The new Sun Modular Datacenter that powers it now is on Sun's campus in Santa Clara, Calif., and houses eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives. The server unit is referred to as a "Thumper."

"The only thing needed besides [the shipping container] are the network connections, a chilled water supply and electricity," said Dave Douglas, Sun's chief sustainability officer. "Customers using this tend to be people running out of data center space and need something quickly or need a data center in remote area where mobility is key."

The nonprofit Internet Archive, which is based in the Presidio in San Francisco, uses an algorithm that repeats a Web crawl every two months in order add new Web page images its database. The algorithm first performs a broad crawl that starts with a few "seed sites," such as Yahoo's directory. After snapping a shot of the home page, it then moves to any referable pages within the site until there are no more pages to capture. If there are any links on those pages, the algorithm automatically opens them and archives that content as well.

Previously, a typical Web crawl was supported by 10 or 20 clustered Linux servers, Kahle said. The new crawls are supported by the entire data center, as all 63 Sun Fire servers act as a single machine.

In addition to Web pages, the Archive also keeps software, books and a moving image collection that has 150,000 items in 100 different subcollections, as well as audio clips -- to the tune of 200,000 items in over 100 collections.

"We see this scale of machine, and the idea of putting machines outdoors is a potential long-term trend for organizations like us," Kahle said.

The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.

    Add a comment

    Post a comment using one of these accounts
    Or join now
    At least 6 characters

    Note: Comment will appear soon after you have activated your account.
    Obscene/spam comments will be removed and accounts suspended.
    The information you submit is subject to our Privacy Policy and Terms of Service.

    ITworld LIVE

    Data Center/ServersWhite Papers & Webcasts

    White Paper

    ESG ~ HP StoreOnce: the Next Wave of Data Deduplication

    Leveraging deduplication in backup environments yields significant advantages. The cost savings in reducing disk capacity requirements change the economics of disk-based backup. For some organizations, it allows disk-based backup-and, importantly, recovery-to be extended to additional workloads in the environment. For others, deduplication makes it possible to introduce disk-based backup where it may not have been feasible before.

    White Paper

    HP Converged Storage Sets the Stage for the Next Era of Computing

    Enterprise storage has undergone many changes in recent years - with converged storage and infrastructure 2.0 paving the way for reduced IT infrastructure costs and greater performance. This report discusses the latest trends that are setting the stage for the next era of computing. Learn about the new infrastructure and storage trends that are changing the way business storage works today.

    White Paper

    Business Value of Blade

    The nature of the blade platform makes system management, monitoring and provisioning easy and efficient. Access this resource to learn how blade migration will save your data center time and money while increasing performance.

    White Paper

    Measuring the Business Value of CI in the Data Center - IDC-HP White Paper

    One of the key strategies that IT teams are pursuing to reduce capital costs while boosting asset utilization and employee productivity is the transition to highly virtualized data centers. However, IDC finds that expectations for further boosts in IT asset use and operational efficiency often surpass the actual results for a variety of reasons. These problems can quickly overwhelm any hoped-for benefits as the scope of virtual server deployment expands.

    White Paper

    HP CloudSystem Matrix: Managing at a Higher Level

    This white paper examines IT management challenges from a fundamental and system standpoint. In addition, it introduces the concept of a service-oriented and automated approach to IT management.

    See more White Papers | Webcasts

    Ask a question

    Ask a Question