If they develop it, will it work? Exploring Gmail

Google Inc.'s free e-mail service, Gmail, has received a huge amount of interest in the past week thanks mostly to its claim that it will offer 1GB of storage to each user.

It seems safe to assume that within a few days of the service going live, it will have several million people apply for an account. One gigabyte multiplied by several million could represent the world's largest-ever storage order. It could also represent the world's single largest privacy problem due to Google's business model where content-related ads will pop up as you read your incoming mail.

Microsoft Corp.'s Hotmail and Yahoo Inc. offer just a few megabytes of free e-mail storage each. Users pay for additional storage. Google is comprehensively disrupting this model of Web-hosted e-mail. And already a Mac web-hosting service, Spymac Network Inc., is also offering a free gigabyte e-mail storage to its members.

If one million users, say, take Gmail up then, on the face of it, 1PB, one petabyte -- that's one million gigabytes, of hard disk would be needed. Double that for redundancy, add in more for indexing, and some lucky supplier could find a 2.5PB HDD order in the in-tray.

But Google doesn't work like this.

As we described, Google operates a massively distributed server and storage design using clustered Linux X86 server nodes with one or two hard drives each. The servers store Google's web page index separately from the web documents themselves.

A Google spokeswoman confirmed: "Gmail is built on existing Google search technology, letting people quickly search over the large amount of information in their emails. Using keywords or the advanced search feature, Gmail users can find what they need, when they need it." The Gmail service, incidentally, is already up and running and all Google employees have their own "gmail.com" address.

But such a system architecture is unusual in a world where storage networking is the norm. It may also be a gamble for the search engine giant, with storage experts telling us that alternative methods are better when dealing with so much data.

Google's system can be defined as direct-attached storage (DAS), where, oddly enough, storage is attached directly to a computer. The vast majority of big storage networks in use are network-attached (NAS) -- where a data server on a network provides storage accessed via the network -- or storage area network (SAN) -- a high-speed subnetwork of shared storage devices.

Tom Clark, director for SAN technology at McData, thinks Google may have it wrong. "With individual servers with separate, direct-attached storage, there are inherent scaling problems over time and I would think increased administrative overhead as more servers are added," he told us. "The success of SANs to date is based on the ability to reduce administrative overhead through centralized sharing of storage assets, streamlining backup operations, gaining performance via SAN-based RAID, plus five-nines availability (meaning, 99.999 per cent availability) through enterprise-class storage."

He continued: "I would think Google would see a significant benefit from implementing a high-performance SAN, which would also scale more readily over time compared to NAS. Even global file systems such as Sistina benefit from having SANs as the shared storage infrastructure."

Paul Ligget, sales and business development director for 3Pardata Inc. in Europe, is also sceptical. "There are massive benefits of centralized storage rather than DAS. These have been well documented. In terms of our systems, the major benefit to Google would be ease of provisioning new storage, ease of backup, snapshots to protect against corruption and allow rapid recovery, DR planning would be easier with our replication, rather than having to replicate each server."

Our understanding is that Google is proposing to treat an e-mail as a quasi-web page. It will be indexed and this index data added to the Gmail overall index. The e-mails themselves plus attachments will be stored as quasi-web documents.

Google will use its existing search technology to enable users to find their e-mails using keywords or other search features. Its website states: "Each message is grouped with all its replies and displayed as a conversation." This is similar to a newsgroup thread.

The infrastructure needs will be massive, but Google currently operates more than 15,000 Linux servers in clusters of over a thousand machines. Wayne Rosing, Google VP engineering, said in a report: "It will take many petabytes. The infrastructure is quite amazing ... and we don't even flinch at the thought of 10 million or even 100 million users."

Mail storage

Users will not use up their 1GB entitlement. They'll add e-mails over time. Google will compress the data and, it's expected, single copies of duplicated attachments. It will extract and discard spam, unless users want to keep it. Hotmail experience is that users may take up a tenth of their entitlement. So we suggest that a million users in a year would need 0.25PB, not 2.5PB. That's a more manageable 250TB.

Just hold that incredulity at the these numbers a moment more. Google has already scaled out a web indexing and retrieval infrastructure from a couple of machines in 1998 to 2,000 in January, 2000, 4,000 in June 2000 and on up to 15,000 plus today. It's done it once. It can do it again, this time with an e-mail indexing and retrieval infrastructure.

Once again, there will not be any SANs involved, no Fibre Channel, no Shark or Symmetrix arrays, just commodity IDE drives in commodity X86 servers running Red Hat Linux. Scaling out is straightforward with new server/drive combos being added in to a 10/1200Mbit/s Ethernet network infrastructure.

Paying for it

Google reckons that a gigabyte of storage costs under US$2 to operate per annum, we understand. It might bring in $1 to $10 a year in ad revenue per Gmail user.

Attachments and spam

Attachments can be large, related to a basic e-mail message, and duplicated. It is likely that Gmail will detect duplicate attachments and use the indexing system to enable several users' e-mail content indices to point to just one attachment. Compression may also be used. The overall Gmail service is in test mode at present while kinks are being ironed out.

Google's website explains: "Gmail includes a sophisticated spam filter that we're continuing to improve. The Report Spam link in Gmail is a way for users to help with this effort. It removes spam from the inbox and sends valuable data to the Gmail team working on spam blocking."

This filtering out of spam by a ISP is becoming more common. When the BT Yahoo! OpenWorld broadband service recently updated its spam filters, users saw a huge reduction in spam in their in-boxes.

Privacy

How will Gmail ensure that only the entitled email owner gets to see their e-mail? After all, in this world of shared PCs it cannot assume that a PC is being used by the same person all the time. There will be a username/password scheme as for Hotmail and Yahoo!. The terms of the privacy policy are laid out here.

Some people think that a person's e-mail is sacrosanct. Google stresses that no humans, no Google staff, read e-mails, unless specifically requested by users. The placing of ads is done by the same computer systems that place ads on search result pages. Advertisers receive a record of the total number of impressions and clicks for each ad. They do not receive any personal information about the person who viewed the ad from Google.

Notwithstanding this, the Privacy International organization has been reported as saying that a "vast violation of European law is occurring". The problem appears to be that Google can read people's e-mails or retain e-mails after users have deleted them.

The Gmail privacy policy says that residual emails may remain on the system after a user has deleted them. Campaigners appear not to realize that Yahoo!'s privacy policy states pretty much the same: "If you ask Yahoo! to delete your Yahoo! account, in most cases your account will be deactivated and then deleted from our user registration database in approximately 90 days. This delay is necessary to discourage users from engaging in fraudulent activity. Please note that any information that we have copied may remain in back-up storage for some period of time after your deletion request. This may be the case even though no information about your account remains in our active user databases."

It seems that computer scanning of e-mails constitutes "reading" as far as privacy campaigners are concerned. They say European law is stronger in this regard than U.S. law. Possibly the Google developers have not realized this.

Microsoft's take on Gmail is this: "The offering appears to be a very limited beta that we have not yet seen; therefore we cannot provide specific comment. It will be interesting to see how Google's trial develops and what they ultimately will deliver broadly to consumers. We are very focused on ensuring that our 170 million active MSN Hotmail customers are increasingly satisfied with the world

This story, "If they develop it, will it work? Exploring Gmail" was originally published by Techworld.com.

Insider: How the basic tech behind the Internet works
Don't miss
Join the discussion
Be the first to comment on this article. Our Commenting Policies