Dmitri Alperovitch talks about reputation-based spam protection
With about 90 percent of all emails today being spam, it's hard for even the best anti-spam program to keep up. And what's worse is that spammers are constantly developing new techniques, such as image-based spam, to sidestep the filters. What if you could determine ahead of time, the intentions of everyone who sends you an email? Wouldn't it be wonderful to know, without a doubt, who the bad guys are? What if there were a central authority that knew the reputations of everyone who has ever sent an email? As it turns out, you don't have to be a mind-reader. Reputation-based security is very similar to what the financial services industry has created with credit agencies. Every person who has ever paid a bill or used a credit card has a credit score-a credit reputation, if you will. When you want to buy a new car, the finance company looks at your reputation, and then decides whether or not to let you past the gates and give you some money. We now have the same thing for people who send emails.
Dmitri Alperovitch, Chief Research Scientist at Secure Computing and developer of reputation-based security, talks about the evolution of spam, the next big thing in spam prevention, and how to identify the culprits before they bombard your email server.
Where did reputation based security come from?
It was an invention of CipherTrust, and since then, a variety of other companies have applied it to the email security area as well. When we were working on spam detection at CipherTrust, which was bought by Secure Computing, we realized early on that it made sense to aggregate information on a global level and to collect data from all of the customers that we had deployed at that time, and apply behavioral techniques not on an individual box, but on the cloud where you have a much broader view into email traffic.
You're the point man on reputation system development. What led to your development efforts in this direction? Was there an "aha" moment when you knew that you had to create a reputation system?
Spam was starting to take off in the early 2000s and we were developing antispam technology fairly early on. We realized that a lot of this analysis that we were trying to put on the spam gateway itself would really work much better if we had a view into more traffic. The only way to do that is to put it in the cloud, and have all of these devices talk to the cloud, report what they are seeing, and get information back from the cloud. So that was the "eureka" moment that we had, where we said, "hey, let's try to get as much data as possible." And the only way to do that is through this centralized authority. It's very much akin to a credit agency. If you're a store and someone comes in to apply for credit, you can look at the local history that you may have on that person, but that can only be so effective if this is a new customer or a customer that's purchased only one or two things from you. But if you aggregate together with all the stores in the nation, which is what credit agencies do, you can build a much more accurate profile. That was the approach we took with this system.
And at that point nobody else had ever done that yet.
Exactly right. The blacklists were out there but they were not really doing the analysis globally; they were distributing the information globally, and that was the difference.
How does reputation technology differ from a real time blacklist?
It differs in a couple ways. Just on the most basic level a blacklist is just that, it's a list of malicious hosts, so it has no view into legitimate traffic. They usually have a pretty high false positive rate. I'll give you an example. Hotmail is one of the top spam centers out there. They send out a lot of mail and a decent percentage of it is malicious. Spammers relay frequently through Hotmail accounts that they're able to register automatically. Of course you don't want to block Hotmail because you know that there's a lot of legitimate content that is originating from it. A blacklist would not have any view without manual intervention that Hotmail is legitimate. A reputation system, through the analysis of both legitimate and malicious traffic, would know that, and would know that it needs to assign a neutral reputation to a host like Hotmail. The second difference is how a blacklist is generated. Most of the blacklists out there work in a very simple fashion, people get spam, they submit it manually to a blacklist operator, and they put the sender of that particular spam message on the list. The reaction time is fairly slow, so it's not real time analysis of the traffic. And secondly, most blacklists suffer from the problem of delisting. Once you list a spam center on the blacklist, how do you know when to delist them? Again you have no view into the sender's legitimate traffic, so you don't know when they actually stop sending the spam. If it's a compromised machine maybe it's already been cleaned up and it's now sending legitimate mail. And a reputation system would know that, because it's seen that traffic. A blacklist has no concept of that.
What were some of the biggest challenges in the early stages of development that you came across in creating the reputation system?
The challenges were really scale and real-time analysis. You're processing billions and billions of messages daily, responding to those queries in real time and doing all of this very intensive analysis on it. Some of the problems that we've had to solve is how to do that in a redundant fashion, so that we had 100% uptime for the system, which I'm happy to say that we've had. And, how to store all this data. A lot of the storage providers out there are very happy to have us as customers because we do spend quite a bit of time and money on lots and lots of storage, and software to analyze all those records in real time.
Is reputation based security the next big thing in spam prevention?
I think it is the next big thing in security in general. If you look at how spam detection has evolved, reputation technology is certainly one of the breakthroughs in that field in that it allows you to very quickly and effectively identify most of the spam outbreaks due to what is known as the network effect. This is the ability to not just deploy technology on individual endpoints, but to aggregate that information on a global level, have a view into billions and billions of transactions that are happening on the Internet on a real-time basis, analyze those transactions and develop a reputation for all of the mail centers that are out in the wild. You can also apply the same principles to other types of network traffic. We will be announcing a new launch of the Sidewinder firewall next month that will include it on a network level. So we will now be assigning reputations to all of the machines out on the Internet that are trying to connect to our customers, or that our customers are trying to connect to, on a variety of different protocols.
How granular does it get? For example, is there a database floating around out there somewhere that says, if you get an email from Dan Blacharski it's okay to look at?
Yes. Initially when we developed this it was based on IP addresses that sent mail. Since then we've expanded it to a variety of different identities. So for example we are now assigning it based on the message content itself, which has a reputation. We're doing it on email addresses, so your email address will have a certain reputation. We're doing it on domains, we're doing it on URLs in the context of web security, and we're doing it on IP addresses in combinations with protocols for the network security we're doing at the firewall level. So an IP address may have one reputation as it's trying to send mail, it may have another reputation as it's trying to host a web site, and it may have a totally different reputation as it's trying to run a DNS server or an FTP server. The context in which the reputation has been calculated now matters a great deal. For example, a malicious host may be part of a botnet that's used for spamming but nothing else. And if you are the owner of that particular machine that's been compromised, you still want to go to web sites, and you still want to be allowed to go to Google or Amazon, even though you have this malicious malware that's present on your machine that's doing these nefarious things. So we want to make sure we block those bad things that are emanating from your machine, but still allow you to do other legitimate tasks out on the Internet.
How does the reputation system go about determining the reputation of so many millions of different senders and entities?
It's really based on real-time analysis of the traffic. Any time one of our devices receives a connection or generates a connection to an Internet host, it sends a query to the TrustedSource database. This database is distributed around the world. We have eight data centers around the world that are hosting the service and are synchronized with each other, so when you query one you'll get the same answer as when you query any one of them. All these queries essentially tell the system the activity of various hosts that are out on the Internet. So for example, if you send an email to me, a query will be performed on TrustedSource, and TrustedSource will immediately know that you are a mail sender. It has access to the historical database, going back since the beginning of the system, of how you send mail, who you send mail to, and whether you send other types of network traffic. Based on that historical data and the real time information it's getting, it is essentially calculating a risk score-- a profile for you of whether we can expect malicious or legitimate content to originate from you.
So you have a huge database and collection points all around the world. Are some entities still slipping through the cracks?
Absolutely. No system is 100 percent effective, so certainly it will not prevent all spam from coming through, but we do pride ourselves in extremely high effectiveness levels. Our average effectiveness across the customer base is 99.8 or 99.9 percent, so very very little gets through. And one of the advantages of the system is not just in the high levels of effectiveness that it can provide but also the fact that you can reject a lot of this content at the connection level, so you can save the resources of your email gateway by rejecting those connections without having to accept the mail.
What about the possibility of false positives?
One of the unique things about a reputation system is not just its ability to identify the malicious content, but also its ability to identify the good centers and the good website hosts. And that can dramatically reduce your false positive rate because we know for example, if you, Dan, have sent legitimate traffic previously. That basically lowers your risk profile because we know that we can expect to see legitimate content from you in the future. And because that behavioral analysis can be applied to both malicious and legitimate entities that are out on the Internet we can provide extremely high levels of accuracy and reduce the false positive rate that most antispam systems out there suffer from greatly.
Since those early days when spam first started to pop up and on a continuum to today, how has spam changed since that time?
It has changed in a couple ways. In the early days we didn't really have to worry about botnets for example, and now they're a major plague on the Internet. It used to be that spammers were renting servers and using them continuously to send spam, and you didn't have to spend a lot of time and effort to detect those servers, it was fairly obvious what they were. Nowadays they're infecting anywhere on the order of 250,000 machines every single day around the world, using them for very short periods of time to send spam, and then allowing those machines to stay dormant for months before starting to use them again. So you have to react much more quickly, you have to worry about the fact that there are all those compromised machines out there that may be sending legitimate mail as well, so you have to treat false positives in a much more careful fashion than you used to. And of course the content has changed so much more as well. It used to be that spam was just text messages, and now we're seeing images, we're seeing videos, audio files being sent as attachments, so that it's not just the propagation method, but the delivery mechanism itself has changed drastically.
What is image based spam and does that present any special challenges to the reputation system?
Image based spam appeared on the scene about two years ago. That's when spammers realized that instead of sending a text based promotion message they can encode it as an image and send you that image as an attachment. They noticed that the effectiveness from the standpoint of the user still clicking and reacting to that spam hasn't really degraded. A lot of the filters that are out there trying to analyze that message content have failed miserably in trying to understand what that image is, and whether it represents malicious content or not. So on the reputation system front that really hasn't impacted us much because we're not worried about text, we're not worried about particular formats, we're really looking at various parameters of the message as a whole and applying the reputation to the patterns that are within that message. So whether it's image based spam or video based spam or audio based spam we don't really care. Now if you are a text based analysis filter that's running on the local gateway then you probably have seen your effectiveness go down dramatically because of this technique.
What's the most dangerous type of spam attack?
The most dangerous ones are the scams that are trying to steal your identity or steal your financial records. Nowadays we see the phishing attacks a lot that pretend to come from your bank but in reality are just trying to steal your credentials so that they can empty your bank account. Spam that sends out links to malware, so that they can compromise a system and steal all of the
passwords located on your machine, is now very popular. These are the most dangerous. But really all the spam that's out there is creating a huge headache for a lot of organizations because over 90% of email now is spam, so they're able to saturate this very important and critical channel for communication, with all this junk. And unless you have good filters in place, a lot of it goes through and causes you to lose productivity as you're trying to delete the stuff, and that's in the best case scenario when you're not clicking on the links and getting compromised. And in the course of going through that mailbox and trying to find the 10 legitimate messages in 100 that you would see, you would misplace the legitimate message that may be out there that might be drowned out by all this junk.
Who are the biggest perpetrators? Is spam a big business, or is it just small timers trying to make a quick buck?
It varies, but it is believed that most spam it is sent by about 200 top spammers. And they're present all over the world. Some of these individuals are located in the United States, and they've been prosecuted successfully, and some of them have even been forced to close up shop. A lot of the spammers are now operating out of Eastern Europe where the law enforcement has not yet been able to reach them. One of the other things that has changed is that it is now an affiliate-based business. For example, there are affiliate networks for drugs where they provide you with order forms and an order processing system, and all they ask is that you send these emails on their behalf and draw customers in, and then they give you a percentage of the sales. So it is very easy for you now to set up your own business as a spammer with one of these affiliates. And all you have to do is compromise a couple thousand machines, deploy your own spam sending software, and you're in business. So the barriers to entry have been lowered dramatically.
I would think there would be easier ways to sell Viagra. Why do they keep doing it?
People do react and people do buy the stuff. One thing to keep in mind is that they're not actually selling Viagra, there have been a number of investigations to actually find out what you get, and typically what you get is a package from a factory in India, and when you do the analysis of the composition of that drug you find that it's nothing like Viagra. God knows what components they've actually used to produce that blue pill. So it's incredibly dangerous to buy that sort of stuff and consume it. Most of these things are completely fraudulent, so they're not just violating laws in terms of sending unsolicited mail, they are actually violating quite a few other laws as well as far as the delivery of the product goes.
We have these botnets with huge networks of zombie computers, how does a reputation system work within those?
I think that's where it really shines and can really provide the best protection because the way these botnets work is that a machine gets compromised and instantly used for malicious purposes, whether it's for sending spam or hosting malicious websites. They use it for a few hours literally, until these blacklists out there react and people report that there is abuse associated with that machine and they get shut down. Or they simply turn it off because it becomes less effective to use it. But really in those first few hours of attack, reputation systems are the only ones that can protect you and block that content that is originating from that machine quickly enough because they are able to react in real time.
What happens for example, if my computer gets hijacked into a botnet, and it gets used to send out spam? But I'm not a spammer, I'm a victim. Does the reputation system label me a spammer?
It depends. If you are a legitimate organization that is actually sending out email from that machine, it's a legitimate mail server. The reputation system would be able to see that legitimate content and would not automatically lower your reputation down to the level of spammer. It will raise it enough to make sure that all the email from this point forward gets scrutinized, but it will not block it outright. The blacklist may very well do that because they have no view into that legitimate mail. Now if on the other hand, you're not sending any email from that machine, and are relaying email through your ISP for example as is the common case, then your reputation will get adjusted to the spammer level and all the email traffic will get blocked from that machine.
What are the spammers doing to try to get around the reputation system?
They really haven't been able to figure out how to do that. Their answer has been to try to get more machines, to try to send more mail through the ISP networks that are out there. They are trying to relay more mail through Gmail and Hotmail, which can be blocked by the reputation system on the IP level because of course these systems send a lot of legitimate mail as well. But really the content based reputation that we apply, the reputation of the links that are within that spam message, really provides a great level of effectiveness even if the IP address that they're sending from is neutral.
Are there any privacy concerns?
Not really because we're selling the reputation service as part of our overall solution to the customer, so they don't have to buy it. By virtue of selecting us they allow us to do this. And also we're not reading their email. We're not looking at the content, we're only looking at this meta data about the email, about where it's coming from and where it's originating and how it is being sent, and there are no privacy concerns associated with that level of data.
What about remediation, if a site gets incorrectly labeled? Is there a process to get back on the good side of the reputation system?
Absolutely. We provide a variety of methods for customers to report false positives to us through automated means. We deliver for example, a desktop client, a toolbar that integrates directly into your Outlook or other email client, and with the click of a button, you can report to us either a spam that got through or a message that got misclassified. We also have a web site called trustedsource.org which is kind of unique in the industry, because it provides a free view into the reputation system, extending its reach beyond just our customer base so anyone can go onto that web site and put in an IP address or a URL and view the current and past history and reputation that we've assigned to that particular entity. No one else in the industry does that. And right then and there from that website, you can also send us an email to request a change in reputation if you think we've got it wrong.
What are some of the shortcomings and potential limitations of reputation security
Just like any method, it is not foolproof so it's not going to block all of the spam for you. Usually they are about 90 percent effective on their own, so they can reduce the amount of junk that your server has to process. You really want to apply, as with any security solution, the defense-in-depth approach--so you want to layer you security and apply various technologies in order to get the maximum degree of effectiveness. So the reputation system can block about 90 percent of the inbound spam and for the other 10 percent, you want to apply some of the local analysis technology that can get you to that 99.8 or 99.9 percent effectiveness.