Thursday, March 13, 2008

Using log data to help keep you safe



(Cross-posted from Official Google Blog)

We recently began two new series of posts. The first, which explains how we harness data for our users, started with this post. The second, focusing on how we secure information and how users can protect themselves online, began here. This post is the second installment in both series.- Ed.

We sometimes get questions on what Google does with server log data, which registers how users are interacting with our services. We take great care in protecting this data, and while we've talked previously about some of the ways it can be useful, something we haven't covered yet are the ways it can help us make Google products safer for our users.

While the Internet on the whole is a safe place, and most of us will never fall victim to an attack, there are more than a few threats out there, and we do everything we can to help you stay a step ahead of them. Any information we can gather on how attacks are launched and propagated helps us do so.

That's where server log data comes in. We analyze logs for anomalies or other clues that might suggest malware or phishing attacks in our search results, attacks on our products and services, and other threats to our users. And because we have a reasonably significant data sample, with logs stretching back several months, we're able to perform aggregate, long-term analyses that can uncover new security threats, provide greater understanding of how previous threats impacted our users, and help us ensure that our threat detection and prevention measures are properly tuned.

We can't share too much detail (we need to be careful not to provide too many clues on what we look for), but we can use historical examples to give you a better idea of how this kind of data can be useful. One good example is the Santy search worm (PDF), which first appeared in late 2004. Santy used combinations of search terms on Google to identify and then infect vulnerable web servers. Once a web server was infected, it became part of a botnet and started searching Google for more vulnerable servers. Spreading in this way, Santy quickly infected thousands and thousands of web servers across the Internet.

As soon as Google recognized the attack, we began developing a series of tools to automatically generate "regular expressions" that could identify potential Santy queries and then block them from accessing Google.com or flag them for further attention. But because regular expressions like these can sometimes snag legitimate user queries too, we designed the tools so they'd test new expressions in our server log databases first, in order to determine how each one would affect actual user queries. If it turned out that a regular expression affected too many legitimate user queries, the tools would automatically adjust the expression, analyze its performance against the log data again, and then repeat the process as many times as necessary.

In this instance, having access to a good sample of log data meant we were able to refine one of our automated security processes, and the result was a more effective resolution of the problem. In other instances, the data has proven useful in minimizing certain security threats, or in preventing others completely. In the end, what this means is that whenever you use Google search, or Google Apps, or any of our other services, your interactions with those products helps us learn more about security threats that could impact your online experience. And the better the data we have, the more effectively we can protect all our users.

3 comments:

Christopher Soghoian said...

Dr Provos,

You've made a reasonable argument for keeping the log data on the actual content requested from google's servers.
(i.e. google.com/q=Free&Screensavers)

However, what you have failed to argue for, is Google's shameful policy of keeping full ip addresses in the logs for 18 months.

I've read through your blog post, and some of your past academic papers, and I can come up with no solid argument for keeping such data. Your complex regex system does not need user ip addresses in order to do its job.

Personally, I think it's pretty sneaky for Google get its engineers to take on a PR role, and post about how great Google's log retention policies are. Neither you nor Dr. Whitten have yet to cover the most important issue: Why do you need my ip for 18 months?

While I'm not surprised that lawyers or PR lackeys would spout this information, I am frankly quite disappointed that someone so respected in the security field would so willingly tow the company line.

NRI said...

I am glad that Google is keeping this log data. Unauthorized users my be able to sniff your password and log in to your account. Keeping a log file can help in legal matters of privacy.

Ramiz said...

It is very comphorting that Google is keeping the log data.

what I would like to hear from Google is to provide me a facility to get a report (whenever I request) of all the accesses (including backup applications, administrator checks, my own access, authorized, if authorized by who, non-authorized accesses, and by myself) with the timestamps my own documents. Since this is my data, I would want to know who, when and under what authority accessed to my documents.

I believe this is the ultimate compliance with privacy.