I'm a grad student working with David Patterson at the RAD Lab at the CS
department at UC Berkeley. Our lab is interested in building more reliable
internet services. You can read more about our lab here:
* our wiki page: http://radlab.cs.berkeley.edu
* slashdot article: http://slashdot.org/articles/05/12/15/1426223.shtml
To do interesting research, it's extremely useful to have traces/logs from
real-world internet sites that we can use in our experiments. For example,
here's a paper where we used data from Ebates.com to evaluate new algorithms
for detecting and localizing failures in Internet services:
I'm wondering whether you have any logs from the Applefritter website that
you could make available to our research group. For example, web server
access logs are very useful for understanding the traffic to a site.
Of course, we could help you to anonymize the data in case it contains any
If you cannot provide such data, could you recommend other people or
websites that could make their logs available or simply forward this email
I'd like to help them out, but obviously need to ensure first that we can anonymize the logs. So far, we're looking at replacing IP addresses with a hash, as well as User IDs in /user accesses and message IDs in /privatemessage. Anything else you'd like to see anonymized? BDub and I will be taking a closer look at the logs, for other potentially identifying information, but we'd also like to get a few more opinions.
I don't know what kind of logs you're talking about, but the kind of logs that my Apache server gives don't have a whole lot of information in them. What I think would be better than hashing the IPs is to resolve them to a hostname, then just omit everything but the last part of the hostname. For example, 220.127.116.11 might resolve to 24-14-2-91.hr.hr.cox.net, but you would only put cox.com. That is a fair amount of work for the nameservers, though, and it's probably not feasable.
Yes - Apache logs. Could you explain your reservations to hashing the IPs?
MD5 hashes can be easily cracked-- with the proper tools, someone could brute-force an MD5 hash in about a minute. That, and it would probably make formatting the data a little harder. Replacing the IPs with some repeated letter like x would work better from a security standpoint.
I wouldn't give them the IPs in any form, but I don't think there would be a problem with giving them the domains. You could resolve them to comcast.net, cox.net, etc., or just truncate the IPs to xxx.xxx.---.--- (truncating the IP at the subnet level).
I'm against xxx.xxx'ing out the IP's because they likely want to trace how users navigate around a website.
Also possible, we could give each IP that shows up it's own unique identifier (first IP in logs is replaced with "1", second with "2") but have IPs that are the same use the same identifier. This would allow them to trace a user around the site, while not having any clues to the users identity.
First off, we'd obviously salt the hash. That's not even a question. Brute forcing becomes quite a bit harder given that situation. Could you quote a source for the 'in about a minute' comment?
We could also use a hash scheme other than MD5.