Anonymizing the logs

6 replies [Last post]
Tom Owad's picture
Offline
Joined: Dec 16 2003
Posts: 2460

Quote:

Hi,
I'm a grad student working with David Patterson at the RAD Lab at the CS
department at UC Berkeley. Our lab is interested in building more reliable
internet services. You can read more about our lab here:

* our wiki page: http://radlab.cs.berkeley.edu
* slashdot article: http://slashdot.org/articles/05/12/15/1426223.shtml

To do interesting research, it's extremely useful to have traces/logs from
real-world internet sites that we can use in our experiments. For example,
here's a paper where we used data from Ebates.com to evaluate new algorithms
for detecting and localizing failures in Internet services:
http://www.cs.berkeley.edu/~bodikp/publications/icac05.pdf

I'm wondering whether you have any logs from the Applefritter website that
you could make available to our research group. For example, web server
access logs are very useful for understanding the traffic to a site.

Of course, we could help you to anonymize the data in case it contains any
confidential information.

If you cannot provide such data, could you recommend other people or
websites that could make their logs available or simply forward this email
to them?

Thank you!
Peter

I'd like to help them out, but obviously need to ensure first that we can anonymize the logs. So far, we're looking at replacing IP addresses with a hash, as well as User IDs in /user accesses and message IDs in /privatemessage. Anything else you'd like to see anonymized? BDub and I will be taking a closer look at the logs, for other potentially identifying information, but we'd also like to get a few more opinions.

__________________

Admin

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
moosemanmoo's picture
Offline
Joined: Aug 17 2004
Posts: 686
I don't know what kind of log

I don't know what kind of logs you're talking about, but the kind of logs that my Apache server gives don't have a whole lot of information in them. What I think would be better than hashing the IPs is to resolve them to a hostname, then just omit everything but the last part of the hostname. For example, 24.14.2.91 might resolve to 24-14-2-91.hr.hr.cox.net, but you would only put cox.com. That is a fair amount of work for the nameservers, though, and it's probably not feasable.

__________________

Join the chat on irc.freenode.com, channel #applefritter.

Tom Owad's picture
Offline
Joined: Dec 16 2003
Posts: 2460
Yes - Apache logs. Could you

Yes - Apache logs. Could you explain your reservations to hashing the IPs?

__________________

Admin

moosemanmoo's picture
Offline
Joined: Aug 17 2004
Posts: 686
MD5 hashes can be easily crac

MD5 hashes can be easily cracked-- with the proper tools, someone could brute-force an MD5 hash in about a minute. That, and it would probably make formatting the data a little harder. Replacing the IPs with some repeated letter like x would work better from a security standpoint.

__________________

Join the chat on irc.freenode.com, channel #applefritter.

Dr. Webster's picture
Offline
Joined: Dec 19 2003
Posts: 1687
I wouldn't give them the IPs

I wouldn't give them the IPs in any form, but I don't think there would be a problem with giving them the domains. You could resolve them to comcast.net, cox.net, etc., or just truncate the IPs to xxx.xxx.---.--- (truncating the IP at the subnet level).

__________________

Applefritter Admin

BDub's picture
Offline
Joined: Dec 20 2003
Posts: 706
If a salted hash is insufficient

I'm against xxx.xxx'ing out the IP's because they likely want to trace how users navigate around a website.

Also possible, we could give each IP that shows up it's own unique identifier (first IP in logs is replaced with "1", second with "2") but have IPs that are the same use the same identifier. This would allow them to trace a user around the site, while not having any clues to the users identity.

__________________

"There is going to be a future: let's chase it until it kills us." - Spider Robinson

BDub's picture
Offline
Joined: Dec 20 2003
Posts: 706
Re: MD5 hashes can be easily crac

moosemanmoo wrote:

MD5 hashes can be easily cracked-- with the proper tools, someone could brute-force an MD5 hash in about a minute. That, and it would probably make formatting the data a little harder. Replacing the IPs with some repeated letter like x would work better from a security standpoint.

First off, we'd obviously salt the hash. That's not even a question. Brute forcing becomes quite a bit harder given that situation. Could you quote a source for the 'in about a minute' comment?

We could also use a hash scheme other than MD5.

__________________

"There is going to be a future: let's chase it until it kills us." - Spider Robinson