What we think is reasonable, commonplace, or even possible in terms of protecting or violating online privacy shifts constantly. Recent developments in tools and techniques for tracking online behavior and identifying individuals from supposedly anonymized data sets should cause us to reevaluate what is possible.
Katherine McKinley of iSEC Partners published a detailed analysis of how popular browsers and browser extensions handle cookies and other methods of local data storage used for tracking users in her December, 2008 paper Cleaning Up after Cookies (PDF). McKinley tested the ability for browsers and extensions to clear the private data as well as “private browsing” features. She found that most browsers attempted to clear previous stored private data, but often left some data accessible. She found that Adobe Flash did not attempt to remove this data and in fact stored it in such a way that it circumvented most privacy protections offered by browsers. iSEC Partners created an online version of the test used in the article to allow individuals to test their own configurations. It is available at Breadcrumbs Tracker.
The August, 2009 paper Flash Cookies and Privacy by Ashkan Soltani and Shannon Canty and Quentin Mayo and Lauren Thomas and Chris Jay Hoofnagle at UC Berkeley focuses directly on the privacy issues related to Flash Cookies. The authors survey the top 100 web sites according to QuantCast in July of 2009 and found that more than half of them used Flash cookies. The authors note that unlike standard HTTP cookies, Flash cookies do not have an expiration date and are stored in a different location on the file system that is harder to find. Most cookie management tools will not delete these type of cookies and they remain in place even when private browsing mode in enabled. The authors found that Flash cookies were frequently employed to track users that had explicitly attempted to prevent cookie tracking by using the Flash cookie to regenerate a HTTP cookie that had been deleted.
Most online services use multiple tracking services for analytics, performance monitoring, and usability analysis. A mixture of JavaScript-based tracking codes and cookies is the most common method for user tracking. The paper On the Leakage of Personally Identifiable Information Via Online Social Networks (PDF) presented at the ACM Workshop on Online Social Networks by Balachander Krishnamurthy and Craig Wills describes the techniques used by advertising firms and social networks services to track users and the types of information they release. The authors studied information leakage from twelve online social networks and found that the bulk of user information is released through HTTP headers and third-party cookies.
In his post Netflix’s Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder on the Freedom to Tinker blog, Paul Ohm discusses his 2009 publication Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization in the context of the announcement for the second Netflix prize to improve the accuracy of Netflix predictions. Ohm argues that it is not possible to anonymize the data and that it is irresponsible and possibly illegal to release it. Netflix released a half a million anonymized subscriber records for analysis in the original contest. The one million dollar prize resulted in significant numbers of researchers entering the contest.
Soon after the Netflix released the records, researchers Arvind Narayanan and Vitaly Shmatikov proved they could identify individual subscribers by combining the Netflix data with other databases. Their publication Robust De-anonymization of Large Sparse Datasets (PDF) presented at the 2008 IEEE Symposium on Security and Privacy describes How to Break Anonymity of the Netflix Prize Dataset. Narayanan and Shmatikov continued their research and demonstrated de-anonymizing social networks such as Twitter in De-Anonymizing Social Networks (PDF) (paper FAQ) a paper presented at the 2009 IEEE Symposium on Security and Privacy. Ohm, reminds the readers about the scandal that occurred in 2006 when AOL researchers Greg Pass, Abdur Chowdhury, and Cayley Torgeson presented their paper A Picture of Search (PDF) at the first International Conference on Scalable Information Systems. The authors released an anonymized dataset they analyzed in the paper that included more than six hundred thousand AOL users, some individuals were subsequently individually identified.
Carnegie Mellon University professor Latanya Sweeney developed the foundation for much of the current work on de-anonymizing data sets. Her paper All the Data on All The People (only abstract is publicly available) published in 2000, showed that it was possible to identify individuals in US Census data using only a few variables. The paper argues that it is possible to identify almost 90% of the US population using only full date of birth, gender, and ZIP code.
Alessandro Acquisti and Ralph Gross (no relation) presented their research on Predicting Social Security Numbers from Public Data. The authors demonstrate that it is possible to automate the process of predicting an individual’s Social Security Number (SSN) for large portions of the population using public information. The information used to create predictions is easily harvested from social networking sites, voter registration records, and commercial databases. Aquiusti and Gross argue that we must reconsider our policies around the use of SSNs, which are commonly used for authentication and frequently abused by identity thieves.
* This article originally appeared as The State of User Tracking and the Impossibility of Anonymizing Data in my Messaging News “On Message Column.”