In the current Age of Big Data, companies are constantly striving to figure out how to better use data at their disposal. And it seems that the only thing better than big data is more data. However, the data used is often personal in nature and thus linked to specific individuals and their personal details, traits, or preferences. In such cases, sharing and use of the data conflict with privacy laws and interests. A popular remedy applied to sidestep privacy-based concerns is to render the data no longer “private” by anonymizing it. Anonymization is achieved through a variety of statistical measures. Anonymized data, so it seems, can be sold, shared with researchers, or even possibly released to the general public.
Yet, the Age of Big Data has turned anonymization into a difficult task, as the risk of re-identification seems to be constantly looming. Re-identification is achieved by “attacking” the anonymous dataset, aided by the existence of vast datasets (or “auxiliary information”) from various other sources available to the potential attacker. It is, therefore, difficult to establish whether anonymization was achieved, whether privacy laws pertain to the dataset at hand, and if so, how. In a recent paper, Ira Rubinstein and Woodrow Hartzog examine this issue’s pressing policy and legal aspects. The paper does an excellent job in summarizing the way that the current academic debate in this field is unfolding. It describes recent failed and successful re-identification attempts and provides the reader with a crash course on the complicated statistical methods of de-identification and re-identification. Beyond that, it provides both theoretical insights and a clear roadmap for confronting challenges to properly releasing data.
The discussion on anonymization, or de-identification (the more precise term which the authors choose to apply, as it does not imply full anonymization) was once mostly of academic interest: Statisticians introduced ways to anonymize data, while mathematicians and computer scientists strove to prove re-identification “attacks” were nonetheless possible. Several successful re-identification attacks (perhaps the most famous one involved Neflix and IMDb) also led legal scholars to debate proper policy practices, as well broader implications of re-identification. However, this academic discussion is quickly crossing over into the world of practitioners. Recent policy papers published by regulators in the U.S., U.K., and the E.U. strive to create legal and normative guidelines for the manner in which personal information can be shared and released. In addition, corporations are turning to legal counsel for advice on using anonymization to mitigate potential liability.
In an age in which legal scholarship seems to be drifting away from legal practice, this paper demonstrates how both can be brought together. To a great extent, the knowledge conveyed in this paper is now essential for all legal practitioners advising clients with large databases. To demonstrate the relevance of this discussion, note a recent debate regarding the practices of Yodlee, an online financial tools provider, which has also emerged as a powerful financial-data aggregator. As recently reported by the Wall Street Journal, Yodlee sells information, gathered by facilitating consumer transactions, to investors and research firms. The WSJ claimed that Yodlee clients’ privacy is being compromised, and Yodlee responded by arguing that all personal information was properly handled and de-identified. It is safe to assume that similar stories involving other companies’ collecting, marketing, or de-identifying personal data are just around the corner.
Perhaps the central point that Rubinstein and Hartzog’s paper strives to articulate is that classifying personal data as either anonymous or identifiable is both incorrect and useless. With regard to anonymization, the authors further note that: “[a]lmost all uses of the term to describe the safety of data sets are misleading, and often they are deceptive. Focusing on the language of process and risk will better set expectations” (P.4). In other words, anonymity (or rather – de-identification) is not an absolute term, but one indicating degrees on a scale – one that should be measured by the effort required to reveal the personal data, and the chance it could occur. As the authors note, this latter notion was already introduced (perhaps most famously by Paul Ohm). Rubinstein and Hartzog’s important contribution is to break this notion down into practical steps – formulating a proper data release policy as well as providing a full toolbox of measures to be applied in the process.
Beyond this important observation, the paper’s most substantial analytical contribution is to link appropriate data release policies with the notion of data security. The relationship, as explained by the authors, is based on these concepts mutual need to meet a specific standard of care in the process, and not necessarily be judged by the outcome. The authors also explain that context matters, and list various parameters and attributes of the data release process that should be considered when formulating a release policy (p. 32). In addition, they demonstrate that an integral part of a release policy is the technical measures applied when distributing and sharing the information. In doing so, they note that the Release-and-Forget Model of data sharing (in which, for example, a de-identified database is merely made available over the internet) is most likely obsolete (p. 36); all data release schemes must include unique measures (technological, contractual – or both) which strive to limit re-identification by potential attackers.
Beyond the rich policy discussion the authors provide in comparing and equating security policy to data release policy, several additional theoretical questions (with practical implications) come to mind and are worthy of future discussion: Is a regulatory response similarly necessary in the security and data release contexts? While companies usually under-invest in security (given, among other factors, the negative externalities of security breaches), there have been examples of instances in which corporate motivation to enhance security was close to sufficient, especially in view of market pressures and the reputational costs of breaches. In many cases, companies’ and clients’ interests in maintaining security are aligned. More often, though, corporations’ and clients’ interests regarding data releases directly conflict. Corporations are interested in capitalizing on their data, whereas consumers do not necessarily share corporate enthusiasm for sharing their de-identified personal information, as they are not likely to benefit from or be compensated for this additional revenue stream. For this and other reasons, the security-release policy comparison has its limits; data release policies might call for stricter rules and enforcement mechanism.
In addition, it would be interesting to consider the role insurance could play in the process of data release—an issue also currently emerging in the context of data security. An active insurance market might indeed facilitate the shift from outcome- to process-based liability without the need to change the regulatory framework. Therefore, the change the authors here advocate for might be just around the corner. Insurers could, for instance, limit indemnification to those companies that follow acceptable data-release policies (yet nonetheless cause harms to third parties). Yet, relying on insurance markets may not be a safe bet. In this specific context, insurance markets face several difficulties, which mandate further discussion. The comparison to data security can prove illuminating here as well.