Data anonymization in Big Data scenarios: an open challenge to become GDPR compliant

If we think about the information that large companies such as Facebook or Amazon handle daily, probably one concept will come to mind: Big Data. With the increase of computational power and the decrease of storage costs, and the recent developments in fields such as cloud computing or the Internet of Things, Big Data has acquired special importance in recent years.

Big Data refers to large volumes of complex data that have to be processed at a high speed. From this definition, we can extract the three main properties of Big Data, also known as the 3’Vs: volume, variety and velocity. Big Data has allowed the development of fields such as Machine Learning, Deep Learning, Data Science, etc. These technologies have a direct impact on daily life services like healthcare, self-driving cars or finance.

However, even though Big Data has had a great impact on the world and brought multiple benefits, it also opened new challenges. The introduction of Big Data played an important role in the growth of concerns regarding data privacy, that is, how personal data is collected, handled and for which purpose. These concerns led to new regulations on data protection and privacy like the General Data Protection Regulation (GDPR) in 2018. The GDPR introduced new requirements to process personal data, and companies and researchers had to adapt to become compliant.

One of the mechanisms to improve data privacy for individuals is data anonymization. When performed correctly, anonymization makes it impossible to single out a particular entry in a dataset. Anonymized data is not considered personal data anymore, and the anonymized data will fall out of the context of GDPR. The GDPR sets a high standard for data to be considered truly anonymous. An effective anonymization solution must fulfil the following criteria, preventing a third party from:

Singling out an individual in a dataset (i.e., isolating some records that could point out the identity of the person).
Linking two records within a dataset that belong to the same person, even though her identity remains unknown.
Inferring any information in such a dataset by deducing the value of an attribute from another set of attributes.

Challenges of Big Data anonymization

Designing an effective anonymization solution becomes especially complex in Big Data, due to the aforementioned properties of Big Data scenarios. First of all, in order to choose the right anonymization approach, it is important to deeply analyze the data. This prior analysis aims to identify the personal information, as well as other types of information that could be used to identify an individual. When the data volume increases, this analysis becomes more costly and challenging.

In addition, anonymous data has to guarantee that an individual cannot be singled out when linking the dataset with other available information. It has been proved that just removing the direct identifiers (name, id, etc) of a dataset is not enough to preserve privacy, as an attacker could have access to additional information that could allow the re-identification of individuals in the dataset. As an example, a 5-digit zip code, birthdate and gender can identify around 80 per cent of the population in the United States. Therefore, it is crucial to identify all the potential data that could be used to identify an individual, taking into account the possible external information that an attacker might have from other sources.

In addition, the growth of Big Data has also increased the publicly available information that could be used to cross-reference data to re-identify users. One example is the Netflix case: Arvind Narayanan and Vitaly Shmatikov of the University of Texas were able to re-identify users from a supposedly anonymous released Netflix database of movie reviews by cross-referencing the data with another publicly available dataset of IMDb.

Moreover, determining the re-identification risk (the probability of identifying a particular individual) of a dataset is crucial to verify if the dataset is properly anonymized. In Big Data scenarios, evaluating this risk is computationally complex, again, due to its inherent properties. The same problem is presented when calculating the utility of the dataset. In addition, most of the existing anonymization solutions are made for homogeneous data. However, the variety property of Big Data implies that the data is usually heterogeneous.

Finally, the velocity property implies that the data needs to be processed at high speed. For this reason, they are usually processed in real-time. This makes it even more difficult to perform an analysis of the data to select the best anonymization strategy or to calculate privacy and utility metrics, as the data will be incomplete at processing time. Some solutions have been developed to address this issue, but there are still open challenges such as reducing the time and space complexity of the existing algorithms.

In conclusion, the growth of Big Data had a big impact on data privacy, as its inherent properties, made the preservation of privacy difficult. There is a clear need for new more efficient and scalable anonymization algorithms and solutions.

Author: Sara El Kortbi Martínez, researcher-engineer at Security & Privacy department at Gradiant

Gradiant is currently taking part in the H2020 project INFINITECH (Grant Agreement 856632), researching different anonymization algorithms. Our goal is to develop an anonymization tool that would help automate the anonymization process, as well as facilitate Big Data streaming anonymization.