Against the backdrop of a growing need to safely share and handle personal data both within a company and across organizations, companies are increasingly turning to data anonymization and data pseudonymization techniques.
It is often recommended to perform basic data anonymization or pseudonymization before or after collecting or moving sensitive data to a cloud environment where the data is distributed or analyzed. A variety of questions arise at this point. Should I anonymize or pseudonymize? Which technique is better according to my use case? Also, what data anonymization techniques can I choose from? Whatever your choice, your primary goal is protecting sensitive information and safeguarding privacy.
Data pseudonymization techniques can reduce restrictions in the handling of personal data under the General Data Protection Regulation (GDPR). Data pseudonymization is a bit easier to perform than its counterpart, data anonymization, and can alleviate some of your obligations as set out in GDPR. At the same time, data anonymization is more difficult to perform but adds an extra layer of security for your collected data. And all the while, sometimes it still makes sense to use personal data in its original form. So which strategy is best for you?
In what follows, you will find an overview of basic definitions of data pseudonymization and data anonymization, as well as a breakdown of essential data anonymization techniques.
Anonymized or pseudonymized data: which is better?
Typically, how you want to handle personal data or personally identifiable information (PII) depends on your use case.
What is personal data?
As laid out in Article 4(1) GDPR, “personal data” pertains to “any information relating to an identified or identifiable natural person (‘data subject’)”. According to this definition, an identifiable natural person is a person who can directly or indirectly be identified with the help of an identifier such as a name, location, an ID number, or an online identifier, as well as physical, physiological, economic, cultural or social features of that natural person.
Defining data pseudonymization
Pseudonymization is a procedure where the majority of identifying fields within a data record are replaced by one or several artificial identifiers, the so-called pseudonyms. These can range from just one pseudonym to a cluster of replaced fields or one pseudonym per replaced field to ascertain non-attribution to an identifiable person.
Article 4(5) GDPR defines data pseudonymization as follows:
“… the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual.”
Pseudonymization techniques help organizations meet some of their data protection obligations, especially in the sense of the principle of “data minimization” and the principle of “storage limitation” as laid out on Articles 5(1c) and 5(1e) GDPR.
However, according to the definition of personal data after pseudonymization as per Recital 26 GDPR,
“Personal data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
This means that personal data that has been pseudonymized would still fall within the scope of the GDPR. Pseudonymized data falls into the definition of “personal data” and allows for indirect or remote re-identification. Companies that use data pseudonymization techniques may still have to meet their obligations under GDPR.
Defining data anonymisation
According to Recital 26 GDPR, anonymized data is outside of the scope of GDPR. Anonymized data does not fall into the definition of “personal information”.
Recital 26 GDPR defines anonymous data as follows:
“…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
Following the logic set out in this definition, data anonymization refers to the process of removing direct and indirect personal identifiers. This includes but is not limited to any units of information or information clusters that may lead to the identification of an individual, such as an address, telephone number, image, date of birth, etc. The process of data anonymization uses various de-identification techniques to make sure that individuals are no longer identifiable.
The goal of de-identification is to safeguard the confidentiality of the original data and to make sure that the identity of a person cannot be inferred from the anonymized data. Once this is achieved, the anonymized data does not fall within the scope of GDPR as it no longer counts as “personal data”.
One of the challenges here is to thoroughly assess the adequacy and scope of the many de-identification techniques available now. Another hurdle is the quality of the data itself. Some anonymization techniques may lead to a devaluation of the data and make it difficult for the data to be used in some scenarios. A tradeoff between data quality and level of de-identification is to be sought in such cases.
Choosing between pseudonymization and anonymization
If handling personal data, the rule of thumb is to implement at least one or several data pseudonymization or anonymization techniques to minimize both potential risks and the cost of compliance.
As implied above, there is a clear legal distinction between anonymized data and pseudonymized data based on their categorization as personal data. Anonymized data cannot be re-identified, at least not without a disproportionately big effort, whereas pseudonymized data may still allow for some direct, indirect, or remote forms of re-identification. Pseudonymization techniques do not remove all of the personally identifiable information. They only significantly reduce the number of relations by virtue of which the data can serve to identify a given data subject. Anonymized data, on the other hand, is scraped of any trace of the identity of a given data subject.
There is a wealth of advantages associated with anonymizing data. If your data has been anonymized, you will no longer be under the obligation to inform your anonymized users if a data breach occurs that in some way exposes their information. This is simply because there is no longer an actual link between the individual and the breached data.
Also, if you have opted for anonymizing your data, the obligation to comply with user rights and demands (the right to be erased, the right to be forgotten, requests for a full record of user data) will no longer apply. Further, applying certain pseudonymization or anonymization techniques may make it possible for you to transfer the data to other countries.
While there is no universal way to deal with anonymization, several simple considerations might help in deciding which approach may suit you best. Given the wealth of techniques that are available at the moment, it is recommended to seek a balance between the degree of risk involved in re-identification or exposing the personal data and the purpose for which you are using the data. Other factors that may influence your decision are the very nature of the data you are protecting and an analysis of your recipients.
What are the typical techniques for pseudonymizing and anonymizing data?
In all their forms, data pseudonymization and anonymization techniques seek to reduce the identifiability of data that qualifies as personal data and belongs to a natural person from a given original data set down to a level that does not exceed a pre-established risk threshold. Today, there is a variety of (open-source) tools and programs for data pseudonymization and data anonymization. Below you will find a summary of the classic techniques:
Directory replacement
The directory replacement method involves making changes to the names of individuals within the data but maintaining consistent relations between other values. For example, you can use a postcode and an ID to identify an individual. In a separate location, you store the information that directly identifies the individual. The data is pseudonymized in this way. To anonymize, you delete the separately stored information that identifies the individual.
Masking out
This technique allows you to hide part of the data by placing random characters or other data instead. You can pseudonymize by masking identities or important identifiers, and thus still be able to identify the data without manipulating the actual identities. This technique is typical for billing scenarios; the most common example includes the masking of credit card information that is then displayed in the form XXXX XXXX XXXX 4321.
Scrambling/Shuffling
Scrambling or shuffling simply entails the mixing of letters or digits in the personal data. For example, #458912 may become #298514. Ideally, the process is irreversible so that the original data cannot be retrieved from the scrambled data. The most popular methods here include cryptographic data scrambling, network security data scrambling, as well as the so-called nearest neighbor data substitution (NeNDS).
Generalization
This technique has the purpose of reducing the granularity of the data. As a result, the data that is disclosed is less precise than the original data and therefore makes it difficult if not impossible to retrieve the exact values associated with an individual. For example, if you have a database containing the age of certain types of patients, the exact ages of the concrete individuals would be replaced with age groups, e.g. 55-65, 65-75, etc.
Blurring
Similar to generalization, this technique reduces the precision of the disclosed data to minimize the possibility of identification. As the name suggests, blurring uses an approximation of data values instead of the original identifiers, making it difficult to identify individuals with absolute certainty. For example, a natural person might be identifiable by an exact account balance at a point in time. Adding small random values to this balance does not introduce significant error into the data but provides anonymity for the affected person.
Data encryption
This technique translates the personal data into another form or code so that the data that is deemed sensitive is replaced with data in an unreadable format. Authorized users have access to a secret key or a password that makes it possible for them to retrieve the data in its original form. Data encryption brings a variety of benefits when moving to the cloud. These include helping you meet regulatory requirements and providing safe harbor from breach notifications. It allows you to secure your remote locations and sets the ground for secure outsourcing and licensing. It can also prevent service providers from accessing or inadvertently exposing your data.
Substitution
As the name suggests, this technique allows you to replace the contents of a database column with data from a predefined list of fake data so that the data cannot be traced back to an identifiable individual. This technique has the advantage of keeping the integrity of the original information intact.
Nulling out
In this case, the sensitive data is simply removed and deleted from the data set. All pieces of sensitive information, such as customer name, address, or age, become null values.
Number and date variance
If specifically dealing with numeric and date columns, this anonymization technique may come in question: in this case, each value in a column is modified by a random percentage of its real value. In this way, the data is altered to such an extent that it can no longer be traced back to its original form.
Custom anonymization
The method of custom or personalized anonymization is simply about creating and implementing your own anonymization technique or a combination of several techniques. You can do this by using scripts or an application.
Depending on your use case, it may become necessary for you to pseudonymize or anonymize your data before moving it to the cloud for further processing and analysis. Whenever your data contains personally identifiable information (PII), you may have to take extra measures to safeguard the safe handling of that data and protect the identity of the data subjects. This is especially the case when you are dealing with healthcare data. But transactional data including customer or employee data may also be affected.