WP What is Pseudonymization | Safeguarding Data with Fictional IDs | Imperva

Pseudonymization

31.6k views
Data Security

What is Pseudonymization?

Pseudonymization is a security technique that aims to protect sensitive data by replacing it with fictional data. Doing so ensures that information cannot be directly linked to a specific individual without additional data, as Article 4 (5) of the GDPR outlines.

The primary purpose of pseudonymization is to maintain referential integrity and statistical accuracy while allowing the regular operation of various business processes, development and testing systems, training programs, and data analysis.

This technique is applicable in scenarios requiring realistic data, such as application development and testing environments, data warehousing, analytical data stores, training programs, and other business processes. Pseudonymization can also be used when exporting data to non-EU/EEA countries.

By implementing pseudonymization, organizations can enhance data security and privacy, ensuring compliance with relevant regulations and protecting the confidentiality of sensitive information.

What is a Pseudonym?

A pseudonym, often referred to as an alias or pen name, is a fictitious name used instead of a person’s real name for various purposes. In the context of pseudonymization, a pseudonym serves as a stand-in for the identifiable data of an individual.

This pseudonym is generated so that it cannot be associated with a specific individual without the use of additional information. While pseudonyms help mask individuals’ identities, they do not provide complete anonymity. If the additional data, often called “the master key,” is accessed, the pseudonym can be traced back to the original individual, thereby revealing their identity.

Using pseudonyms allows organizations to safeguard user privacy while maintaining their data’s functional usability. For example, in a dataset, a user’s name could be replaced with a pseudonym, preventing direct identification while still allowing the data to be used meaningfully.

How Does Pseudonymization Work?

Pseudonymization works by replacing identifiable data with artificial identifiers or pseudonyms. While the process does not entirely erase all identifiable information, it ensures that rendered data cannot be linked to individuals without additional information. The key information that maps pseudonyms to original data is kept separate and secure, often protected by stringent encryption.

The pseudonymization process typically involves the following steps:

  1. Identification of the data fields containing personal identifiable information (PII).
  2. Application of pseudonymization algorithms to replace the PII with fictitious, but realistic, data.
  3. Storage of the mapping between original data and pseudonyms in a secure location.
  4. Use of pseudonymized data in place of original data for regular operations.

This process ensures that the data remains valid for analysis and business processes while significantly reducing the risk of data theft or misuse. It is important to note that pseudonymization is reversible, provided separate key information is available, distinguishing it from irreversible anonymization.

Pseudonymization Example

Let’s consider the following example to illustrate how pseudonymization works in a database. For this example, we will use a hypothetical database of a healthcare organization.

Original Database (Table 1)

Patient ID Name Address Diagnosis
1 John Doe 123 Main Street Hypertension
2 Jane Smith 456 Maple Avenue Diabetes
3 Sam Lee 789 Elm Drive Asthma

First, the fields containing personally identifiable information (PII) to pseudonymize are identified. In this example, that would include the ‘Name’ and ‘Address’ fields. They’re replaced using a pseudonymization algorithm.

Pseudonymized Database (Table 2)

Patient ID Name Address Diagnosis
1 XH54K1 AD34Z9 Hypertension
2 RG78P2 FG16B7 Diabetes
3 UI23N6 KO89V5 Asthma

The mapping between the original data and pseudonyms is stored separately and securely.

Mapping Database (Table 3)

Pseudonym Original Data
XH54K1 John Doe
RG78P2 Jane Smith
UI23N6 Sam Lee
AD34Z9 123 Main Street
FG16B7 456 Maple Avenue
KO89V5 789 Elm Drive

This way, the data remains useful for health analysis and other operations while ensuring data privacy and mitigating the risk of data theft or misuse.

Does the GDPR require Pseudonymization?

Under the General Data Protection Regulation (GDPR), pseudonymization is not strictly required but highly recommended. The GDPR encourages the implementation of pseudonymization as a method of data protection. Article 4(5) of the GDPR defines pseudonymization as the processing of personal data so that the data can no longer be attributed to a specific subject without additional information.

Moreover, Recital 28 of the GDPR states that applying pseudonymization to personal data can reduce risks to data subjects and help controllers and processors meet their data protection obligations. Therefore, while pseudonymization is not mandatory, its use is incentivized as a proactive measure to enhance data security, minimize risks, and ensure compliance with GDPR’s data processing principles.

Pseudonymization vs. Anonymization

While pseudonymization and anonymization are data protection techniques, they serve different purposes and offer varying degrees of security. 

Understanding the Concept

Pseudonymization is a data protection technique where fields with PII in a data record are replaced by one or more artificial identifiers or pseudonyms. A single pseudonym may replace multiple real names or identifiers. It’s a reversible process, as demonstrated in the tables below.

Original Data

Employee ID Name Email
1 John Doe john.doe@example.com
2 Jane Smith jane.smith@example.com

Pseudonymized Data

Employee ID Name Email
1 X9T1 X9T1@mail
2 Y4G2 Y4G2@mail

Anonymization, on the other hand, involves removing or encrypting personally identifiable information to prevent the identification of individuals. Unlike pseudonymization, anonymization is irreversible. Once data is anonymized, it cannot be traced back to the original data.

Original Data

Employee ID Name Email
1 John Doe john.doe@example.com
2 Jane Smith jane.smith@example.com

Anonymized Data

Employee ID Name Email
1 null null
2 null null

Level of Data Protection

Pseudonymization provides moderate data protection. Since pseudonymization is reversible, it may only partially prevent the possibility of re-identification. However, it significantly reduces linkage to original identities.

Anonymization provides a high level of data protection. The risk of re-identification is minimal because anonymization techniques are designed to be irreversible.

GDPR Perspective

According to the GDPR, pseudonymization is a recommended data protection measure. It falls within the scope of the regulation because pseudonymized data can still be linked to individuals.

Anonymization, however, is not regulated under GDPR since anonymized data cannot be linked back to individuals and therefore does not constitute personal data.

In summary, the choice between pseudonymization and anonymization will depend on the specific use case, the sensitivity of the data, and the desired level of security and privacy.

See how Imperva Data Masking can help you with pseudonymization.

Pseudonymization vs. Tokenization

While pseudonymization and tokenization are data obfuscation techniques, they differ in method and application.

Understanding the Concept

Pseudonymization replaces identifiable data with fictitious data but keeps a link to the original data. This link is stored in a separate and secure mapping table, as shown below:

Pseudonymized Data

Employee ID Name
1 X9T1
2 Y4G2

Mapping Table

Pseudonym Original Data
X9T1 John Doe
Y4G2 Jane Smith

Tokenization, on the other hand, replaces sensitive data with a non-sensitive equivalent, referred to as a token. These tokens have no meaning or value and cannot be reversed without access to the tokenization system. A tokenization example is shown below:

Original Data

Credit Card Number
1234 5678 9012 3456

Tokenized Data

Token
GT56 PL89

Level of Data Protection

Pseudonymization provides moderate data protection. It reduces the linkage to original identities but might only partially prevent re-identification since it’s reversible.

Tokenization provides a high level of data protection. The process is irreversible without the tokenization system, significantly reducing the risk of re-identification.

Use Cases

Pseudonymization is often used in data processing activities where the ability to reverse the process might be needed, such as medical research or customer relationship management.

Tokenization is predominantly used in the financial sector, where it’s crucial to protect sensitive data like credit card numbers without losing the ability to process transactions.

In brief, the choice between pseudonymization and tokenization depends on the specific use case and the desired data security and privacy level.