Data Privacy in AI: PII versus Personal Information
๐ Abstract
The article discusses the different types of personal data that are covered by data privacy laws, and how to handle this data responsibly when using it for analysis or machine learning. It covers the concept of personally identifiable information (PII) and personal information (PI), and how the context and combination of data points can determine whether data is considered protected. The article also provides strategies for minimizing risk when working with personal data, such as reducing the amount of data stored, aggregating data, and using techniques like hashing to de-identify data. It emphasizes the importance of understanding the legal and ethical implications of using personal data, and the need to obtain proper consent and have data security measures in place.
๐ Q&A
[01] Distinguishing Protected Data
1. What are some examples of personally identifiable information (PII)?
- Examples of PII include ID numbers (e.g. social security, driver's license), full name, full street address, photograph, and telephone number.
2. How is "personal information" (PI) defined more broadly under data privacy laws?
- PI includes data that can be linked back to an individual, such as gender, age/birthdate, profession, race/ethnicity, etc. The more data points that can be combined, the higher the risk of identifying a specific individual.
3. When is data considered not protected under privacy laws?
- Data that is already publicly available, such as public government records, is generally not protected under privacy laws.
[02] Strategies for Minimizing Risk
1. What are some tips for reducing the amount of personal data stored?
- Only store the minimum amount of data needed to achieve the objective, as each additional data point increases the risk of identifying individuals.
2. How can aggregating data help reduce risk?
- Analyzing data at a group level rather than the individual level can significantly reduce the risk, as long as the individual data is deleted after aggregation.
3. What techniques can be used to de-identify personal data?
- Techniques like irreversible hashing can allow retaining relationships between data points without the data being human-interpretable, to enable modeling while protecting individual identities.
[03] Considerations for Using Personal Data
1. What is the key risk to consider when deciding to use personal data?
- The risk must be carefully weighed against the expected benefits of the project, and appropriate consent and data security measures must be in place.
2. Why is it important for ML practitioners to be involved in consent form language?
- To ensure the organization has the necessary authorization to use the personal data for the intended purposes.
3. What security measures should be implemented when using personal data?
- Limiting access to only those who need it, and consulting with IT/security teams to implement best practices for protecting the data from breaches or unauthorized access.