Learning to Unlearn: Why Data Scientists and AI Practitioners Should Understand Machine Unlearning
๐ Abstract
The article discusses the challenges and considerations around privacy in the context of AI, including an overview of the machine unlearning field and the SISA training approach as a solution to address the "Right to be Forgotten" principle.
๐ Q&A
[01] The Right to Privacy and Contextual Integrity
1. What is the traditional approach to privacy, and how has it been challenged by disruptive technologies? The traditional approach to privacy has been associated with securing and protecting what we care about behind closed curtains, keeping it out of the public eye, and controlling its access and use. However, disruptive technologies like photography, video, and the exponential growth of data have tested the boundaries of privacy over time.
2. What is the "Contextual Integrity" model of informational privacy, and how does it differ from the traditional approach? The "Contextual Integrity" model states that privacy should be judged based on the appropriateness of the flow of information, taking into account the context and the governing norms within it to establish the limits. This differs from the traditional rigid concept of the "right to be left alone", as privacy is not seen as a binary state but rather as a matter of properly regulating the exchange of information.
3. How does the article suggest we should approach the concept of privacy as AI continues to shape the future? The article suggests that as AI continues to shape the future, we may need to adapt existing rights or introduce new digital rights, considering whether the rigid concept of privacy should remain unchanged or if we should first understand the social rules governing information flows.
[02] Machine Unlearning and the SISA Framework
1. What are the key reasons why we should be interested in machine unlearning techniques? The key reasons include:
- The "Right to be Forgotten" (RTBF) and the need for companies using AI to adjust processes to meet regulations and user requests to remove personal data from pre-trained models.
- The "Non-Zero Influence" problem, where differential privacy may not be enough to completely remove the influence of a single data point.
- Performance optimization, as retraining a complete model from scratch to remove a single data point may not be the most efficient approach.
- Cybersecurity, as machine unlearning can help remove harmful data points and protect the sensitive information used to train the model.
2. What are the two main lines of thought in the machine unlearning landscape, and how do they differ? The two main lines of thought are:
- Exact Machine Unlearning, which focuses on eliminating the influence of specific data points by removing them completely.
- Approximate Machine Unlearning, which aims to efficiently reduce the influence of specific data points in a trained model.
3. How does the SISA framework (Sharded, Isolated, Sliced, and Aggregated) address the problem of unlearning data from ML models? The SISA framework replicates the model several times, with each replica trained on a different subset of the dataset (shards). Within each shard, the data is further divided into "slices", and incremental learning is applied with parameters archived accordingly. When data needs to be unlearned, only the constituent models whose shards contain the point to be unlearned are retrained, avoiding the need to retrain the entire model from scratch.
[03] Applying SISA to a Computer Vision Use Case
1. How did the article adapt the initial CNN model to include the SISA technique? The article adapted the initial CNN model by:
- Dividing the dataset into shards, with each shard containing a representative number of samples and a balanced class distribution.
- Creating overlapping slices within each shard, as the small dataset size did not guarantee sufficient balance in exclusive slices.
- Defining functions to isolate and remove specific data points from the slices, preparing the model for future erasure requests.
2. What were the key considerations in the sharding and slicing process for the small dataset used in the example? For the small dataset, the article initially used 10 shards, but this resulted in each shard containing only a few sample images, which did not represent the full dataset's class distribution well, leading to a significant drop in the model's performance metrics. Reducing the number of shards to 4 was a wiser decision to maintain the predictive capabilities of the model.
3. How did the article demonstrate the process of removing specific images from the training data using the SISA approach? The article provided an example scenario where a hockey player requested the removal of three images from the training data. The article showed how to specify the image filenames to be removed, identify the corresponding indices in the dataset, and update the slices to exclude those data points, without the need to retrain the entire model from scratch.