Latent Distillation for Continual Object Detection at the Edge
๐ Abstract
The article addresses the challenge of addressing data distribution shifts in object detection models, which is particularly relevant for edge devices in dynamic environments like automotive and robotics. The authors propose a novel Continual Learning (CL) method called Latent Distillation (LD) that reduces the number of operations and memory required by state-of-the-art CL approaches without significantly compromising detection performance. The authors validate their approach using the PASCAL VOC and COCO benchmarks, reducing the distillation parameter overhead by 74% and the Floating Points Operations (FLOPs) by 56% per model update compared to other distillation methods.
๐ Q&A
[01] Introduction
1. What are the key challenges addressed in the article? The article addresses the challenge of addressing data distribution shifts in object detection models, which is particularly relevant for edge devices in dynamic environments like automotive and robotics. Specifically, the authors focus on the catastrophic forgetting phenomenon where the model entirely forgets the previously acquired knowledge when fine-tuned on new data.
2. How does the article propose to address these challenges? The article proposes a novel Continual Learning (CL) method called Latent Distillation (LD) that reduces the number of operations and memory required by state-of-the-art CL approaches without significantly compromising detection performance.
3. What are the contributions of the article? The article makes two main contributions:
- It investigates and experimentally evaluates the feasibility of an efficient open-source object detector, NanoDet, for CLOD applications at the edge.
- It proposes and benchmarks a new method, Latent Distillation (LD), that reduces the computational and memory burden required when updating the model in CLOD applications.
[02] Related Works
1. What are the three main CL scenarios explored in the literature? The literature explores three CL scenarios: Task-Incremental (TIL), Domain-Incremental (DIL), and Class-Incremental (CIL).
2. What are the three main clusters of CL strategies? The CL strategies can be grouped into three clusters: (i) rehearsal-based, (ii) regularization-based, and (iii) architecture-based.
3. What are the key challenges of CL techniques for edge applications? The time necessary to train the model on a new task and the resources required to do so are two important constraints that must be taken into account when considering CLOD at the edge. Most of the works proposed in CLOD literature consider two-stage detectors like Faster R-CNN, which are sub-optimal for edge applications.
[03] CLOD at the Edge
1. Why is NanoDet a good choice for CLOD at the edge? NanoDet is a good choice for CLOD at the edge because it is an efficient open-source anchor-free object detector developed for real-time inference on edge devices. It presents a good performance-complexity tradeoff and is suitable for edge deployment.
2. What is the key idea behind Latent Distillation (LD)? The key idea behind Latent Distillation (LD) is to reduce the computations and memory overhead required by classic distillation approaches. LD operates by feeding the training samples to a frozen subset of initial layers, generating representations in the embedding space, and then passing these latent representations to both the teacher and student upper layers to compute the distillation and model loss.
3. What are the main advantages of LD compared to other distillation methods? The main advantages of LD are:
- Reduction of computations for the forward step since the teacher and student share a common frozen part.
- Smaller set of weights to update during the backward function, as only the upper layers are trained.
- The weights of the lower layers can be stored only once, reducing the memory overhead of standard distillation.
[04] Experimental Setting
1. What are the datasets and CL scenarios considered in the experiments? The authors validate their method on the PASCAL VOC 2007 and COCO datasets, considering five widely recognized benchmarks in the literature: 10p10, 15p5, 19p1, 15p1 on VOC and 40p40 on COCO.
2. What are the evaluation metrics used? The authors evaluate their approach using mean Average Precision (mAP) weighted at different Intersections over Union (IoU), from 0.5 to 0.95. They also consider Multiply-Accumulate (MAC) computations and the memory overhead required by the CL method.
3. What are the CL methods compared against LD? The authors compare LD against the following CL methods: Joint Training, Fine-tuning, Replay, Latent Replay, LwF, and SID.
[05] Results and Discussion
1. How does LD perform compared to other methods in the Multiple Classes scenarios? In the Multiple Classes scenarios (10p10, 15p5, 40p40), SID is confirmed to be the overall better strategy for the FCOS-based NanoDet. Latent Distillation performs similarly to SID, making it suitable for edge applications where small memory footprint and fast training speed are important requirements.
2. How does LD perform in the One Class scenario? In the One Class (19p1) scenario, the full frozen backbone of LD gives the best stability performance. However, it lacks some learning ability compared to other methods. The authors show that unfreezing more layers of the backbone can improve the learning ability, but this comes at the cost of longer training periods and more parameter overhead.
3. How does LD perform in the Sequential One Class scenario? In the Sequential One Class (15p1) scenario, SID and LD are the best compromise between stability and plasticity. LD shows limited plasticity in some task datasets, which can be addressed by unfreezing a smaller part of the network if the constraints of the edge application permit it.