We introduce the problem of continual distillation learning (CDL) in order to use knowledge distillation (KD) to improve prompt-based continual learning (CL) models. The CDL problem is valuable to study since the use of a larger vision transformer (ViT) leads to better performance in prompt-based continual learning. The distillation of knowledge from a large ViT to a small ViT can improve the inference efficiency for prompt-based CL models. We empirically found that existing KD methods such as logit distillation and feature distillation cannot effectively improve the student model in the CDL setup. To this end, we introduce a novel method named Knowledge Distillation based on Prompts (KDP), in which globally accessible prompts specifically designed for knowledge distillation are inserted into the frozen ViT backbone of the student model. We demonstrate that our KDP method effectively enhances the distillation performance in comparison to existing KD methods in the CDL setup.
We introduce the problem of Continual Distillation Learning (CDL) that considers differetn KD methods in the Prompt-based Continual Learning (CL) setup. The following figure illustrates the overall workflow of our experiment. The blue dashed area highlights our proposed KDP method, which addresses the limitations of other KD approaches and achieves state-of-the-art (SOTA) performance.
# | Teacher | Student | Baseline | KD-Method | Task-Number | Accuarcy(%) | Forgetting(%) |
---|---|---|---|---|---|---|---|
1 | # | ViT-Small | CODA [3] | # | 10 | 82.18 | 6.48 |
2 | ViT-Base | ViT-Small | CODA [3] | KD [4] | 10 | 83.03 | 7.24 |
3 | ViT-Base | ViT-Small | CODA [3] | DKD [5] | 10 | 82.27 | 7.81 |
4 | ViT-Base | ViT-Small | CODA [3] | FitNets [6] | 10 | 81.83 | 8.83 |
5 | ViT-Base | ViT-Small | CODA [3] | ReviewKD [7] | 10 | 82.20 | 7.54 |
6 | ViT-Base | ViT-Small | CODA [3] | DeiT [8] | 10 | 83.79 | 6.58 |
7 | ViT-Base | ViT-Small | CODA [3] | KDP(ours) | 10 | 84.31 | 5.63 |
8 | # | ViT-Base | CODA [3] | # | 10 | 86.16 | 5.63 |
9 | ViT-Large | ViT-Base | CODA [3] | KD [4] | 10 | 86.27 | 5.45 |
10 | ViT-Large | ViT-Base | CODA [3] | DKD [5] | 10 | 85.42 | 6.55 |
11 | ViT-Large | ViT-Base | CODA [3] | FitNets [6] | 10 | 85.95 | 6.56 |
12 | ViT-Large | ViT-Base | CODA [3] | ReviewKD [7] | 10 | 86.21 | 5.64 |
13 | ViT-Large | ViT-Base | CODA [3] | DeiT [8] | 10 | 86.78 | 5.43 |
14 | ViT-Large | ViT-Base | CODA [3] | KDP(ours) | 10 | 87.13 | 5.30 |
15 | # | ViT-Small | Dual [2] | # | 10 | 79.85 | 6.12 |
16 | ViT-Base | ViT-Small | Dual [2] | KD [4] | 10 | 80.16 | 5.76 |
17 | ViT-Base | ViT-Small | Dual [2] | DKD [5] | 10 | 80.44 | 6.96 |
18 | ViT-Base | ViT-Small | Dual [2] | FitNets [6] | 10 | 80.70 | 5.73 |
19 | ViT-Base | ViT-Small | Dual [2] | ReviewKD [7] | 10 | 80.33 | 5.86 |
20 | ViT-Base | ViT-Small | Dual [2] | DeiT [8] | 10 | 80.64 | 5.67 |
21 | ViT-Base | ViT-Small | Dual [2] | KDP(ours) | 10 | 81.78 | 3.63 |
22 | # | ViT-Base | Dual [2] | # | 10 | 84.66 | 5.91 |
23 | ViT-Large | ViT-Base | Dual [2] | KD [4] | 10 | 84.67 | 4.52 |
24 | ViT-Large | ViT-Base | Dual [2] | DKD [5] | 10 | 84.93 | 4.95 |
25 | ViT-Large | ViT-Base | Dual [2] | FitNets [6] | 10 | 83.12 | 8.33 |
26 | ViT-Large | ViT-Base | Dual [2] | ReviewKD [7] | 10 | 84.11 | 5.19 |
27 | ViT-Large | ViT-Base | Dual [2] | DeiT [8] | 10 | 85.73 | 5.05 |
28 | ViT-Large | ViT-Base | Dual [2] | KDP(ours) | 10 | 86.92 | 4.77 |
29 | # | ViT-Small | L2P [1] | # | 10 | 77.71 | 7.12 |
30 | ViT-Base | ViT-Small | L2P [1] | KD [4] | 10 | 79.64 | 6.35 |
31 | ViT-Base | ViT-Small | L2P [1] | DKD [5] | 10 | 78.21 | 9.13 |
32 | ViT-Base | ViT-Small | L2P [1] | FitNets [6] | 10 | 79.56 | 5.89 |
33 | ViT-Base | ViT-Small | L2P [1] | ReviewKD [7] | 10 | 78.50 | 8.04 |
34 | ViT-Base | ViT-Small | L2P [1] | DeiT [8] | 10 | 79.56 | 6.71 |
35 | ViT-Base | ViT-Small | L2P [1] | KDP(ours) | 10 | 81.79 | 4.31 |
36 | # | ViT-Base | L2P [1] | # | 10 | 83.02 | 6.06 |
37 | ViT-Large | ViT-Base | L2P [1] | KD [4] | 10 | 85.00 | 4.48 |
38 | ViT-Large | ViT-Base | L2P [1] | DKD [5] | 10 | 83.29 | 4.99 |
39 | ViT-Large | ViT-Base | L2P [1] | FitNets [6] | 10 | 83.60 | 5.21 |
40 | ViT-Large | ViT-Base | L2P [1] | ReviewKD [7] | 10 | 83.12 | 7.97 |
41 | ViT-Large | ViT-Base | L2P [1] | DeiT [8] | 10 | 84.21 | 6.06 |
42 | ViT-Large | ViT-Base | L2P [1] | KDP(ours) | 10 | 86.56 | 4.97 |
# | Teacher | Student | Baseline | KD-Method | Task-Number | Accuarcy(%) | Forgetting(%) |
---|---|---|---|---|---|---|---|
1 | # | ViT-Small | CODA [3] | # | 10 | 67.44 | 8.52 |
2 | ViT-Base | ViT-Small | CODA [3] | KD [4] | 10 | 69.91 | 7.64 |
3 | ViT-Base | ViT-Small | CODA [3] | DKD [5] | 10 | 68.92 | 8.39 |
4 | ViT-Base | ViT-Small | CODA [3] | FitNets [6] | 10 | 69.87 | 7.38 |
5 | ViT-Base | ViT-Small | CODA [3] | ReviewKD [7] | 10 | 70.19 | 7.68 |
6 | ViT-Base | ViT-Small | CODA [3] | DeiT [8] | 10 | 70.74 | 6.66 |
7 | ViT-Base | ViT-Small | CODA [3] | KDP(ours) | 10 | 71.92 | 5.61 |
8 | # | ViT-Base | CODA [3] | # | 10 | 76.42 | 4.31 |
9 | ViT-Large | ViT-Base | CODA [3] | KD [4] | 10 | 76.99 | 3.81 |
10 | ViT-Large | ViT-Base | CODA [3] | DKD [5] | 10 | 76.70 | 4.84 |
11 | ViT-Large | ViT-Base | CODA [3] | FitNets [6] | 10 | 74.55 | 6.81 |
12 | ViT-Large | ViT-Base | CODA [3] | ReviewKD [7] | 10 | 75.72 | 4.14 |
13 | ViT-Large | ViT-Base | CODA [3] | DeiT [8] | 10 | 77.83 | 4.51 |
14 | ViT-Large | ViT-Base | CODA [3] | KDP(ours) | 10 | 78.62 | 3.46 |
15 | # | ViT-Small | Dual [2] | # | 10 | 65.51 | 5.93 |
16 | ViT-Base | ViT-Small | Dual [2] | KD [4] | 10 | 65.68 | 7.26 |
17 | ViT-Base | ViT-Small | Dual [2] | DKD [5] | 10 | 65.44 | 7.27 |
18 | ViT-Base | ViT-Small | Dual [2] | FitNets [6] | 10 | 66.20 | 5.93 |
19 | ViT-Base | ViT-Small | Dual [2] | ReviewKD [7] | 10 | 65.69 | 6.56 |
20 | ViT-Base | ViT-Small | Dual [2] | DeiT [8] | 10 | 65.82 | 4.00 |
21 | ViT-Base | ViT-Small | Dual [2] | KDP(ours) | 10 | 68.77 | 3.13 |
22 | # | ViT-Base | Dual [2] | # | 10 | 73.18 | 3.45 |
23 | ViT-Large | ViT-Base | Dual [2] | KD [4] | 10 | 73.90 | 3.31 |
24 | ViT-Large | ViT-Base | Dual [2] | DKD [5] | 10 | 75.24 | 4.15 |
25 | ViT-Large | ViT-Base | Dual [2] | FitNets [6] | 10 | 71.23 | 5.71 |
26 | ViT-Large | ViT-Base | Dual [2] | ReviewKD [7] | 10 | 72.19 | 5.72 |
27 | ViT-Large | ViT-Base | Dual [2] | DeiT [8] | 10 | 76.03 | 3.90 |
28 | ViT-Large | ViT-Base | Dual [2] | KDP(ours) | 10 | 76.06 | 3.77 |
29 | # | ViT-Small | L2P [1] | # | 10 | 63.82 | 6.52 |
30 | ViT-Base | ViT-Small | L2P [1] | KD [4] | 10 | 63.97 | 6.51 |
31 | ViT-Base | ViT-Small | L2P [1] | DKD [5] | 10 | 62.91 | 6.55 |
32 | ViT-Base | ViT-Small | L2P [1] | FitNets [6] | 10 | 64.29 | 6.37 |
33 | ViT-Base | ViT-Small | L2P [1] | ReviewKD [7] | 10 | 63.64 | 6.36 |
34 | ViT-Base | ViT-Small | L2P [1] | DeiT [8] | 10 | 64.99 | 3.83 |
35 | ViT-Base | ViT-Small | L2P [1] | KDP(ours) | 10 | 68.18 | 2.08 |
36 | # | ViT-Base | L2P [1] | # | 10 | 73.94 | 4.41 |
37 | ViT-Large | ViT-Base | L2P [1] | KD [4] | 10 | 74.12 | 4.60 |
38 | ViT-Large | ViT-Base | L2P [1] | DKD [5] | 10 | 74.58 | 4.69 |
39 | ViT-Large | ViT-Base | L2P [1] | FitNets [6] | 10 | 70.39 | 5.84 |
40 | ViT-Large | ViT-Base | L2P [1] | ReviewKD [7] | 10 | 72.17 | 6.11 |
41 | ViT-Large | ViT-Base | L2P [1] | DeiT [8] | 10 | 73.99 | 5.09 |
42 | ViT-Large | ViT-Base | L2P [1] | KDP(ours) | 10 | 76.91 | 3.15 |
The code for CDL.
@misc{2024CDL,
title={Continual Distillation Learning: An Empirical Study of Knowledge Distillation in Prompt-based Continual Learning},
author={Qifan Zhang and Yunhui Guo and Yu Xiang},
year={2024},
eprint={2407.13911},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Send any comments or questions to Qifan Zhang: qifan.zhang@utdallas.edu
This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005, the Sony Research Award Program, and the National Science Foundation (NSF) under Grant No. 2346528.