Continual Distillation Learning: Knowledge Distillation in Prompt-based Continual Learning

Abstract

We introduce the problem of continual distillation learning (CDL) in order to use knowledge distillation (KD) to improve prompt-based continual learning (CL) models. The CDL problem is valuable to study since the use of a larger vision transformer (ViT) leads to better performance in prompt-based continual learning. The distillation of knowledge from a large ViT to a small ViT can improve the inference efficiency for prompt-based CL models. We empirically found that existing KD methods such as logit distillation and feature distillation cannot effectively improve the student model in the CDL setup. To this end, we introduce a novel method named Knowledge Distillation based on Prompts (KDP), in which globally accessible prompts specifically designed for knowledge distillation are inserted into the frozen ViT backbone of the student model. We demonstrate that our KDP method effectively enhances the distillation performance in comparison to existing KD methods in the CDL setup.

CDL

We introduce the problem of Continual Distillation Learning (CDL) that considers differetn KD methods in the Prompt-based Continual Learning (CL) setup. The following figure illustrates the overall workflow of our experiment. The blue dashed area highlights our proposed KDP method, which addresses the limitations of other KD approaches and achieves state-of-the-art (SOTA) performance.

gto

Experiment Results

Experiment Results: We conducted knowledge distillation experiments on three prompt-based continual learning methods, providing inspiration for future research in CDL, which can be extended to more continual learning methods.
Cifar-100
# Teacher Student Baseline KD-Method Task-Number Accuarcy(%) Forgetting(%)
1#ViT-SmallCODA [3]#1082.186.48
2ViT-BaseViT-SmallCODA [3]KD [4]1083.037.24
3ViT-BaseViT-SmallCODA [3]DKD [5]1082.277.81
4ViT-BaseViT-SmallCODA [3]FitNets [6]1081.838.83
5ViT-BaseViT-SmallCODA [3]ReviewKD [7]1082.207.54
6ViT-BaseViT-SmallCODA [3]DeiT [8]1083.796.58
7ViT-BaseViT-SmallCODA [3]KDP(ours)1084.315.63
8#ViT-BaseCODA [3]#1086.165.63
9ViT-LargeViT-BaseCODA [3]KD [4]1086.275.45
10ViT-LargeViT-BaseCODA [3]DKD [5]1085.426.55
11ViT-LargeViT-BaseCODA [3]FitNets [6]1085.956.56
12ViT-LargeViT-BaseCODA [3]ReviewKD [7]1086.215.64
13ViT-LargeViT-BaseCODA [3]DeiT [8]1086.785.43
14ViT-LargeViT-BaseCODA [3]KDP(ours)1087.135.30
15#ViT-SmallDual [2]#1079.856.12
16ViT-BaseViT-SmallDual [2]KD [4]1080.165.76
17ViT-BaseViT-SmallDual [2]DKD [5]1080.446.96
18ViT-BaseViT-SmallDual [2]FitNets [6]1080.705.73
19ViT-BaseViT-SmallDual [2]ReviewKD [7]1080.335.86
20ViT-BaseViT-SmallDual [2]DeiT [8]1080.645.67
21ViT-BaseViT-SmallDual [2]KDP(ours)1081.783.63
22#ViT-BaseDual [2]#1084.665.91
23ViT-LargeViT-BaseDual [2]KD [4]1084.674.52
24ViT-LargeViT-BaseDual [2]DKD [5]1084.934.95
25ViT-LargeViT-BaseDual [2]FitNets [6]1083.128.33
26ViT-LargeViT-BaseDual [2]ReviewKD [7]1084.115.19
27ViT-LargeViT-BaseDual [2]DeiT [8]1085.735.05
28ViT-LargeViT-BaseDual [2]KDP(ours)1086.924.77
29#ViT-SmallL2P [1]#1077.717.12
30ViT-BaseViT-SmallL2P [1]KD [4]1079.646.35
31ViT-BaseViT-SmallL2P [1]DKD [5]1078.219.13
32ViT-BaseViT-SmallL2P [1]FitNets [6]1079.565.89
33ViT-BaseViT-SmallL2P [1]ReviewKD [7]1078.508.04
34ViT-BaseViT-SmallL2P [1]DeiT [8]1079.566.71
35ViT-BaseViT-SmallL2P [1]KDP(ours)1081.794.31
36#ViT-BaseL2P [1]#1083.026.06
37ViT-LargeViT-BaseL2P [1]KD [4]1085.004.48
38ViT-LargeViT-BaseL2P [1]DKD [5]1083.294.99
39ViT-LargeViT-BaseL2P [1]FitNets [6]1083.605.21
40ViT-LargeViT-BaseL2P [1]ReviewKD [7]1083.127.97
41ViT-LargeViT-BaseL2P [1]DeiT [8]1084.216.06
42ViT-LargeViT-BaseL2P [1]KDP(ours)1086.564.97
ImageNet-R
# Teacher Student Baseline KD-Method Task-Number Accuarcy(%) Forgetting(%)
1#ViT-SmallCODA [3]#1067.448.52
2ViT-BaseViT-SmallCODA [3]KD [4]1069.917.64
3ViT-BaseViT-SmallCODA [3]DKD [5]1068.928.39
4ViT-BaseViT-SmallCODA [3]FitNets [6]1069.877.38
5ViT-BaseViT-SmallCODA [3]ReviewKD [7]1070.197.68
6ViT-BaseViT-SmallCODA [3]DeiT [8]1070.746.66
7ViT-BaseViT-SmallCODA [3]KDP(ours)1071.925.61
8#ViT-BaseCODA [3]#1076.424.31
9ViT-LargeViT-BaseCODA [3]KD [4]1076.993.81
10ViT-LargeViT-BaseCODA [3]DKD [5]1076.704.84
11ViT-LargeViT-BaseCODA [3]FitNets [6]1074.556.81
12ViT-LargeViT-BaseCODA [3]ReviewKD [7]1075.724.14
13ViT-LargeViT-BaseCODA [3]DeiT [8]1077.834.51
14ViT-LargeViT-BaseCODA [3]KDP(ours)1078.623.46
15#ViT-SmallDual [2]#1065.515.93
16ViT-BaseViT-SmallDual [2]KD [4]1065.687.26
17ViT-BaseViT-SmallDual [2]DKD [5]1065.447.27
18ViT-BaseViT-SmallDual [2]FitNets [6]1066.205.93
19ViT-BaseViT-SmallDual [2]ReviewKD [7]1065.696.56
20ViT-BaseViT-SmallDual [2]DeiT [8]1065.824.00
21ViT-BaseViT-SmallDual [2]KDP(ours)1068.773.13
22#ViT-BaseDual [2]#1073.183.45
23ViT-LargeViT-BaseDual [2]KD [4]1073.903.31
24ViT-LargeViT-BaseDual [2]DKD [5]1075.244.15
25ViT-LargeViT-BaseDual [2]FitNets [6]1071.235.71
26ViT-LargeViT-BaseDual [2]ReviewKD [7]1072.195.72
27ViT-LargeViT-BaseDual [2]DeiT [8]1076.033.90
28ViT-LargeViT-BaseDual [2]KDP(ours)1076.063.77
29#ViT-SmallL2P [1]#1063.826.52
30ViT-BaseViT-SmallL2P [1]KD [4]1063.976.51
31ViT-BaseViT-SmallL2P [1]DKD [5]1062.916.55
32ViT-BaseViT-SmallL2P [1]FitNets [6]1064.296.37
33ViT-BaseViT-SmallL2P [1]ReviewKD [7]1063.646.36
34ViT-BaseViT-SmallL2P [1]DeiT [8]1064.993.83
35ViT-BaseViT-SmallL2P [1]KDP(ours)1068.182.08
36#ViT-BaseL2P [1]#1073.944.41
37ViT-LargeViT-BaseL2P [1]KD [4]1074.124.60
38ViT-LargeViT-BaseL2P [1]DKD [5]1074.584.69
39ViT-LargeViT-BaseL2P [1]FitNets [6]1070.395.84
40ViT-LargeViT-BaseL2P [1]ReviewKD [7]1072.176.11
41ViT-LargeViT-BaseL2P [1]DeiT [8]1073.995.09
42ViT-LargeViT-BaseL2P [1]KDP(ours)1076.913.15

References

  1. Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022. [ Paper ]
  2. Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, pages 631–648. Springer, 2022. [ Paper ]
  3. James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11909–11919, June 2023. [ Paper]
  4. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. [ Paper]
  5. Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11953–11962, 2022. [ Paper]
  6. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. [ Paper]
  7. Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5008–5017, 2021. [ Paper]
  8. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. arXiv preprint arXiv:2012.12877, 2020. [ Paper]

Code

CDL

The code for CDL.

BibTeX

Please cite CDL if it helps your research:
@misc{2024CDL,
title={Continual Distillation Learning: An Empirical Study of Knowledge Distillation in Prompt-based Continual Learning},
author={Qifan Zhang and Yunhui Guo and Yu Xiang},
year={2024},
eprint={2407.13911},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Contact

Send any comments or questions to Qifan Zhang: qifan.zhang@utdallas.edu

Acknowledgements

This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005, the Sony Research Award Program, and the National Science Foundation (NSF) under Grant No. 2346528.