Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

09/26/2021
by   Xuanyang Zhang, et al.
0

Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of knowledge quality, but ignore the significant influence of knowledge quantity on the distillation procedure. Opposed to the conventional distillation approaches, which extract knowledge from a fixed teacher computation graph, this paper explores a non-negligible research direction from a novel perspective of knowledge quantity to further improve the efficacy of knowledge distillation. We introduce a new concept of knowledge decomposition, and further put forward the Partial to Whole Knowledge Distillation (PWKD) paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks with same depth but increasing channel width, and train sub-networks jointly to obtain decomposed knowledge (sub-networks with more channels represent more knowledge). Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence. Generally, PWKD can be regarded as a plugin to be compatible with existing offline knowledge distillation approaches. To verify the effectiveness of PWKD, we conduct experiments on two benchmark datasets: CIFAR-100 and ImageNet, and comprehensive evaluation results reveal that PWKD consistently improve existing knowledge distillation approaches without bells and whistles.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset