Moonshine: Distilling with Cheap Convolutions

11/07/2017
by   Elliot J. Crowley, et al.
0

Model distillation compresses a trained machine learning model, such as a neural network, into a smaller alternative such that it could be easily deployed in a resource limited setting. Unfortunately, this requires engineering two architectures: a student architecture smaller than the first teacher architecture but trained to emulate it. In this paper, we present a distillation strategy that produces a student architecture that is a simple transformation of the teacher architecture. Recent model distillation methods allow us to preserve most of the performance of the trained model after replacing convolutional blocks with a cheap alternative. In addition, distillation by attention transfer provides student network performance that is better than training that student architecture directly on data.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset