Pruning Ternary Quantization
We propose pruning ternary quantization (PTQ), a simple, yet effective, symmetric ternary quantization method. The method significantly compresses neural network weights to a sparse ternary of [-1,0,1] and thus reduces computational, storage, and memory footprints. We show that PTQ can convert regular weights to ternary orthonormal bases by simply using pruning and L2 projection. In addition, we introduce a refined straight-through estimator to finalize and stabilize the quantized weights. Our method can provide at most 46x compression ratio on the ResNet-18 structure, with an acceptable accuracy of 65.36 ResNet-18 model from 46 MB to 955KB ( 48x) and a ResNet-50 model from 99 MB to 3.3MB ( 30x), while the top-1 accuracy on ImageNet drops slightly from 69.7 65.3 quantization and thus provides a range of size-accuracy trade-off.
READ FULL TEXT