Bigger Faster: Two-stage Neural Architecture Search for Quantized Transformer Models

09/25/2022
by   Yuji Chai, et al.
0

Neural architecture search (NAS) for transformers has been used to create state-of-the-art models that target certain latency constraints. In this work we present Bigger Faster, a novel quantization-aware parameter sharing NAS that finds architectures for 8-bit integer (int8) quantized transformers. Our results show that our method is able to produce BERT models that outperform the current state-of-the-art technique, AutoTinyBERT, at all latency targets we tested, achieving up to a 2.68 models found by our technique have a larger number of parameters than their float32 counterparts, due to their parameters being int8, they have significantly smaller memory footprints.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset