Memory Planning for Deep Neural Networks

02/23/2022
by   Maksim Levental, et al.
0

We study memory allocation patterns in DNNs during inference, in the context of large-scale systems. We observe that such memory allocation patterns, in the context of multi-threading, are subject to high latencies, due to contention in the system memory allocator. Latencies incurred due to such contention produce undesirable bottlenecks in user-facing services. Thus, we propose a "memorization" based technique, , for optimizing overall latency, with only moderate increases in peak memory usage. Specifically, our technique consists of a runtime component, which captures all allocations and uniquely associates them with their high-level source operation, and a static analysis component, which constructs an efficient allocation "plan". We present an implementation of in the PyTorch deep learning framework and evaluate memory consumption and execution performance on a wide range of DNN architectures. We find that outperforms state-of-the-art general purpose memory allocators, with respect to DNN inference latency, by as much as 40%.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset