Elastic deep learning in multi-tenant GPU cluster

09/26/2019
by   Yidi Wu, et al.
0

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability to adjust the parallelism (number of GPUs) of a job with low overhead, helps to achieve the goals of GPU cluster management. With elasticity, we can adjust the trade-off between throughput and efficiency, adapt to the cluster load variations, utilize transient idle resource and etc. Motivated by the benefits of elasticity, we designed Amoeba, which requires minimum change to user code and provides a simple API for the scheduler to control the parallelism of jobs. Amoeba is general in that it delegates single machine execution to existing deep learning frameworks and uses light-weight control layer for coordination and management. As it is crucial to reduce the overhead of parallelism adjustment, Amoeba adopts key designs including automatic job management, background scaling and dynamic data pipeline. Experimental results show that Amoeba introduces negligible overhead to normal training without parallelism adjustment and pays significantly lower cost (around 95 also show that state-of-the-art GPU cluster scheduler can leverage elasticity with simple modifications and reduce the average JCT by as much as 29 case without elasticity.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset