Improved Latency-Communication Trade-Off for Map-Shuffle-Reduce Systems with Stragglers
In a distributed computing system operating according to the map-shuffle-reduce framework, coding data prior to storage can be useful both to reduce the latency caused by straggling servers and to decrease the inter-server communication load in the shuffling phase. In prior work, a concatenated coding scheme was proposed for a matrix multiplication task. In this scheme, the outer Maximum Distance Separable (MDS) code is leveraged to correct erasures caused by stragglers, while the inner repetition code is used to improve the communication efficiency in the shuffling phase by means of coded multicasting. In this work, it is demonstrated that it is possible to leverage the redundancy created by repetition coding in order to increase the rate of the outer MDS code and hence to increase the multicasting opportunities in the shuffling phase. As a result, the proposed approach is shown to improve over the best known latency-communication overhead trade-off.
READ FULL TEXT