DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access
Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important applications in HPC have an inherent object structure. On SIMD accelerators including GPUs, this is mainly due to performance problems with memory allocation: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. In this work, we present DynaSOAr, a C++/CUDA data layout DSL for object-oriented programming, combined with a parallel dynamic object allocator. DynaSOAr was designed for a class of object-oriented programs that we call Single-Method Multiple Objects (SMMO), in which parallelism is expressed over a set of objects. DynaSOAr is the first GPU object allocator that provides a parallel do-all operation, which is the foundation of SMMO applications. DynaSOAr improves the usage of allocated memory with a Structure of Arrays (SOA) data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. In our benchmarks, DynaSOAr achieves a significant speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.
READ FULL TEXT