Computer Science Colloquia
Friday, December 2, 2011
Advisor: Andrew Grimshaw
Attending Faculty: Kevin Skadron (Chair); Kim Hazelwood; David Luebke; and Hossein Haj-Hariri.
Olsson Hall, Room 236D, 3:00 pm
Ph.D. Dissertation Presentation
Allocation-oriented Algorithm Design for GPU Computing
The wide data-parallelism of GPU processor design facilitates the execution of large quantities of concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many computations whose parallelizations would impose dynamic dependences between processing elements. This dissertation specifically focuses on such problems whose efficient solutions require cooperative allocation, i.e., their operation entails dynamic data placement within shared structures. The contribution of this thesis is a set of design patterns and reusable tunable software primitives that foster the construction of cooperative parallelizations that are significantly faster, more efficient, more flexible, and more performance-portable than the state-of-the-art. Whereas concurrent CPU programs typically leverage fine-grained atomic operations for coordinating data placement, we demonstrate the advantages of parallel prefix sum as a high performance, bulk-synchronous alternati for cooperative allocation. We construct very efficient algorithms for global and local prefix sum by employing flexible granularity coarsening techniques for properly balancing serial and parallel workloads. The large efficiency gains permit kernel fusion techniques whereby application-specific logic can be incorporated within our prefix sum constructs at significantly reduced (or negligible) overhead. To demonstrate the viability of our methods, we construct cooperative GPU implementations for a variety of parallel list-processing primitives (including reduction, prefix scan, duplicate removal, histogram, and reduce-by-key), evaluating their performance across a wide spectrum of problem sizes, types, and target architectures. To achieve such performance-portability, we present a policy-based autotuning design idiom where we leave unbound many of the program details having opaque performance consequences. Finally, we showcase high performance implementations for two archetypical problems in Computer Science: sorting and breadth-first search (BFS). We achieve excellent performance, demonstrating multiple factors of speedup for both problem genres over prior work for GPU and conventional multi-core platforms.