Computer Science Colloquia
Wednesday, April 24, 2013
Advisor: Kevin Skadron
Attending Faculty: Sudhanva Gurumurthi (Chair), Jack Davidson, Mircea Stan, and Westley Weimer
2:00 PM, Rice Hall, Rm. 242
PhD Dissertation Presentation
Rate Matching to Improve the Performance and Energy Efficiency of GPU Systems
Graphics processing units (GPUs) have attracted enormous interest over the past decade due to substantial increases in both performance and programmability. Programmers can potentially leverage GPUs for substantial performance gains, but at the cost of significant software engineering effort. In practice, most GPU applications do not effectively utilize all of the available resources in a system: they either fail to use a resource at all or use a resource to less than its full potential. This underutilization can hurt both performance and energy efficiency. In this dissertation, we frame this underutilization as a rate matching problem: the rate at which requests arrive at a resource does not match the rate at which that resource completes those requests. This problem is challenging because the completion rate of a given resource is highly application dependent and may also change significantly at run-time in response to system- or application-level changes. We present novel techniques for automatically solving this rate matching problem to improve performance and energy efficiency.
First, to better understand the challenges of leveraging a GPU, we present as a case study our experiences accelerating a computationally intensive video tracking application from systems biology. Although we were able to improve performance by 26x relative to the best CPU implementation, we encountered significant challenges along the way and had to leverage non-obvious optimization strategies. Based on the lessons we learned, we present general guidelines for optimizing GPU applications as well as recommendations for system-level changes that would simplify the development of high-performance GPU applications.
Next, we address underutilization at the system level by using load balancing to improve performance. We propose a dynamic scheduling algorithm that automatically and efficiently divides the execution of a data-parallel kernel across multiple, possibly heterogeneous GPUs. We show that our scheduler can nearly match the performance of an unrealistic static scheduler when device performance is fixed, and can provide better performance when device performance varies.
Finally, we address underutilization within the GPU by using frequency scaling to improve energy efficiency. We propose a novel algorithm for predicting the energy-optimal GPU clock frequencies for an arbitrary kernel. Using power measurements from real systems, we demonstrate that our algorithm improves significantly on the state of the art across multiple generations of GPUs. We also propose and evaluate techniques for decreasing the CPU's energy consumption during GPU computation.
The techniques presented in this dissertation can be used to improve the performance and energy efficiency of GPU applications with no programmer effort or software modifications required. As the diversity of available hardware systems continues to increase, automatic techniques such as these will become critical for software to fully realize the benefits of future hardware improvements.