Computer Science Colloquia
Tuesday, December 20, 2011
Advisor: Sudhanva Gurumurthi
Attending Faculty: Kevin Skadron, Chair; Mary Lou Soffa; Joanne Bechta Dugan and Mircea Stan
Olsson Hall, Room 236D, 1:00 PM
Ph.D. Proposal Presentation
A Multi-Level Approach to Processor and Memory Reliability
We are in the era of multicore processors and it is expected that the number of the processing cores on a chip will steadily increase over the next decade, driven by Moores Law. While technology scaling paves the way for high performance multicore processors, the scaling has a dark side too: silicon reliability. The silicon reliability problems affect both the processor cores and the caches, as well as main memory. Processors and their memory system have to be designed to provide adequate protection against these reliability problems. Designing a fault-tolerant computing system is a three-step process: i) understanding the underlying reliability problem through measurement studies and field-data analysis of deployed systems, ii) abstraction of the understanding and insights gained in the first step in the form of models for each reliability phenomenon, and iii) developing protection techniques to mitigate or tolerate the reliability problems. While this three-step measurement-modeling-optimization process may appear straightforward, there are several challenges one has to address in applying this methodology. Designing a reliable computer system is a large and complex multi-dimensional and multi-level problem, comprising of different hardware blocks, reliability phenomena, design layers, metrics, and optimization techniques. This dissertation proposes to target a subset of this large problem space. This dissertation will consider both the processor and main memory. For the processor, it will consider two key reliability phenomena, namely: Bias Temperature Instability (BTI) and Process Variations (PV). The proposed research will involve measurement studies from chips that contain PMOS and NMOS devices, the development of models that are suitable for architecture analysis, and the development of mitigation techniques for both logic and memory structures. For main memory, this dissertation will present an analysis of field-data collected from 30,000 systems deployed in data centers and will develop a model for main memory reliability based on that analysis.