Computer Science Colloquia

Monday, April 21, 2014
Sriram Sankar
Advisor: Sudhanva Gurumurthi
Attending Faculty: Kevin Skadron, chair; Paul Reynolds, Marty Humphrey, Kamin Whitehouse, and Mircea Stan

9:00 AM, Rice Hall, Rm. 242

PhD Dissertation Defense Presentation
Impact of Data Center Infrastructure on Server Availability – Characterization, Management and Optimization


As cloud computing, online services and user storage needs grow, large companies continue building data center facilities to serve end-user requirements. These large data center facilities are warehouse-scale computers in their own right and the cost efficiency of such data centers is critical for both cloud and enterprise business. Data center infrastructure can be partitioned logically into the IT infrastructure (server and network) and the critical environment infrastructures (power, cooling and management). Although the IT component of the data center is crucial for applications to run, almost one-third of the total cost of ownership in a data center is spent towards building and operating the critical environment infrastructure. Data center operators strive to reduce the cost of the critical environment infrastructure in order to increase the server portion of the capital expense investment. However, reduction of this cost usually comes at the expense of increase in failures or unavailability of the server infrastructure. In this work, we explore the impact of data center critical infrastructure on server availability. Using data from production data centers in Microsoft, we first characterize server component failures with respect to temperature, evaluating the relationship between server hard disk drive failures and temperature in detail. We then evaluate power availability events and their impact on data center power provisioning. We then focus on the critical management infrastructure that coordinates the rest of the infrastructure, and propose a novel, low-cost, wireless-based management solution for data center management. We also present a new class of failures in data centers which we call "soft failures", which results in service unavailability but does not need actual hardware replacements.