Computer Science Colloquia
Monday, December 12, 2011
Advisor: Sudhanva Gurumurthi
Attending Faculty: Kevin Skadron (Chair), Paul Reynolds, Marty Humphrey
Olsson Hall, Room 236D, 2:00 pm
Ph.D. Qualifying Exam Presentation
Impact of Temperature on Hard Disk Drive Reliability in Large Datacenters
With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. However, there is very little understanding on the major determinants of server failures in datacenters. Hard disk drives are known to contribute significantly to server failures in datacenters. In this work, I focus on the relationship between temperature and hard disk drive failures in a real datacenter. I present a case study on failures in a dense storage design from a large population of servers housing close to 80000 disk drives, hosting a large scale online service at Microsoft. In our preliminary DSN 2011 work, we specifically establish correlation between temperatures and failures observed at different location granularities: a) inside drive locations in a server chassis, b) across server locations in a rack and c) across multiple racks in a datacenter. In this presentation, we extend the previous study and show that Temperature exhibits a stronger correlation to failures compared to disk utilization or workload characteristics. Additionally, I explore the impact of variations in temperature on hard disk drive failures with data collected from the datacenter deployment. With data from real drives and experimental evaluation under lab conditions, we show that workload changes contribute minimally to temperature changes or failures in the storage system under study. We also explore parameters in chassis design that can influence temperature experienced by hard disk drives, including placement of disk drives within the chassis and the impact of varying fan speeds. Finally, with the help of a datacenter cost model and the results of an Arrhenius model to estimate reliability, I shall show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability, and motivate the need for datacenter architects to consider temperature impact at design phase.