Computer Science Colloquia
Monday, April 16, 2012
Arkaitz Ruiz Alvarez
Advisor: Marty Humphrey
Attending Faculty: Kim Hazelwood, Chair; Wes Weimer, Brian Smith, and Diego Lopez de Ipina Gonzalez de Artaza
12:00 PM, Rice Hall, Rm. 504
Ph.D. Dissertation Presentation
Automated Data Management in Cloud Computing
Scientists are increasingly relying on computational resources, both compute and storage, to expand scientific knowledge. For example, the data deluge is quickly overcoming the capacity of storage systems and the increasing use of simulation requires large compute capabilities. Thus, scientists need to expand their local resources with highly available and scalable systems. We consider cloud computing to be the solution that provides scientific applications with the computational resources needed. However, the services offered by the cloud providers do not address several important issues: how to meet the data requirements with the storage systems available, and how to optimize cost and other performance metrics. The variety of storage and compute choices with different characteristics and prices, the growth of the data stored in terms of size and number and the data management requirements make these tasks overwhelmingly complex for individual users.
To address these challenges, we focus on four key elements of data management: the analysis of current storage services, the expression of data requirements and storage capabilities, data management algorithms and data-aware scheduling algorithms. We combine the information from our analysis of the storage services with their capabilities in a machine-readable format that can be processed by our implementation of the user's data requirements. Thus, we can obtain within a few milliseconds a list of storage services per application dataset that meet the user's requirements, and provide cost and performance estimates. Our unique approach to data management generates an integer linear programming problem with this list. The solution to this problem is an optimal assignment of the application's data to cloud services. Our implementation can provide optimal solutions for our use cases in less than one second. We have also created new scheduling algorithms for two types of cloud applications (MapReduce and watershed model calibration) that balance cost and execution time. The scheduling decisions are Pareto optimal and, therefore, superior to other strategies. We believe that these four elements can provide the users with a comprehensive solution to the data management problem, and allow them to take advantage of the new opportunities that cloud computing offers.