Computer Science Colloquia
Tuesday, February 12, 2013
Advisor: Andrew Grimshaw
Attending Faculty: John Knight (Chair), Jack Stankovic, Westley Weimer, and Donald Brown (Minor Representative)
1:00 PM, Rice Hall, Rm. 242
PhD Proposal Presentation
Transaction Logging for Automated and Proactive Fault Management
A Grid System is a collection of heterogeneous computer resources, managed by different resource owners, connected across a wide area, providing services under a single abstract interface. When Grid components fail, the current approach to managing failures is for a human system administrator to read through logs for the failed components, find relevant entries in the midst of otherwise unrelated text, cross-reference these with other related entries to determine root cause (entries which are potentially in log files on other components), and then finally apply some fix based on the result. This approach is time consuming, and can lead to substantial instances of lost service for Grid clients.
Recent work has shown that failures can be automatically detected, categorized, and, in some cases, predicted based on information available in system log files. However, a common theme in the literature is a centralized decision engine based on a single input source. Such an approach is unacceptable for large-scale distributed systems, as any centralized component will eventually become a bottleneck for the system.
The goal of the dissertation proposed herein is to apply a technique known as transaction logging to allow system logs from multiple distributed sites and heterogeneous platforms to be used as input for automated fault management mechanisms. By allowing each resource to maintain its own local logs and request information from other resources only as needed, we will enable automated fault management, while avoiding the scalability problems present in centralized solutions.