Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental ques...
For servers today, that run mission critical workloads, downtime is not an option and any outage of ...
As engineering and computer systems become larger and more complex, additional challenges around the...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Virtual execution environments and middleware are required to be extremely reliable because applicat...
Software faults are recognized to be among the main responsible for system failures in many applicat...
On-line timing error detection entails gathering and analyzing monitoring data to pinpoint deviation...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Today's system monitoring tools are capable of detecting system failures such as host failures, OS ...
Abstract-Today's system monitoring tools are capable of detecting system failures such as host ...
As software is growing in size and complexity, accompanied by vendors ’ increased time-to-market pre...
Checking the execution behaviour of continuous running software systems is a critical task, to valid...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Log data, produced from every computer system and program, are widely used as source of valuable inf...
For servers today, that run mission critical workloads, downtime is not an option and any outage of ...
As engineering and computer systems become larger and more complex, additional challenges around the...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Virtual execution environments and middleware are required to be extremely reliable because applicat...
Software faults are recognized to be among the main responsible for system failures in many applicat...
On-line timing error detection entails gathering and analyzing monitoring data to pinpoint deviation...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Today's system monitoring tools are capable of detecting system failures such as host failures, OS ...
Abstract-Today's system monitoring tools are capable of detecting system failures such as host ...
As software is growing in size and complexity, accompanied by vendors ’ increased time-to-market pre...
Checking the execution behaviour of continuous running software systems is a critical task, to valid...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Log data, produced from every computer system and program, are widely used as source of valuable inf...
For servers today, that run mission critical workloads, downtime is not an option and any outage of ...
As engineering and computer systems become larger and more complex, additional challenges around the...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...