Cascading failures can severely affect the correct functioning of large enterprise applications consisting of hundreds of interacting microservices. As a consequence, the ability to effectively analyse the causes of occurred cascading failures is crucial for managing complex applications. In this paper, we present a model-based methodology to automate the analysis of application logs in order to identify the possible failures that occurred and their causality relations. Our methodology employs topology graphs to represent the structure of microservice-based applications and finite state machines to model their expected replica- and failure-aware behaviour. We also present a proof-of-concept implementation of our methodology, which we exploi...
System restoration from cascading failures is an integral part of the overall defense against catast...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
International audienceThis article proposes a new approach for the analysis of functional failure id...
Cascading failures can severely affect the correct functioning of large enterprise applications cons...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Complex and unforeseen failures in distributed systems must be diagnosed and replicated so developer...
An increasing number of Internet applications are applying microservice architecture due to its flex...
Computer applications, such as servers, databases and middleware, ubiquitously emit execution traces...
Virtual execution environments and middleware are required to be extremely reliable because applicat...
In this work, we present Graph Based Liability Analysis Framework (GRALAF) for root cause analysis (...
We apply the machinery of interventional causal learning with programmable interventions to the doma...
Monolithic applications are gradually getting replaced by systems built after the emerging microserv...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
Microservices are popular for web applications as they offer better scalability and reliability than...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
System restoration from cascading failures is an integral part of the overall defense against catast...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
International audienceThis article proposes a new approach for the analysis of functional failure id...
Cascading failures can severely affect the correct functioning of large enterprise applications cons...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Complex and unforeseen failures in distributed systems must be diagnosed and replicated so developer...
An increasing number of Internet applications are applying microservice architecture due to its flex...
Computer applications, such as servers, databases and middleware, ubiquitously emit execution traces...
Virtual execution environments and middleware are required to be extremely reliable because applicat...
In this work, we present Graph Based Liability Analysis Framework (GRALAF) for root cause analysis (...
We apply the machinery of interventional causal learning with programmable interventions to the doma...
Monolithic applications are gradually getting replaced by systems built after the emerging microserv...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
Microservices are popular for web applications as they offer better scalability and reliability than...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
System restoration from cascading failures is an integral part of the overall defense against catast...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
International audienceThis article proposes a new approach for the analysis of functional failure id...