Abstract Embedded parallel and distributed computing systems for real-time applications are becoming commonplace. Many such real-time applications are life-critical and require extensive fault-tolerance capabilities in order to ensure very high reliability. At the same time, cost, power, weight, and volume constraints require that any introduced redundancy must be efficiently used. Thus, failure-recovery strategies must be implemented to allow the system to most efficiently manage its resources in the presence of one or more failures, while attempting to continue the execution of the current tasks so as to not miss any deadlines. We have developed such a resource-management algorithm, which selects the optimal failure recovery procedure to ...