Navigating the Performance-Resilience Trilemma: HYDRA, a Budget-Aware Antifragile Approach for the Mission-Critical Application Placement in Computing Continuum

Nome do aluno

 

Wesley Oliveira Souza

 

Título do trabalho

 

Navigating the Performance-Resilience Trilemma: HYDRA, a Budget-Aware Antifragile Approach for the Mission-Critical Application Placement in Computing Continuum

 

Resumo do trabalho

 

Application placement in the Computing Continuum must satisfy the increasingly complex demands of modern services. While this paradigm effectively integrates edge and cloud resources, it also introduces a critical trilemma: the simultaneous optimization of low latency, energy efficiency, and high resilience. These objectives are often mutually exclusive; for instance, distributing replicas to enhance resilience typically increases network latency and energy consumption, whereas consolidating them for energy efficiency compromises fault tolerance. Conventional approaches frequently prioritize performance and efficiency, treating resilience as a static, secondary constraint, an approach that is insufficient for mission-critical and latency-sensitive applications. To address this challenge, this research proposes Hydra, a novel placement approach designed to resolve this trilemma. Hydra leverages a hybrid architecture, combining Deep Reinforcement Learning (DRL) with heuristics, to dynamically manage application replicas. Moreover, it introduces an adaptive resilience mechanism that intelligently responds to failures, thereby enhancing service robustness while concurrently optimizing the conflicting objectives. A cornerstone of the Hydra framework is its ability to treat an application's Service Level Agreement (SLA) error budget as a dynamic and governable resource. This paradigm shift empowers the DRL agent to make strategic trade-offs: it can intentionally consume a controlled fraction of the error budget to proactively provision additional service replicas, thus increasing fault tolerance and scalability to preemptively mitigate cascading failures. To validate the feasibility of our approach, we have conducted preliminary simulation-based experiments with a partial implementation of Hydra. The results indicate that our method significantly outperforms a state-of-the-art baseline, particularly in maintaining high reliability and performance under heavy load conditions. These initial findings provide evidence that by strategically managing the SLA error budget, Hydra can achieve superior resilience without compromising predictable, SLA-compliant performance. This research will build upon this foundation to develop a complete and robust solution for next-generation distributed systems.

 

Orientador

 

Maycon Leone Maciel Peixoto - Universidade Federal da Bahia (UFBA)

 

Membro externo 1 (com afiliação)

 

Helder May Nunes da Silva Oliveira - Instituto de Matemática e Estatística da Universidade de São Paulo (IME-USP)

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/1468872219964148

 

Membro interno 1 (com afiliação)

 

Cássio Vinicius Serafim Prazeres - Universidade Federal da Bahia (UFBA)

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/5075736089100544

 

Suplente do membro externo (com afiliação)

 

Geraldo Pereira Rocha Filho - Universidade Estadual do Sudoeste da Bahia (UESB)

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/7417585446064168

 

Suplente do membro interno (com afiliação)

 

Gustavo Bittencourt Figueiredo - Universidade Federal da Bahia (UFBA)

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/2204147669620762

 

Data do exame

 

27 Nov, 2025

 

Horário do exame

 

1:00 PM

 

 

Data da Defesa: 
27/11/2025 - 13:00
Tipo de Defesa: 
Qualificação de Doutorado