Machine Learning-Based Anomaly Detection for Hybrid Cloud Infrastructure and Mission-Critical Database Workloads
Abstract
Hybrid cloud has become the default operating model for enterprises that must combine public cloud elasticity, private cloud control, and locally governed data platforms. In this setting, mission-critical database workloads are often the most fragile layer because small deviations in wait events, query plans, replication delay, storage latency, or authentication behaviour can propagate into customer-facing outages and security exposure. This review synthesises research and industry evidence published between 2020 and 2025 on machine learning-based anomaly detection for hybrid cloud infrastructure and critical database operations. The aim is to define a practical evidence-informed framework that links telemetry collection, feature engineering, model selection, explainability, and guarded response. The review shows that no single model family is sufficient. Statistical baselines and rules remain valuable for stable service-level indicators, while tree-based, probabilistic, deep sequence, transformer, graph, and knowledge-based methods address non-linear workloads, cross-service dependencies, and root-cause reasoning. For database workloads, anomaly detection must integrate infrastructure metrics with database-native signals, including active sessions, wait classes, lock contention, query execution plans, buffer usage, redo or write-ahead logging behaviour, and replication health. The paper proposes an architecture in which observability, topology context, model governance, and operational runbooks are treated as one system rather than separate tools. It therefore argues that successful deployment should be measured not only by precision and recall, but also by mean time to detect, mean time to recover, false-alert cost, safety of automated actions, and compliance readiness. The review concludes that hybrid cloud anomaly detection is most reliable when machine learning is constrained by domain knowledge, continuous validation, human approval for high-risk interventions, and clear accountability across platform, security, and database engineering teams.