Attention-Based Graph Transformers for Fault-Tolerant Task Migration in Heterogeneous Data Centers

Bo Xu; Chenhao Li; Lars Pettersson

doi:10.54097/n85zvm48

Authors

Bo Xu
Chenhao Li
Lars Pettersson

DOI:

https://doi.org/10.54097/n85zvm48

Keywords:

Graph Transformers, Fault Tolerance, Task Migration, Heterogeneous Computing, Data Centers, Attention Mechanisms, Resource Allocation

Abstract

Heterogeneous data centers face significant challenges in maintaining service continuity during hardware failures and resource contention scenarios. Traditional task migration strategies often struggle with the complexity of modern distributed systems that exhibit diverse processor architectures, varying network topologies, and dynamic workload patterns. This paper proposes a novel attention-based graph transformer architecture specifically designed for fault-tolerant task migration in heterogeneous data center environments. The proposed framework leverages graph neural network principles to model the complex interdependencies between computational nodes, network links, and task requirements while employing attention mechanisms to dynamically prioritize critical migration paths and resource allocations. Our approach constructs a heterogeneous graph representation where nodes represent computing resources with different capabilities and edges encode communication costs and reliability metrics. The attention mechanism learns to focus on the most relevant subgraphs and identifies optimal migration strategies that minimize service disruption while maintaining quality-of-service guarantees. Through comprehensive analysis of attention weight distributions across node categories, we demonstrate that our model successfully learns to co-locate related tasks and prioritize reliable migration destinations. Experimental results demonstrate that our method achieves superior performance compared to traditional heuristic approaches, reducing migration time by an average of 34% and improving fault recovery success rates by 41% across diverse failure scenarios. The graph transformer architecture also exhibits strong generalization capabilities, effectively handling previously unseen fault patterns and adapting to dynamic resource availability changes in real-time operational environments.

Downloads

Download data is not yet available.

References

[1] Saxena, D., & Singh, A. K. (2024). A comprehensive survey on sustainable resource management in cloud computing environments. Authorea Preprints.

[2] Haziq, M., Phung, Q. V., Lachowicz, S., Habibi, D., & Ahmad, I. (2025). Modulation Techniques for Underwater Acoustic Communication: A Comprehensive Survey. IEEE Access.

[3] Mao, Y., Yu, X., Huang, K., Zhang, Y. J. A., & Zhang, J. (2024). Green edge AI: A contemporary survey. Proceedings of the IEEE.

[4] Gu, Y., Liu, Z., Dai, S., Liu, C., Wang, Y., Wang, S., ... & Cheng, L. (2025). Deep reinforcement learning for job scheduling and resource management in cloud computing: An algorithm-level review. arXiv preprint arXiv:2501.01007.

[5] Yang, Y., Ding, G., Chen, Z., & Yang, J. (2025). GART: Graph Neural Network-based Adaptive and Robust Task Scheduler for Heterogeneous Distributed Computing. IEEE Access.

[6] Zhang, Z., Cui, P., & Zhu, W. (2020). Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 34(1), 249-270.

[7] Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in neural information processing systems, 32.

[8] Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., & Murphy, K. (2022). Machine learning on graphs: A model and comprehensive taxonomy. Journal of Machine Learning Research, 23(89), 1-64.

[9] Dwivedi, V. P., & Bresson, X. (2020). A generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications.

[10] Min, E., Chen, R., Bian, Y., Xu, T., Zhao, K., Huang, W., ... & Rong, Y. (2022). Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455.

[11] Liu, X., Cheng, B., Yue, Y., Wang, M., Li, B., & Chen, J. (2019, July). Traffic-aware and reliability-guaranteed virtual machine placement optimization in cloud datacenters. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD) (pp. 91-98). IEEE.

[12] Zhang, H., Ge, Y., Zhao, X., & Wang, J. (2025). Hierarchical deep reinforcement learning for multi-objective integrated circuit physical layout optimization with congestion-aware reward shaping. IEEE Access.

[13] Sun, T., & Wang, M. (2025). Usage-Based and Personalized Insurance Enabled by AI and Telematics. Frontiers in Business and Finance, 2(02), 262-273.

[14] Ren, S., & Chen, S. (2025). Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support. Computer Life, 13(3), 39-47.

[15] Chen, S., Liu, Y., Zhang, Q., Shao, Z., & Wang, Z. (2025). Multi‐Distance Spatial‐Temporal Graph Neural Network for Anomaly Detection in Blockchain Transactions. Advanced Intelligent Systems, 2400898.

[16] Ge, Y., Wang, Y., Liu, J., & Wang, J. (2025). GAN-Enhanced Implied Volatility Surface Reconstruction for Option Pricing Error Mitigation. IEEE Access.

[17] Wang, Y., Ding, G., Zeng, Z., & Yang, S. (2025). Causal-Aware Multimodal Transformer for Supply Chain Demand Forecasting: Integrating Text, Time Series, and Satellite Imagery. IEEE Access.

[18] Liu, J., Wang, J., and Lin, H. (2025). Coordinated Physics-Informed Multi-Agent Reinforcement Learning for Risk-Aware Supply Chain Optimization. IEEE Access

[19] Wang, M., Zhang, X., & Han, X. (2025). AI Driven Systems for Improving Accounting Accuracy Fraud Detection and Financial Transparency. Frontiers in Artificial Intelligence Research, 2(3), 403-421.

[20] Sun, T., Yang, J., Li, J., Chen, J., Liu, M., Fan, L., & Wang, X. (2024). Enhancing auto insurance risk evaluation with transformer and SHAP. IEEE Access.

[21] Wang, M., Zhang, X., Yang, Y., & Wang, J. (2025). Explainable Machine Learning in Risk Management: Balancing Accuracy and Interpretability. Journal of Financial Risk Management, 14(3), 185-198.

[22] Zhang, X., Li, P., Han, X., Yang, Y., & Cui, Y. (2024). Enhancing Time Series Product Demand Forecasting with Hybrid Attention-Based Deep Learning Models. IEEE Access.

[23] Ahmad, Z., Jehangiri, A. I., Ala'anzy, M. A., Othman, M., Latip, R., Zaman, S. K. U., & Umar, A. I. (2021). Scientific workflows management and scheduling in cloud computing: taxonomy, prospects, and challenges. IEEE Access, 9, 53491-53508.

[24] Yang, Y., Wang, M., Wang, J., Li, P., & Zhou, M. (2025). Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains. Sensors (Basel, Switzerland), 25(8), 2428.

[25] Chen, S., & Ren, S. (2025). AI-enabled Forecasting, Risk Assessment, and Strategic Decision Making in Finance. Frontiers in Business and Finance, 2(02), 274-295.

[26] Han, X., Yang, Y., Chen, J., Wang, M., & Zhou, M. (2025). Symmetry-Aware Credit Risk Modeling: A Deep Learning Framework Exploiting Financial Data Balance and Invariance. Symmetry (20738994), 17(3).

[27] Jiang, B., Cao, J., Tan, Y., & Qiu, S. (2025). Deep Learning Architectures for Sequential Decision-Making in Financial Systems: From Fraud Detection to Risk Management. Journal of Banking and Financial Dynamics, 9(9), 1-11.

[28] Sun, T., Wang, M., & Han, X. (2025). Deep Learning in Insurance Fraud Detection: Techniques, Datasets, and Emerging Trends. Journal of Banking and Financial Dynamics, 9(8), 1-11.

[29] Wang, M., Zhang, X., Yang, Y., & Wang, J. (2025). Explainable Machine Learning in Risk Management: Balancing Accuracy and Interpretability. Journal of Financial Risk Management, 14(3), 185-198.

[30] Zhang, S., Qiu, L., & Zhang, H. (2025). Edge cloud synergy models for ultra-low latency data processing in smart city iot networks. International Journal of Science, 12(10).

[31] Yang, J., Zeng, Z., & Shen, Z. (2025). Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation. IEEE Access.

[32] Sun, T., Wang, M., & Chen, J. (2025). Leveraging Machine Learning for Tax Fraud Detection and Risk Scoring in Corporate Filings. Asian Business Research Journal, 10(11), 1-13.