Achieving Resource Isolation in Multi-Tenant Cloud Platforms Without Sacrificing Performance

Jianbo Ding; Tingjie Chen; Yuxuan Qin

doi:10.54097/b5qr1a18

Authors

Jianbo Ding
Tingjie Chen
Yuxuan Qin

DOI:

https://doi.org/10.54097/b5qr1a18

Keywords:

Multi-tenant cloud, Resource isolation, Performance, Virtualization, Containers, Quality of service, Scheduling, Cloud computing

Abstract

Multi-tenant cloud (MTC) platforms have become the cornerstone of modern distributed computing infrastructure, enabling diverse workloads from independent clients to coexist on shared physical hardware while each tenant perceives a logically dedicated environment. The central engineering challenge in these environments is achieving robust resource isolation (RI) without incurring performance penalties that undermine the economic and operational advantages of consolidation. This paper provides a comprehensive review of mechanisms, scheduling policies, and architectural innovations that address the isolation-performance trade-off across compute, memory, network, and storage domains. We survey hardware-assisted virtualization techniques, container-based sandboxing, software-defined networking (SDN) overlays, and disaggregated storage architectures, analyzing how each approach positions itself on the spectrum between isolation fidelity and execution efficiency. We further examine the growing role of machine learning (ML) in dynamic resource management, the unique demands imposed by emerging workload classes such as serverless functions and artificial intelligence (AI) training pipelines, and the microarchitectural threat landscape shaped by speculative execution vulnerabilities.

Downloads

Download data is not yet available.

References

[1] Tirmazi, M., Barker, A., Deng, N., Haque, M. E., Qin, Z. G., Hand, S., ... & Wilkes, J. (2020, April). Borg: the next generation. In Proceedings of the fifteenth European conference on computer systems (pp. 1-14).

[2] Tang, C., Yu, K., Veeraraghavan, K., Kaldor, J., Michelson, S., Kooburat, T., ... & Zhang, P. (2020). Twine: A unified cluster management system for shared infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 787-803).

[3] Ambati, P., Goiri, Í., Frujeri, F., Gun, A., Wang, K., Dolan, B., ... & Bianchini, R. (2020). Providing {SLOs} for {Resource-Harvesting}{VMs} in cloud platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 735-751).

[4] Qiu, H., Banerjee, S. S., Jha, S., Kalbarczyk, Z. T., & Iyer, R. K. (2020). {FIRM}: An intelligent fine-grained resource management framework for {SLO-Oriented} microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) (pp. 805-825).

[5] Weng, Q., Xiao, W., Yu, Y., Wang, W., Wang, C., He, J., ... & Ding, Y. (2022). {MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters. In 19th USENIX symposium on networked systems design and implementation (NSDI 22) (pp. 945-960).

[6] Lyerly, R. (2024). Popcorn linux: A compiler and runtime for state transformation between heterogeneous-ISA architectures.

[7] Goethals, T., Sebrechts, M., Al-Naday, M., Volckaert, B., & De Turck, F. (2022, July). A functional and performance benchmark of lightweight virtualization platforms for edge computing. In 2022 IEEE International conference on edge computing and communications (EDGE) (pp. 60-68). IEEE.

[8] Jain, S. M. (2020). Linux Containers and Virtualization. A Kernel Perspective, 2020-10.

[9] Randal, A. (2020). The ideal versus the real: Revisiting the history of virtual machines and containers. ACM Computing Surveys (CSUR), 53(1), 1-31.

[10] Casalicchio, E., & Iannucci, S. (2020). The state‐of‐the‐art in container technologies: Application, orchestration and security. Concurrency and Computation: Practice and Experience, 32(17), e5668.

[11] Vasireddy, I., Kandi, P., & Gandu, S. (2023). Efficient resource utilization in kubernetes: A review of load balancing solutions. International Journal of Innovative Research in Engineering & Management, 10(6), 44-48.

[12] Luo, S., Xu, H., Lu, C., Ye, K., Xu, G., Zhang, L., ... & Xu, C. (2021, November). Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM symposium on cloud computing (pp. 412-426).

[13] Shahrad, M., Fonseca, R., Goiri, I., Chaudhry, G., Batum, P., Cooke, J., ... & Bianchini, R. (2020). Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20) (pp. 205-218).

[14] Gujarati, A., Karimi, R., Alzayat, S., Hao, W., Kaufmann, A., Vigfusson, Y., & Mace, J. (2020). Serving {DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 443-462).

[15] Zhao, W., Chen, T., Yang, J. S., & Qiu, L. (2026). AutoML-Pipeline: A RAG-Enhanced Code Generation Framework with Pre-validation for Cloud-Native Machine Learning Workflows. IEEE Access.

[16] Hemamalini, V., Mishra, A. K., Tyagi, A. K., & Kakulapati, V. (2024). Artificial intelligence–blockchain‐enabled–internet of things‐based cloud applications for next‐generation society. Automated secure computing for next‐generation systems, 65-82.

[17] Thiyyakat, M., Kalambur, S., & Sitaram, D. (2020, May). Improving resource isolation of critical tasks in a workload. In Workshop on Job Scheduling Strategies for Parallel Processing (pp. 45-67). Cham: Springer International Publishing.

[18] Garg, A., Kulkarni, P., Kurkure, U., Sivaraman, H., & Vu, L. (2019, December). Empirical analysis of hardware-assisted gpu virtualization. In 2019 IEEE 26th International conference on high performance computing, data, and analytics (HiPC) (pp. 395-405). IEEE.

[19] Li, S. W., Koh, J. S., & Nieh, J. (2019). Protecting cloud virtual machines from hypervisor and host operating system exploits. In 28th USENIX Security Symposium (USENIX Security 19) (pp. 1357-1374).

[20] Sultan, S., Ahmad, I., & Dimitriou, T. (2019). Container security: Issues, challenges, and the road ahead. IEEE access, 7, 52976-52996.

[21] Wang, X., Du, J., & Liu, H. (2022). Performance and isolation analysis of RunC, gVisor and Kata Containers runtimes. Cluster Computing, 25(2), 1497-1513.

[22] Noor, J., Faysal, M. B., Amin, M. S., Tabassum, B., Khan, T. R., & Rahman, T. (2025). Kubernetes application performance benchmarking on heterogeneous cpu architecture: An experimental review. High-Confidence Computing, 5(1), 100276.

[23] Cui, H., Tang, Z., Lou, J., Jia, W., & Zhao, W. (2024). Latency-aware container scheduling in edge cluster upgrades: A deep reinforcement learning approach. IEEE Transactions on Services Computing, 17(5), 2530-2543.

[24] Meyer, V. (2022). Interference-aware cloud scheduling architecture for dynamic latency-sensitive workloads.

[25] Dzikowski, B. Practical Recommendations for Accurately Predicting Performance Degradation Caused by Memory Contention.

[26] Ma, H., Luo, X., & Xu, D. (2023). Intelligent queue management of open vSwitch in multi-tenant data center. Future Generation Computer Systems, 144, 50-62.

[27] Yang, Y., Jiang, H., Liang, Y., Wu, Y., Lv, Y., Li, X., & Xie, G. (2020, December). Isolation guarantee for efficient virtualized network i/o on cloud platform. In 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (pp. 344-351). IEEE.

[28] Döring, T., Stubbe, H., & Holzinger, K. (2021). SmartNICs: Current trends in research and industry. Network, 19.

[29] Jaliminche, L. N., Chakraborttii, C. N., Choi, C., & Litz, H. (2023, October). Enabling multi-tenancy on SSDs with accurate IO interference modeling. In Proceedings of the 2023 ACM Symposium on Cloud Computing (pp. 216-232).

[30] Kappes, G., & Anastasiadis, S. V. (2020, August). Libservices: Dynamic storage provisioning for multitenant i/o isolation. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (pp. 33-41).

[31] Lu, K., Zhao, S., Shan, H., Wei, Q., Li, G., Wan, J., ... & Wang, D. (2024). Scythe: a low-latency RDMA-enabled distributed transaction system for disaggregated memory. ACM Transactions on Architecture and Code Optimization, 21(3), 1-26.

[32] Mao, H., Schwarzkopf, M., Venkatakrishnan, S. B., Meng, Z., & Alizadeh, M. (2019). Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication (pp. 270-288).

[33] Rzadca, K., Findeisen, P., Swiderski, J., Zych, P., Broniek, P., Kusmierek, J., ... & Wilkes, J. (2020, April). Autopilot: workload autoscaling at google. In proceedings of the fifteenth european conference on computer systems (pp. 1-16).

[34] Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., & Zaharia, M. (2020). {Heterogeneity-Aware} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 481-498).

[35] Mehboob, T., Guo, L., Tallent, N. R., Zink, M., & Irwin, D. (2025, November). PowerTrip: Exploiting Federated Heterogeneous Datacenter Power for Distributed ML Training. In Proceedings of the 2025 ACM Symposium on Cloud Computing (pp. 762-775).

[36] Tam, D. S. H., Liu, Y., Xu, H., Xie, S., & Lau, W. C. (2023, August). Pert-gnn: Latency prediction for microservice-based cloud-native applications via graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 2155-2165).

[37] Pupykina, A., & Agosta, G. (2019). Survey of memory management techniques for hpc and cloud computing. IEEE ACCESS, 7(1), 1-23.

[38] Kumar, A., Narayanan, I., Zhu, T., & Sivasubramaniam, A. (2020, April). The fast and the frugal: Tail latency aware provisioning for coping with load variations. In Proceedings of The Web Conference 2020 (pp. 314-326).

[39] Kocher, P., Horn, J., Fogh, A., Genkin, D., Gruss, D., Haas, W., ... & Yarom, Y. (2020). Spectre attacks: Exploiting speculative execution. Communications of the ACM, 63(7), 93-101.

[40] Canella, C., Van Bulck, J., Schwarz, M., Lipp, M., Von Berg, B., Ortner, P., ... & Gruss, D. (2019). A systematic evaluation of transient execution attacks and defenses. In 28th USENIX Security Symposium (USENIX Security 19) (pp. 249-266).

[41] Agache, A., Brooker, M., Iordache, A., Liguori, A., Neugebauer, R., Piwonka, P., & Popa, D. M. (2020). Firecracker: Lightweight virtualization for serverless applications. In 17th USENIX symposium on networked systems design and implementation (NSDI 20) (pp. 419-434).

[42] Du, D., Yu, T., Xia, Y., Zang, B., Yan, G., Qin, C., ... & Chen, H. (2020, March). Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 467-481).

[43] Ustiugov, D., Petrov, P., Kogias, M., Bugnion, E., & Grot, B. (2021, April). Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th ACM international conference on architectural support for programming languages and operating systems (pp. 559-572).

[44] Rahman, J., & Lama, P. (2019, June). Predicting the end-to-end tail latency of containerized microservices in the cloud. In 2019 IEEE International Conference on Cloud Engineering (IC2E) (pp. 200-210). IEEE.

[45] Vankayala, S. C. (2022). Tail-Latency-Oriented Quality Assurance for Microservices: A System-Aware, SLO-Driven Approach. International Journal of Science, Engineering and Technology, 10(5).

[46] Ali, B. S., Chen, K., & Khan, I. (2019). Towards efficient, work-conserving, and fair bandwidth guarantee in cloud datacenters. IEEE Access, 7, 109134-109150.

[47] Chen, W., Zhou, X., & Rao, J. (2019). Preemptive and low latency datacenter scheduling via lightweight containers. IEEE Transactions on Parallel and Distributed Systems, 31(12), 2749-2762.

[48] Wei, Z., Huang, Z., Yen, J., Xiong, T., Xu, K., Zheng, Y., ... & Qi, Z. SpiderSense: Lightweight Last-Level Cache Management via Time Period Tagging for LLC-Critical Workloads. ACM Transactions on Architecture and Code Optimization.

[49] Zhang, Z., Cheng, Y., Gao, Y., Nepal, S., Liu, D., & Zou, Y. (2020). Detecting hardware-assisted virtualization with inconspicuous features. IEEE Transactions on Information Forensics and Security, 16, 16-27.

[50] Zhi, J. (2025). A study on overcommitment in cloud providers (Doctoral dissertation, Universidade de São Paulo).

[51] Maruf, H. A., Wang, H., Dhanotia, A., Weiner, J., Agarwal, N., Bhattacharya, P., ... & Chauhan, P. (2023, March). Tpp: Transparent page placement for cxl-enabled tiered-memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (pp. 742-755).

[52] Sadayan, S. (2025). Deployment Scenarios For Tenant Routing Multicast (TRM) In Modern Data Centers. Journal of International Crisis & Risk Communication Research (JICRCR), 8.

[53] Liu, M., Cui, T., Schuh, H., Krishnamurthy, A., Peter, S., & Gupta, K. (2019). Offloading distributed applications onto smartnics using ipipe. In Proceedings of the ACM Special Interest Group on Data Communication (pp. 318-333).

[54] Medeiros, B., Simplicio, M. A., & Andrade, E. R. (2019, February). Designing and assessing multi-tenant isolation strategies for cloud networks. In 2019 22nd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN) (pp. 214-221). IEEE.

[55] Wu, J., Cai, L., Cai, Z., Zhang, F., & Liao, J. (2025). Improving I/O performance and fairness in NVMe SSDs with pooling portions of cache partitions. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[56] Oh, M., Kang, C., Lee, S., Kim, W., Roh, Y., Kang, J. U., & Chang, S. (2025). NVMe-oF-R: Fast Recovery Design on Disaggregated Distributed Storage System. IEEE Transactions on Parallel and Distributed Systems, 37(2), 380-394.

[57] Li, M., Wilke, L., Wichelmann, J., Eisenbarth, T., Teodorescu, R., & Zhang, Y. (2022, May). A systematic look at ciphertext side channels on AMD SEV-SNP. In 2022 IEEE Symposium on Security and Privacy (SP) (pp. 337-351). IEEE.

[58] Patchamatla, P. S. S. R. (2023). Integrating AI for Intelligent Network Resource Management across Edge and Multi-Tenant Cloud Clusters. International Journal of Advanced Research in Computer Science & Technology (IJARCST), 6(6), 9378-9385.

[59] Bakshi, E. (2024). Performance Interference Detection for Cloud-Native Applications Using Unsupervised Machine Learning Models (Master's thesis, California Polytechnic State University).

[60] Wang, Y., Arya, K., Kogias, M., Vanga, M., Bhandari, A., Yadwadkar, N. J., ... & Bianchini, R. (2021, April). Smartharvest: Harvesting idle cpus safely and efficiently in the cloud. In Proceedings of the Sixteenth European Conference on Computer Systems (pp. 1-16).

[61] Liu, Y., Zeng, P., Cui, J., & Xia, C. (2023). Co-design of control, computation, and network scheduling based on reinforcement learning. IEEE Internet of Things Journal, 11(3), 5249-5258.

[62] Mukkawar, A. (2025, August). ML-Driven Predictive Autoscaling and Fault Tolerance in Multi-Region Cloud Architectures. In 2025 IEEE International Conference on High Performance Computing and Communications (HPCC) (pp. 1-11). IEEE.

[63] Kornaros, G. (2022). Hardware-assisted machine learning in resource-constrained IoT environments for security: review and future prospective. IEEE Access, 10, 58603-58622.

[64] Bharadwaj, M. (2022). Predictive Performance Modeling for Multi-Core and Many-Core Computing Architectures Using AI-Driven Analytics. American International Journal of Computer Science and Technology, 4(4), 1-13.

[65] Zhang, S., Qiu, L., & Zhang, H. (2025). Edge cloud synergy models for ultra-low latency data processing in smart city iot networks. International Journal of Science, 12(10).