Latency-Bounded Embedding Table Partitioning Across Heterogeneous Accelerators for Large-Scale Recommendation Serving

Authors

  • Jingyi Fang School of Informatics, University of California, Berkeley, California, USA

DOI:

https://doi.org/10.54097/jd5aet13

Keywords:

Embedding tables, Heterogeneous accelerators, Recommendation serving, Latency optimization, Model parallelism, Table partitioning, SLO constraints

Abstract

 Large-scale recommendation systems (RS) are among the most computationally demanding workloads deployed in modern data centers, characterized by the co-existence of memory-intensive embedding table lookups and compute-intensive dense neural network operations. As model sizes grow to encompass hundreds of embedding tables with billions of parameters, deploying these models under strict service-level objective (SLO) constraints becomes increasingly challenging. This paper proposes a latency-bounded embedding table partitioning framework, termed LatEmbed, that intelligently distributes embedding tables across heterogeneous accelerator pools comprising graphics processing units (GPU), central processing units (CPU), and field-programmable gate arrays (FPGA). Our approach constructs an offline profiling-guided cost model that captures per-table access latency, memory bandwidth consumption, and interconnect transfer overhead, then formulates the partitioning problem as a constrained optimization objective minimizing tail latency subject to memory capacity and bandwidth limits. A dynamic rebalancing mechanism further adapts placements to runtime access distribution shifts without violating end-to-end SLO bounds. Experiments on industrial-scale workloads demonstrate that LatEmbed reduces P99 inference latency by up to 43% compared to GPU-only baselines and achieves 2.1× improvement in memory efficiency over CPU-only configurations, while maintaining SLO compliance above 99.5% under peak traffic conditions.

Downloads

Download data is not yet available.

References

[1] Naumov, M., Mudigere, D., Shi, H. J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Gonzalez, A., Aizman, A., Doshi, N., Smelyanskiy, M., & Rao, V. (2019). Deep learning recommendation model for personalization and recommendation systems. arXiv, arXiv:1906.00091. https://doi.org/10.48550/arXiv.1906.00091

[2] Gupta, U., Wu, C.-J., Wang, X., Naumov, M., Reagen, B., Brooks, D., Cottel, B., Hazelwood, K., Hwu, W.-M., Jia, T., Lee, H., Li, A., Maier, G., Mudigere, D., Reddi, V. J., & Zhang, X. (2020). The architectural implications of Facebook's DNN-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 488-501). IEEE.

https://doi.org/10.1109/HPCA47549.2020.00046

[3] Ke, L., Gupta, U., Cho, B. Y., Brooks, D., Chandra, V., Diril, U., Fitch, A., Flemming, M., Gandhi, J., Hazelwood, K., Jia, Z., Jin, D., Lee, D., Li, Z., Liu, M., Menon, S., Naumov, M., Pang, R., Schwing, T., … & Zhang, X. (2020). RecNMP: Accelerating personalized recommendation with near-memory processing. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (pp. 790-803). IEEE. https://doi.org/10.1109/ISCA45697.2020.00071

[4] Shen, Z., Zhao, W., Wang, B., Wang, Z., & Shang, W. (2026). CAGR: A cross-accelerator graph optimization framework for efficient recommender system inference. IEEE Access, 14, 1-16.

[5] Acun, B., Murphy, M., Wang, X., Nie, J., Wu, C.-J., & Hazelwood, K. (2021). Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 802-814). IEEE. https://doi.org/10.1109/HPCA51647.2021.00070

[6] Adnan, M., Maboud, Y. E., Mahajan, D., & Nair, P. J. (2024). Heterogeneous acceleration pipeline for recommendation system training. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) (pp. 1063-1079). IEEE.

[7] Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdemir, M., Nie, J., Park, J., Ping, L., Xiao, S., Yang, Z., Xing, Y., & Rao, V. (2022). Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA) (pp. 993-1011). ACM.

https://doi.org/10.1145/3470496.3532262

[8] Zha, D., Feng, L., Bhushanam, B., Choudhary, D., Nie, J., Tian, Y., Choudhury, S., & Hu, X. (2022). AutoShard: Automated embedding table sharding for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 4461-4471). ACM. https://doi.org/10.1145/3534678.3539107

[9] Shi, H. J. M., Mudigere, D., Naumov, M., & Yang, J. (2020). Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 165-175). ACM. https://doi.org/10.1145/3394486.3403061

[10] Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-16). IEEE.

https://doi.org/10.1109/SC41405.2020.00045

[11] Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, Y., Xu, Y., Zhuo, D., Xing, Y., Gonzalez, J. E., & Stoica, I. (2022). Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 559-578). USENIX Association. https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin

[12] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv, arXiv:2006.16668.

https://doi.org/10.48550/arXiv.2006.16668

[13] Sethi, G., Bhattacharya, P., Choudhary, D., Wu, C.-J., & Kozyrakis, C. (2023). FlexShard: Flexible sharding for industry-scale sequence recommendation models. arXiv, arXiv:2301.02959. https://doi.org/10.48550/arXiv.2301.02959

[14] Zhou, G., Mou, N., Fan, Y., Pi, Q., Bian, W., Zhou, C., Zhu, X., & Gai, K. (2019). Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 1, pp. 5941-5948). AAAI Press.

https://doi.org/10.1609/aaai.v33i01.33015941

[15] Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., & Chi, E. (2021). DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021 (pp. 1785-1797). ACM. https://doi.org/10.1145/3442381.3449972

[16] Zhang, X., Zhu, Q., Xu, L., Huda, Z., Zhou, W., Fang, J., Chen, L., Zhang, Z., & Yang, C. (2025). Two-dimensional sparse parallelism for large scale deep learning recommendation model training. arXiv, arXiv:2508.03854. https://doi.org/10.48550/arXiv.2508.03854

[17] Ma, K., Yan, X., Cai, Z., Huang, Y., Wu, Y., & Cheng, J. (2023). FEC: Efficient deep recommendation model training with flexible embedding communication. Proceedings of the ACM on Management of Data, 1(2), Article 142. https://doi.org/10.1145/3589287

[18] Lian, X., Yuan, B., Zhu, X., Wang, Y., He, Y., Wu, H., Sun, L., Lyu, H., Liu, J., Dong, X., Liao, Y., Liu, M., Li, C., & Xie, X. (2022). Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3288-3298). ACM. https://doi.org/10.1145/3534678.3539108

[19] Wang, X., He, X., Wang, M., Feng, F., & Chua, T. S. (2019). Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 165-174). ACM. https://doi.org/10.1145/3331184.3331221

[20] Zhang, W., Qin, J., Guo, W., Tang, R., & He, X. (2021). Deep learning for click-through rate estimation. arXiv, arXiv:2104.10584. https://doi.org/10.48550/arXiv.2104.10584

[21] Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, P. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 1441-1450). ACM. https://doi.org/10.1145/3357384.3357895

[22] Liu, C. L., Tseng, C. J., Huang, T. H., Yang, J. S., & Huang, K. B. (2023). A multi-task learning model for building electrical load prediction. Energy and Buildings, 278, 112601. https://doi.org/10.1016/j.enbuild.2022.112601

[23] Chen, J., Liang, Y., Liu, J., & Zhou, M. (2026). Temporal transformer with conditional tabular GAN for credit card fraud detection: A sequential deep learning approach. Mathematics, 14(7), 1183. https://doi.org/10.3390/math14071183

[24] Wang, Z., Yang, J. S., Shang, W., & Ding, J. (2026). FairPromote: Explainable and fairness-aware talent promotion prediction via adversarial debiasing and SHAP-based interpretation. IEEE Access, 14, 1-16.

[25] Ding, J., Shen, Z., & Liu, W. (2026). Game-theoretic cost-sensitive adversarial training for robust cloud intrusion detection against GAN-based evasion attacks. Applied Sciences, 16(8), 3944. https://doi.org/10.3390/app16083944

[26] Ping, W., Jiao, Y., Fan, H., & Zhang, X. (2026). Multimodal fraud detection in financial statements: A trimodal attention network with contrastive evidence chain construction. IEEE Access, 14, 1-17.

Downloads

Published

29-06-2026

Issue

Section

Articles

How to Cite

Fang, J. (2026). Latency-Bounded Embedding Table Partitioning Across Heterogeneous Accelerators for Large-Scale Recommendation Serving. Journal of Computing and Electronic Information Management, 21(3), 8-14. https://doi.org/10.54097/jd5aet13