Exclusive

Research progress on technologies of high–performance network in artificial intelligence data center

  • Jie REN ,
  • Chang LIU ,
  • Bowen HAN ,
  • Chenyang WEN ,
  • Bohua XU ,
  • Chang CAO
Expand
  • Research Institute of China United Network Communications Co., Ltd., Beijing 100176, China

Received date: 2024-08-21

  Online published: 2025-06-13

Copyright

All rights reserved. Unauthorized reproduction is prohibited.

Cite this article

Jie REN , Chang LIU , Bowen HAN , Chenyang WEN , Bohua XU , Chang CAO . Research progress on technologies of high–performance network in artificial intelligence data center[J]. Science & Technology Review, 2025 , 43(9) : 62 -75 . DOI: 10.3981/j.issn.1000-7857.2024.08.01038

1
International Data Corporation. 2023—2024中国人工智能计算力发展评估报告[R]. 北京: IDC, 2023.

2
王祺, 李冬露. 2023年中国人工智能产业研究报告[R]. 上海: 艾瑞咨询研究院, 2024.

3
中华人民共和国国民经济和社会发展第十四个五年规划和2035年远景目标纲要[EB/OL]. (2021-03-12) [2024-08-06]. https://www.gov.cn/xinwen/2021-03/13/content_5592681.htm.

4
工业和信息化部. 算力基础设施高质量发展行动计划[EB/OL]. (2023-10-08) [2024-08-06]. https://www.gov.cn/zhengce/zhengceku/202310/P020231009520949915888.pdf.

5
Infiniband Trade Association. Infiniband architecture volume 1, general specifications, release 1.4[EB/OL]. [2024-08-06]. http://47.92.214.21:8888/rdma/IB%20Specification%20Vol%201-Release-1.4-2020-04-07_ib_spec_vol1.pdf.

6
Infiniband Trade Association. Infiniband architecture specifi-cation release 1.2. 1 annex A16: RoCE[EB/OL]. [2024-08-10]. https://www.afs.enea.it/asantoro/V1r1_2_1.Release_12062007.pdf.

7
Infiniband Trade Association. Infiniband architecture specifi-cation release 1.2. 1 annex A17: RoCEv2[EB/OL]. [2024-08-15]. https://websearch.excite.co.jp/?q=InfiniBand+Architec-ture+Specification+Release+1.2.1+Annex+A17%3A+RoCEv2&page=1.

8
Internet Engineering Task Force. The architecture of direct data placement (DDP) and Remote direct memory access (RDMA) on Internet protocols[EB/OL]. [2024-08-15]. https://datatracker.ietf.org/doc/html/rfc4296.

9
Kim J , Dally W J , Scott S , et al. Technology-driven, highly- scalable dragonfly topology[J]. ACM SIGARCH Computer Architecture News, 2008, 36(3): 77- 88.

DOI

10
Agam S. Nvidia shipped 3.76 million data-center GPUs in 2023, according to study[EB/OL]. (2024-06-10) [2024-08- 06]. https://www.hpcwire.com/2024/06/10/nvidia-shipped-3-76-million-data-center-gpus-in-2023-according-to-study/.

11
Wang W Y, Ghobadi M, Shakeri K, et al. Rail-only: A low- cost high-performance network for training LLMs with tril-lion parameters[C]//Proceedings of IEEE Symposium on High-Performance Interconnects (HOTI). Albuquerque: IEEE, 2024.

12
Al-Fares M , Loukissas A , Vahdat A . A scalable, commodity data center network architecture[J]. ACM SIGCOMM Computer Communication Review, 2008, 38(4): 63- 74.

13
Cisco. Data center overlay technologies[R]. USA: Cisco, 2013.

14
Cisco. Cisco ACI multi-tier architecture white paper[R]. USA: Cisco, 2024.

15
Dong J B, Cao Z, Zhang T, et al. EFLOPS: Algorithm and system co-design for a high performance distributed train-ing platform[C]//Proceedings of IEEE International Sympo-sium on High Performance Computer Architecture (HPCA). San Diego: IEEE, 2020: 610-622.

16
Natalie E J , Tushar K , Li S , et al. On-chip networks[M]. Williston, USA: Morgan & Claypool, 2017.

17
张雅芝. 新型数据中心网络拓扑结构及性质的研究[D]. 济南: 齐鲁工业大学, 2024.

18
Zhu Y B, Eran H, Firestone D, et al. Congestion control for large-scale RDMA deployments[C]//Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. New York: ACM, 2015: 523-536.

19
Mittal R, Lam V T, Dukkipati N, et al. TIMELY[C]//Proceedings of the 2015 ACM Conference on Special Inter-est Group on Data Communication. New York: ACM, 2015: 537-550.

20
Li Y L, Miao R, Liu H H, et al. HPCC[C]//Proceedings of the ACM Special Interest Group on Data Communication. New York: ACM, 2019: 44-58.

21
IEEE. 802.1Qbb. Priority-based flow control[EB/OL]. [2024-08-15]. https://1.ieee802.org/dcb/802-1qbb/.

22
Alizadeh M, Atikoglu B, Kabbani A, et al. Data center trans-port mechanisms: Congestion control theory and IEEE stan-dardization[C]//Proceedings of 46th Annual Allerton Confer-ence on Communication, Control, and Computing. Monti-cello: IEEE, 2008: 1270-1277.

23
Alizadeh M, Greenberg A, Maltz D A, et al. Data center TCP (DCTCP)[C]//Proceedings of the ACM SIGCOMM 2010 conference. New York: ACM, 2010.

24
Zhu Y B, Ghobadi M, Misra V, et al. ECN or delay[C]//Proceedings of the 12th International on Confer-ence on Emerging Networking Experiments and Technolo-gies. New York: ACM, 2016: 313-327.

25
Rhamdani F, Suwastika N A, Nugroho M A. Equal-cost multipath routing in data center network based on software defined network[C]//Proceedings of 6th International Conference on Information and Communication Technol-ogy (ICoICT). Bandung: IEEE, 2018: 222-226.

26
Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA[C]//Proceedings of the 2014 ACM conference on SIGCOMM. New York: ACM, 2014: 503-514.

27
Lu Y W, Chen G, Li B J, et al. Multi-path transport for RDMA in datacenters[C]//Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementa-tion. New York: ACM, 2018: 357-371.

28
Song C H, Khooi X Z, Joshi R, et al. Network load balanc-ing with in-network reordering support for RDMA[C]//Proceedings of the ACM SIGCOMM 2023 Conference. New York: ACM, 2023: 816-831.

Outlines

/