Research progress on technologies of high–performance network in artificial intelligence data center
Received date: 2024-08-21
Online published: 2025-06-13
Copyright
Jie REN , Chang LIU , Bowen HAN , Chenyang WEN , Bohua XU , Chang CAO . Research progress on technologies of high–performance network in artificial intelligence data center[J]. Science & Technology Review, 2025 , 43(9) : 62 -75 . DOI: 10.3981/j.issn.1000-7857.2024.08.01038
1 |
International Data Corporation. 2023—2024中国人工智能计算力发展评估报告[R]. 北京: IDC, 2023.
|
2 |
王祺, 李冬露. 2023年中国人工智能产业研究报告[R]. 上海: 艾瑞咨询研究院, 2024.
|
3 |
中华人民共和国国民经济和社会发展第十四个五年规划和2035年远景目标纲要[EB/OL]. (2021-03-12) [2024-08-06]. https://www.gov.cn/xinwen/2021-03/13/content_5592681.htm.
|
4 |
工业和信息化部. 算力基础设施高质量发展行动计划[EB/OL]. (2023-10-08) [2024-08-06]. https://www.gov.cn/zhengce/zhengceku/202310/P020231009520949915888.pdf.
|
5 |
Infiniband Trade Association. Infiniband architecture volume 1, general specifications, release 1.4[EB/OL]. [2024-08-06]. http://47.92.214.21:8888/rdma/IB%20Specification%20Vol%201-Release-1.4-2020-04-07_ib_spec_vol1.pdf.
|
6 |
Infiniband Trade Association. Infiniband architecture specifi-cation release 1.2. 1 annex A16: RoCE[EB/OL]. [2024-08-10]. https://www.afs.enea.it/asantoro/V1r1_2_1.Release_12062007.pdf.
|
7 |
Infiniband Trade Association. Infiniband architecture specifi-cation release 1.2. 1 annex A17: RoCEv2[EB/OL]. [2024-08-15]. https://websearch.excite.co.jp/?q=InfiniBand+Architec-ture+Specification+Release+1.2.1+Annex+A17%3A+RoCEv2&page=1.
|
8 |
Internet Engineering Task Force. The architecture of direct data placement (DDP) and Remote direct memory access (RDMA) on Internet protocols[EB/OL]. [2024-08-15]. https://datatracker.ietf.org/doc/html/rfc4296.
|
9 |
|
10 |
Agam S. Nvidia shipped 3.76 million data-center GPUs in 2023, according to study[EB/OL]. (2024-06-10) [2024-08- 06]. https://www.hpcwire.com/2024/06/10/nvidia-shipped-3-76-million-data-center-gpus-in-2023-according-to-study/.
|
11 |
Wang W Y, Ghobadi M, Shakeri K, et al. Rail-only: A low- cost high-performance network for training LLMs with tril-lion parameters[C]//Proceedings of IEEE Symposium on High-Performance Interconnects (HOTI). Albuquerque: IEEE, 2024.
|
12 |
|
13 |
Cisco. Data center overlay technologies[R]. USA: Cisco, 2013.
|
14 |
Cisco. Cisco ACI multi-tier architecture white paper[R]. USA: Cisco, 2024.
|
15 |
Dong J B, Cao Z, Zhang T, et al. EFLOPS: Algorithm and system co-design for a high performance distributed train-ing platform[C]//Proceedings of IEEE International Sympo-sium on High Performance Computer Architecture (HPCA). San Diego: IEEE, 2020: 610-622.
|
16 |
|
17 |
张雅芝. 新型数据中心网络拓扑结构及性质的研究[D]. 济南: 齐鲁工业大学, 2024.
|
18 |
Zhu Y B, Eran H, Firestone D, et al. Congestion control for large-scale RDMA deployments[C]//Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. New York: ACM, 2015: 523-536.
|
19 |
Mittal R, Lam V T, Dukkipati N, et al. TIMELY[C]//Proceedings of the 2015 ACM Conference on Special Inter-est Group on Data Communication. New York: ACM, 2015: 537-550.
|
20 |
Li Y L, Miao R, Liu H H, et al. HPCC[C]//Proceedings of the ACM Special Interest Group on Data Communication. New York: ACM, 2019: 44-58.
|
21 |
IEEE. 802.1Qbb. Priority-based flow control[EB/OL]. [2024-08-15]. https://1.ieee802.org/dcb/802-1qbb/.
|
22 |
Alizadeh M, Atikoglu B, Kabbani A, et al. Data center trans-port mechanisms: Congestion control theory and IEEE stan-dardization[C]//Proceedings of 46th Annual Allerton Confer-ence on Communication, Control, and Computing. Monti-cello: IEEE, 2008: 1270-1277.
|
23 |
Alizadeh M, Greenberg A, Maltz D A, et al. Data center TCP (DCTCP)[C]//Proceedings of the ACM SIGCOMM 2010 conference. New York: ACM, 2010.
|
24 |
Zhu Y B, Ghobadi M, Misra V, et al. ECN or delay[C]//Proceedings of the 12th International on Confer-ence on Emerging Networking Experiments and Technolo-gies. New York: ACM, 2016: 313-327.
|
25 |
Rhamdani F, Suwastika N A, Nugroho M A. Equal-cost multipath routing in data center network based on software defined network[C]//Proceedings of 6th International Conference on Information and Communication Technol-ogy (ICoICT). Bandung: IEEE, 2018: 222-226.
|
26 |
Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA[C]//Proceedings of the 2014 ACM conference on SIGCOMM. New York: ACM, 2014: 503-514.
|
27 |
Lu Y W, Chen G, Li B J, et al. Multi-path transport for RDMA in datacenters[C]//Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementa-tion. New York: ACM, 2018: 357-371.
|
28 |
Song C H, Khooi X Z, Joshi R, et al. Network load balanc-ing with in-network reordering support for RDMA[C]//Proceedings of the ACM SIGCOMM 2023 Conference. New York: ACM, 2023: 816-831.
|
/
〈 |
|
〉 |