广域长距离高性能传输技术研究与讨论
梁腾,特聘副研究员,研究方向为新型网络架构,电子信箱: liangt@pcl.ac.cn |
收稿日期: 2024-08-21
网络出版日期: 2025-06-13
基金资助
新一代人工智能国家科技重大专项(2022ZD0115303)
鹏城国家实验室重大攻关任务(PCL2023A06)
版权
On developing wide–area long–distance high–performance transport techniques in computer networks
Received date: 2024-08-21
Online published: 2025-06-13
Copyright
广域长距离高性能传输技术在中国“东数西算”工程构建全国一体化算力网背景下具备重要的战略价值。3个趋势对广域分布式算力协同范式提出新需求:对算力资源要求极高的人工智能(AI)大模型智能应用的兴起;高端高性能图形处理单元(GPU)芯片被禁运限制单中心算力资源;中国各地建设的算力集群形成算力分散分布态势。广域长距离高性能传输技术是上述新范式的关键技术。从支撑广域分布式算力协同新范式、技术路线、承载网络、研究难点、成本5个方面进行讨论,结合深圳到宁夏中卫2100 km实网实验结果,将现有远程直接内存访问(remote direct memory access,RDMA)技术基于广域全光网进行长距离优化的方案是短期内可行性高、成本低且利于开展研究的最佳方案之一,通过优化基于融合以太网的远程直接内存访问(RDMA over Converged Ethernet,RoCE)可以在广域全光网上实现“广域光数直达”逼近物理层通信性能指标。
关键词: 广域长距离高性能传输; 广域远程内存直接访问(WRDMA); 算力网络; 东数西算
梁腾 , 杨健 , 杨佳宇 , 张宇 , 张伟哲 . 广域长距离高性能传输技术研究与讨论[J]. 科技导报, 2025 , 43(9) : 31 -37 . DOI: 10.3981/j.issn.1000-7857.2024.08.01032
1 |
关于深入实施"东数西算"工程加快构建全国一体化算力网的实施意见[EB/OL]. (2023-12-25) [2024-01-02]. https://www.gov.cn/zhengce/zhengceku/202401/content_6924596.htm.
|
2 |
Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[EB/OL]. [2024-01-02]. https://arxiv.org/abs/2005.14165v4.
|
3 |
OpenAI. GPT-4 is OpenAI's most advanced system, producing safer and more useful responses[EB/OL]. [2024-02-04]. https://openai.com/index/gpt-4/.
|
4 |
U S Bureau of Industry and Security. BIS updated public information page on export controls imposed on advanced computing and semiconductor manufacturing items to the People's Republic of China (PRC)[EB/OL]. [2024-03-02]. https://www.bis.doc.gov/index.php/about-bis/newsroom/2082.
|
5 |
McMahan H B, Moore E, Ramage D, et al. Communication- efficient learning of deep networks from decentralized data [EB/OL]. [2024-03-02]. https://arxiv.org/abs/1602.05629v4.
|
6 |
|
7 |
Wikipedia. Remote direct memory access [EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/Remote_direct_memory_access.
|
8 |
刘雨蒙, 唐正梁, 路松峰, 等. RDMA协议应用及安全防护技术综述[J]. 网络与信息安全学报, 2024, 10 (2): 22- 46.
|
9 |
Wikipedia. InfiniBandInfiniBand[EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/InfiniBandInfiniBand.
|
10 |
Beck M, Kagan M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure[C]//Proceedings of the 3rd Workshop on Data Center-Converged and Virtual Ethernet Switching. Omaha: ACM, 2011: 9-15.
|
11 |
Wikipedia. RoCE[EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet.
|
12 |
Beck M, Kagan M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure[C]//Proceedings of the 3rd Workshop on Data Center-Converged and Virtual Ethernet Switching. San Francisco: ACM, 2011: 9-15.
|
13 |
Wikipedia. iWARP [EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/IWARP.
|
14 |
|
15 |
Recio R, Metzler B, Culley P, et al. A remote direct memory access protocol specification[EB/OL]. [2024-03-13]. https://www.rfc-editor.org/info/rfc5040.
|
16 |
Shah H, Pinkerton J, Recio R, et al. Direct data placement over reliable transports[EB/OL]. [2024-03-13]. https://www.rfc-editor.org/info/rfc5041.
|
17 |
Yu W K, Rao N S V, Vetter J S. Experimental analysis of InfiniBand transport services on WAN[C]//Proceedings of International Conference on Networking, Architecture, and Storage. Chongqing: IEEE, 2008: 233-240.
|
18 |
Bai W. Empowering azure storage with RDMA[C]//20th USENIX Symposium on Networked Systems Design and Implementation. Massachusetts, USA: Usenix, 2023: 49-67.
|
19 |
Clark D. The design philosophy of the DARPA Internet protocols[C]//Symposium Proceedings on Communications Architectures and Protocols. New York: ACM, 1988: 106-114.
|
20 |
迈向智能世界白皮书2023—全光网[EB/OL]. [2024-03-13]. https://www-file.huawei.com/-/media/corp2020/pdf/giv/striding-towards-the-intelligent-world/the_intelligent_world_all_optical_network_2023_cn.pdf.
|
21 |
中国联通研究院. 算力时代的全光底座白皮书[EB/OL]. [2024-03-13]. http://221.179.172.81/images/20220921/28711663727812497.pdf.
|
22 |
中国电信发布全光网2.0技术白皮书[EB/OL]. [2024-03-13]. https://www.c114.com.cn/topic/117/a1179842.html.
|
23 |
Jin X, Li Y R, Wei D, et al. Optimizing bulk transfers with software-defined optical WAN[C]//Proceedings of the 2016 ACM SIGCOMM Conference. New York: ACM, 2016: 87-100.
|
24 |
Jacobson V, Braden R. RFC1072: TCP extensions for long- delay paths [EB/OL]. [2024-01-02]. https://www.rfc-editor.org/rfc/rfc1072.html.
|
25 |
|
26 |
数据快递服务_DES_Teleport[EB/OL]. [2024-03-13]. https://www.huaweicloud.com/product/des.html.
|
27 |
天翼云SD_WAN解决方案[EB/OL]. [2024-03-13]. https://www.ctyun.cn/solutions/10050195.
|
28 |
NVIDIA, Mellanox® ConnectX®-5网卡[EB/OL]. [2024-03-13]. https://www.nvidia.cn/networking/ethernet/connectx-5/.
|
29 |
Github, open fabrics enterprise distribution (OFED) performance tests[EB/OL]. [2024-05-15]. https://github.com/linuxrdma/perftest.
|
/
〈 |
|
〉 |