特色专题

广域长距离高性能传输技术研究与讨论

  • 梁腾 , 1 ,
  • 杨健 1, 2 ,
  • 杨佳宇 1 ,
  • 张宇 1, 3 ,
  • 张伟哲 , 1, 2, 3, *
展开
  • 1. 鹏城实验室,深圳 518000
  • 2. 哈尔滨工业大学(深圳),深圳 518055
  • 3. 哈尔滨工业大学,哈尔滨 150006
张伟哲(通信作者),教授,研究方向为并行计算、分布式计算、云计算以及计算机网络,电子信箱:

梁腾,特聘副研究员,研究方向为新型网络架构,电子信箱:

收稿日期: 2024-08-21

  网络出版日期: 2025-06-13

基金资助

新一代人工智能国家科技重大专项(2022ZD0115303)

鹏城国家实验室重大攻关任务(PCL2023A06)

版权

版权所有,未经授权,不得转载。

On developing wide–area long–distance high–performance transport techniques in computer networks

  • Teng LIANG , 1 ,
  • Jian YANG 1, 2 ,
  • Jiayu YANG 1 ,
  • Yu ZHANG 1, 3 ,
  • Weizhe ZHANG , 1, 2, 3, *
Expand
  • 1. Pengcheng Laboratory, Shenzhen 518000, China
  • 2. Harbin Instituted of Technology (Shenzhen), Shenzhen 518055, China
  • 3. Harbin Instituted of Technology, Harbin 150006, China

Received date: 2024-08-21

  Online published: 2025-06-13

Copyright

All rights reserved. Unauthorized reproduction is prohibited.

摘要

广域长距离高性能传输技术在中国“东数西算”工程构建全国一体化算力网背景下具备重要的战略价值。3个趋势对广域分布式算力协同范式提出新需求:对算力资源要求极高的人工智能(AI)大模型智能应用的兴起;高端高性能图形处理单元(GPU)芯片被禁运限制单中心算力资源;中国各地建设的算力集群形成算力分散分布态势。广域长距离高性能传输技术是上述新范式的关键技术。从支撑广域分布式算力协同新范式、技术路线、承载网络、研究难点、成本5个方面进行讨论,结合深圳到宁夏中卫2100 km实网实验结果,将现有远程直接内存访问(remote direct memory access,RDMA)技术基于广域全光网进行长距离优化的方案是短期内可行性高、成本低且利于开展研究的最佳方案之一,通过优化基于融合以太网的远程直接内存访问(RDMA over Converged Ethernet,RoCE)可以在广域全光网上实现“广域光数直达”逼近物理层通信性能指标。

本文引用格式

梁腾 , 杨健 , 杨佳宇 , 张宇 , 张伟哲 . 广域长距离高性能传输技术研究与讨论[J]. 科技导报, 2025 , 43(9) : 31 -37 . DOI: 10.3981/j.issn.1000-7857.2024.08.01032

1
关于深入实施"东数西算"工程加快构建全国一体化算力网的实施意见[EB/OL]. (2023-12-25) [2024-01-02]. https://www.gov.cn/zhengce/zhengceku/202401/content_6924596.htm.

2
Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[EB/OL]. [2024-01-02]. https://arxiv.org/abs/2005.14165v4.

3
OpenAI. GPT-4 is OpenAI's most advanced system, producing safer and more useful responses[EB/OL]. [2024-02-04]. https://openai.com/index/gpt-4/.

4
U S Bureau of Industry and Security. BIS updated public information page on export controls imposed on advanced computing and semiconductor manufacturing items to the People's Republic of China (PRC)[EB/OL]. [2024-03-02]. https://www.bis.doc.gov/index.php/about-bis/newsroom/2082.

5
McMahan H B, Moore E, Ramage D, et al. Communication- efficient learning of deep networks from decentralized data [EB/OL]. [2024-03-02]. https://arxiv.org/abs/1602.05629v4.

6
Cerf V , Kahn R . A protocol for packet network intercommunication[J]. IEEE Transactions on Communications, 1974, 22 (5): 637- 648.

DOI

7
Wikipedia. Remote direct memory access [EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/Remote_direct_memory_access.

8
刘雨蒙, 唐正梁, 路松峰, 等. RDMA协议应用及安全防护技术综述[J]. 网络与信息安全学报, 2024, 10 (2): 22- 46.

9
Wikipedia. InfiniBandInfiniBand[EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/InfiniBandInfiniBand.

10
Beck M, Kagan M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure[C]//Proceedings of the 3rd Workshop on Data Center-Converged and Virtual Ethernet Switching. Omaha: ACM, 2011: 9-15.

11
Wikipedia. RoCE[EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet.

12
Beck M, Kagan M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure[C]//Proceedings of the 3rd Workshop on Data Center-Converged and Virtual Ethernet Switching. San Francisco: ACM, 2011: 9-15.

13
Wikipedia. iWARP [EB/OL]. [2024-03-13]. https://en.wikipedia.org/wiki/IWARP.

14
Peterson C , Sutton J , Wiley P . iWarp: A 100-MOPS, LIW microprocessor for multicomputers[J]. IEEE Micro, 2002, 11 (3): 26- 29.

15
Recio R, Metzler B, Culley P, et al. A remote direct memory access protocol specification[EB/OL]. [2024-03-13]. https://www.rfc-editor.org/info/rfc5040.

16
Shah H, Pinkerton J, Recio R, et al. Direct data placement over reliable transports[EB/OL]. [2024-03-13]. https://www.rfc-editor.org/info/rfc5041.

17
Yu W K, Rao N S V, Vetter J S. Experimental analysis of InfiniBand transport services on WAN[C]//Proceedings of International Conference on Networking, Architecture, and Storage. Chongqing: IEEE, 2008: 233-240.

18
Bai W. Empowering azure storage with RDMA[C]//20th USENIX Symposium on Networked Systems Design and Implementation. Massachusetts, USA: Usenix, 2023: 49-67.

19
Clark D. The design philosophy of the DARPA Internet protocols[C]//Symposium Proceedings on Communications Architectures and Protocols. New York: ACM, 1988: 106-114.

20

21
中国联通研究院. 算力时代的全光底座白皮书[EB/OL]. [2024-03-13]. http://221.179.172.81/images/20220921/28711663727812497.pdf.

22
中国电信发布全光网2.0技术白皮书[EB/OL]. [2024-03-13]. https://www.c114.com.cn/topic/117/a1179842.html.

23
Jin X, Li Y R, Wei D, et al. Optimizing bulk transfers with software-defined optical WAN[C]//Proceedings of the 2016 ACM SIGCOMM Conference. New York: ACM, 2016: 87-100.

24
Jacobson V, Braden R. RFC1072: TCP extensions for long- delay paths [EB/OL]. [2024-01-02]. https://www.rfc-editor.org/rfc/rfc1072.html.

25
Hecht J . Understanding fiber optics[M]. Washington DC: Wiley-IEEE Press, 2018.

26
数据快递服务_DES_Teleport[EB/OL]. [2024-03-13]. https://www.huaweicloud.com/product/des.html.

27
天翼云SD_WAN解决方案[EB/OL]. [2024-03-13]. https://www.ctyun.cn/solutions/10050195.

28
NVIDIA, Mellanox® ConnectX®-5网卡[EB/OL]. [2024-03-13]. https://www.nvidia.cn/networking/ethernet/connectx-5/.

29
Github, open fabrics enterprise distribution (OFED) performance tests[EB/OL]. [2024-05-15]. https://github.com/linuxrdma/perftest.

文章导航

/