REN Jie, LIU Chang, HAN Bowen, WEN Chenyang, XU Bohua, CAO Chang
Amidst the rapid expansion of large-scale models and the AI sector, epitomized by advancements like ChatGPT, there is a burgeoning demand for intelligent computing power to facilitate extensive distributed computing applications. Traditional data centers are facing challenges to accommodate the performance requisites of these scenarios. Currently, the Chinese government has already issued numerous policies to expedite the development of artificial intelligence data centers, purveying clear policy guidance and accelerated planning and construction blueprints. A high-performance network is pivotal within AIDC, serving as the backbone for computational tasks and enabling inter-data center connectivity and efficient data transmission.This paper aims to establish a robust technical framework to propel the continuous development of high-performance network by primarily investigating key technologies for high-performance networks in artificial intelligence data center (AIDC). Core requirements in transport protocols, networking, and operation administration and maintenance (OAM) for large-scale AI tasks are studied. Based on these demands, this paper further investigate the evolving demands on different layers of Artificial Intelligence Network and delves into core technologies, e.g. network architecture, congestion control policy, load balance policy, operation administration and maintenance. Subsequently, from the two major perspectives of network protocol development and all-optical networks, this paper analyzes the future developmental trends of AIDC networks. To establish a robust high-performance network framework within AIDC, this paper concludes that sufficient network performance, such as a near-lossless network environment, adequate interconnectivity, and solutions to storage performance bottlenecks in distributed storage scenarios, must be purveyed effectively. Furthermore, the development of high-performance networks in AIDC necessitates the integration and synergy of key technologies, including standardized networking schemes, innovative load balancing and congestion control protocols, and advanced OAM mechanisms, to enhance operational efficiency. High-performance AIDC networks must also offer comprehensive and universal device and resource awareness, allocation, scheduling, and OAM across the entire network, providing highperformance lossless transmission capabilities.