围棋人工智能AlphaGo系列算法的原理与方法

章胜; 龙强; 孔轶男; 王宇

doi:10.3981/j.issn.1000-7857.2023.07.009

科技导报 >

2023 , Vol. 41 >Issue 7: 79 - 97

DOI: https://doi.org/10.3981/j.issn.1000-7857.2023.07.009

综述

围棋人工智能AlphaGo系列算法的原理与方法

章胜 ,
龙强 ,
孔轶男 ,
王宇

展开

1. 中国空气动力研究与发展中心空天技术研究所，绵阳 621000
2. 西南科技大学数理学院，绵阳 621000
3. 中国空气动力研究与发展中心计算空气动力研究所，绵阳 621000

章胜，副研究员，研究方向为系统优化与人工智能，电子信箱：zszhangshengzs1@outlook.com

收稿日期: 2022-08-03

修回日期: 2022-11-03

网络出版日期: 2023-04-27

基金资助

国家自然科学基金项目（11902332）

收起

Principle and methodology of AlphaGo family algorithms

ZHANG Sheng ,
LONG Qiang ,
KONG Yinan ,
WANG Yu

Expand

1. Aerospace Technology Institute, China Aerodynamic Research and Development Center, Mianyang 621000, China
2. School of Mathematics and Physics, Southwest University of Science and Technology, Mianyang 621000, China
3. Computational Aerodynamics Institute, China Aerodynamic Research and Development Center, Mianyang 621000, China

Received date: 2022-08-03

Revised date: 2022-11-03

Online published: 2023-04-27

Fold

摘要

围棋人工智能AlphaGo系列算法是人工智能发展历史中的重要里程碑事件。它们不仅成功地求解了以围棋为代表的完全信息博弈问题，而且具有更加广泛的适用性。依算法的发展历程，从基本原理与技术特征方面对 AlphaGo Fan 到 MuZero 的一系列算法进行了梳理，说明了AlphaGo系列算法的落子原理，阐释与对比了其中采用的关键技术：蒙特卡洛树搜索和深度神经网络的建模及训练。AlphaGo系列算法对解决实践中的其他重要问题，从算法设计、神经网络建模到模型利用等方面都具有重要的参考意义，本文的总结有助于快速地掌握这些算法的基本原理，从而为相关算法的研究与拓展提供有益参考。

关键词： 人工智能; AlphaGo系列算法; 蒙特卡洛树搜索; 深度神经网络; 强化学习

本文引用格式

章胜 , 龙强 , 孔轶男 , 王宇 . 围棋人工智能AlphaGo系列算法的原理与方法[J]. 科技导报, 2023 , 41(7) : 79 -97 . DOI: 10.3981/j.issn.1000-7857.2023.07.009

Abstract

AlphaGo family algorithms are important milestones in the history of artificial intelligence. These algorithms not only solve the typical complete information game problem such as Go but also are applicable to a wider range of problems. According to their development, this paper summarizes the fundamental principle and technical characteristics for the series of algorithms from AlphaGo Fan to MuZero, elaborating how the AlphaGo family algorithms work. The key technologies employed, including the Monte Carlo tree search, modeling and training of deep neural networks, are surveyed and compared. The AlphaGo family algorithms are of significant instructive value for addressing various problems in practice, from algorithm design, neural network modeling to model utilization. This paper helps to quickly understand the principle of these algorithms and is expected to provide useful reference for the further research and development of algorithms.

Key words： artificial intelligence; AlphaGo family algorithms; Monte Carlo tree search; deep neural network; reinforcement learning

参考文献

[1] 加里·卡斯帕罗夫 . 深度思考——人工智能的终点与人类创造力的起点[M]. 集智俱乐部, 译 . 北京: 中国人民大学出版社, 2018.
[2] Campbell M, Hoane A J Jr, Hsu F H. Deep blue[J]. Artifi⁃cial Intelligence, 2002, 134(1/2): 57-83.
[3] Müller M. Computer go[J]. Artificial Intelligence, 2002,134(1/2): 145-179.
[4] Silver D. Tutorial: Deep reinforcement learning, Google DeepMind[EB/OL]. [2022-08-31]. https://www.davidsilver.uk/wp-content/uploads/2020/03/deep_rl_tutorial_small_compressed.pdf.
[5] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J].Nature, 2016, 529(7587): 484-489.
[6] Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature,2017, 550(7676): 354-359.
[7] 尹立蒙 . 上帝能否创造一块自己推不动的石头: Alpha⁃Go 击败柯洁[M]//社会热点解读(2017). 北京: 北京出版社, 2018: 335-353.
[8] 赵冬斌, 邵坤, 朱圆恒, 等 . 深度强化学习综述: 兼论计算机围棋的发展[J]. 控制理论与应用, 2016, 33(6): 701-717.
[9] DeepMind. master-series-60-online-games[EB/OL]. [2022-08-31]. https://www.deepmind.com/research/ highlighted-research/alphago/master-series-60-online-games.
[10] Silver D, Hubert T, Schrittwieser J, et al. A general rein⁃forcement learning algorithm that Masters chess, shogi,and Go through self-play[J]. Science, 2018, 362(6419): 1140-1144.
[11] Schrittwieser J, Antonoglou I, Hubert T, et al. Mastering Atari, Go, chess and Shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609.
[12] Stockfish: Strong open source chess engine[EB/OL].[2022-08-31]. https://stockfishchess.org/.
[13] The 27th World Computer Shogi Championship[EB/OL].[2022-08-31]. http://www2. computer-shogi. org/wcsc27/index_e.html.
[14] 苏剑波, 陈叶飞, 马哲, 等. 从AlphaGo到BetaGo——基于任务可完成性分析的定性人工智能的定量实现[J].控制理论与应用, 2016, 33(12): 1572-1583.
[15] 陶九阳, 吴琳, 胡晓峰 . AlphaGo 技术原理分析及人工智能军事应用展望[J]. 指挥与控制学报, 2016, 2(2):114-120.
[16] Wang F Y, Zhang J J, Zheng X H, et al. Where does AlphaGo go: From church-Turing thesis to AlphaGo thesis and beyond[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(2): 113-120.
[17] 杨小康 . 未来人工智能: 从 AlphaGo到 BetaGo[J]. 科学,2017, 69(3): 6-8.
[18] 胡晓峰, 贺筱媛, 陶九阳 . AlphaGo 的突破与兵棋推演的挑战[J]. 科技导报, 2017, 35(21): 49-60.
[19] Jiang Q, Li K, Du B, et al. Deltadou: Expert-level Doudizhu AI through self-play[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 2019: 1265-1271.
[20] Yang X, Duvaud W, Wei P. Continuous control for searching and planning with a learned model[DB/OL].arXiv preprint: 2006.07430, 2020.
[21] Fawzi A, Balog M, Huang A, et al. Discovering faster matrix multiplication algorithms with reinforcement learning[J]. Nature, 2022, 610(7930): 47-53.
[22] 田渊栋 . 阿法狗围棋系统的简要分析[J]. 自动化学报,2016, 42(5): 671-675.
[23] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484-489.
[24] 唐振韬, 邵坤, 赵冬斌, 等 . 深度强化学习进展: 从 Al⁃phaGo 到 AlphaGo Zero[J]. 控制理论与应用, 2017, 34(12): 1529-1546.
[25] 刘知青, 吴修竹 . 解读 AlphaGo 背后的人工智能技术[J]. 控制理论与应用, 2016, 33(12): 1685-1687.
[26] 薛永红, 王洪鹏. 机器下棋的历史与启示: 从“深蓝”到AlphaZero[J]. 科技导报, 2019, 37(19): 87-96.
[27] 陈铭禹. AlphaGo与AlphaZero原理和未来应用研究[J].通讯世界, 2019, 26(12): 22-23.
[28] 唐川, 陶业荣, 麻曰亮 . AlphaZero 原理与启示[J]. 航空兵器, 2020, 27(3): 27-36.
[29] 李理 . 深度学习理论与实战: 提高篇[EB/OL]. (2019-03-14)[2022-08-31]. http://fancyerii.github.io/ 2019/03/14/dl-book/.
[30] 浅述: 从 Minimax到 AlphaZero, 完全信息博弈之路[EB/OL]. [2022-08-31]. https://zhuanlan.zhihu.com/p/31809-930.
[31] Redflashing. 与 AI博弈: 从 AlphaGo 到 MuZero [EB/OL].[2022-08-31]. https://zhuanlan.zhihu.com/p/458714600.
[32] 从 α 到 μ: DeepMind 棋盘游戏 AI 进化史 [EB/OL].[2022-08-31]. https://baijiahao.baidu.com/s?id=165759-0770469176607&wfr=spider&for=pc.
[33] CSDN. AI 中的搜索(二)——对抗搜索(最小最大搜索Minimax、Alpha-Beta剪枝搜索、蒙特卡洛树搜索MCTS)[EB/OL]. [2022-08-31]. https://blog.csdn.net/hxxjxw/arti⁃cle/details/105848821.
[34] Amir R, Evstigneev V I. On Zermelo's theorem[J]. Jour⁃nal of Dynamics and Games, 2017, 4(3): 191-194.
[35] Knuth D E, Moore R W. An analysis of alpha-beta prun⁃ing[J]. Artificial Intelligence, 1975, 6(4): 293-326.
[36] Herik H J D, Uiterwijk J W H M, Rijswijck J V. Games solved: Now and in the future[J]. Artificial Intelligence,2002, 134: 277-311.
[37] Müller M. Computer Go[J]. Artificial Intelligence, 2002,134(1/2): 145-179.
[38] Tesauro G, Galperin G. On-line policy improvement us⁃ing Monte-Carlo search[C]//Proceedings of the 9th Inter⁃national Conference on Neural Information Processing Systems. New York: ACM, 1996: 1068-1074.
[39] Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search[C]//Proceedings of the 5th inter⁃national conference on Computers and games. New York: ACM, 2006: 72-83.
[40] 王少锋. 棋类运动在人工智能背景下的发展问题研究:AI围棋对弈软件优化技术研究[C]//中国围棋论丛(第6辑). 杭州: 浙江古籍出版社, 2021: 310-339.
[41] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classi⁃fication with deep convolutional neural networks[C]//Proceeding in Advances in Neural Information Process⁃ing Systems. 2012.
[42] Silver D. Reinforcement learning and simulation-based search in computer Go[D]. Edmonton: University of Al⁃berta, 2009.
[43] Kocsis L, Szepesvári C. Bandit based Monte-Carlo plan⁃ning[M]//Lecture Notes in Computer Science. Berlin,Heidelberg: Springer Berlin Heidelberg, 2006: 282-293.
[44] 28天自制你的 AlphaGo(6): 蒙特卡洛树搜索（MCTS）基础[EB/OL]. [2022-08-31]. https://zhuanlan.zhihu.com/p/25345778.
[45] Monte Carlo tree search—Beginners guide[EB/OL].[2022-08-31]. https://int8. io/monte-carlo-tree-searchbeginners-guide/.
[46] 董豪, 丁子涵, 仉尚航, 等. 深度强化学习基础、研究与应用[M]. 北京: 电子工业出版社, 2021.
[47] Browne C B, Powley E, Whitehouse D, et al. A survey of Monte Carlo tree search methods[J]. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1-43.
[48] Auer P, Cesa-Bianchi N, Fischer P. Finite-time analy⁃sis of the multiarmed bandit problem[J]. Machine Learn⁃ing, 2002, 47(2/3): 235-256.
[49] Rosin C D. Multi-armed bandits with episode context[J].Annals of Mathematics and Artificial Intelligence, 2011,61(3): 203-230.
[50] Enzenberger M, Müller M. A lock-free multithreaded Monte-Carlo tree search algorithm[M]//Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010: 14-20.
[51] Silver D, Hasselt H V, Hessel M, et al. The predictron:End-to-end learning and planning[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70. New York: ACM, 2017: 3191-3199.
[52] Oh J, Singh S, Lee H. Value prediction network[C]//Pro⁃ceding in the Conference and Workshop on Neural Infor⁃mation Processing Systems. La Jolla: Neural Information Processing Systems (NIPS), 2017.
[53] 围棋人工智能程序 AI 之 Zen 系列[EB/OL]. [2022-08-31]. www.bilibili.com/read/cv13112575/.
[54] Coulom R. Computing Elo ratings of move patterns in the game of Go[J]. ICGA Journal, 2007, 30(4): 198-208.
[55] Baudiš P, Gailly J L. PACHI: State of the art open source go program[M]//Lecture Notes in Computer Sci⁃ence. Berlin, Heidelberg: Springer Berlin Heidelberg,2012: 24-38.
[56] Müller M, Enzenberger M, Arneson B, et al. Fuego—An open-source framework for board games and Go engine based on Monte-Carlo tree search[J]. IEEE Transactions on Computational Intelligence and AI in Games, 2010, 2(4): 259-270.
[57] Hinton G E, Osindero S, Teh Y W. A fast learning algo⁃rithm for deep belief nets[J]. Neural Computation, 2006,18(7): 1527-1554.
[58] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature,2015, 521(7553): 436-444.
[59] Clark C, Storkey A. Teaching deep convolutional neural networks to play go[DB/OL]. arXiv preprint: 1412.3409,2014.
[60] Tian Y D, Zhu Y. Better computer go player with neural network and long-term prediction[DB/OL]. arXiv pre⁃print: 1511.06410, 2015.
[61] Werf E, Herik H, Uiterwijk J. Learning to score final po⁃sitions in the game of Go[J]. Theoretical Computer Sci⁃ence, 2005, 349(2): 168-183.
[62] Sutskever I, Nair V. Mimicking go experts with convolu⁃tional neural networks[M]//Artificial Neural NetworksICANN 2008. Berlin, Heidelberg: Springer Berlin Hei⁃delberg, 2008: 101-110.
[63] Maddison C J, Huang A, Sutskever I, et al. Move evalua⁃tion in Go using deep convolutional neural networks[DB/OL]. arXiv preprint: 1412.6564, 2014.
[64] Yang Y, Wang J. An overview of multi-agent reinforce⁃ment learning from game theoretical perspective[DB/OL]. arXiv preprint: 2011.00583, 2020.
[65] KGS 围棋服务器 [EB/OL]. [2022-08-31]. https://www.gokgs.com/index.jsp?locale=zh_CN.
[66] Sutton R S, Barto A G. Reinforcement learning: An intro⁃duction[M]. 2nd edition. Cambridge: MIT Press, 2018.
[67] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Confer⁃ence on Computer Vision and Pattern Recognition. Pis⁃cataway: IEEE Press, 2016: 770-778.
[68] Hussein A, Gaber M M, Elyan E, et al. Imitation learn⁃ing: A survey of learning methods[J]. ACM Computing Surveys, 2018, 50(2): 1-35.
[69] Khalid E A. Dirichlet processes, a gentle tutorial[EB/OL]. [2022-08-31]. Carnegie Mellon University, 2008,https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf.
[70] Steinitz W. The modern chess instructor[M]. Russell En⁃terprises, 2017.
[71] Wang T, Bao X, Clavera I, et al. Benchmarking model-based reinforcement learning[DB/OL]. arXiv preprint:1907.02057, 2019.
[72] Ye W, Liu S, Kurutach T, et al. Mastering atari games with limited data[DB/OL]. arXiv preprint: 2111.00210,2021.

Options

文章导航

摘要

本文引用格式

Abstract

参考文献

联系我们

访问统计

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献

联系我们

访问统计