Articles

A large scale social networking community detection prototype system based on Spark

  • YE Xiaorong ,
  • SHAO Qing
Expand
  • 1. Institute of Scientific and Technical Information of China, Beijing 100038, China;
    2. KNET Co., Ltd., Beijing 100190, China

Received date: 2018-10-09

  Revised date: 2018-11-20

  Online published: 2018-12-18

Abstract

In order to effectively explore the user information in large-scale social networks and improve the understanding of the relationship between users, a community detection prototype system based on Spark is designed and developed. The ActiveMQ is used to acquire a large amount of the user data, taking advantage of the naive Bayesian algorithm provided by Spark-based MLlib to clean the user data, and using the PageRank algorithm provided by Spark-based GraphX and the Z-Score algorithm provided by MLlib to calculate the user ranking. In the prototype system, the LPA algorithm is finally used and optimized, to group the users of similar features and close ties into the same community quickly, as a foundation for further analysis and utilization of the community user data.

Cite this article

YE Xiaorong , SHAO Qing . A large scale social networking community detection prototype system based on Spark[J]. Science & Technology Review, 2018 , 36(23) : 93 -101 . DOI: 10.3981/j.issn.1000-7857.2018.23.012

References

[1] 杨文杰, 周志刚, 雷欢, 等. 基于GraphX的社交网络用户推荐算法研究[J]. 自动化与信息工程, 2018, 39(1):27-31. Yang Wenjie, Zhou Zhigang, Lei Huan, et al. Research on recommendation algorithm for social network users based on GraphX[J]. Automation & Information Engineering, 2018, 39(1):27-31.
[2] 林友芳, 王天宇, 唐锐, 等. 一种有效的社会网络社区发现模型和算法[J]. 计算机研究与发展, 2012, 49(2):337-345. Lin Youfang, Wang Tianyu, Tang Rui, et, al. An effective model and algorithm for community detection in social networks[J]. Journal of Computer Research and Development, 2012, 49(2):337-345.
[3] 李镇. 基于Spark的大规模社交网络社区发现算法设计与实现[D]. 扬州:扬州大学, 2015. Li Zhen. Design and implementation of community detection algorithm based on spark[D]. Yangzhou:Yangzhou University, 2015.
[4] 梁晋, 梁吉业, 赵兴旺. 一种面向大规模社会网络的社区发现算法[J]. 南京大学学报, 2016, 52(1):159-166. Liang Jin, Liang Jiye, Zhao Xingwang. A community detection algorithm for large social network[J]. Journal of Nanjing University, 2016, 52(1):159-166.
[5] Zhu Xiaojin, Ghahramani Zoubin. Learning from labeled and unlabeled data with label propagation[R]. Pittsburgh:Carnegie Mellon University, 2002.
[6] 张俊丽, 常艳丽, 师文. 标签传播算法理论及其应用研究综述[J]. 计算机应用研究, 2013, 30(1):21-25. Zhang Junli, Chang yanli, Shi Wen. Overview on label propagation algorithm and applications[J]. Application Research of Computers, 2013, 30(1):21-25.
[7] 张素琪, 高星, 霍士杰, 等. 基于速度优化和社区偏向的标签传播算法[J]. 数据分析与知识发现, 2018, 2(3):60-69. Zhang Suqi, Gao Xing, Huo Shijie, et al. A label propagation algorithm based on speed optimization and community preference[J]. Data Analysis and Knowledge Discovery, 2018, 2(3):60-69.
[8] 胡俊, 胡贤德, 程家兴. 基于Spark的大数据混合计算模型[J]. 计算机系统应用, 2015, 24(4):214-218. Hu jun, Hu Xiande, Cheng Jiaxing. Big data hybrid computing mode based on Spark[J]. Computer Systems & Applications, 2015, 24(4):214-218.
[9] 杨天晴, 王津, 杨旭涛, 等. 一种Spark环境下的高效率大规模图数据处理机制[J]. 计算机应用研究, 2016, 33(12):3730-3747. Yang Tianqing, Wang Jing, Yang Xutao, et al. High efficiency large-scale graph data processing mechanism in environment of Spark[J]. Application Research of Computers, 2016, 33(12):3730-3747.
[10] 戴俊, 朱晓民. 基于ActiveMQ的异步消息总线的设计与实现[J]. 计算机系统应用, 2010, 19(8):254-257. Dai Jun, Zhu Xiaomin. Design and implementation of an asynchronous message bus based on ActiveMQ[J]. Computer Systems & Applications, 2010, 19(8):254-257.
[11] Domingos P M, Pazzani M J. On the optimality of the simple bayesian classifier under zero-one loss[J]. Machine Learning, 1997, 29(2):103-130.
[12] 陈湘辉. 基于朴素贝叶斯算法的社交网络数据挖掘技术研究[J]. 计算机测量与控制, 2017, 25(6):199-202. Chen Xianghui. Social networks data mining technology research based on naive bayes algorithm[J]. Computer Measurement & Control, 2017, 25(6):199-202.
[13] 张东亮, 董礼. 基于改进的朴素贝叶斯算法在垃圾短信过滤中的研究[J]. 计算机测量与控制, 2012, 20(2):526-528. Zhang Dongliang, Dong Li. Research of sms spam filtering based on optimized naïve Bayesian algorithm[J]. Henan Science, 2012, 20(2):526-528.
[14] 刘磊, 陈兴蜀, 尹学渊, 等. 基于特征加权朴素贝叶斯分类算法的网络用户识别[J]. 计算机应用, 2011, 31(12):3268-3270. Liu Lei, Chen Xingshu, Yin Xueyuan, et al. Network user identification based feature weighting naive bayesian classification algorithm[J]. Journal of Computer Applications, 2011, 31(12):3268-3270.
[15] 周志华. 机器学习[M]. 北京:清华大学出版社, 2016:150-151. Zhou Zhihua. Machine learning[M]. Beijing:Tsinghua University Press, 2016:150-151.
[16] 李彦广. 基于Spark+MLlib分布式学习算法的研究[J]. 商洛学院学报, 2015, 29(2):16-19. Li Yanguang. Research on distribution learning algorithm based on Spark + Mllib[J]. Journal of Shangluo University, 2015, 29(2):16-19.
[17] 宫秀文, 张佩云. 基于PageRank的社交网络影响最大化传播模型与算法研究[J]. 计算机科学, 2013, 40(6A):136-140. Gong Xiuwen, Zhang Peiyun. Research on propagation model and algorithm for influence maximization in social network based on pagerank[J]. Computer Science, 2013, 40(6A):136-140.
[18] 原野, 李晨, 田丽华. 面向微博的PageRank算法的改进与应用[J]. 计算机应用与软件, 2017, 34(3):31-37. Yuan Ye, Li Chen, Tian Lihua. Improvement and application of PageRank algorithm for micro-blog[J]. Computer Applications and Software, 2017, 34(3):31-37.
[19] 王鹏, 汪振, 李松江, 等. 基于用户行为的改进PageRank影响力算法[J]. 计算机工程, 2017, 43(12):155-159. Wang Peng, Wang Zhen, Li Songjiang, et al. Improved pagerank influence algorithm based on user behavior[J]. Computer Engineering, 2017, 43(12):155-159.
[20] 王天吉, 朱艳辉, 李飞. 一种基于Z-score的微博文本情感分类方法[J]. 信息与电脑, 2018(6):40-42. Wang Tianji, Zhu Yanhui, Li Fei. A method of emotional classification on microblog text based on Z-score[J]. China Computer & Communication, 2018(6):40-42.
[21] 郑永广, 岳昆, 尹子都, 等. 大规模社交网络中高效的关键用户选取方法[J]. 计算机应用, 2017, 37(11):3101-3106. Zheng Yongguang, Yue Kun, Yin Zidu, et al. Efficient approach for selecting key users in large-scale social networks[J]. Journal of Computer Applications, 2017, 37(11):3101-3106.
[22] 文馨, 陈能成, 肖长江. 基于Spark Graph X和社交网络大数据的用户影响力分析[J]. 计算机应用研究, 2018, 35(3):830-834. Wen Xin, Chen Nengcheng, Xiao Changjiang. Analysis of user influence based on social network big data and Spark Graph X[J]. Application Research of Computers, 2018, 35(3):830-834.
[23] 王虹旭, 吴斌, 刘旸. 基于Spark的并行图数据分析系统[J]. 计算机科学与探索, 2015, 9(9):1066-1074. Wang Hongxu, Wu bin, Liu Yang. Parallel graph data analysis system based on Spark[J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(9):1066-1074.
[24] 段剑峰. 基于Spark的大规模图数据并行计算研究[J]. 现代计算机, 2016(7):44-46. Duan Jianfeng. Research on large-scale graph parallel computing based on Spark[J]. Modern Computer, 2016(7):44-46.
[25] 孙海. Spark的图计算框架:Graph X[J]. 现代计算机, 2017(9):120-127. Sun Hai. Spark's graph calculation framework:Graph X[J]. Modern Computer, 2017(9):120-127.
[26] 陈虹君. Spark框架的Graph X算法研究[J]. 电脑知识与技术, 2015, 11(1):75-77. Chen Hongjun. Research on Graph X algorithms in Spark framework[J]. Computer Knowledge and Technology, 2015, 11(1):75-77.
[27] 宋宝燕, 张永普, 单晓欢. Spark-Graph X框架下的大规模加权图最短路径查询[J]. 辽宁大学学报, 2017, 44(4):289-293. Song Baoyan, Zhang Yongpu, Shan Xiaohuan. A shortest path method on large-scale graph based on Spark-Graph X[J]. Journal of Liaoning University, 2017, 44(4):289-293.
[28] 张陶, 于炯, 廖彬, 等. 基于Graph X的传球网络构建及分析研究[J]. 计算机研究与发展, 2016, 53(12):2729-2752. Zhang Tao, Yu Jiong, Liao Bin, et al. The construction and analysis of pass network graph based on Graph X[J]. Journal of Computer Research and Development, 2016, 53(12):2729-2752.
[29] 崔印昌. 基于Spark的社会网络分析系统的设计与实现[D]. 北京:北京邮电大学, 2017. Cui Yinchang. Design and implementation of social analysis system based on Spark[D]. Beijing University of Posts and Telecommunications, 2017.
[30] Newman M E J, Girvan M. Finding and evaluating community structure in networks[J]. Physical Review E, 2004, 69(2):026113.
[31] Usha Nandini Raghavan, Reka Albert, Soundar Kumara. Near linear time algorithm to detect community structures in largescale networks[J]. Physical Review E, 2007, 76(3):036106.
[32] Michael J Barber, Clark J W. Detecting network communities by propagating labels under constraints[J]. Physial Review E, 2009, 80(2 Pt 2):026129
[33] 赵卓翔, 王轶彤, 田家堂, 等. 社会网络中基于标签传播的社区发现新算法[J]. 计算机研究与发展, 2011, 48(S3):8-15. Zhao Zhuoxiang, Wang Yitong, Tian Jiatang, et al. A novel algorighm for community discovery in social networks based on label propagation[J]. Journal of Computer Research and Development, 2011, 48(S3):8-15.
[34] 艾川, 陈彬, 刘亮, 等. 基于Pregel的大规模网络传播仿真算法设计及实现[J], 中国科学:信息科学, 2018, 48(7):932-946. Ai Chuan, Chen Bin, Liu Liang, et al. Design and implementation of large-scale network propagation simulation method inspired by Pregel mechanism[J]. Scientia Sinica(Informationis), 2018, 48(7):932-946.
Outlines

/