研究论文

政府网站移动搜索的日志挖掘和个性化改进

  • 叶小榕 ,
  • 邵晴
展开
  • 1. 中国科学技术信息研究所, 北京100038;
    2. 北龙中网(北京)科技有限责任公司, 北京100190
叶小榕, 高级工程师, 研究方向为计算机软件、数字图书馆, 电子信箱:yeelfine@sina.com

收稿日期: 2014-10-22

  修回日期: 2014-11-20

  网络出版日期: 2015-01-09

Log Mining and Personalization Improvement for Mobile Search System of Government Websites

  • YE Xiaorong ,
  • SHAO Qing
Expand
  • 1. Institute of Scientific and Technical Information of China, Beijing 100038, China;
    2. KNET Co., Ltd., Beijing 100190, China

Received date: 2014-10-22

  Revised date: 2014-11-20

  Online published: 2015-01-09

摘要

为充分利用移动搜索和政府网站的特点, 发挥Hadoop 处理大数据的优势, 设计开发了日志挖掘和个性化定制系统。利用Flume 和HDFS 实现了海量日志的汇总和存储, 为日志挖掘提供了数据源和调用接口;采用MapReduce 实现了对日志的高效分析, 利用搜索结果网页的标签和导航, 建立了网页向量空间模型和用户兴趣模型;根据用户兴趣模型, 使用聚类分析中的K-means算法将有相似兴趣的用户组成兴趣组;通过计算搜索结果网页到用户所在兴趣组的距离, 判断用户对该网页是否感兴趣, 据此调整搜索结果的排序, 实现个性化搜索和推送功能。

本文引用格式

叶小榕 , 邵晴 . 政府网站移动搜索的日志挖掘和个性化改进[J]. 科技导报, 2014 , 32(36) : 110 -116 . DOI: 10.3981/j.issn.1000-7857.2014.36.018

Abstract

By taking full advantage of the characteristics of mobile search and government website, a log mining and customization system, which makes use of the advantages of Hadoop in large data processing, is designed and developed. First, it uses Flume and HDFS to realize the collection and storage of massive log and to provide source data and program interface of log mining. Second, the system uses MapReduce to efficiently analyze the log by taking advantage of labels and navigation bar of search result pages. Thus, the vector space model of search result pages and user interest model are established. Third, based on user interest model and combined with MapReduce again, the K-means algorithm which is for cluster analysis is used. Then, users are divided into different interest groups depending on their interests. Finally, by calculating the distance between search result page and the user's interest group, whether the user is interested in this page is determined, then the system adjusts the order of search results and pushes a new page to this user accordingly. Therefore, the personalized search and push function are implemented.

参考文献

[1] 中国互联网络信息中心. 第34 次中国互联网络发展状况统计报告[EB/OL]. 2014-07-21[2014-08-20]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201407/P020140721507223212132.pdf. China Internet Network Information Center. The 34th statistical report on internet development in China[EB/OL]. 2014-07-21[2014-08-20]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201407/P020140721507223212132.pdf.
[2] 王继民, 李雷明子, 郑玉凤. 基于日志挖掘的移动搜索用户行为研究 综述[J]. 情报理论与实践, 2014, 37(3): 134-139. Wang Jimin, Li Leimingzi, Zheng Yufeng. Review on mobile users search behavior based on Web log mining[J]. Information Studies: Theory & Application, 2014, 37(3): 134-139.
[3] 万飞, 赵溪, 梁循, 等. 基于移动互联网日志的搜索引擎用户行为研究[J]. 中文信息学报, 2014, 28(2): 144-150. Wan Fei, Zhao Xi, Liang Xun, et al. Research on search engine mobile Internet user behavior based on log[J]. Journal of Chinese Information Processing, 2014, 28(2): 144-150.
[4] 赵龙. 基于hadoop的海量搜索日志分析平台的设计和实现[D]. 大连: 大连理工大学, 2013. Zhao Long. The design and implementation of massive search logs analysis platform based on hadoop[D]. Dalian: Dalian University of Technology, 2013.
[5] 周婷婷. 基于海量查询日志的数据挖掘及用户行为分析[D]. 北京: 北 京邮电大学, 2012. Zhou Tingting. Data mining and user behavior analysis based on the massive query log[D]. Beijing: Beijing University of Posts and Telecommunications, 2012.
[6] 王振宇, 郭力. 基于Hadoop的搜索引擎用户行为分析[J]. 计算机工程 与科学, 2011, 33(4): 115-120. Wang Zhenyu, Guo Li. Search engine user behavior analysis based on Hadoop[J]. Computer Engineering & Science, 2011, 33(4): 115-120.
[7] 胡晓, 王理, 潘守慧. 基于改进VSM的Web文本分类方法[J]. 情报杂 志, 2010, 29(5): 144-147. Hu Xiao, Wang Li, Pan Shouhui. Web text classification method based on improved VSM[J]. Journal of Intelligence, 2010, 29(5): 144-147.
[8] 周炎涛, 唐剑波, 王家琴. 基于信息熵的改进TFIDF特征选择算法[J]. 计算机工程与应用, 2007, 43(35): 156-171. Zhou Yantao, Tang Jianbo, Wang Jiaqin. Improved TFIDF feature selection algorithm based on information entropy[J]. Computer Engineering and Applications, 2007, 43(35): 156-171.
[9] 李杉, 刘莉莉. 基于MapReduce的Web日志挖掘[J]. 计算机工程与应 用, 2012, 48(22): 95-98. Li Shan, Liu Lili. MapReduce log mining based on Web[J]. Computer Engineering and Applications, 2012, 48(22): 95-98.
[10] Amresh K, Kiran M, Prathap B R. Verification and validation of mapreduce program model for parallel K-means algorithm on hadoop cluster [C]//2013 Fourth International Conference on Computing, Communications and Networking Technologies. Tiruchengode, India: IEEE, 2013: 274-282.
[11] 江小平, 李成华, 向文, 等. K-means聚类算法的MapReduce并行化实 现[J]. 华中科技大学学报: 自然科学版, 2011, 39(6): 120-124. Jiang Xiaoping, Li Chenghua, Xiang Wen, et al. Parallel implementation of K-means clustering algorithm MapReduce[J]. Journal of Huazhong University of Science and Technology: Natural Science Edition, 2011, 39 (6): 120-124.
[12] 周婷, 张君瑛, 罗成. 基于Hadoop的K-means聚类算法的实现[J]. 计 算机工程与发展, 2013, 23(4): 18-21. Zhou Ting, Zhang Junying, Luo Cheng. Realization of K-means clustering algorithm based on Hadoop[J]. Computer Technology and Development, 2013, 23(4): 18-21.
[13] 冀素琴, 石洪波. 基于MapReduce的K-means聚类集成[J]. 计算机工 程, 2013, 39(9): 84-87. Yi Suqin, Shi Hongbo. Clustering of K-means integration based on MapReduce[J]. Computer Engineering, 2013, 39(9): 84-87.
[14] 倪红军. 基于Android平台的消息推送研究与实现[J]. 实验室研究与 探索, 2014, 33(5): 96-100. Ni Hongjun. Research and implementation of push messages based on Android platform[J]. Research and Exploration in Laboratory, 2014, 38 (5): 96-100.
文章导航

/