论文

融合BERT和阻塞过滤的国家电网公共数据模型实体映射技术

  • 李雨霏 ,
  • 郝保聪 ,
  • 楼轶维 ,
  • 杨诗语 ,
  • 高士杰 ,
  • 张鹏宇
展开
  • 1. 国家电网有限公司大数据中心,北京 100053
    2. 北京大学计算机学院,北京 100871
    3. 北京中电普华信息技术有限公司,北京 100085
李雨霏,高级工程师,研究方向为大数据应用技术等,电子信箱:15101537383@126.com

收稿日期: 2023-02-06

  修回日期: 2023-03-09

  网络出版日期: 2023-08-30

基金资助

国网大数据中心科技项目(SGSJ0000SJJS2200040)

An entity mapping technology of national grid public data model integrating BERT and congestion filtering

  • LI Yufei ,
  • HAO Baocong ,
  • LOU Yiwei ,
  • YANG Shiyu ,
  • GAO Shijie ,
  • ZHANG Pengyu
Expand
  • 1. Big Data Center of State Grid Corporation of China, Beijing 100053, China
    2. School of Computer Science, Peking University, Beijing 100871, China
    3. Beijing Zhongdian Puhua Information Technology Co., Ltd., Beijing 100085, China

Received date: 2023-02-06

  Revised date: 2023-03-09

  Online published: 2023-08-30

摘要

针对目前国家电网公共数据模型SG-CIM(state grid-common information model)难以实现自动更新迭代和挖掘新元素效率较低等问题,提出了一种基于知识图谱和BERT(bidirectional encoder representations from transformers)模型的SG-CIM模型自动映射技术。在现有SG-CIM模型的基础上,构建出SG-CIM知识图谱和数据表知识图谱;通过研究基于BERT模型和阻塞过滤的实体映射技术,在2个知识图谱之间建立映射关系;对文本方法映射效果进行实验分析,结果表明在自制数据集上微调后BERT模型的精确度在88%以上。

本文引用格式

李雨霏 , 郝保聪 , 楼轶维 , 杨诗语 , 高士杰 , 张鹏宇 . 融合BERT和阻塞过滤的国家电网公共数据模型实体映射技术[J]. 科技导报, 2023 , 41(15) : 113 -123 . DOI: 10.3981/j.issn.1000-7857.2023.15.012

Abstract

Aiming at the problems of current SG-CIM (state grid-common information model) such as difficult to achieve automatic update iteration and low efficient mining of new elements, an SG-CIM model automatic mapping technology based on BERT model and blocking filtering is proposed. On the basis of the existing SG-CIM, an SG-CIM knowledge map and data table knowledge graph are constructed at first. Secondly, by studying the entity alignment method based on BERT model and blocking filtering, the mapping relationship between the two knowledge graphs is established. Finally, the effectiveness of the proposed method is verified by experimental analysis of the text mapping effect. Results show that the accuracy of BERT model after finetuning on a self-made data set is more than 95%. This method lays a foundation for subsequent mining of new elements and automatic updating iteration of SG-CIM.

参考文献

[1] 杨帅. 基于SG-CIM的配电网生产管理系统的研究与应用[D]. 北京: 华北电力大学, 2018.
[2] 徐尧强, 舒乔晔, 黄昭, 等. 基于公共信息模型的电力项目管理模型设计[J]. 能源工程, 2021(4): 76-80.
[3] HAO YI M,WU Y,CHEN L,et al. Intelligent question answering system based on domain knowledge graph[P]. 2022 3rd International Conference on Artificial Intelligence and Education: IC-ICAIE 2022, 2022.
[4] ZHOU H J, SHEN T T, LIU X L, et al. Survey of knowledge graph approaches and applications[J]. Journal on Artificial Intelligence, 2020, 2(2): 89-101.
[5] 曲克童 . 基于深度迁移学习的电力知识图谱智能问答[D]. 北京: 华北电力大学(北京), 2022.
[6] Sajisha P S, Anoop V S, Ansal K A. Knowledge graphbased recommendation systems: The state-of-the-art and some future directions[J]. International Journal of Machine Learning and Networked Collaborative Engineering, 2019, 3(3): 159-167.
[7] 陈烨, 周刚, 卢记仓 . 多模态知识图谱构建与应用研究综述[J]. 计算机应用研究, 2021, 38(12): 3535-3543.
[8] 闻涛 . 面向知识图谱的实体对齐和知识补全[D]. 杭州:杭州电子科技大学, 2019.
[9] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. New York: ACM, 2013: 3111-3119.
[10] Hu B T, Lu Z D, Li H, et al. Convolutional neural network architectures for matching natural language sentences[DB/OL]. arXiv Preprint: 1503.03244, 2015.
[11] Yoon K. Convolutional neural networks for sentence classification. 2014[DB/OL]. arXiv Preprint: CL/1408.5852, 2014.
[12] Wang S H, Jiang J. Learning natural language inference with LSTM[DB/OL]. arXiv Preprint: 1512.08849, 2015.
[13] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[DB/OL]. arXiv Preprint: 1810.04805, 2018.
[14] 张富, 杨琳艳, 李健伟, 等. 实体对齐研究综述[J]. 计算机学报, 2022, 45(6): 1195-1225.
[15] Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration[C]//Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 475-480.
[16] Song D Z, Luo Y, Heflin J. Linking heterogeneous data in the semantic web using scalable and domain-independent candidate selection[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1): 143-156.
[17] Arasu A, Götz M, Kaushik R. On active learning of record matching packages[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. New York: ACM, 2010: 783-794.
[18] Teong K S, Soon L K, Su T T. Schema-agnostic entity matching using pre-trained language models[C]// Proceedings of the 29th ACM International Conference on Information & Knowledge Management. New York: ACM, 2020: 2241-2244.
[19] 李家瑞, 李华昱, 闫阳. 面向多源异质数据源的学科知识图谱构建方法[J]. 计算机系统应用, 2021, 30(10): 59-67.
[20] Gruber T R. A translation approach to portable ontology specifications[J]. Knowledge Acquisition, 1993, 5(2): 199-220.
[21] Cui Y M, Che W X, Liu T, et al. Pre-training with whole word masking for Chinese BERT[J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[22] 谢腾, 杨俊安, 刘辉 . 基于 BERT-BiLSTM-CRF 模型的中文实体识别[J]. 计算机系统应用, 2020, 29(7): 48-55.
[23] 杨晨 . 基于神经网络的短文本语义相似度计算方法研究[D]. 成都: 电子科技大学, 2020.
[24] Vaswani A, Shazeer N, Parmar N, et al. Attention is all You need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[25] Zeng K S, Li C J, Hou L, et al. A comprehensive survey of entity alignment for knowledge graphs[J]. AI Open, 2021, 2: 1-13.
[26] Carrington A M, Manuel D G, Fieguth P W, et al. Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 329-341.
文章导航

/