Information processing of ancient Chinese seldom uses unearthed documents as corpus to carry out relevant research. The number of Liye Qin bamboo manuscripts reached ten times that of all the Qin slips unearthed before, which can fill many gaps in the historical records of the Qin Dynasty. In this paper, we used them as experimental corpus and explored the automatic sentence segmentation and word segmentation of unearthed documents based on the CRF model. We combined the actual characteristics of the corpus and set up different feature templates to verify the generalization ability of model sequence labeling on different tasks. We set up a joint approach to sentence segmentation and word segmentation as comparative experiment to select a better performance processing plan. At the same time, a comparative experiment was designed between deep learning methods and pretrained models. The results proved that the overall performance of the joint approach in each task was improved and that the F1-score of automatic sentence segmentation and word segmentation reached 75.79% and 94.44%, respectively. Since it's faster and takes less time, this approach is more suitable for the Liye Qin bamboo slips. The research results can serve the proofreading work of the last three volumes of Liye Qin bamboo slips and the in-depth processing and construction of the corpus.
FENG Huimin
,
GUO Shuaishuai
,
LIU Ming
. Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model[J]. Science & Technology Review, 2024
, 42(23)
: 135
-144
.
DOI: 10.3981/j.issn.1000-7857.2023.05.00812
[1] 欧阳剑. 面向数字人文研究的大规模古籍文本可视化分析与挖掘[J]. 中国图书馆学报, 2016, 42(2): 66-80.
[2] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49.
[3] 陈天莹, 陈蓉, 潘璐璐, 等. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程, 2007, 33(3): 192-193.
[4] 张合, 王晓东, 杨建宇, 等. 一种基于层叠CRF的古文断句与句读标记方法[J]. 计算机应用研究, 2009, 26(9): 3326-3329.
[5] 张开旭, 夏云庆, 宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版), 2009, 49(10): 1733-1736.
[6] 王博立, 史晓东, 苏劲松. 一种基于循环神经网络的古文断句方法[J]. 北京大学学报(自然科学版), 2017, 53(2): 255-261.
[7] 俞敬松, 魏一, 张永伟. 基于BERT的古文断句研究与应用[J]. 中文信息学报, 2019, 33(11): 57-63.
[8] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45.
[9] 梁社会, 陈小荷. 先秦文献《孟子》自动分词方法研究[J]. 南京师范大学文学院学报, 2013(3): 175-182.
[10] 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学文学院, 2014.
[11] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
[12] 严顺. 基于CRF的古汉语分词标注模型研究[J]. 江苏科技信息, 2016(8): 10-12.
[13] 王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[14] 杨世超, 纪月, 赵立鹏. 基于条件随机场的古汉语分词研究[J]. 电脑知识与技术, 2017, 13(22): 183-184.
[15] 刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报, 2019, 33(1): 46-55.
[16] 俞敬松, 魏一, 张永伟, 等. 基于非参数贝叶斯模型和深度学习的古文分词研究[J]. 中文信息学报, 2020, 34(6): 1-8.
[17] 程宁, 李斌, 葛四嘉, 等. 基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J]. 中文信息学报, 2020, 34(4): 1-9.
[18] Sutton C, McCallum A. An introduction to conditional random fields[J]. Foundations and Trends in Machine Learning, 2011, 4(4): 267-373.
[19] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers. 2001: 282-289.
[20] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. Computer Science, arXiv preprint arXiv: 1508.01991, 2015.
[21] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018.
[22] 化振红. 深加工中古汉语语料库建设的若干问题[J]. 西南大学学报(社会科学版), 2014, 40(3): 136-142.
[23] GB/T 13715-92, 信息处理用现代汉语分词规范[S]. 北京:中国标准出版社, 1993.
[24] 黄居仁, 陈克健, 陈凤仪, 等.《资讯处理用中文分词规范》设计理念及规范内容[J]. 语言文字应用, 1997(1): 94-102.
[25] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5): 49- 64.
[26] 俞士汶, 朱学锋, 段慧明. 大规模现代汉语标注语料库的加工规范[J]. 中文信息学报, 2002, 16(6): 58-64.
[27] 陈小荷. 先秦文献信息处理[M]. 北京: 世界图书出版社公司北京公司, 2013: 13-69.
[28] 李成华, 孙雅婧, 张世娟, 等. 基于CRF模型的维吾尔语 分词 研究 [J]. 中南 民族 大学 学报(自然 科学 版), 2019, 38(4): 596-604.