朴勇

个人信息Personal Information

副教授

硕士生导师

性别:男

毕业院校:大连理工大学

学位:博士

所在单位:软件学院、国际信息与软件学院

办公地点:大连经济开发区大连理工大学软件学院

联系方式:15641190702

电子邮箱:piaoy@dlut.edu.cn

扫描关注

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

XML structure extraction from plain texts based on conditional random fields

点击次数:

论文类型:期刊论文

发表时间:2010-08-01

发表刊物:Journal of Computational Information Systems

收录刊物:EI、Scopus

卷号:6

期号:8

页面范围:2683-2690

ISSN号:15539105

摘要:Information extraction technique is an effective way to converting information in unstructured text into structured records. Although there are a number of previous research works, most of them are devoted to the extraction of atomic entities and flat records and the result structure is usually flat without richer structural information. A novel approach using Conditional Random Fields (CRFs) for the task of extracting higher-order structures from unstructured texts is proposed in this paper, namely the Tree Structure Extraction system based on CRFs (TSECRF), which utilizes the path information in XML documents and the proper feature sets for CRFs as training sources and automatically obtain tree structures from the texts and generate target XML documents with certain structures. Experiments on real life data sets proved that this method has a higher precision and recall in comparison with the results by Hidden Markov Models. TSECRF has the application field of helping to solve problems of structural storage and text information retrieval as well as data integration on Internet. Copyright ? 2010 Binary Information Press.