个人信息Personal Information
教授
博士生导师
硕士生导师
性别:男
毕业院校:大连理工大学
学位:博士
所在单位:计算机科学与技术学院
电子邮箱:yangzh@dlut.edu.cn
Incorporating dictionary features into conditional random fields for gene/protein named entity recognition
点击次数:
论文类型:会议论文
发表时间:2007-05-22
收录刊物:EI、CPCI-S
卷号:4819
页面范围:162-173
关键字:BioNER; dictionary feature; CRF
摘要:Biomedical Named Entity Recognition (BioNER) is an important preliminary step for biomedical text mining. Previous researchers built dictionaries of gene/protein names from online databases and incorporated them into machine learning models as features, but the effects were very limited. This paper gives a quality assessment of four dictionaries derived form online resources, and investigate the impacts of two factors (i.e., dictionary coverage and noisy terms) that may lead to the poor performance of dictionary features. Experiments are performed by comparing performances of the external dictionaries and a dictionary derived from GENETAG corpus, using Conditional Random Fields (CRFs) with dictionary features. We also make observations of the impacts regarding long names and short names. The results show that low coverage of long names and noises of short names are the main problems of current online resources and a high quality dictionary could substantially improve the accuracy of BioNER.