location: Current position: Home >> Scientific Research >> Paper Publications

Using context and semantic resources for cross-domain word segmentation

Hits:

Indexed by:会议论文

Date of Publication:2011-11-27

Included Journals:EI、Scopus

Page Number:227-232

Abstract:Chinese word Segmentation (CWS) plays a fundamental role in Chinese language processing, because almost all Chinese language processing tasks are assumed to work with segmented input. After active research for many years, most of reports from evaluation tasks always give impressive results. But most of them are limited to testing corpora on specific area. Once used on another different domain, the accuracy will plummet. Thus, the domain-adaptive word segmentation is introduced into Bakeoffs. In this paper, we propose a new joint decoding strategy that combines the character-based and word-based conditional random field model, which takes the part-of-speech of words in dictionary as important features in a segment path. Moreover, according to the characteristics of the cross-domain segmentation, context information is reasonably used to guide CWS. Besides, because there are similar contexts among synonyms, semantic information can be used to recall some out-of-vocabularies (OOVs). This method is proven to be effective through several experiments on the simplified Chinese test data from SIGHAN Bakeoff 2010. Except for the domain of literature, the F-scores are higher than the best performance of the corresponding open test. In addition, the rate of OOV recall reaches 70.7%, 84.3%, 79.0% and 86.2%, respectively. ? 2011 IEEE.

Pre One:Protein-protein Interaction extraction based on ensemble kernel model and active learning strategy

Next One:Detecting hedges scope based on phrase structures and dependency structures