location: Current position: Home >> Scientific Research >> Paper Publications

Pragmatic Chinese lexical analysis based on word-character hybrid model

Hits:

Indexed by:期刊论文

Date of Publication:2010-04-01

Journal:Journal of Information and Computational Science

Included Journals:EI、Scopus

Volume:7

Issue:4

Page Number:827-832

ISSN No.:15487741

Abstract:In the field of information and natural language processing, Chinese lexical analysis is important basic step for Chinese, Japanese or other asian language. This paper presents Chinese lexical analysis integrating word-level and character-level information based on hybrid model combining word-based CRF model and latent semi-CRF model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from word-lattice by using rich and flexible predefined features. Latent semi-CRF model is adopted in unknown word processing, which is character-based and invoked when no matching word can be found in a lexicon for building the lattice. This pragmatic method based on hybrid CRFs models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis. First, flexible feature designs for hierarchical tag sets become possible. Second, influences of label and length bias are minimized. Third, the word-level information for the known words and the character-level information for the unknown words can be combined and fully used. ? 2010 Binary Information Press.

Pre One:基于子词的双层CRFs中文分词

Next One:正则表达式在汉英对照中国文化术语抽取中应用