location: Current position: Home >> Scientific Research >> Paper Publications

Context Information and Fragments Based Cross-Domain Word Segmentation

Hits:

Indexed by:期刊论文

Date of Publication:2012-03-01

Journal:CHINA COMMUNICATIONS

Included Journals:SCIE、CSCD、Scopus

Volume:9

Issue:3

Page Number:49-57

ISSN No.:1673-5447

Key Words:cross-domain CWS; Conditional Randem Fields(CRFs); joint decoding; context variables; segmentation fragments

Abstract:A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed. In this segmentation framework, fragments are used to generate candidate Out-of-Vocabularies (OOVs). After the initial segmentation, the segmentation fragments are divided into two classes as "combination" (combining several fragments as an unknown word) and "segregation" (segregating to some words). So, more OOVs can be recalled. Moreover, for the characteristics of the cross-domain segmentation, context information is reasonably used to guide Chinese Word Segmentation (CWS). This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010. The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.

Pre One:最大生成树算法和决策式算法相结合的中文依存关系解析

Next One:MT-Oriented English PoS Tagging and Its Application to Noun Phrase Chunking