大连理工大学主页平台管理系统黄德根 Creating Chinese-English Comparable Corpora 自然语言处理

论文成果

当前位置: 自然语言处理 >> 科学研究 >> 论文成果

Creating Chinese-English Comparable Corpora

发布时间：2019-03-09 点击次数：

论文类型：期刊论文
发表刊物：IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
收录刊物：Scopus、EI、SCIE、A&HCI
卷号：E96D
期号：8
页面范围：1853-1861
ISSN号：0916-8532
关键字：Comparable Corpora; cross language information retrieval; keyword extraction; document alignment
摘要：Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora.

上一条：基于句法结构约束的模糊限制信息范围检测

下一条：Implication operators on the set of V-irreducible element in the linguistic truth-valued intuitionistic fuzzy lattice

基本信息

黄德根Huang Degen

同专业博导

同专业硕导

个人学术主页

论文成果

Creating Chinese-English Comparable Corpora