location: Current position: Home >> Scientific Research >> Paper Publications

XML structure extraction from plain texts based on conditional random fields

Hits:

Indexed by:期刊论文

Date of Publication:2010-08-01

Journal:Journal of Computational Information Systems

Included Journals:EI、Scopus

Volume:6

Issue:8

Page Number:2683-2690

ISSN No.:15539105

Abstract:Information extraction technique is an effective way to converting information in unstructured text into structured records. Although there are a number of previous research works, most of them are devoted to the extraction of atomic entities and flat records and the result structure is usually flat without richer structural information. A novel approach using Conditional Random Fields (CRFs) for the task of extracting higher-order structures from unstructured texts is proposed in this paper, namely the Tree Structure Extraction system based on CRFs (TSECRF), which utilizes the path information in XML documents and the proper feature sets for CRFs as training sources and automatically obtain tree structures from the texts and generate target XML documents with certain structures. Experiments on real life data sets proved that this method has a higher precision and recall in comparison with the results by Hidden Markov Models. TSECRF has the application field of helping to solve problems of structural storage and text information retrieval as well as data integration on Internet. Copyright ? 2010 Binary Information Press.

Pre One:XML Structure Extraction from plain texts with Hidden Markov Model

Next One:互动式游戏数据库的数据存储与管理