Hits:
Indexed by:期刊论文
Date of Publication:2010-08-01
Journal:Journal of Computational Information Systems
Included Journals:EI、Scopus
Volume:6
Issue:8
Page Number:2683-2690
ISSN No.:15539105
Abstract:Information extraction technique is an effective way to converting information in unstructured text into structured records. Although there are a number of previous research works, most of them are devoted to the extraction of atomic entities and flat records and the result structure is usually flat without richer structural information. A novel approach using Conditional Random Fields (CRFs) for the task of extracting higher-order structures from unstructured texts is proposed in this paper, namely the Tree Structure Extraction system based on CRFs (TSECRF), which utilizes the path information in XML documents and the proper feature sets for CRFs as training sources and automatically obtain tree structures from the texts and generate target XML documents with certain structures. Experiments on real life data sets proved that this method has a higher precision and recall in comparison with the results by Hidden Markov Models. TSECRF has the application field of helping to solve problems of structural storage and text information retrieval as well as data integration on Internet. Copyright ? 2010 Binary Information Press.