location: Current position: Home >> Scientific Research >> Paper Publications

Leveraging unlabeled data to scale blocking for record linkage

Hits:

Indexed by:会议论文

Date of Publication:2011-07-16

Included Journals:EI、Scopus

Page Number:2211-2217

Abstract:Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.

Pre One:物联网中基于历史上下文的决策模型

Next One:传感器网络的粒子群优化定位算法