7N0JU5nhI8ma0BgnSGVbSCsoES2e07SJoQJ7I2RmPlaqMBSbnKiZTgG3QvDW
Current position: Home >> Scientific Research >> Paper Publications

Leveraging unlabeled data to scale blocking for record linkage

Release Time:2019-03-11  Hits:

Indexed by: Conference Paper

Date of Publication: 2011-07-16

Included Journals: Scopus、EI

Page Number: 2211-2217

Abstract: Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.

Prev One:物联网中基于历史上下文的决策模型

Next One:传感器网络的粒子群优化定位算法