陈志奎

个人信息Personal Information

教授

博士生导师

硕士生导师

主要任职:teaching

性别:男

毕业院校:重庆大学

学位:博士

所在单位:软件学院、国际信息与软件学院

学科:软件工程. 计算机软件与理论

办公地点:开发区综合楼405

联系方式:Email: zkchen@dlut.edu.cn Moble:13478461921 微信:13478461921 QQ:1062258606

电子邮箱:zkchen@dlut.edu.cn

扫描关注

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

Leveraging unlabeled data to scale blocking for record linkage

点击次数:

论文类型:会议论文

发表时间:2011-07-16

收录刊物:EI、Scopus

页面范围:2211-2217

摘要:Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.