大连理工大学主页平台管理系统申彦明--中文主页-- An efficient MapReduce algorithm for similarity join in metric spaces

申彦明

教授博士生导师硕士生导师
性别：男
毕业院校：纽约理工大学
学位：博士
所在单位：计算机科学与技术学院
办公地点：海山楼B0813
联系方式：shen@dlut.edu.cn
电子邮箱：shen@dlut.edu.cn

访问量：

开通时间：..

最后更新时间：..

个人学术主页

当前位置: 中文主页 >> 科学研究 >> 论文成果

An efficient MapReduce algorithm for similarity join in metric spaces

点击次数：

论文类型：期刊论文

发表时间：2016-03-01

发表刊物：JOURNAL OF SUPERCOMPUTING

收录刊物：SCIE、EI

卷号：72

期号：3

页面范围：1179-1200

ISSN号：0920-8542

关键字：MapReduce; Similarity join; Data set partition; Redundancy

摘要：Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia-Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

上一条：An Efficient Approach of Processing Multiple Continuous Queries

下一条：FleCube: A flexibly-connected architecture of data center networks on multi-port servers