葛宏伟

个人信息Personal Information

教授

博士生导师

硕士生导师

主要任职:计算机科学与技术学院党委书记

性别:男

毕业院校:吉林大学

学位:博士

所在单位:计算机科学与技术学院

学科:计算机应用技术

办公地点:海山楼A1022

联系方式:hwge@dlut.edu.cn

电子邮箱:gehw@dlut.edu.cn

扫描关注

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

Fast batch searching for protein homology based on compression and clustering

点击次数:

论文类型:期刊论文

发表时间:2017-11-21

发表刊物:BMC BIOINFORMATICS

收录刊物:SCIE、EI、PubMed

卷号:18

期号:1

页面范围:508

ISSN号:1471-2105

关键字:Protein homology; Batch searching; Compression; Clustering

摘要:Background: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn't exploit the common subsequences shared by queries.
   Results: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage.
   Conclusions: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.