An Efficient Mining for Approximate Frequent Items in Protein Sequence Database

JETWI News

The new website of JETWI is established. Welcome to submit your manuscripts.

Submissions

Please send your full manuscript to: jetwi@etpub.com

Useful Documents

FAQs

1. How to submit my research paper? What’s the process of publication of my paper?
The journal receives submitted manuscripts via email only. Please submit your research paper in .doc or .pdf format to the submission email: jetwi@etpub.com.
2.Can I submit an abstract?
The journal publishes full research papers. So only full paper submission should be considered for possible publication. Papers with insufficient content may be rejected as well, make sure your paper is sufficient enough to be published...[Read More]

Home > Published Issues > 2014 > Volume 6, No. 3, August 2014 >

J. Jeyabharathi1 and D. Shanthi2

1. Department of Computer Science and Engineering, C.R. Engineering College, Madurai, TamilNadu, India
2. Department of Computer Science and Engineering, PSNA College of Engineering and Technology, Dindigul, TamilNadu India.

Abstract—The rapid increase of available proteins, DNA and other biological sequences has made the problem of discovering the meaningful patterns from sequences, a major task for Bioinformatics research. Data mining of protein sequence databases poses special challenges, because several protein databases are non-relational whereas most of the data mining and machine learning techniques considers the data input to be a relational database. The existing sequence mining algorithms mainly focus on mining for subsequences. However, a wide range of applications such as biological DNA and protein motif mining needs an effective mining for identifying the approximate frequent patterns. The existing approximate frequent pattern mining algorithms have some delimitations such as lack of knowledge to finding the patterns, poor scalability and complexity to adapt into some other applications. In this paper, a Generalized Approximate Pattern Algorithm (GAPA) is proposed to efficiently mine the approximate frequent patterns in the protein sequence database. Pearson’s coefficient correlation is computed among the protein sequence database items to analyze the approximate frequent patterns. The performance of the proposed GAPA is analyzed and tested with the FASTA protein sequence database. FASTA database files hold the protein translations of Ensembl gene predictions. GAPA is compared with the existing methods such as Approximate Frequent Itemsets (AFI) tree and Approximate Closed Frequent Itemsets (ACFIM) in terms of support, accuracy, memory usage and time consumption. The experimental results shows GAPA is scalable and outperforms than the existing algorithms.

Index Terms—Approximate Frequent Patterns, Bioinformatics, Data mining, Generalized Approximate Pattern Algorithm, Pearson’s coefficient correlation, Protein Motif, Protein sequence database, and Relational Database

Cite: J. Jeyabharathi and D. Shanthi, "An Efficient Mining for Approximate Frequent Items in Protein Sequence Database," Journal of Emerging Technologies in Web Intelligence, Vol. 6, No. 3, pp. 324-330, August 2014. doi:10.4304/jetwi.6.3.324-330

v6n3 8

Array

Previous paper：Developing E-Government Interoperability Driven Methodology
Next paper：QoS Based Optimal Selection of Web Services Using Fuzzy Logic