I need help creating a thesis and an outline on The Efficiency of Clustering Algorithms for Mining Large Data Bases. Prepare this assignment according to the guidelines found in the APA Style Guide. An abstract is required. The problem is, these methods cannot be used for comparing and clustering data sets that have been provided by users (Fayech et al., 2009).
This study focuses on evaluating the efficiency of various types of sequencing data mining algorithms with respect to protein sequence data sets, and on the basis of their shortcomings, design and develops an efficient clustering algorithm on the basis of the partitioning method. The rationale here is that partitioning techniques or methods have not been exploited in the field of protein clustering sequence for comparing and clustering large data sets. As a matter of fact, there are millions of protein sequences available (Mount, 2002). Additionally, as earlier noted, alignment methods are extremely expensive and inefficient when it comes to clustering of large protein data sets.
To meet the objectives of this study, the algorithms based on partitioning techniques proposed by Fayech et al. (2009) are implemented on the protein sequence data sets, to come up with C clusters from n protein sequences data set D, such that the objective function f(X) is optimized. These algorithms proposed by Fayech et al. (2009) are the Pro-LEADER, Pro-CLARINS, Pro-Kmeans and Pro-CLARA. These algorithms’ performance and efficiency are measured and evaluated. then, their results are compared to the results of the newly proposed algorithm. This complements their shortcomings and inefficiencies.
Rj in this scenario represents the centroid of the class j in which the object Oi. On the other hand, from the equation, the alignment score of Rj and Oi –protein sequences-is calculated using the equation stated below
This algorithm begins by randomly partitioning of the protein sequence data set D into patterns. It utilizes the local Smith-Waterman algorithm in comparing the proteins of each pattern and computes each protein Sum Score.