Write 7 pages with APA style on Identifying Outliers in a Large Biological Data Base. The Pro-Kmeans Algorithm involves partitioning that is randomly performed on a set of data which are then put into clusters and then later makes use of the Smith-Waterman algorithm in an effort to come up with a comparison of each protein cluster as well as computing each protein’s SumScores. The sequence in the respective cluster which happens to have the most SumScores is normally regarded as the cluster’s centroid (Tatusov, 2003). The Smith-Waterman algorithm is applied at this stage to compare the respective protein in the set of data provided with the found centroids and also used in the assigning of objects to the cluster that is nearest to the maximum score and is similar to the identified object. This algorithm repeats the above process many times so as to come up with the maximum function. In this kind of algorithm, the number of clusters forms the input parameter with the output being the most suitable partition in the entire set of data used (Sasson and Linial, 2002).
Pro-LEADER algorithm involves the selection of the first sequence forming the set of data making it the first leader and goes ahead to make use of the Smith-Waterman algorithm to calculate the similarity score of every sequence represented in the data set with all leaders. This algorithm operates in such a way that it identifies the nearest leader to every sequence and goes further to compare the scores with a threshold that is normally pre-fixed. If the score of the nearest leader happens to be larger than the threshold used, then the sequence will be taken as the new leader. If this does not happen then the sequence has to be assigned to the cluster which is defined by the leader (Herger and Holm, 2001). This makes pro-LEADER an incremental algorithm where each of the clusters has to be represented by an identified leader. These clusters are reached by the use of threshold values that are most suitable. The aim of this algorithm is to have the function maximized. Its input parameter happens to be the threshold similarity score considering an object as being a new leader while its output is the best partition returns of the training data set as well as the number of the leaders of the clusters that have been obtained.