Date of Award
Master of Science (MS)
Department of Mathematics
Data classification as a preprocessing technique is a crucial step in the analysis and understanding of numerical data. Cluster analysis, in particular, provides insight into the inherent patterns found in data which makes the interpretation of any follow-up analyses more meaningful. A clustering algorithm groups together data points according to a predefined similarity criterion. This allows the data set to be broken up into segments which, in turn, gives way for a more targeted statistical analysis. Cluster analysis has applications in numerous fields of study and, as a result, countless algorithms have been developed. However, the quantity of options makes it difficult to find an appropriate algorithm to use. Additionally, the more commonly used algorithms, while precise, require a familiarity with the data structure that may be resource-consuming to attain. Here, we address this concern by developing a novel clustering algorithm, the sieve method, for the preliminary cluster analysis of high-dimensional data. We evaluate its performance by comparing it to three well-known clustering algorithms for numerical data: the k-means, single-linkage hierarchical, and self-organizing maps. To compare the algorithms, we measure accuracy by using the misclassification or error rate of each algorithm. Additionally, we compare the within- and between-cluster variation of each clustering result through multivariate analysis of variance. We use each algorithm to cluster Fisher's Iris Flower data set, which consists of 3 ``true'' clusters and 150 total observations, each made up of four numerical measurements. When the optimal clustering structure is known, we found that the k-means and self-organizing maps are the more efficient algorithms in terms of speed and accuracy. When this structure is not known, we found that the sieve algorithm, despite higher misclassification rates, was able to obtain the optimal clustering structure through a truly blind clustering. Thus, the sieving algorithm functions as an informative and blind preliminary clustering method that can then be followed-up by a more refined algorithm. The existence of reliably efficient clustering process for numerical data means that more time, effort, and computational resources can be spent on a more rigorous and targeted statistical analysis.
Gonzalez, Josselyn, "Clustering Biological Data with Self-Adjusting High-Dimensional Sieve" (2018). Theses and Dissertations. 857.