Graduation Term

4-10-2018

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Department of Mathematics

Committee Chair

Olcay Akman

Abstract

Data classification as a preprocessing technique is a crucial step in the analysis and understanding of numerical data. Cluster analysis, in particular, provides insight into the inherent patterns found in data which makes the interpretation of any follow-up analyses more meaningful. A clustering algorithm groups together data points according to a predefined similarity criterion. This allows the data set to be broken up into segments which, in turn, gives way for a more targeted statistical analysis. Cluster analysis has applications in numerous fields of study and, as a result, countless algorithms have been developed. However, the quantity of options makes it difficult to find an appropriate algorithm to use. Additionally, the more commonly used algorithms, while precise, require a familiarity with the data structure that may be resource-consuming to attain. Here, we address this concern by developing a novel clustering algorithm, the sieve method, for the preliminary cluster analysis of high-dimensional data. We evaluate its performance by comparing it to three well-known clustering algorithms for numerical data: the k-means, single-linkage hierarchical, and self-organizing maps. To compare the algorithms, we measure accuracy by using the misclassification or error rate of each algorithm. Additionally, we compare the within- and between-cluster variation of each clustering result through multivariate analysis of variance. We use each algorithm to cluster Fisher's Iris Flower data set, which consists of 3 ``true'' clusters and 150 total observations, each made up of four numerical measurements. When the optimal clustering structure is known, we found that the k-means and self-organizing maps are the more efficient algorithms in terms of speed and accuracy. When this structure is not known, we found that the sieve algorithm, despite higher misclassification rates, was able to obtain the optimal clustering structure through a truly blind clustering. Thus, the sieving algorithm functions as an informative and blind preliminary clustering method that can then be followed-up by a more refined algorithm. The existence of reliably efficient clustering process for numerical data means that more time, effort, and computational resources can be spent on a more rigorous and targeted statistical analysis.

Recommended Citation

Gonzalez, Josselyn, "Clustering Biological Data with Self-Adjusting High-Dimensional Sieve" (2018). Theses and Dissertations. 857.
https://ir.library.illinoisstate.edu/etd/857

DOI

http://doi.org/10.30707/ETD2018.Gonzalez.J

Page Count

Download

Included in

Biostatistics Commons, Mathematics Commons

COinS

Theses and Dissertations

Clustering Biological Data with Self-Adjusting High-Dimensional Sieve

Graduation Term

Document Type

Degree Name

Department

Committee Chair

Abstract

Recommended Citation

DOI

Page Count

Included in

Search

Browse

Contribute

Links

Theses and Dissertations

Clustering Biological Data with Self-Adjusting High-Dimensional Sieve

Author

Graduation Term

Document Type

Degree Name

Department

Committee Chair

Abstract

Recommended Citation

DOI

Page Count

Included in

Share

Search

Browse

Contribute

Links