Date of Award

11-15-2021

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

School of Information Technology: Information Systems

First Advisor

Yongning Tang

Abstract

Machine learning has shown strong potential in improving the performance of an Intrusion Detection Systems (IDS). In a machine learning based IDS, the problem is commonly formulated as a supervised classification, in which various training datasets are used to train a selected model to learn how various network features are related to different types (i.e., benign traffic or a type of network attack) of network traffic. Each training dataset usually includes a large amount of data samples, and each data sample contains many network features and their associated type of traffic called label. Most recent studies focus on developing a better machine learning model to achieve higher performance in an IDS. Very little research has been done in understanding the quality of training datasets, especially mislabeling affects the performance of a machine learning based IDS.In this thesis, we focus on the mislabeling issue in a machine learning based IDS. We first show the impact of mislabeling on the performance of such an IDS. Then, we propose a new algorithm called Heuristic Mislabel Identification (HMI) based on Data Shapley [6] to identify mislabels in training datasets. Based on different mislabeling scenarios, HMI heuristically and iteratively divides a training dataset into multiple groups to narrow down the location or range of mislabels. We have evaluated our method using a widely adopted IDS training dataset (i.e., CICIDS2017). The evaluation results show that HMI can identify 84% random mislabels and 78% mislabels from a single data source. The precision on both experiment above is 100% which means the suspect group must contain mislabeling samples.

Comments

Imported from Li_ilstu_0092N_12066.pdf

DOI

https://doi.org/10.30707/ETD2021.20220215070317590203.999987

Page Count

48

Share

COinS