Machine Learning for Classifying Malware in Closed-set and Open-set Scenarios
Hassen, Mehadi Seid
MetadataShow full item record
Anti-malware vendors regularly receive large amount of suspected malware files to be examined. However, the sheer number of files makes manual analysis time-consuming. Therefore, it is important to automate this process. Two of the main automation approaches are malware classification and clustering, where similar malware samples are grouped into malware families. Grouping malware into families allows malware analysts to examine fewer representative samples from each family, hence streamlining the malware defense process. In this dissertation, we focus on two aspects of the automated malware defense. For the first part of our work, we focus on malware classification in a closed set scenario. The assumption in this scenario is that instances seen during testing are from the same set of classes that are seen in training. We explore ways to improve the scalability of feature extraction while retaining discriminative information about the malware sample. We propose a method for extracting features from function call graphs (FCGs). Our proposed approach achieves a linear time complexity compared to previous FCG-based approaches which have quadratic time complexity both in the size of the dataset and the size of the graph. Experimental results also indicate that our proposed feature also improves the classification accuracy compared to past research. For the second part of our work, we propose supervised and unsupervised approaches for handling open set scenarios. Unlike closed set scenarios where the training data distribution is the same as the test data distribution, in open set, test data contains instances from data distributions that were not seen during training. First, we present an approach that builds on the output of an existing malware classifier and extracts features from its output to perform open set recognition. Then, we present a supervised neural-network-based representation in which instances from the same class are close to each other while instances from different classes are further apart. We evaluate this approach on two malware datasets; one Windows malware dataset and another Android malware dataset. We also evaluate the approach on an image dataset to show that it can be applicable to other domains. The evaluation shows that our representation results in a statistically significant open set recognition performance improvement when compared to a state of the art approach on the three datasets. Finally, we extend our neural-network-based representation by combining it with Adversarial Autoencoders (AAE) to address unsupervised open set recognition problem. Our evaluations on three datasets (two malware, and one image dataset) show that our proposed approach gives improved open set recognition performance.