Ensemble Correlation Coeﬃcient for Variable Association Detection
MetadataShow full item record
Subjects in a population are represented by their characteristics, and the characteristics are represented by variables. Identifying the relationship between these variables is essential for prediction, hypothesis testing, and decision making. The relation between two variables is often quantified using a correlation factor. Once correlations between response and independent variables are known, they can be used to make predictions regarding response variables. That is, if two variables are correlated, by observing one, we can make predictions about the other one. A more accurate prediction can be made where there is a strong relationship between variables. Several correlation factors have been introduced. Among them, Pearson’s Correlation Coefficient has been commonly used, while Distance Correlation and Maximal Information Coefficient have been recently introduced to address the shortcomings of Pearson’s Correlation Coefficient. Different coefficients perform differently for identifying underlying relationships and under different noise conditions. This makes it very challenging to choose the right correlation factor for a specific dataset when the underlying relationship is unknown. In this dissertation, we first compare these factors through a set of Monte Carlo simulations for different relationships and a variety of noise conditions. We then propose a method to ensemble and aggregate them to introduce a more robust factor that can be generally used with a variety of relationship types under different noise conditions. We then apply the proposed ensemble method to DNA copy numbers obtained from patients with non-small cell lung cancer to identify associated genes with lung cancer. Finally, we introduce our Robust Distance Correlation, a method that we developed to improve Distance Correlation and to make it robust with regard to the relationship type as well as the noise environment.