The influence of distribution characteristics and data balancing on classification bias in highly unbalanced data sets
Class imbalance is an inherent feature of most real-world data sets, albeit with differing degrees of skewness. With it being the norm, not the exception, many modern Machine Learning (ML) applications need to deal with it in some way, both with regards to model training as well as evaluation and interpretation of the results. While in most circumstances this is not an inherent issue and can be disregarded, it becomes a vital point of interest in high-stakes decision-making scenarios, such as policy-making or clinical diagnosis. Since the minority class is typically the subject of interest and its reliable classification of importance, biasing towards it is a natural response. However, this bias will often translate into an over-estimation of this class at deployment-time, which in turn produces false-positive predictions that might have severe consequences.
In this project, we investigate the role that such a balancing-bias has on predictions. We explore if and to what extent features and characteristics of the underlying distributions (like dimensionality, class overlap/separation, shape and type of distribution, dispersion, (multi-)modality, sparsity, heterogeneity, etc.) can govern the strength of introduced bias and how it can be accounted for and corrected. We employ a proto-typical generic ML pipeline, using a multitude of different balancing and classification techniques, that systematically analyses this problem with the aim of discovering patterns that help understand the compromise between better class separation under imbalance and balancing-introduced classification bias. This knowledge in turn can be used to understand, estimate and correct the artificial skewness of results. We believe this to be an important, yet missing aspect for decision-making in high-stakes environments in general and clinical applications in particular.
Involved scientists
- Dr Friedemann Uschner
Funding
IMB budget