Jan 12, 2026
Publication on the NAUS method
Less can be more: A new open-access paper on imbalanced medical data in machine learning
Machine learning can support medical decision-making, but only if the data are not heavily imbalanced.
In many clinical datasets, there are a lot of frequent cases (e.g., normal findings) and only a few rare cases (e.g., rare diseases or complications). However, these rare cases are often the most important and models are more likely to miss them.
International collaboration: Dresden × Almaty (Kazakhstan)
Together with our collaborators from Kazakhstan, namely Zholdas Buribayev, Ainur Yerkos, and Zhibek Zhetpisbay, and within the ScaDS.AI environment in Dresden, Markus Wolfien has published a new open-accessarticle in Elsevier’s Informatics in Medicine Unlocked. The paper presents NAUS (Noise-Aware Undersampling with Subsampling): a method that cleans and reduces medical datasets in a targeted way to decrease redundancy and make rare, clinically important cases more visible during training. The work was enabled by close exchange between Dresden and Almaty, including joint time and collaboration on site in Dresden.
Link to the article: https://doi.org/10.1016/j.imu.2026.101731
Why “more data” is not always better
The idea “the more data, the better” is not always true in practice, especially when:
- There are many frequent cases that look very similar (a lot of repetition),
- Some frequent cases are noisy or contain errors (measurement issues, outliers, unclear labels),
- The rare cases are truly scarce.
In that situation, the large number of frequent cases can “drown out” the rare ones: the model learns very well what is common, but not what is rare and important.
Our idea: not just less data, but a smarter selection
NAUS follows a clear principle: first clean the data, then reduce them in a meaningful way.
Instead of deleting data randomly, NAUS aims to remove those frequent cases that are more likely to harm than help, while keeping the informative ones. This is done in several steps, including:
- Detecting and removing noise (e.g., contradictory or suspicious data points),
- Considering borderline cases (cases close to the decision boundary, which are often crucial for learning),
- Reducing redundancy (very similar frequent cases are reduced so rare cases do not “disappear” during training).
The goal is not to “throw away” data, but to strengthen the signal.
An example: why less can make sense
Imagine a medical screening scenario:
- 10,000 patients are normal → frequent cases
- 100 patients have a rare disease → rare cases
Here is the problem: many of the 10,000 frequent cases are very similar. This repetition creates “a lot of data volume” but little new information. At the same time, the 100 rare cases are so few that they can easily be overlooked during training. The model then becomes very confident about the frequent situation, but detects rare cases less reliably.
What NAUS does:
NAUS reduces exactly these problematic frequent cases (noise, very similar examples, difficult borderline cases). As a result, rare cases become more visible during training and the model learns to recognize them more reliably.
Where was NAUS tested?
We evaluated NAUS on medical datasets, including chronic kidney disease, liver disease, and heart disease, and also on established benchmark datasets for imbalanced learning. For comparison, NAUS was tested against common approaches—methods that “add” data (oversampling methods such as SMOTE/ADASYN/LoRAS) as well as classical methods that reduce data (undersampling). To validate the cleaned/rebalanced datasets, we used machine learning models such as Random Forest, LightGBM, and Multilayer Perceptron.