Employing Random Sampling Techniques on Machine Learning Models:
Performance Comparison

Team members:
Jinghao Chen
Roxanne Alvarez
Ananta Arora
Teshani Jayasinghe

Data Analysis and Statistical Inference Course (DANA 4800)
Teamwork assessment
Instructor: Dr. Monica Nguyen
April 2024

DANA 4800

Instructor: Monica

Table of Contents
Background ................................................................................................................................... 1
Introduction ................................................................................................................................... 2
Dataset ............................................................................................................................................ 2
Data Preprocessing ................................................................................................................... 3
Random Sampling Methods ......................................................................................................... 4
Undersampling .......................................................................................................................... 4
Oversampling ............................................................................................................................ 5
Performance Evaluation ............................................................................................................... 6
Confusion Matrix ...................................................................................................................... 6
Classification Report ................................................................................................................ 8
ROC-AUC ................................................................................................................................ 12
Conclusion ................................................................................................................................... 14
Bibliography ................................................................................................................................ 15

DANA 4800

Instructor: Monica
Background

Data imbalance is one of the most common issues in machine learning (ML) classification
tasks. It refers to the unequal representation of classes in the training dataset. In other words, there
is a minority class that has significantly fewer instances than the majority class; ideally, all classes
should be equally distributed. Because of this imbalance, the minority class becomes more difficult
to predict as there is less information for the machine learning model to learn from during training.1
One good example of an imbalanced dataset is the proportion of emails that are spam and not
spam. If the model is trained on this dataset, it may exhibit bias towards predicting incoming emails
as not spam. Data imbalance could potentially lead to problems such as bias towards the majority
class 2, poor generalization to unseen data, and misleading evaluation of the machine learning
model’s accuracy. In this context, two random sampling techniques, oversampling and
undersampling, are explored to address the issue.

edX, “What Is Undersampling?,” Master’s in Data Science, April
2022, https://www.mastersindatascience.org/learning/statistics-data-science/undersampling/.
2
Priyanka Dave, “From Bias to Balance: Solving Imbalanced Data Issues,” Medium, September 20,
2023, https://priyanka-ddit.medium.com/how-to-deal-with-imbalanced-dataset86de86c49#:~:text=Bias%20Toward%20Majority%20Class%3A%20The.
1

Page 1

DANA 4800

Instructor: Monica
Introduction

This study presents three ML models (Support Vector Machine – SVM, Random Forest –
RF, eXtreme Gradient Boosting – XGBoost) tasked with predicting the survival outcome of
mechanically ventilated patients admitted to an Intensive Care Unit (ICU). In order to accurately
predict patient outcomes, a preprocessing step is necessary to handle missing and invalid values.
Additionally, two random sampling techniques, such as undersampling and oversampling, are
employed to address the heavily imbalanced data. Subsequently, the performances of the three ML
models utilizing both undersampling and oversampling methods are evaluated and compared to
determine which technique is resulted in more accurate predictions of patient survival outcomes.

Dataset
The Medical Information Mart for Intensive Care (MIMIC-III) dataset is a comprehensive
health-related dataset that primarily focuses on patients admitted to the Intensive Care Unit at the
Beth Israel Deaconess Medical Center (BIDMC) in Boston, Massachusetts, USA. It consists of
18,883 observations and 70 variables. After removing missing values and invalid ranges, a subset
of 12,489 patients and 68 variables (including the response variable) is used to train and test the
ML models. During the data exploration process, it is observed that the dataset is heavily skewed,
with a significant disparity between the number of survived and deceased patients. Specifically,
there are 10,331 survivors, while a substantial 2,158 patients did not survive (Fig. 1).

Page 2

DANA 4800

Instructor: Monica

Figure 1. Class distribution of the dataset (MIMIC-III)

Data Preprocessing

Dealing with missing values is a crucial task in the exploratory data analysis process.
Removing the missing values without proper evaluation can lead to issues that could significantly
impact the results. These issues include loss of information, analysis bias, and statistical power
reduction.

Figure 2. Heatmap for the visualization of the missing values
Page 3

DANA 4800

Instructor: Monica

The heatmap in Fig. 2 illustrates the proportion of missing values, represented by yellow
bars, in the dataset. The vital signs data of forty-one patients and the laboratory results of fifty
patients are completely missed. This results in 6,084 rows with at least one missing value across
all features, which have to be removed. Additionally, the information on renal replacement therapy
and ventilation duration is not available on the first day of patient admission, therefore these two
columns are excluded in the analysis. Furthermore, M A Papadakis et al. (1993) provided valuable
information about the physiologically valid ranges for vital signs and laboratory results 3. Using
this as a reference, 310 observations are removed where at least one of their values falls outside
the valid range. The remaining dataset contained 12,799 entries and 68 variables after removing
all missing and invalid values.

Random Sampling Methods
Two random sampling methods are employed to balance the training data and optimize
the ML models’ capability in learning the patterns of both classes, survivors and non-survivors.
Undersampling
After splitting the dataset into training and test subsets, the undersampling process is
performed. The training subset, which includes 3,016 observations is use to train the model, while
the test subset (1,300 observations) is used to verify the model’s predictions. The undersampling
technique, as illustrated in Fig. 3, randomly removes entries from the majority class (survivors)
until a balanced dataset is achieved, ensuring a more accurate data representation.

M A Papadakis et al., “Prognosis of Mechanically Ventilated Patients,” The Western Journal of
Medicine 159, no. 6 (1993): 659–64, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1022451/.
3

Page 4

DANA 4800

Instructor: Monica

Two most prevalent disadvantages of undersampling are loss of potentially crucial
information and inaccurate representation of the real-world scenario.

Figure 3. Illustration of the undersampling and oversampling process.

Oversampling
Oversampling is another random sampling technique used to address class imbalance
issues where the training data (14,468 observations) is rebalanced by increasing the number of
instances in the minority class (non-survivors) through replication of existing instances until a
balanced data between survivors and non-survivors is achieved. The test set is now increased to
6,914. This is illustrated in Fig. 3.

Similar to undersampling, oversampling is a straightforward process that doesn’t require
complex algorithms. However, it is prone to overfitting since it only replicates the existing
samples, therefore limiting the model from capturing new observations that may provide additional
information about the minority class.
Page 5

DANA 4800

Instructor: Monica
Performance Evaluation

This section discusses various evaluation metrics used to assess the effectiveness of the
two random sampling techniques.

Confusion Matrix
The Confusion matrix is a 2X2 table that visualizes the performance of a ML model used for
classification problems. It contains metrics including true positive, true negative, false negative,
and false positive. True positives (TP) are described as instances correctly classified by the model
as positive. Likewise, true negatives (TN) are correctly classified as negative. False negatives (FN)
are the instances that are incorrectly classified as negative, while false positives (FP) are
incorrectly classified as positive.

The following is an example of a confusion matrix.
Table 1. Sample confusion matrix.
Predicted
Positive

Negative

Positive

TP

FN

Negative

FP

TN

Actual

Page 6

DANA 4800

Instructor: Monica

Undersampling

Oversampling

Figure 4. Confusion Matrix of 3 models fit
with undersampling dataset.

Figure 5. Confusion Matrix of 3 models fit
with oversampling dataset.

Page 7

DANA 4800

Instructor: Monica

In a confusion matrix, the columns represent the distribution of the predicted classes while
rows represent the distribution of the actual classes. This provides a graphical representation of the
model's accuracy in predicting classes by measuring true positives, true negatives, false positives,
and false negatives.
The ML model performance comparison between undersampling and oversampling using the
confusion matrix as metrics is illustrated in Fig. 4 and Fig. 5. Based on the two groups of confusion
matrices, certain models are better suited to either the undersampled or the oversampled dataset.
In particular, the SVM models performed equally well with both techniques, as evidenced by the
identical confusion matrices produced. This suggests that the choice of techniques does not
significantly impact the performance of some models.
On the other hand, the RF and XGBoost models show a clear preference for the undersampling
technique. The TP and TN values in both groups support this finding. For instance, in the RF
model, the TP and TN values are almost equal for the undersampled dataset, while for the
oversampled dataset, the TP value is 0.96, and the TN value is 0.36. This suggests that the model
may be biased towards the TP and may not be accurately predicting the patient's death.

Classification Report
One other approach used to evaluate the model is the use of sklearn.metrics module, which
provides the classification report method. This generates a tabular summary displaying the primary
classification metrics for each class, including precision, recall, F1-score, and accuracy. These four
metrics are all derived from the confusion matrix.
Precision is the proportion of correct positive predictions out of all positive predictions made
by the model. The precision value ranges from 0 to 1, with 1 meaning that the model produces
zero false positives (FP). The formula as
Page 8

DANA 4800

Instructor: Monica

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑇𝑃
𝑇𝑃 + 𝐹𝑃

Recall measures the proportion of actual positives that are correctly identified by the
model, with a value ranging from 0 to 1. A score of 1 indicates that the model produces no false
negatives (FN). The formula as
𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇𝑃
𝑇𝑃 + 𝐹𝑁

F1 score is a metric that considers both precision and recall. It is calculated as the mean of both
metrics and assigns equal importance. The score ranges from 0 to 1, where 1 indicates perfect
precision and recall, meaning the model produces zero errors. On the other hand, a score of 0
indicates that either precision or recall is 0, which implies that the model incorrectly predicts
everything. The formula as
𝐹1 𝑆𝑐𝑜𝑟𝑒 =

2
1
1
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

=

2 × (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙)
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

The F1 score comes in handy when the user has difficulty choosing between high precision
and low recall or vice versa.4
Support is the number of samples that are used in that class.
Accuracy is the proportion of correct predictions to total predictions. However, it cannot be
treated as only a metric to measure the performance of a model.

Vaibhav Jayaswal, “Performance Metrics: Confusion Matrix, Precision, Recall, and F1 Score,”
Medium, September 15, 2020, https://towardsdatascience.com/performance-metrics-confusionmatrix-precision-recall-and-f1-score-a8fe076a2262.
4

Page 9

DANA 4800

Instructor: Monica

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

Macro average is the average of all classes without considering the proportion of each class in
the dataset. This means it treats each class equally.

Weighted average is the average of each class, considering each class's impact on the metric,
which is proportional to its size.

Page 10

DANA 4800

Instructor: Monica

Table 2. Classification report of 3 models fit with undersampling and oversampling dataset.

Classification Report: SVM - Undersampling

Classification Report: SVM - Oversampling
Precision Recall F1 Score Support

Precision Recall F1 Score Support

Survival

0.74

0.76

0.75

3097

Survival

0.73

0.76

0.76

650

Death

0.76

0.73

0.75

3097

Death

0.75

0.72

0.74

650

Accuracy

0.75

6194

0.74

1300

Accuracy
Macro avg.

0.74

0.74

0.74

1300

Weighted avg. 0.74

0.74

0.74

1300

Classification Report: RF Classifier - Undersampling

Macro avg.

0.75

0.75

0.75

6194

Weighted avg. 0.75

0.75

0.75

6194

Classification Report: RF Classifier - Oversampling
Precision Recall F1 Score Support

Precision Recall F1 Score Support
Survival

0.8

0.77

0.78

650

Survival

0.60

0.96

0.74

3097

Death

0.78

0.81

0.79

650

Death

0.89

0.36

0.52

3097

0.79

1300

Accuracy

0.66

6194

Accuracy
0.79

0.79

0.79

1300

Macro avg.

0.75

0.66

0.63

6194

Weighted avg. 0.79

0.79

0.79

1300

Weighted avg. 0.75

0.66

0.63

6194

Macro avg.

Classification Report: XGBoost Classifier Undersampling

Classification Report: XGBoost Classifier Oversampling
Precision Recall F1 Score Support

Precision Recall F1 Score Support
Survival

0.79

0.77

0.78

650

Survival

0.65

0.91

0.76

3097

Death

0.77

0.80

0.79

650

Death

0.85

0.50

0.63

3097

0.78

1300

Accuracy

0.71

6194

Accuracy
0.78

0.78

0.78

1300

Macro avg.

0.75

0.71

0.69

6194

Weighted avg. 0.78

0.78

0.78

1300

Weighted avg. 0.75

0.71

0.69

6194

Macro avg.

It's important to note that all the metrics in the classification report are calculated from the
confusion matrix. This revealed that the SVM model had a similar accuracy rate, just like they
have a similar confusion matrix, whereas the other two models showed different results (Table 2).
For instance, when the RF classifier is trained with an undersampling dataset, its accuracy rate is
0.79. However, when trained with an oversampling dataset, its accuracy rate dropped to 0.66. This
Page 11

DANA 4800

Instructor: Monica

means that the undersampling model outperformed the oversampling model, as suggested by the
confusion matrix.

When the RF classifier is trained with the oversampling dataset, it indicated that the model is
good at identifying survival cases (high recall). However, it struggled with accurately identifying
death cases (lower recall). The model seems to be more cautious in predicting death, resulting in
a high precision but low recall for the death class.

ROC-AUC
The ROC curve is a graphical representation of a binary classifier's performance as the
discrimination threshold changes. The curve plots the True Positive Rate (TPR) against the False
Positive Rate (FPR) at different threshold settings to show how well the classifier distinguishes
between positive and negative samples.

The AUC (Area Under the ROC Curve) measures the area under the curve as a whole. It
evaluates the classifier's performance across all possible classification thresholds, providing a
comprehensive performance measure. The AUC ranges from 0 to 1, with higher values indicating
better performance.

Page 12

DANA 4800

Figure 6. ROC curve undersampling

Instructor: Monica

Figure 7. ROC curve oversampling

The true positive and false positive rates in the x-axis and y-axis are calculated from the
confusion matrix, so similar results are presented. The SVM model still achieves similar AUC
values in both techniques. Also, the RF classifier and XGBoost with the undersampling dataset
outperformed the same model fit with the oversampling dataset. This can be illustrated in Fig. 6
and 7.

Page 13

DANA 4800

Instructor: Monica
Conclusion

This study aims to assess the impact of undersampling and oversampling techniques on the ML
model's performance in predicting patient survival outcomes. The results indicates that both
methods are effective in addressing imbalanced data. However, undersampling is more efficient
in achieving a balanced dataset and improving model precision and recall performance for certain
models, such as the RF and XGBoost classifiers.

In general, it is advisable to use undersampling as it reduces the size of the data, which results
in shorter training time and less computer power consumption. When it comes to selecting an
evaluation method, a confusion matrix provides a more detailed breakdown of where your
classifier is making mistakes, whereas a classification report gives you important metrics to
quickly assess your classifier's performance. Moreover, the ROC-AUC curve allows you to
visualize the balance between the true positive rate and the false positive rate of your classifier. It
provides a more direct visualization way to evaluate the performance of the model.

Data availability
To conduct this study, the names of the repository can be found below:
https://mimic.physionet.org. The certification ID obtained for this study is 13273317.

Page 14

DANA 4800

Instructor: Monica
Bibliography

Ashraf, Abdallah. “Oversampling — Handling Imbalanced Data.” Medium, December 23, 2023.
https://medium.com/@abdallahashraf90x/oversampling-for-better-machine-learningwith-imbalanced-data68f9b5ac2696#:~:text=Oversampling%20is%20a%20data%20augmentation.
Brownlee, Jason. “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset.”
Machine Learning Mastery, June 7, 2016. https://machinelearningmastery.com/tactics-tocombat-imbalanced-classes-in-your-machine-learning-dataset/.
Dave, Priyanka. “From Bias to Balance: Solving Imbalanced Data Issues.” Medium, September
20, 2023. https://priyanka-ddit.medium.com/how-to-deal-with-imbalanced-dataset86de86c49#:~:text=Bias%20Toward%20Majority%20Class%3A%20The.
edX. “What Is Undersampling?” Master’s in Data Science, April 2022.
https://www.mastersindatascience.org/learning/statistics-data-science/undersampling/.
Jayaswal, Vaibhav. “Performance Metrics: Confusion Matrix, Precision, Recall, and F1 Score.”
Medium, September 15, 2020. https://towardsdatascience.com/performance-metricsconfusion-matrix-precision-recall-and-f1-score-a8fe076a2262.
Papadakis, M A, K K Lee, W S Browner, D L Kent, D B Matchar, M K Kagawa, J Hallenbeck,
D Lee, R Onishi, and G Charles. “Prognosis of Mechanically Ventilated Patients.” The
Western Journal of Medicine 159, no. 6 (1993): 659–64.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1022451/.

Page 15