California housing II - Balancing imbalanced data and its consequences

Welcome to the second installment on this series of articles! Today we will see some very useful ways to deal with imbalanced data and, furthermore, measure their impact on different algorithms trained on each of them. These algorithms are some of the most widely used in competitions so it is great to test them in combination with the “balancing” preprocessing. All of this will be done, still, with Python and a little help from its friends. In this occasion, they are Pandas, imbalanced-learn, Scikit-learn, Keras and XGBoost.

Introducing today’s subject, balancing an imbalanced dataset is essential for classification problems, as one wants to learn how to classify all categories equally well. This is of the utmost importance in many real cases for preventing bias. For instance, in a task where a model has to decide if a candidate should be considered for a specific job, suppose in the database we have that the majority of people who actually are in similar positions are male. The model might learn to discard all female candidates, just because the data was imbalanced.

The focus through the whole article will be on these resampling techniques, as they are called, and their effects on the models. No accent will be put into dealing with bias, variance, structural risk or any other problem that may arise, so as not to bite more off than we can chew.

Recap

Last time we saw, among other things, how to make the dataset smaller by modifying the variables. We are not working with data with huge dimensionality so perhaps doing PCA is not vital in this case, unless we are considering to use some algorithm that requires axes-parallel data (we saw in the last article that it projects the observations onto a new, smaller, orthogonal set of axes), which today we are not.

We also saw some other methods that analyze the importance of each variable for the task at hand with a few hypothesis tests, we standardized the data and we did some plotting to check if we could see why the outputs of the tests.

Resampling

As we had originally a continuous target variable (an uncountable infinite set), in order to make a classification on it we would have to discretize it: choose some arbitrary number of bins for ranges of its value. This is not a great decision but, for the purpose of this article, it will do. Let us choose the first three bins as quartiles but two more as the ranges with sample probability from 0.75 to 0.9 and 0.9 to 1. This way, we are guaranteed to have some imbalance in the data.

It is important to observe that this division will induce bias in all models: the quantiles were arbitrarily chosen to split a variable that was continuous, we are not partitioning the data according to some observed clustering in it. This will likely affect the models ability to discern one from another.

df_normal["quantile"] = pd.qcut(home_data["median_house_value"],
                                [0., .25, .5, .75, .9, 1.],
                                labels=False)

After also standardizing the data, we can check how many data points there are for each category given by the quantiles.

Category count:
  4253
  4255
  4245
  2548
  1699
Name: quantile, dtype: int64

Category frequency:
  0.250176
  0.250294
  0.249706
  0.149882
  0.099941
Name: quantile, dtype: float64

As we established, when we have an imbalanced dataset. There are many ways to deal with this problem. A way to classify these methods is in undersampling and oversampling. The first group of techniques addresses how to take a further sample from the dataset so that a specific category ends with less observations than it currently has. The latter does the opposite: it tries to make a category have more observations.

We will use the imbalanced-learn library for Python to show these techniques.

Tomek links

Tomek links exist between the pairs of nearest neighbors in the dataset: if two given examples are each other’s NN, then a Tomek link exists between them. Notice that they both must be each other’s nearest neighbor.

If we look for Tomek links in our dataset, we can go ahead and check which class each example in the pair belongs to. It is possible that they both belong to the same one but if they do not, we can examine what the balance between those classes is in the whole dataset. If the two classes are imbalanced, then this sampling algorithm allows us to delete the observation from the majority class, in order to balance the amount of examples of the two classes.

Thinking about linear separation and SVMs, we can intuit that it might help us get a larger margin and a simpler decision function. We will see if it has any effects on the SVM’s ability to generalize.

tl = TomekLinks(return_indices=True,
                sampling_strategy="majority")  # sampling_strategy='majority' by default

X = df_normal.drop(['quantile'], axis=1).copy(deep=True)
y = np.array(df_normal["quantile"])

df_normal_tl, tl_quantile, id_tl = tl.fit_sample(X, y)
df_normal_tl = pd.DataFrame(data={col:df_normal_tl[:, i] \
                                  for i, col in enumerate(df_normal.columns[:-1])})
df_normal_tl['quantile'] = tl_quantile
print(df_normal_tl['quantile'].value_counts(normalize=True),
      f"\nTotal remaining observations: {df_normal_tl['quantile'].value_counts().sum()}.")

Category count:
0    4253
1    3895
2    4245
3    2548
4    1699
Name: quantile, dtype: int64

Category frequency:
0    0.255589
1    0.234075
2    0.255108
3    0.153125
4    0.102103
Name: quantile, dtype: float64

Total remaining observations: 16640.

In the output we can see 360 rows were deleted from the original total of 17000. More specifically, from the second category, which was the majority class by just a couple of observations. This makes the frequency of all the other categories increase. Recall that the first, second and third bins were supposed to have all a 0.25 frequency because of being picked as quartiles. However, the frequency for the first and third bins two increased when the algorithm deleted the points in the Tomek links that belonged to the second bin. We still see a notorious imbalance between the classes.

If you see the code, the implementation was called with a sampling_strategy equal to “majority”. This makes the algorithm undersample just the majority class. The implementation works as one-vs-all and not pairwise (one-vs-one), which would be something worth to try.

Other techniques, such as edited nearest neighbors (ENN) are less restrictive in their NN condition —recall that Tomek links only exist if both points are each other’s NN. With a more flexible condition, ENN may allow for further balancing, albeit at the expense of deleting more points, which may not be desirable. Today we are sticking with this algorithm.

SMOTE

Let us then try to address the issue of imbalance in a different fashion: by performing oversampling instead. We said that oversampling is the opposite of undersampling and that the latter essentially implies getting rid of data. This can be problematic for training if we do not have a considerable amount of data, as many algorithms need vast amounts of data to thrive. But, how do we oversample? How can we “see” more examples from a class if we already have a fix dataset? We can use the same data we have many times but it would not make a difference, it would not help the model see more of the variables space covered, as it places more points in the same place.

What needs to be done is synthesizing new data. There are many techniques for doing it, all of them are fun. We will see the synthetic minority oversampling technique (SMOTE). As its name implies, it creates new, synthetic data points for minority classes.

How does it work? SMOTE analyzes clusters of classes with KNN, picks two of the found neighbors of a minority class cluster and synthesizes a new point in the line that connects them. There are other algorithms that work similarly and even some combinations of SMOTE with other methods that further develop this, but we will stick with the vanilla one for now.

smote = SMOTE(sampling_strategy="minority")  # default

X_sm, y_sm = smote.fit_sample(X, y)

df_normal_sm = pd.DataFrame(data={col:X_sm[:, i] \
                                  for i, col in enumerate(df_normal.columns[:-1])})
df_normal_sm['quantile'] = y_sm
print(df_normal_sm['quantile'].value_counts(normalize=True),
      f"\nTotal remaining observations: {df_normal_sm['quantile'].value_counts().sum()}.")

Category count:
0    4253
1    4255
2    4245
3    2548
4    4255
Name: quantile, dtype: int64

Category frequency:
0    0.217478
1    0.217580
2    0.217069
3    0.130292
4    0.217580
Name: quantile, dtype: float64

Total resulting observations: 19556.

We now have 2556 new samples in addition to the original 17k. Compare this and the change in frequencies against the result of the undersampling. Just beautiful. We still see that the 4th category frequency is quite below the other ones’, though.

Again, the library works with one-versus-all, so we can oversample only the minority class, all of them, all but the majority, et cetera. The implementation allows for a specific amount of new synthesized observations for each class, but let us stick with this result.

As we are dealing with higher-than-three-dimensional data, if that is a thing, we humans are not able to plot it unless we employ some manifold technique, which is not the purpose of this article —but it was in the previous one, in which we actually used a nice manifold technique. It is illuminating to see what these methods do, though, so I encourage you to visit the imbalanced-learn documentation and see their as beautiful as thorough plottings.

Combined

We are not limited to apply one of these techniques alone: we can apply both undersampling and oversampling so let us take it a step further.

st = SMOTETomek(sampling_strategy='all')

X_uo, y_uo = st.fit_sample(X, y)

df_normal_uo = pd.DataFrame(data={col:X_uo[:, i] \
                                  for i, col in enumerate(df_normal.columns[:-1])})
df_normal_uo['quantile'] = y_uo
print(df_normal_uo['quantile'].value_counts(normalize=True),
      f"\nTotal remaining observations: {df_normal_uo['quantile'].value_counts().sum()}.")

Category count:
0    4069
1    3895
2    4014
3    4183
4    4246
Name: quantile, dtype: int64

Category frequency:
0    0.199392
1    0.190866
2    0.196697
3    0.204979
4    0.208066
Name: quantile, dtype: float64

Total remaining observations: 20407.

Almost perfect fifths. We can now say we have a balanced dataset. Notice we called the sampling_strategy as “all”.

Models

Now that we have standardized and pushed towards more balanced datasets, we will se what the impact of the different methods is on the training and predicting stages. We will be addressing the task as a classification one, not as a regression because resampling based on the target variable distribution is not recommended in a regression problem. Here is a good explanation on why that is.

We will consider three different learners: support vector machines (SVM), artificial neural networks (ANN) and XGBoost (extreme gradient boosting).

SVM

First, we will try to tackle the classification task with a support vector machine. SVMs find a subset of points from the dataset that allow the machine to create a surfaces (or hyper-surfaces with more than 3 variables) that split the clouds of data points according to their category with a large margin.

Their parameter C controls the complexity of the model by weighting the importance of correctly classifying every observation. Varying it gives more or less support vectors as a result: more vectors means a less smooth decision function. It’s a kind of regularization. A high C will therefore force complexity and might even incur overfitting, while a low C makes for a sparser model while possibly allowing higher training error. Another parameter, given that we will here use kernels, is γ. In the case of radial-basis functions (RBF), γ controls how “far” each point extends its (and its class’) influence.

svc_rbf = svm.SVC(kernel="rbf",
                  gamma="scale",
                  C=C,
                  decision_function_shape="ovo",
                  probability=True,
                  verbose=True)
svc_rbf_uo.fit(df_normal.drop(["median_house_value",
                                  "quantile"], axis=1),
               df_normal["quantile"])

The code shows the fitting to the data under- and oversampled, but the process is the same for all sets. We will not be doing any hyperparameter tuning today.

ANN

We will try a vanilla, feed-forward artificial neural network (ANN). We can think of a neuron as an element that performs two operations in series. Each neuron has a vector that describes a certain feature or characteristic and the neuron gets activated when its input shows that particular feature. How does it happen? The two operations each neuron performs are: first it computes the inner product of its vector and the vector that comprises all the inputs, which can be thought of as a measure of similarity; and then the output of this inner product is usually fed to a non-linear function, dubbed the “activation”. The latter allows the ANN to learn non-linear relations between the variables, which would not be possible otherwise. An ANN is a set of layers, connected in series, comprised of many neurons each. We will see then how well we can predict median_house_value using a hypothesis space (the set of all possible neural networks with our chosen architecture) capable of fitting non-linear functions.

The last layer has as many neurons as there are categories to classify and these neurons have to learn to activate according to the category that has the highest probability of being correct. The architecture is as follows:

ann = Sequential()
for block in range(len(ann_hp['n_neurons'])):
    ann.add(Dense(ann_hp['n_neurons'][block], input_shape=(8,)))
    ann.add(Dropout(ann_hp['dropout_rate']))
    ann.add(Activation(ann_hp['intermediate_activation']))
ann.add(Dense(5))
ann.add(Activation('softmax'))

where ann_hp is a dictionary with the set of hyperparameters. We will use a four-layers-deep with 512, 1024, 1024 and 512 neurons, respectively.

The structural complexity or VC-dimension of the neural networks is not given by just one parameter like the C in the SVM. As we add more neurons an layers, it is more likely that the model will be able to overfit to the training data. A way to regularize them is to force the neuron weights to be small, close to zero, by adding a term to the loss function that is proportional to their norms. Another way is to make each neuron “unreliable” by activating and deactivating it with some random probability during training, which is called “dropout” and you can see it implemented in the last bit of code above.

To proceed, we one-hot encode our target variable in each of the resampled datasets, use half the test set as a validation one and put the neural network to train for 100 epochs with the rectified Adam optimizer, categorical cross-entropy loss, a dropout rate of 0.33 and the shown architecture.

XGBoost

Lastly, we will see how XGBoost performs on the classification. This one is a young, gradient boosting decision-trees algorithm that has gained quite some popularity really fast. It performs boosting (stacking weak-performance trees) on the trees according to the loss function, so that each added tree must help decrease it.

ddev = xgb.DMatrix(X_dev, label=y_dev)
dtest = xgb.DMatrix(X_test, label=y_test)
num_rounds = 5
params = {'max_depth':6,
          'gamma':.1,
          'eta':.3,
          'objective':'multi:softmax',
          'num_class':5}

dtrain_orig = xgb.DMatrix(df_normal.drop(['median_house_value', 'quantile'],
                                         axis=1),
                          label=df_normal['quantile'])
watchlist = [(dtrain_orig, 'train'), (ddev, 'dev')]

Results

We will now see each specific model-dataset combination results. If you want to see the summary metrics tables you can skip the following and go directly to the “discussion” section.

SVM results

Non-resampled data results

Let us see a small sample of the prediction output:

      prediction class  truth
                2      2
               4      4
               1      1
                4      4
               2      1
                 0      0
               0      0
                1      1
               1      1
               0      0
               1      2
               2      4
                0      0
               0      0
                0      0

In the sample we can see many correctly-classified observations, as well as some incorrect ones, e.g. index 2545.

A confusion matrix lets us see how many points were correctly classified and how many were not by showing us to which class they were assigned and from which class they really are, simultaneously. In its rows the matrix shows the label given by the model and in the columns the correct labels. Therefore, diagonal elements correspond to accurate classifications by the model while off-diagonal to inaccurate ones. We will use the Scikit-learn implementation.

conf_mat = confusion_matrix(predict_data['quantile'],
                            predicted_class_f)
print(f'Confusion matrix:\n{conf_mat}')

 Confusion matrix:
 [[584 122  21   2   2]
  [144 505 137   8   1]
  [ 20 145 490  71   2]
  [  8  25 172 223  38]
  [  4   4  26  91 155]]

We see that the diagonal elements, that correspond to the same label and ground-truth, are bigger than the off-diagonal elements, but it is not very clear how good or bad it is. In order to get a better idea, we have to normalize it:

Normalized confusion matrix
[[0.79890561 0.16689466 0.02872777 0.00273598 0.00273598]
 [0.18113208 0.63522013 0.17232704 0.01006289 0.00125786]
 [0.02747253 0.19917582 0.67307692 0.09752747 0.00274725]
 [0.01716738 0.05364807 0.36909871 0.47854077 0.08154506]
 [0.01428571 0.01428571 0.09285714 0.325      0.55357143]]

Some labels, like the first one, show great accuracy, labeling it correctly approximately four out of five times. But other classes perform rather poorly: the fourth one makes a correct classification less than half the times. The off-diagonal elements are false positives or false negatives, depending on how one looks at it.

As much information as the confusion matrix gives, it would be much easier if we had just one number to score the model performance. That is where metrics like F1 come in. It is the harmonic average between the precision and the recall of the model, so it weights both α and β risks the same.

print(f'SVM classification accuracy with under- and oversampled dataset: {svm_accuracy*100:.4}%.')
print(f'Average F1 score: {F1_multiclass(conf_mat):.4}.')

Here I am using my own implementation of the F1 score that calculates it from the confusion matrix, but it is also available in Scikit-learn.

SVM prediction accuracy without resampling: 65.23%.
Average F1 score: 0.6411.

We can see that F1 is a little worse than accuracy. That is normal, as accuracy does not consider false positives and false negatives.

Now that we have met F1, let us see what the same metrics were on the training set:

SVM training accuracy, no resampling: 73.18%.
SVM training F1 score, no resampling: 0.7249.

The difference between the performance on the training and evaluation sets is called overfitting. It means the model was able to “memorize” the training data pretty well. So much so, that when it comes to predicting on a new set —validation or evaluation—, it does not know how to classify it (generalize) as well as it did on the original data. The techniques to prevent overfitting will not be covered in this article. We will settle for this model and use it through the different datasets to see what their effect is.

Undersampled data results

Performance on the training set was:

SVM training accuracy, undersampled data: 73.89%.
SVM training F1 score, undersampled data: 0.731.

Let us take a look at the prediction results of the same kind of SVM trained with the undersampled data by Tomek links deletion.

Confusion matrix with undersampled data:
[[578 124  24   3   2]
 [147 491 144  11   2]
 [ 23 138 480  84   3]
 [  9  25 162 233  37]
 [  4   4  26  90 156]]

Normalized confusion matrix
[[0.79069767 0.16963064 0.03283174 0.00410397 0.00273598]
 [0.18490566 0.61761006 0.18113208 0.01383648 0.00251572]
 [0.03159341 0.18956044 0.65934066 0.11538462 0.00412088]
 [0.0193133  0.05364807 0.34763948 0.5        0.07939914]
 [0.01428571 0.01428571 0.09285714 0.32142857 0.55714286]]

SVM prediction accuracy with undersampled data: 64.6%.
Average F1 score: 0.6373.

Given that SVMs work by finding wide regions that separate classes, it can be surprising that the overall performance dropped. Nevertheless, we have to remember that the undersampling was not performed one-versus-one for all categories, so “no wedge was driven” between all classes and therefore we do not have a large, safe margin between all of them.

Oversampled data results

Printing the same kind of results, for training:

SVM training accuracy, oversampled data: 72.09%.
SVM training F1 score, oversampled data: 0.7106.

And for prediction:

Confusion matrix:
[[578 121  24   1   7]
 [146 489 146   6   8]
 [ 20 146 475  58  29]
 [  8  22 165 157 114]
 [  5   4  24  30 217]]

Normalized confusion matrix:
[[0.79069767 0.16552668 0.03283174 0.00136799 0.00957592]
 [0.1836478  0.61509434 0.1836478  0.00754717 0.01006289]
 [0.02747253 0.20054945 0.65247253 0.07967033 0.03983516]
 [0.01716738 0.0472103  0.35407725 0.33690987 0.24463519]
 [0.01785714 0.01428571 0.08571429 0.10714286 0.775     ]]

SVM classification with oversampled dataset: 63.87%.
Average F1 score: 0.621.

Knowing how SVMs work, synthesizing new, fake data might just add noise and make the algorithm’s job harder. That might explain the drop in performance from the original dataset.

Under- & oversampled data results

Training:

SVM training accuracy, under- and oversampled data: 71.45%.
SVM training F1 score, under- and oversampled data: 0.7176.

Prediction:

Confusion matrix:
[[578 122  21   4   6]
 [149 483 145  12   6]
 [ 22 133 425 128  20]
 [  9  20 106 251  80]
 [  5   4  17  52 202]]

Normalized confusion matrix:
[[0.79069767 0.16689466 0.02872777 0.00547196 0.00820793]
 [0.18742138 0.60754717 0.18238994 0.01509434 0.00754717]
 [0.03021978 0.18269231 0.58379121 0.17582418 0.02747253]
 [0.0193133  0.04291845 0.22746781 0.53862661 0.17167382]
 [0.01785714 0.01428571 0.06071429 0.18571429 0.72142857]]

SVM classification with under and oversampled dataset: 64.63%.
Average F1 score: 0.6427.

With both techniques used, we see a little improvement in the F1 from the original data and a best performance overall in the SVMs there, but a little lower accuracy than the original data.

ANN results

Non-resampled data results

Let us see the last five training epochs details.

Epoch 96/100
17000/17000 [==============================] - 5s 316us/step - loss: 0.7381 - acc: 0.6895 - val_loss: 0.8578 - val_acc: 0.6760
Epoch 97/100
17000/17000 [==============================] - 5s 318us/step - loss: 0.7382 - acc: 0.6848 - val_loss: 0.8575 - val_acc: 0.6740
Epoch 98/100
17000/17000 [==============================] - 6s 330us/step - loss: 0.7364 - acc: 0.6898 - val_loss: 0.8574 - val_acc: 0.6753
Epoch 99/100
17000/17000 [==============================] - 5s 320us/step - loss: 0.7343 - acc: 0.6895 - val_loss: 0.8581 - val_acc: 0.6727
Epoch 100/100
17000/17000 [==============================] - 6s 324us/step - loss: 0.7348 - acc: 0.6931 - val_loss: 0.8568 - val_acc: 0.6760

Average validation F1 score: 0.6668.

The loss, categorical cross-entropy, might not be very intuitive for everyone, but the accuracy tells us that the model “hits the mark” 67.6% of the time on the validation set. The difference between the accuracy during training and validation (around 1.7%) shows a little overfitting, but we will dwell on that in another post.

Let us now see the performance on the evaluation:

Confusion matrix - ANN, non-resampled data:
[[277  58  10   0   0]
 [ 64 259  83   6   0]
 [  6  63 258  38   0]
 [  3  16  64 128  22]
 [  2   2   9  41  91]]

Normalized ANN confusion matrix, non-resampled data:
[[0.7869318  0.14572865 0.02358491 0.         0.        ]
 [0.18181819 0.6507538  0.19575472 0.02816901 0.        ]
 [0.01704546 0.15829146 0.6084906  0.17840375 0.        ]
 [0.00852273 0.040201   0.1509434  0.600939   0.19469027]
 [0.00568182 0.00502513 0.02122642 0.19248827 0.8053097 ]]

ANN classification accuracy with non-resampled dataset: 67.53%.
Average F1 score: 0.6736.

Accuracy went from 69.31% on the training set to 67.6% on the validation set and 67.53% on the evaluation set. The F1 score was 0.6668 in the validation set and 0.6736 in the evaluation set. This kind of “bounce” back in performance (both in terms of accuracy and F1 scores) during evaluation could mean that the validation set resulted, by chance, a little harder after its sampling, but the difference is small. The difference in accuracy between training and evaluation amounts to 1.8% which, again, shows a little overfitting and we will cover later on.

Undersampled data results

Let us see the same results for the undersampled data.

Epoch 96/100
16640/16640 [==============================] - 5s 330us/step - loss: 0.7312 - acc: 0.6927 - val_loss: 0.8637 - val_acc: 0.6673
Epoch 97/100
16640/16640 [==============================] - 6s 332us/step - loss: 0.7305 - acc: 0.6963 - val_loss: 0.8636 - val_acc: 0.6707
Epoch 98/100
16640/16640 [==============================] - 6s 331us/step - loss: 0.7281 - acc: 0.6919 - val_loss: 0.8667 - val_acc: 0.6660
Epoch 99/100
16640/16640 [==============================] - 5s 328us/step - loss: 0.7335 - acc: 0.6925 - val_loss: 0.8627 - val_acc: 0.6693
Epoch 100/100
16640/16640 [==============================] - 5s 329us/step - loss: 0.7315 - acc: 0.6971 - val_loss: 0.8632 - val_acc: 0.6707

Average validation F1 score: 0.6583.

Confusion matrix - ANN, undersampled data (Tomek links):
[[289  46  10   0   0]
 [ 76 250  79   7   0]
 [  7  56 269  33   0]
 [  5  15  64 128  21]
 [  3   1  10  42  89]]

Normalized ANN confusion matrix, undersampled data:
[[0.7605263  0.125      0.02314815 0.         0.        ]
 [0.2        0.6793478  0.18287037 0.03333334 0.        ]
 [0.01842105 0.1521739  0.6226852  0.15714286 0.        ]
 [0.01315789 0.04076087 0.14814815 0.60952383 0.19090909]
 [0.00789474 0.00271739 0.02314815 0.2        0.8090909 ]]

ANN classification accuracy with undersampled dataset: 68.33%.
Average F1 score: 0.6778.

Now we have an accuracy of 69.71% on the training set, 67.07% on the validation set and 68.33% during evaluation. The F1 went from 0.6583 in validation to 0.6778 in evaluation. Again, this bounce back implies that the distribution of data in the validation set might casually be “harder” to classify.

Oversampled data results

The results of the training process for the dataset oversampled with SMOTE are:

Epoch 96/100
20405/20405 [==============================] - 7s 339us/step - loss: 0.7017 - acc: 0.7024 - val_loss: 0.8727 - val_acc: 0.6713
Epoch 97/100
20405/20405 [==============================] - 7s 342us/step - loss: 0.7009 - acc: 0.7037 - val_loss: 0.8766 - val_acc: 0.6687
Epoch 98/100
20405/20405 [==============================] - 7s 340us/step - loss: 0.6993 - acc: 0.7077 - val_loss: 0.8761 - val_acc: 0.6647
Epoch 99/100
20405/20405 [==============================] - 7s 341us/step - loss: 0.7004 - acc: 0.7053 - val_loss: 0.8734 - val_acc: 0.6673
Epoch 100/100
20405/20405 [==============================] - 7s 339us/step - loss: 0.6985 - acc: 0.7049 - val_loss: 0.8738 - val_acc: 0.6713

Average validation F1 score: 0.6668.

And the evaluation process:

Confusion matrix - ANN, oversampled data (SMOTE):
[[274  60   9   1   1]
 [ 72 252  76  11   1]
 [  6  64 223  69   3]
 [  3  16  38 145  31]
 [  3   1   7  31 103]]

Normalized ANN confusion matrix, oversampled data:
[[0.76536316 0.15267175 0.02549575 0.00389105 0.00719424]
 [0.20111732 0.64122134 0.21529745 0.04280156 0.00719424]
 [0.01675978 0.16284987 0.63172805 0.26848248 0.02158273]
 [0.00837989 0.04071247 0.10764872 0.5642023  0.22302158]
 [0.00837989 0.00254453 0.01983003 0.12062257 0.7410072 ]]

ANN classification accuracy with oversampled dataset: 66.47%.
Average F1 score: 0.6688.

Accuracy went from 70.49% to 67.13% to 66.47% and the F1 score from 0.6668 on the validation set to 0.6688 on the evaluation one. This model shows a little more overfitting and performs a little worse than the same network trained on the original data or the undersampled ones.

Under- & oversampled data results

Printing the results of the training and evaluation process for the data that underwent both the under- and oversampling techniques, we see:

Epoch 96/100
20405/20405 [==============================] - 7s 329us/step - loss: 0.6943 - acc: 0.7068 - val_loss: 0.8843 - val_acc: 0.6600
Epoch 97/100
20405/20405 [==============================] - 7s 330us/step - loss: 0.6922 - acc: 0.7082 - val_loss: 0.8826 - val_acc: 0.6607
Epoch 98/100
20405/20405 [==============================] - 7s 330us/step - loss: 0.6972 - acc: 0.7027 - val_loss: 0.8860 - val_acc: 0.6587
Epoch 99/100
20405/20405 [==============================] - 7s 333us/step - loss: 0.6955 - acc: 0.7062 - val_loss: 0.8845 - val_acc: 0.6573
Epoch 100/100
20405/20405 [==============================] - 7s 329us/step - loss: 0.6936 - acc: 0.7105 - val_loss: 0.8838 - val_acc: 0.6573

Average validation F1 score: 0.6487.

Confusion matrix - ANN, under- and oversampled data (SMOTE+Tomek):
[[275  58   9   2   1]
 [ 74 250  78   9   1]
 [  7  61 228  61   8]
 [  3  13  42 144  31]
 [  3   1   7  30 104]]

Normalized ANN confusion matrix, under- and oversampled data:
[[0.7596685  0.15143603 0.02472528 0.00813008 0.00689655]
 [0.2044199  0.6527415  0.21428572 0.03658536 0.00689655]
 [0.01933702 0.15926893 0.62637365 0.24796748 0.05517241]
 [0.00828729 0.03394256 0.11538462 0.58536583 0.2137931 ]
 [0.00828729 0.00261097 0.01923077 0.12195122 0.7172414 ]]

ANN classification accuracy with under- and oversampled dataset: 66.73%.
Average F1 score: 0.6702.

Again, some overfitting and bounce back during evaluation. Even though validation performance is the lowest out of all four datasets, evaluation shows it is still in the ballpark.

XGBoost results

Non-resampled data results

Training output:

[0]	train-merror:0.388294	dev-merror:0.423329
[1]	train-merror:0.359941	dev-merror:0.390564
[2]	train-merror:0.350176	dev-merror:0.383355
[3]	train-merror:0.335235	dev-merror:0.38401
[4]	train-merror:0.325471	dev-merror:0.380079

XGBoost train accuracy, no resampling: 67.45%.
XGBoost train F1 score, no resampling: 0.6679.

Evaluation output:

XGBoost confusion matrix, non-resampled data:
[[300  37   9   0   0]
 [ 90 178  95   4   0]
 [ 18  68 243  39   4]
 [  9  17  95 109  14]
 [  3   4  22  42  74]]

Normalized XGBoost confusion matrix, non-resampled data:
[[0.71428573 0.12171052 0.01939655 0.         0.        ]
 [0.21428572 0.5855263  0.20474137 0.02061856 0.        ]
 [0.04285714 0.2236842  0.5237069  0.20103092 0.04347826]
 [0.02142857 0.05592105 0.20474137 0.5618557  0.1521739 ]
 [0.00714286 0.01315789 0.04741379 0.21649484 0.8043478 ]]

XGBoost test accuracy, no resampling: 61.33%.
XGBoost test F1 score, no resampling: 0.6035.

So far, we can state that the confusion matrix could look worse but it could look better, too. Evaluation accuracy and F1 were better in the other algorithms, but we will do a more thorough cross-model comparison in the discussions section. Again, considerable overfitting but we will not dig into it today.

Undersampled data results

Training output:

[0]	train-merror:0.373777	dev-merror:0.40498
[1]	train-merror:0.349251	dev-merror:0.382045
[2]	train-merror:0.334696	dev-merror:0.378768
[3]	train-merror:0.320203	dev-merror:0.377457
[4]	train-merror:0.30986	dev-merror:0.379423

XGBoost train accuracy, undersampled data: 69.01%.
XGBoost train F1 score, undersampled data: 0.6821.

Evaluation output:

XGBoost confusion matrix, undersampled data:
[[300  36   7   2   1]
 [ 84 193  80  10   0]
 [ 15  71 232  52   2]
 [  7  23  73 126  15]
 [  2   5  19  47  72]]

Normalized XGBoost confusion matrix, undersampled data:
[[0.7352941  0.1097561  0.01703163 0.00843882 0.01111111]
 [0.20588236 0.5884146  0.19464721 0.04219409 0.        ]
 [0.03676471 0.21646342 0.5644769  0.21940929 0.02222222]
 [0.01715686 0.07012195 0.17761557 0.5316456  0.16666667]
 [0.00490196 0.0152439  0.04622871 0.19831224 0.8       ]]

XGBoost test accuracy, undersampled data: 62.62%.
XGBoost test F1 score, undersampled data: 0.6161.

Notwithstanding that undersampling was performed on only one class, because XGBoost splits the data in each tree in a way that minimizes the loss it intuitively makes sense that evaluation performance improves a bit if we “made room” by deleting points in Tomek links. Boosted trees do not need any margin as SVMs do, they just split the data points.

Oversampled data results

Training output:

[0]	train-merror:0.363674	dev-merror:0.404325
[1]	train-merror:0.341992	dev-merror:0.386632
[2]	train-merror:0.333913	dev-merror:0.386632
[3]	train-merror:0.322561	dev-merror:0.388598
[4]	train-merror:0.314226	dev-merror:0.381389

XGBoost train accuracy, oversampled data: 68.58%.
XGBoost train F1 score, oversampled data: 0.6608.

Evaluation output:

XGBoost confusion matrix, oversampled data:
[[297  36  12   1   0]
 [ 88 197  76   4   2]
 [ 18  81 231  26  16]
 [  8  24  86  72  54]
 [  2   4  16  24  99]]

Normalized XGBoost confusion matrix, oversampled data:
[[0.7191283  0.10526316 0.02850356 0.00787402 0.        ]
 [0.21307506 0.5760234  0.18052256 0.03149606 0.01169591]
 [0.04358353 0.23684211 0.5486936  0.20472442 0.09356725]
 [0.01937046 0.07017544 0.20427553 0.56692916 0.31578946]
 [0.00484262 0.01169591 0.03800475 0.18897638 0.57894737]]

XGBoost test accuracy, oversampled data: 60.79%.
XGBoost test F1 score, oversampled: 0.5871.

Again, the same observation as in SVMs pertains: if XGBoost prospers by splitting data, synthesizing new examples might make its job harder.

With respect to the original data, the frequency for correct classifications did not diverge much from the original, non-resampled data except for the fifth category. More than 22% of correct classifications was lost and that could be what dragged down the F1 score.

Under- and oversampled data results

Training output:

[0]	train-merror:0.384706	dev-merror:0.407602
[1]	train-merror:0.364294	dev-merror:0.404325
[2]	train-merror:0.357235	dev-merror:0.392529
[3]	train-merror:0.351882	dev-merror:0.388598
[4]	train-merror:0.347059	dev-merror:0.387287

XGBoost train accuracy, under- and oversampled data: 67.31%.
XGBoost train F1 score, under- and oversampled data: 0.6698.

Evaluation output:

XGBoost confusion matrix, under- and oversampled data:
[[301  36   3   5   1]
[ 86 195  62  23   1]
[ 18  71 171 102  10]
[  8  20  48 133  35]
[  2   5   6  38  94]]

Normalized XGBoost confusion matrix, under- and oversampled data:
[[0.7253012  0.11009175 0.01034483 0.01661129 0.0070922 ]
 [0.20722891 0.5963303  0.2137931  0.07641196 0.0070922 ]
 [0.0433735  0.21712539 0.58965516 0.33887044 0.07092199]
 [0.01927711 0.06116208 0.16551724 0.44186047 0.24822696]
 [0.00481928 0.01529052 0.02068966 0.12624584 0.6666667 ]]

XGBoost test accuracy, under- and oversampled data: 60.65%.
XGBoost test F1 score, under- and oversampled data: 0.603.

Looking at the confusion matrix, we can see some categories’ correct classifications improved, but others’, as for the fourth and fifth category, decreased drastically. This explains the drop in accuracy and F1.

Discussion

ANN

ANN                  No resampling    Undersampled    Oversampled    Under+over  
Training accuracy            69.31           69.71          70.49        *71.05  
Validation accuracy         *67.60           67.07          67.13         65.73
Validation F1 score        *0.6668          0.6583        *0.6668        0.6487
Evaluation accuracy          67.53          *68.33          66.47         66.73
Evaluation F1 score         0.6736         *0.6778         0.6688        0.6702

First and foremost, neural networks train their parameters with stochastic gradient descent (SGD). This means that training the same neural network architecture with the same data twice will likely not yield the same neural network as a result. This must be noted before all the qualitative (and not quantitative) analysis I will do for didactic purposes.

We can see that while performing both undersampling and oversampling appears to allow the neural network to fit the training data “better” in terms of training accuracy, its predictions on a validation set result more than 5% worse. Using the two techniques could make overfitting easier. Moreover, the lowest F1 score on the validation set corresponds to this training dataset, which tells us false positives and negatives were worse with it. Intuitively, one could think that the double resampling distorted the decision boundary too much, but we can see that the F1 score on the evaluation set was not far from the rest. This may imply that the validation/test split casually ended up giving a complicated validation set for the neural network trained with this dataset, as they both came from the same set. The higher evaluation F1 in all trained neural networks supports this idea.

Using just oversampling increases training accuracy, too, but yields a smaller degree of overfitting (3.36% drop in accuracy from training set to validation set against 5.32%). The neural network trained with the oversampled dataset shares the best validation F1 score with the one trained on the original dataset. Accuracy drops another 0.66% in the evaluation set but the F1 score does not change much, it increases a bit.

Undersampling, compared to the original, non-resampled set, also allows for more overfitting, but to a lesser degree than oversampling does. The difference in accuracies between training and validation is now 2.64%. Validation F1 is considerably low and, looking at the similar scores on the neural nets trained on the non-resampled and oversampled datasets, perhaps the low F1 in the one trained on the double-resampled data is more due to the undersampling than it is to the oversampling, but this is not well-founded. In spite of this, evaluation accuracy and F1 scores were the highest. The validation set might have casually ended up with many data points in the region were majority class points in Tomek links were deleted, while the evaluation set might have not. Again, this has not been quantitatively analyzed.

Lastly, the original dataset yielded a network comparable to the ones trained on the other sets. Performance on the validation set was the highest both in terms of accuracy and F1 scores. On the evaluation set, it was second only to the net trained on the undersampled set.

SVM

SVM              Training    Training   Evaluation   Evaluation
                 accuracy    F1 score     accuracy     F1 score
No resampling       73.18      0.7249       *65.23       0.6411
Undersampled       *73.89      *0.731         64.6       0.6373
Oversampled         72.09      0.7106        63.87        0.621
Under+over          71.45      0.7176        64.63      *0.6427

We can see that the undersampling allowed the algorithm to fit the training data better, at the expense of higher overfitting than in the original set, too. The difference in accuracy from training to evaluation was 9.29% against 7.95% in the latter, whereas the F1 score difference was 0.0937 with undersampling and 0.0838 without. This means more overfitting and worse false positives and negatives rates (through precision and recall) in the undersampled data.

Performing oversampling with SMOTE made fitting the training data harder but it also impacted negatively the evaluation accuracy and F1 scores. Differences from training to evaluation in accuracy and F1 are 8.22% and 0.0896. This is less overfitting than with undersampling, but more than with the original set. Furthermore, the SVM trained with the oversampled data performed worse in the evaluation set as given by both metrics.

By combining the two resampling techniques we get the best evaluation F1 score and the second best evaluation accuracy, bested only by the SVM trained on the original set. Moreover, the overfitting reaches its lowest point if measured with the accuracy and F1 scores: the difference in them is 6.82% and 0.0749, respectively.

XGBoost

XGBoost         Training    Training        Test        Test
                accuracy    F1 score    accuracy    F1 score
No resampling      67.45      0.6679       61.33      0.6035
Undersampled      *69.01     *0.6821      *62.62     *0.6161
Oversampled        68.58      0.6608       60.79      0.5871
Under+over         67.31      0.6698       58.71       0.603

On the original set we see a difference in accuracy of 6.12% from training to evaluation and 0.0644 in F1. Moving to the trees trained on the undersampled set, these differences become 6.39% and 0.066, respectively. Even though they increase, the data was fit sufficently better to produce better evaluation results. All metrics were their best with the model trained on the undersampled set.

When trained on the oversampled set, the model appears to overfit even more, resulting in an accuracy difference of 7.79% and one of 0.0668 for F1. Additionally, the F1 for the evaluation set is the lowest for this algorithm. The same interpretation as in support vector machines can be done here: if boosting trees needs to split the data in a way that minimizes the loss, perhaps the oversampling distorts the decision boundaries.

The double-resampled dataset shows the hardest fitting problem for the algorithm, as evidenced by its accuracy and the error output during training in the corresponding section. The training false positives and negatives rates were the second best nonetheless, which we can see through the F1 score. Overfitting was 8.6% in terms of accuracy, the highest, but it was the lowest if we measure it as the difference between the mean error during validation and training.

All models

                     Training   Evaluation   Evaluation
                     accuracy     accuracy     F1 score
SVM no resampling       73.18        65.23       0.6411
SVM undersampled       *73.89         64.6       0.6373
SVM oversampled         72.09        63.87        0.621
SVM under+over          71.45        64.63       0.6427
ANN no resampling       69.31        67.53       0.6736
ANN undersampled        69.71       *68.33      *0.6778
ANN oversampled         70.49        66.47       0.6688
ANN under+over          71.05        66.73       0.6702
XGBoost no resampling   67.45        61.33       0.6035
XGBoost undersampled    69.01        62.62       0.6161
XGBoost oversampled     68.58        60.79       0.5871
XGBoost under+over      67.31        58.71        0.603

Looking at the results from all models, the one that allowed the better fit to the training data was the SVM with the undersampled dataset. Despite this, it was not able to generalize properly and was not even the best amongst its brethren, both in terms of accuracy and F1 but chiefly for the latter. The best evaluation results for the support vector machines was produced by the one trained on the data both under- and oversampled, either looking at its accuracy or its F1 score.

XGBoost results were consistent through all stages: the undersampled dataset favored both better fitting and generalization. With these hyperparameter settings it was not able to reach the other algorithms’ performance.

On the neural network side, the two resampling methods applied simultaneously allowed the net to fit the training data better, but the interpolation did not translate well to the evaluation set. The undersampled set delivered the all-model winning bias/variance tradeoff.

Heads up: the following is almost purely conjectural. Noticeably, the three algorithms’ evaluation metric ranges do not overlap: XGBoost performed worse than SVMs, which responded worse than ANNs. Nevertheless, in training this does not hold, as SVMs were able to overfit the most, followed but the neural nets and XGBoost right behind. Structural risk (roughly the bias induced by the complexity —again, roughly, the size— of the model space) was easier to handle with the dropout rate being limited to the range from 0 to 1, opposite to the C and γ parameters from SVMs which just need to be positive real numbers. Nevertheless, the complexity carried by the amount of neurons with all their connections against the simplicity of SVMs and “sparseness” considering the amount of vectors they yield makes the difference in overfitting quite surprising. In the case of XGBoost, its higher inability to overfit the training set does not come as a surprise, as it comprises a combination of many linear nodes, while these SVMs used non-linear kernels to project data into more dimensions and the ANNs we used had thousands of neurons with non-linear activation functions and thousands of connections between them. It should be noted, however, that variance was not fully optimized here for any of the models, as the main focus was on the effects of the resampling methods, so a more serious and quantitative study could and should be conducted.

Conclusion

We introduced a couple of very useful ways to balance the data and the importance of using them, we presented three of the nowadays most used algorithms, we did some extensive, qualitative analysis on each technique effects on training, validation —when applicable— and evaluation performance for each algorithm and, last but not least, we reflected on the possible causes of these effects, trying to get all these abstract ideas down to earth.

A constant through all models is that oversampling on its own does not help models generalize and even bars them from it. Another takeaway is that if undersampling on its own did not improve the performance during evaluation, a combination of both under- and oversampling could do it, as we could see for the SVMs.

Congratulations and thank you if you made it this far!