Pima Indians Diabetes Prediction¶

Feature Information¶

Pregnancies: Number of times pregnant
GlucosePlasma: Glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

Exploratory Data Analysis¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')

import sklearn
import warnings
warnings.filterwarnings("ignore")
print('Sklearn Ver: ' + str(sklearn.__version__))

Sklearn Ver: 0.22.2.post1

It is important to take a look at the data before starting our analysis, check the number of records and columns. This helps we see how the data is organized and to see if there are abnormalities.

path = 'diabetes.csv'

df = pd.read_csv(path)
df_original = df
df.head()

# (n_rows, n_cols)
df.shape

(768, 9)

It seems that all features are numeric.

Checking for missing values. Machine learning algorithms (most of them) do not support NaN values.

df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Even though all features appear to be numeric, we must confirm all types of data in our data frame.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Checking basic statistics.

df.describe()

There are some strange values like 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI' equal to 0 (zero). I will deal with that later. First, let's take a look at whether our target variable ('Outcome') is balanced.

df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

Our target is unbalanced, this may affect our prediction. I will deal with that in the next sections.

Let's plot histograms to see the features' distribution.

df.hist(figsize=(10,5), bins=20);
plt.tight_layout()

Plotting density lines we can se a "smooth" distribution, helping to check for Gaussian distributions.

df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Boxplots show data dispersion and potential outliers.

df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), showmeans=True,
        sharex=False, sharey=False);
plt.tight_layout()

After exploring each feature independently, let's look for the correlation. Scatter plots are useful in this task, revealing patterns and linearity between variables.

sns.pairplot(df);

There are many methods for quantifying the correlations. One of the most popular is Person's Correlation, which assumes normaly distributed data and measures the linearity between variables.

I am going to plot a heatmap and use it as a correlarion matrix.

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(method='pearson'), annot=True)
plt.tight_layout()

Ranking the correlation between features and target.

df.corr().iloc[:-1,-1].sort_values(ascending=False) * 100

Glucose                     46.658140
BMI                         29.269466
Age                         23.835598
Pregnancies                 22.189815
DiabetesPedigreeFunction    17.384407
Insulin                     13.054795
SkinThickness                7.475223
BloodPressure                6.506836
Name: Outcome, dtype: float64

Handling Outliers¶

I don't know much about health data (whether a value is too high or too low), but there are extreme values that are unreasonable.

def MinMaxValues(data, top, remove):
    """
    Returns the lowest/highest values from each dataframe column

    data: dataframe
    top: 'low' or 'high'
    """
    
    if(top == 'low'):
        sort = True
    elif(top == 'high'):
        sort = False

    values = {}

    for col in data.columns:
        values[col] = data[col].value_counts().sort_index(ascending=sort).iloc[:5].index

    values.pop(remove)
    
    return pd.DataFrame(values)

# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')

Well, there are some values equals to 0 (zero) - 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI'. I will take these records as wrong and fill them in with their respective median.

for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
    df.loc[df[col] == 0, col] = df[col].median()

# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')

Now they seem reasonable.

# Checking the 5 highest values
MinMaxValues(df, 'high', 'Outcome')

The number of pregnancies is very high, but it is not impossible to give birth to 17 babies. Due to lack of knowledge in the area, I will accept the other values as correct.

After these operations, I will take another look at data distribution.

df.hist(figsize=(10,5), bins=20);
plt.tight_layout()

df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Feature Selection¶

It is not easy to determine which features are important to our model. There are features that contribute to noise, impairing the forecast. One way to determine the features is through the Univariate Selection approach, which selects the variables based on univariate statistical test.

There are 8 features, we do a test selecting the 4 most relevant. The statistical test that we will use is Chi-Squared.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Spliting the data into input and output (the 'Outcome' column is our target)
X = df.values[:,:-1]
y = df.values[:,-1]

# chi2: Chi-Squared
# k = 4: top 4 features
test = SelectKBest(score_func = chi2, k = 4)
fit = test.fit(X,y)

# Summarizing Scores 
print("Scores:")
print(fit.scores_)
print()

# Summarizing Selected features
features = fit.transform(X)
print('Selected features:')
print(features[0:5,:])

Scores:
[ 111.51969064 1418.44239729   42.58250709   85.43128164 1989.58939433
  108.93372502    5.39268155  181.30368904]

Selected features:
[[  6.  148.   30.5  50. ]
 [  1.   85.   30.5  31. ]
 [  8.  183.   30.5  32. ]
 [  1.   89.   94.   21. ]
 [  0.  137.  168.   33. ]]

# Renaming selected features dataframe columns
features = pd.DataFrame(features)
features.columns = df.columns[sorted(pd.Series(fit.scores_).sort_values(ascending=False).iloc[:4].index)]
features.head()

# Joining our target
features['Outcome'] = df['Outcome']
features.head()

Data Preprocessing¶

Each machine learning algorithm has its peculiarity, expecting some data requirements when fed. I will try three techniques: Data Normalization, Data Standardization and Power Transformation

Data Normalization¶

The normalizer function rescales each input row to have unit norm. This process does not depend on the distribution of the samples. It is useful for algorithms that use distance measurement.

from sklearn.preprocessing import Normalizer

# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]

# Normalizing
X_normalized = Normalizer().fit_transform(X)

Data Standardization¶

This process transforms a normally distributed random variable with mean u and standard deviation s into a standard normal random variable with mean 0 and standard deviation equal to 1.

# Checking features' normality
features.hist(figsize=(10,5), bins=20);
plt.tight_layout()

features.plot(kind='kde', subplots=True, layout=(3,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

We can see that data is not exactly normally distributed, but we will continue anyway.

from sklearn.preprocessing import StandardScaler

# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]

# Standardizing the data
X_standard = StandardScaler().fit_transform(X)

Power Transformation¶

The previous transformation expected the data was normally distributed. Trying to improve this, we will use Power Transform, which will use a transformation that makes the distributions normal (or something close).

from sklearn.preprocessing import PowerTransformer

# PowerTransformer performs operations on data to make it normally distributed
# Activating "standardize" key word performs standardization
power = PowerTransformer(method='yeo-johnson', standardize=True)
X = features.values[:,:-1]
y = features.values[:,-1]

X_power = power.fit_transform(X)

# Checking the transformation results
pd.DataFrame(X_power).hist(figsize=(10,5), bins=20);
plt.tight_layout()

pd.DataFrame(X_power).plot(kind='kde', subplots=True, layout=(2,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Model Selection¶

Testing differents models to find the one that make the best prediction.

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

# Processed data dictionary
X = {}

# All features without transformations original
X['All_Original'] = df_original.values[:,:-1]

# All features without transformations cleanned
X['All_Cleanned'] = df.values[:,:-1]

# Selected features without transformations
X['Selected_Cleanned'] = features.values[:,:-1]

# Selected features Normalized
X['Normalized'] = X_normalized

# Selected features Standardized
X['Standardized'] = X_standard

# Selected features Power Transformed
X['Power_Transformed'] = X_power

y = df.values[:,-1]

Importing some algorithms. Due to imbalance, I will define some penalized algorithms.

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier(random_state=101)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto', random_state=101)))
models.append(('SGD', SGDClassifier(random_state=101)))

# Penalized algorithms
models.append(('LR_Penalized', LogisticRegression(class_weight='balanced')))
models.append(('SVM_Penalized', SVC(kernel='linear', 
                                    class_weight='balanced',
                                    probability=True, random_state=101)))
models.append(('CART_Penalized', DecisionTreeClassifier(class_weight='balanced',
                                                        random_state=101)))
models.append(('SGD_Penalized', SGDClassifier(class_weight='balanced',
                                              random_state=101)))

# Store all results
all_results = {}
all_resumed = {}

# Evaluate each model in turn
for feat in X.keys():
    all_results[feat] = {}
    all_resumed[feat] = {}
    X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
    for name, model in models:
        kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
        all_results[feat].update({name: cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')})
        all_resumed[feat].update({name: {'Mean': np.round(all_results[feat][name].mean(), 4)}})
        all_resumed[feat][name].update({'Std': np.round(all_results[feat][name].std(), 4)})

# Algorithm comparison
for feat in X.keys():
    print(feat + ':')
    print(pd.DataFrame(all_resumed[feat]).T)
    print()

All_Original:
                  Mean     Std
LR              0.7510  0.0848
LDA             0.7672  0.0681
KNN             0.6841  0.0705
CART            0.6806  0.0493
NB              0.7410  0.0412
SVM             0.6466  0.0065
SGD             0.5865  0.1029
LR_Penalized    0.7428  0.0631
SVM_Penalized   0.7492  0.0570
CART_Penalized  0.6743  0.0440
SGD_Penalized   0.5989  0.1241

All_Cleanned:
                  Mean     Std
LR              0.7510  0.0848
LDA             0.7672  0.0681
KNN             0.6841  0.0705
CART            0.6806  0.0493
NB              0.7410  0.0412
SVM             0.6466  0.0065
SGD             0.5865  0.1029
LR_Penalized    0.7428  0.0631
SVM_Penalized   0.7492  0.0570
CART_Penalized  0.6743  0.0440
SGD_Penalized   0.5989  0.1241

Selected_Cleanned:
                  Mean     Std
LR              0.7459  0.0587
LDA             0.7459  0.0604
KNN             0.6969  0.0800
CART            0.6935  0.0694
NB              0.7279  0.0719
SVM             0.6417  0.0170
SGD             0.4772  0.1272
LR_Penalized    0.7280  0.0705
SVM_Penalized   0.7345  0.0623
CART_Penalized  0.7018  0.0546
SGD_Penalized   0.5130  0.1228

Normalized:
                  Mean     Std
LR              0.6466  0.0065
LDA             0.6712  0.0417
KNN             0.6660  0.0528
CART            0.5913  0.0712
NB              0.6270  0.0426
SVM             0.6466  0.0065
SGD             0.6435  0.0293
LR_Penalized    0.5587  0.0476
SVM_Penalized   0.5783  0.0739
CART_Penalized  0.6289  0.0507
SGD_Penalized   0.5275  0.1354

Standardized:
                  Mean     Std
LR              0.7459  0.0604
LDA             0.7459  0.0604
KNN             0.7229  0.0534
CART            0.6936  0.0700
NB              0.7279  0.0719
SVM             0.7313  0.0682
SGD             0.6940  0.0697
LR_Penalized    0.7280  0.0705
SVM_Penalized   0.7345  0.0623
CART_Penalized  0.7018  0.0588
SGD_Penalized   0.6515  0.0617

Power_Transformed:
                  Mean     Std
LR              0.7476  0.0587
LDA             0.7410  0.0557
KNN             0.7277  0.0608
CART            0.6903  0.0800
NB              0.7295  0.0667
SVM             0.7410  0.0710
SGD             0.6871  0.0595
LR_Penalized    0.7197  0.0697
SVM_Penalized   0.7229  0.0702
CART_Penalized  0.7002  0.0578
SGD_Penalized   0.6578  0.0843

# Plotting algorithm comparison
fig, axes = plt.subplots(3,2, figsize=(14,12))

i, j = 0, 0
for feat in X.keys():
    axes[i,j].boxplot(all_results[feat].values(),
                      labels=all_results[feat].keys(), showmeans=True)
    axes[i,j].set_title(feat + ': Algorithm Comparison')
    j += 1
    if(j > 1):
        i += 1
        j = 0

plt.tight_layout()

The best result is from the LDA in All_Original/All_Cleanned (same accuracy).

Although accuracy is a good indicator, our class is unbalanced, so we need other measures to compare the models.

# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

# Evaluate each model in turn
for feat in X.keys():
    all_results[feat] = {}
    all_resumed[feat] = {}
    X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
    for name, model in models:
        kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
        all_results[feat].update({name: {}})
        all_resumed[feat].update({name: {}})
        for score in scores:
          all_results[feat][name].update({score: cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)})
          all_resumed[feat][name].update({score: str(np.round(all_results[feat][name][score].mean(), 4)) + ' (' +
                                                 str(np.round(all_results[feat][name][score].std(), 4)) + ')'})

# list(X.keys())[0] means the first data case: All_Original
pd.DataFrame(all_resumed[list(X.keys())[0]]).T

When we look at precision, recall and f1, we realize that it is not a good idea to rely on accuracy alone.

Let's make some predictions with SVM case and print a confusion matrix.

X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20, random_state=101)

model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[103   0]
 [ 51   0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.67      1.00      0.80       103
         1.0       0.00      0.00      0.00        51

    accuracy                           0.67       154
   macro avg       0.33      0.50      0.40       154
weighted avg       0.45      0.67      0.54       154

This example shows that the model is biased, "forecasting" only the majority class.

Handling Imbalaced Class¶

As we cannot acquire more data, we need to use other methods such as Undersampling Majority Class, Oversampling Minority Class and generate synthetic samples using SMOTE. After applying each technique, we will evaluate the most appropriate one.

Undersampling Majority Class¶

Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset. A drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.

# Check version number
import imblearn
print('Imblearn Ver: ' + str(imblearn.__version__))

Imblearn Ver: 0.4.3

# Random undersampling to balance the class distribution
from imblearn.under_sampling import RandomUnderSampler

# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_train, y_train = undersample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'

cv_df

The best model is LR_Penalizes. Let's make predictions.

model = LogisticRegression(class_weight='balanced') # Penalized
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[74 26]
 [20 34]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.79      0.74      0.76       100
         1.0       0.57      0.63      0.60        54

    accuracy                           0.70       154
   macro avg       0.68      0.68      0.68       154
weighted avg       0.71      0.70      0.70       154

Oversampling Minority Class¶

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

# Random oversampling to balance the class distribution
from imblearn.over_sampling import RandomOverSampler

# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_train, y_train = oversample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'

cv_df

The best model is SVM. Let's make predictions.

model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[100   0]
 [ 54   0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.65      1.00      0.79       100
         1.0       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154

The forecast result is inconsistent when compared to Cross-Validation. The high scores are due to (simple) oversampling, as we randomly duplicate examples from the minority class and adding them to the training dataset, making the model predict exactly the same as the samples used in the training. It is the same effect as data leakage.

Applying CART, the second best result in Cross-Validation, in our validation data.

model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[84 16]
 [22 32]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.79      0.84      0.82       100
         1.0       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154

Synthetic Samples (SMOTE)¶

SMOTE is another way of oversampling, selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Splitting data into train and test.

X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)

Summarizing the number of examples in each class.

from collections import Counter

# summarize class distribution
counter = Counter(y_train)
print(counter)

Counter({0.0: 400, 1.0: 214})

Creating a scatter plot of the dataset and color the examples for each class a different color to clearly see the spatial nature of the class imbalance.

import numpy as np

plt.figure(figsize=(10,6))

color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = np.where(y_train == label)[0]
  # Taking advantage of the coincidence of the label having the same numbering
  # as the color index
  plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
              alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);

from imblearn.over_sampling import SMOTE

# Transform the dataset
oversample = SMOTE(random_state=101)
X_train, y_train = oversample.fit_resample(X_train, y_train)

# Summarize the new class distribution
counter = Counter(y_train)
print(counter)

Counter({0.0: 400, 1.0: 400})

plt.figure(figsize=(10,6))

color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = np.where(y_train == label)[0]
  # Taking advantage of the coincidence of the label having the same numbering
  # as the color index
  plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
              alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);

# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'

cv_df

model = KNeighborsClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[71 29]
 [20 34]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.78      0.71      0.74       100
         1.0       0.54      0.63      0.58        54

    accuracy                           0.68       154
   macro avg       0.66      0.67      0.66       154
weighted avg       0.70      0.68      0.69       154

Combining SMOTE and Undersampling¶

# example of both undersampling and oversampling
from imblearn.combine import SMOTEENN

# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define oversampling followed by undersampling strategy
overundersample = SMOTEENN(sampling_strategy=0.81, random_state=101)
# fit and apply the transform
X_train, y_train = overundersample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'

cv_df

model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))

Confusion Matrix:
[[81 19]
 [14 40]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.85      0.81      0.83       100
         1.0       0.68      0.74      0.71        54

    accuracy                           0.79       154
   macro avg       0.77      0.78      0.77       154
weighted avg       0.79      0.79      0.79       154

Conclusion¶

After applying different methods to compensate for the imbalance, we can see there are some gains.

Our first time predicting the validation data we had:

accuracy = 0.74
f1_macro = 0.72

The best result was obtained by combining SMOTE and Undersampling:

accuracy = 0.79
f1_macro = 0.77

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	Insulin	Age
0	6.0	148.0	30.5	50.0
1	1.0	85.0	30.5	31.0
2	8.0	183.0	30.5	32.0
3	1.0	89.0	94.0	21.0
4	0.0	137.0	168.0	33.0

	Pregnancies	Glucose	Insulin	Age	Outcome
0	6.0	148.0	30.5	50.0	1
1	1.0	85.0	30.5	31.0	0
2	8.0	183.0	30.5	32.0	1
3	1.0	89.0	94.0	21.0	0
4	0.0	137.0	168.0	33.0	1

	accuracy	precision_macro	recall_macro	f1_macro
LR	0.751 (0.0848)	0.7339 (0.1049)	0.6987 (0.0986)	0.7059 (0.102)
LDA	0.7672 (0.0681)	0.7516 (0.0837)	0.72 (0.085)	0.7269 (0.0856)
KNN	0.6841 (0.0705)	0.6518 (0.0864)	0.6417 (0.0813)	0.6422 (0.0836)
CART	0.6806 (0.0493)	0.6579 (0.0482)	0.6485 (0.0403)	0.6463 (0.0463)
NB	0.741 (0.0412)	0.7163 (0.0493)	0.7059 (0.0532)	0.7085 (0.0517)
SVM	0.6466 (0.0065)	0.3233 (0.0032)	0.5 (0.0)	0.3927 (0.0024)
SGD	0.5865 (0.1029)	0.5759 (0.1582)	0.5469 (0.064)	0.4576 (0.0917)
LR_Penalized	0.7428 (0.0631)	0.7285 (0.065)	0.7435 (0.0693)	0.7297 (0.0663)
SVM_Penalized	0.7492 (0.057)	0.7298 (0.0636)	0.741 (0.0714)	0.7316 (0.0649)
CART_Penalized	0.6743 (0.044)	0.6458 (0.0489)	0.6435 (0.0509)	0.6415 (0.0489)
SGD_Penalized	0.5989 (0.1241)	0.5867 (0.2257)	0.57 (0.0923)	0.4847 (0.1371)

	accuracy	precision_macro	recall_macro	f1_macro
LR	0.7265 (0.0664)	0.7287 (0.066)	0.7265 (0.0665)	0.7255 (0.0671)
LDA	0.7402 (0.085)	0.7467 (0.0833)	0.7404 (0.0846)	0.7377 (0.0871)
KNN	0.7008 (0.0753)	0.7074 (0.0765)	0.7008 (0.0747)	0.6982 (0.076)
CART	0.6961 (0.0748)	0.7017 (0.0763)	0.6959 (0.0741)	0.694 (0.0748)
NB	0.724 (0.0946)	0.7303 (0.0964)	0.724 (0.095)	0.7216 (0.0962)
SVM	0.5001 (0.0257)	0.481 (0.2164)	0.5089 (0.0238)	0.3665 (0.0411)
SGD	0.5774 (0.0871)	0.6066 (0.1976)	0.5793 (0.0828)	0.4982 (0.1332)
LR_Penalized	0.7172 (0.0784)	0.7196 (0.0786)	0.7171 (0.0787)	0.7159 (0.0793)
SVM_Penalized	0.7402 (0.0824)	0.7478 (0.0796)	0.7403 (0.082)	0.7368 (0.086)
CART_Penalized	0.7007 (0.0823)	0.705 (0.0835)	0.7003 (0.0821)	0.6988 (0.0828)
SGD_Penalized	0.6146 (0.0675)	0.6524 (0.1448)	0.6175 (0.0668)	0.5661 (0.1081)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	0	0	0	0	0	0.0	0.078	21
1	1	44	24	7	14	18.2	0.084	22
2	2	56	30	8	15	18.4	0.085	23
3	3	57	38	10	16	19.1	0.088	24
4	4	61	40	11	18	19.3	0.089	25

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	0	44	24	7	14.0	18.2	0.078	21
1	1	56	30	8	15.0	18.4	0.084	22
2	2	57	38	10	16.0	19.1	0.085	23
3	3	61	40	11	18.0	19.3	0.088	24
4	4	62	44	12	22.0	19.4	0.089	25

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	17	199	122	99	846.0	67.1	2.420	81
1	15	198	114	63	744.0	59.4	2.329	72
2	14	197	110	60	680.0	57.3	2.288	70
3	13	196	108	56	600.0	55.0	2.137	69
4	12	195	106	54	579.0	53.2	1.893	68

	accuracy	precision_macro	recall_macro	f1_macro
LR	0.7425 (0.0376)	0.7461 (0.0353)	0.7425 (0.0376)	0.7413 (0.0386)
LDA	0.7488 (0.0347)	0.7538 (0.0308)	0.7488 (0.0347)	0.7471 (0.0367)
KNN	0.7488 (0.0508)	0.7518 (0.0518)	0.7488 (0.0508)	0.7481 (0.0507)
CART	0.8212 (0.0379)	0.8259 (0.0397)	0.8212 (0.0379)	0.8207 (0.038)
NB	0.7263 (0.0351)	0.7309 (0.0324)	0.7263 (0.0351)	0.7245 (0.0373)
SVM	0.8662 (0.043)	0.8962 (0.0262)	0.8662 (0.043)	0.863 (0.0461)
SGD	0.585 (0.057)	0.6433 (0.0646)	0.585 (0.057)	0.5401 (0.0918)
LR_Penalized	0.7425 (0.0376)	0.7461 (0.0353)	0.7425 (0.0376)	0.7413 (0.0386)
SVM_Penalized	0.7463 (0.0362)	0.7525 (0.0323)	0.7463 (0.0362)	0.7443 (0.0382)
CART_Penalized	0.8212 (0.0379)	0.8259 (0.0397)	0.8212 (0.0379)	0.8207 (0.038)
SGD_Penalized	0.585 (0.057)	0.6433 (0.0646)	0.585 (0.057)	0.5401 (0.0918)

	accuracy	precision_macro	recall_macro	f1_macro
LR	0.7725 (0.0348)	0.774 (0.0339)	0.7725 (0.0348)	0.7721 (0.0351)
LDA	0.7612 (0.0323)	0.7652 (0.031)	0.7612 (0.0323)	0.7602 (0.0329)
KNN	0.7662 (0.0454)	0.7749 (0.0473)	0.7662 (0.0454)	0.7644 (0.0459)
CART	0.7588 (0.0454)	0.7613 (0.0454)	0.7588 (0.0454)	0.7581 (0.0458)
NB	0.7437 (0.0187)	0.7465 (0.0204)	0.7437 (0.0187)	0.7431 (0.0185)
SVM	0.6938 (0.0392)	0.8108 (0.0157)	0.6938 (0.0392)	0.6602 (0.0511)
SGD	0.6325 (0.066)	0.6737 (0.0787)	0.6325 (0.066)	0.5957 (0.1043)
LR_Penalized	0.7725 (0.0348)	0.774 (0.0339)	0.7725 (0.0348)	0.7721 (0.0351)
SVM_Penalized	0.7612 (0.0377)	0.765 (0.0366)	0.7612 (0.0377)	0.7603 (0.0382)
CART_Penalized	0.7588 (0.0454)	0.7613 (0.0454)	0.7588 (0.0454)	0.7581 (0.0458)
SGD_Penalized	0.6325 (0.066)	0.6737 (0.0787)	0.6325 (0.066)	0.5957 (0.1043)

	accuracy	precision_macro	recall_macro	f1_macro
LR	0.9221 (0.0492)	0.9234 (0.0472)	0.9222 (0.0518)	0.921 (0.0501)
LDA	0.919 (0.0455)	0.9246 (0.0408)	0.9147 (0.0484)	0.9168 (0.0474)
KNN	0.9565 (0.0375)	0.9568 (0.038)	0.9579 (0.0356)	0.9562 (0.0376)
CART	0.9624 (0.0289)	0.9641 (0.0267)	0.9629 (0.0297)	0.9619 (0.0295)
NB	0.9334 (0.0364)	0.9335 (0.035)	0.9345 (0.0374)	0.9327 (0.0368)
SVM	0.6953 (0.0668)	0.825 (0.0281)	0.6573 (0.0763)	0.6228 (0.1034)
SGD	0.8083 (0.0567)	0.8364 (0.0415)	0.8037 (0.0555)	0.7994 (0.0627)
LR_Penalized	0.9161 (0.0366)	0.918 (0.0322)	0.9169 (0.0411)	0.9148 (0.0379)
SVM_Penalized	0.9306 (0.0314)	0.9334 (0.0271)	0.9326 (0.0349)	0.9297 (0.0324)
CART_Penalized	0.9507 (0.0227)	0.9513 (0.0233)	0.9514 (0.0223)	0.9502 (0.0229)
SGD_Penalized	0.7818 (0.1038)	0.8298 (0.0641)	0.7834 (0.0913)	0.7684 (0.1162)