Pima Indians Diabetes Prediction

Feature Information

  • Pregnancies: Number of times pregnant
  • GlucosePlasma: Glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')

import sklearn
import warnings
warnings.filterwarnings("ignore")
print('Sklearn Ver: ' + str(sklearn.__version__))
Sklearn Ver: 0.22.2.post1

It is important to take a look at the data before starting our analysis, check the number of records and columns. This helps we see how the data is organized and to see if there are abnormalities.

In [2]:
path = 'diabetes.csv'

df = pd.read_csv(path)
df_original = df
df.head()
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [3]:
# (n_rows, n_cols)
df.shape
Out[3]:
(768, 9)

It seems that all features are numeric.

Checking for missing values. Machine learning algorithms (most of them) do not support NaN values.

In [4]:
df.isna().sum()
Out[4]:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Even though all features appear to be numeric, we must confirm all types of data in our data frame.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Checking basic statistics.

In [6]:
df.describe()
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

There are some strange values like 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI' equal to 0 (zero). I will deal with that later. First, let's take a look at whether our target variable ('Outcome') is balanced.

In [7]:
df['Outcome'].value_counts()
Out[7]:
0    500
1    268
Name: Outcome, dtype: int64

Our target is unbalanced, this may affect our prediction. I will deal with that in the next sections.

Let's plot histograms to see the features' distribution.

In [8]:
df.hist(figsize=(10,5), bins=20);
plt.tight_layout()

Plotting density lines we can se a "smooth" distribution, helping to check for Gaussian distributions.

In [9]:
df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Boxplots show data dispersion and potential outliers.

In [10]:
df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), showmeans=True,
        sharex=False, sharey=False);
plt.tight_layout()

After exploring each feature independently, let's look for the correlation. Scatter plots are useful in this task, revealing patterns and linearity between variables.

In [11]:
sns.pairplot(df);

There are many methods for quantifying the correlations. One of the most popular is Person's Correlation, which assumes normaly distributed data and measures the linearity between variables.

I am going to plot a heatmap and use it as a correlarion matrix.

In [12]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(method='pearson'), annot=True)
plt.tight_layout()

Ranking the correlation between features and target.

In [13]:
df.corr().iloc[:-1,-1].sort_values(ascending=False) * 100
Out[13]:
Glucose                     46.658140
BMI                         29.269466
Age                         23.835598
Pregnancies                 22.189815
DiabetesPedigreeFunction    17.384407
Insulin                     13.054795
SkinThickness                7.475223
BloodPressure                6.506836
Name: Outcome, dtype: float64

Handling Outliers

I don't know much about health data (whether a value is too high or too low), but there are extreme values that are unreasonable.

In [14]:
def MinMaxValues(data, top, remove):
    """
    Returns the lowest/highest values from each dataframe column

    data: dataframe
    top: 'low' or 'high'
    """
    
    if(top == 'low'):
        sort = True
    elif(top == 'high'):
        sort = False

    values = {}

    for col in data.columns:
        values[col] = data[col].value_counts().sort_index(ascending=sort).iloc[:5].index

    values.pop(remove)
    
    return pd.DataFrame(values)
In [15]:
# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')
Out[15]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0 0 0 0 0 0.0 0.078 21
1 1 44 24 7 14 18.2 0.084 22
2 2 56 30 8 15 18.4 0.085 23
3 3 57 38 10 16 19.1 0.088 24
4 4 61 40 11 18 19.3 0.089 25

Well, there are some values equals to 0 (zero) - 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI'. I will take these records as wrong and fill them in with their respective median.

In [16]:
for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
    df.loc[df[col] == 0, col] = df[col].median()
In [17]:
# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')
Out[17]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0 44 24 7 14.0 18.2 0.078 21
1 1 56 30 8 15.0 18.4 0.084 22
2 2 57 38 10 16.0 19.1 0.085 23
3 3 61 40 11 18.0 19.3 0.088 24
4 4 62 44 12 22.0 19.4 0.089 25

Now they seem reasonable.

In [18]:
# Checking the 5 highest values
MinMaxValues(df, 'high', 'Outcome')
Out[18]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 17 199 122 99 846.0 67.1 2.420 81
1 15 198 114 63 744.0 59.4 2.329 72
2 14 197 110 60 680.0 57.3 2.288 70
3 13 196 108 56 600.0 55.0 2.137 69
4 12 195 106 54 579.0 53.2 1.893 68

The number of pregnancies is very high, but it is not impossible to give birth to 17 babies. Due to lack of knowledge in the area, I will accept the other values as correct.

After these operations, I will take another look at data distribution.

In [19]:
df.hist(figsize=(10,5), bins=20);
plt.tight_layout()
In [20]:
df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
In [21]:
df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Feature Selection

It is not easy to determine which features are important to our model. There are features that contribute to noise, impairing the forecast. One way to determine the features is through the Univariate Selection approach, which selects the variables based on univariate statistical test.

There are 8 features, we do a test selecting the 4 most relevant. The statistical test that we will use is Chi-Squared.

In [22]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Spliting the data into input and output (the 'Outcome' column is our target)
X = df.values[:,:-1]
y = df.values[:,-1]

# chi2: Chi-Squared
# k = 4: top 4 features
test = SelectKBest(score_func = chi2, k = 4)
fit = test.fit(X,y)

# Summarizing Scores 
print("Scores:")
print(fit.scores_)
print()

# Summarizing Selected features
features = fit.transform(X)
print('Selected features:')
print(features[0:5,:])
Scores:
[ 111.51969064 1418.44239729   42.58250709   85.43128164 1989.58939433
  108.93372502    5.39268155  181.30368904]

Selected features:
[[  6.  148.   30.5  50. ]
 [  1.   85.   30.5  31. ]
 [  8.  183.   30.5  32. ]
 [  1.   89.   94.   21. ]
 [  0.  137.  168.   33. ]]
In [23]:
# Renaming selected features dataframe columns
features = pd.DataFrame(features)
features.columns = df.columns[sorted(pd.Series(fit.scores_).sort_values(ascending=False).iloc[:4].index)]
features.head()
Out[23]:
Pregnancies Glucose Insulin Age
0 6.0 148.0 30.5 50.0
1 1.0 85.0 30.5 31.0
2 8.0 183.0 30.5 32.0
3 1.0 89.0 94.0 21.0
4 0.0 137.0 168.0 33.0
In [24]:
# Joining our target
features['Outcome'] = df['Outcome']
features.head()
Out[24]:
Pregnancies Glucose Insulin Age Outcome
0 6.0 148.0 30.5 50.0 1
1 1.0 85.0 30.5 31.0 0
2 8.0 183.0 30.5 32.0 1
3 1.0 89.0 94.0 21.0 0
4 0.0 137.0 168.0 33.0 1

Data Preprocessing

Each machine learning algorithm has its peculiarity, expecting some data requirements when fed. I will try three techniques: Data Normalization, Data Standardization and Power Transformation

Data Normalization

The normalizer function rescales each input row to have unit norm. This process does not depend on the distribution of the samples. It is useful for algorithms that use distance measurement.

In [25]:
from sklearn.preprocessing import Normalizer

# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]

# Normalizing
X_normalized = Normalizer().fit_transform(X)

Data Standardization

This process transforms a normally distributed random variable with mean u and standard deviation s into a standard normal random variable with mean 0 and standard deviation equal to 1.

In [26]:
# Checking features' normality
features.hist(figsize=(10,5), bins=20);
plt.tight_layout()
In [27]:
features.plot(kind='kde', subplots=True, layout=(3,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

We can see that data is not exactly normally distributed, but we will continue anyway.

In [28]:
from sklearn.preprocessing import StandardScaler

# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]

# Standardizing the data
X_standard = StandardScaler().fit_transform(X)

Power Transformation

The previous transformation expected the data was normally distributed. Trying to improve this, we will use Power Transform, which will use a transformation that makes the distributions normal (or something close).

In [29]:
from sklearn.preprocessing import PowerTransformer

# PowerTransformer performs operations on data to make it normally distributed
# Activating "standardize" key word performs standardization
power = PowerTransformer(method='yeo-johnson', standardize=True)
X = features.values[:,:-1]
y = features.values[:,-1]

X_power = power.fit_transform(X)
In [30]:
# Checking the transformation results
pd.DataFrame(X_power).hist(figsize=(10,5), bins=20);
plt.tight_layout()
In [31]:
pd.DataFrame(X_power).plot(kind='kde', subplots=True, layout=(2,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()

Model Selection

Testing differents models to find the one that make the best prediction.

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
In [33]:
# Processed data dictionary
X = {}

# All features without transformations original
X['All_Original'] = df_original.values[:,:-1]

# All features without transformations cleanned
X['All_Cleanned'] = df.values[:,:-1]

# Selected features without transformations
X['Selected_Cleanned'] = features.values[:,:-1]

# Selected features Normalized
X['Normalized'] = X_normalized

# Selected features Standardized
X['Standardized'] = X_standard

# Selected features Power Transformed
X['Power_Transformed'] = X_power

y = df.values[:,-1]

Importing some algorithms. Due to imbalance, I will define some penalized algorithms.

In [34]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier(random_state=101)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto', random_state=101)))
models.append(('SGD', SGDClassifier(random_state=101)))

# Penalized algorithms
models.append(('LR_Penalized', LogisticRegression(class_weight='balanced')))
models.append(('SVM_Penalized', SVC(kernel='linear', 
                                    class_weight='balanced',
                                    probability=True, random_state=101)))
models.append(('CART_Penalized', DecisionTreeClassifier(class_weight='balanced',
                                                        random_state=101)))
models.append(('SGD_Penalized', SGDClassifier(class_weight='balanced',
                                              random_state=101)))

# Store all results
all_results = {}
all_resumed = {}

# Evaluate each model in turn
for feat in X.keys():
    all_results[feat] = {}
    all_resumed[feat] = {}
    X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
    for name, model in models:
        kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
        all_results[feat].update({name: cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')})
        all_resumed[feat].update({name: {'Mean': np.round(all_results[feat][name].mean(), 4)}})
        all_resumed[feat][name].update({'Std': np.round(all_results[feat][name].std(), 4)})
In [35]:
# Algorithm comparison
for feat in X.keys():
    print(feat + ':')
    print(pd.DataFrame(all_resumed[feat]).T)
    print()
All_Original:
                  Mean     Std
LR              0.7510  0.0848
LDA             0.7672  0.0681
KNN             0.6841  0.0705
CART            0.6806  0.0493
NB              0.7410  0.0412
SVM             0.6466  0.0065
SGD             0.5865  0.1029
LR_Penalized    0.7428  0.0631
SVM_Penalized   0.7492  0.0570
CART_Penalized  0.6743  0.0440
SGD_Penalized   0.5989  0.1241

All_Cleanned:
                  Mean     Std
LR              0.7510  0.0848
LDA             0.7672  0.0681
KNN             0.6841  0.0705
CART            0.6806  0.0493
NB              0.7410  0.0412
SVM             0.6466  0.0065
SGD             0.5865  0.1029
LR_Penalized    0.7428  0.0631
SVM_Penalized   0.7492  0.0570
CART_Penalized  0.6743  0.0440
SGD_Penalized   0.5989  0.1241

Selected_Cleanned:
                  Mean     Std
LR              0.7459  0.0587
LDA             0.7459  0.0604
KNN             0.6969  0.0800
CART            0.6935  0.0694
NB              0.7279  0.0719
SVM             0.6417  0.0170
SGD             0.4772  0.1272
LR_Penalized    0.7280  0.0705
SVM_Penalized   0.7345  0.0623
CART_Penalized  0.7018  0.0546
SGD_Penalized   0.5130  0.1228

Normalized:
                  Mean     Std
LR              0.6466  0.0065
LDA             0.6712  0.0417
KNN             0.6660  0.0528
CART            0.5913  0.0712
NB              0.6270  0.0426
SVM             0.6466  0.0065
SGD             0.6435  0.0293
LR_Penalized    0.5587  0.0476
SVM_Penalized   0.5783  0.0739
CART_Penalized  0.6289  0.0507
SGD_Penalized   0.5275  0.1354

Standardized:
                  Mean     Std
LR              0.7459  0.0604
LDA             0.7459  0.0604
KNN             0.7229  0.0534
CART            0.6936  0.0700
NB              0.7279  0.0719
SVM             0.7313  0.0682
SGD             0.6940  0.0697
LR_Penalized    0.7280  0.0705
SVM_Penalized   0.7345  0.0623
CART_Penalized  0.7018  0.0588
SGD_Penalized   0.6515  0.0617

Power_Transformed:
                  Mean     Std
LR              0.7476  0.0587
LDA             0.7410  0.0557
KNN             0.7277  0.0608
CART            0.6903  0.0800
NB              0.7295  0.0667
SVM             0.7410  0.0710
SGD             0.6871  0.0595
LR_Penalized    0.7197  0.0697
SVM_Penalized   0.7229  0.0702
CART_Penalized  0.7002  0.0578
SGD_Penalized   0.6578  0.0843

In [36]:
# Plotting algorithm comparison
fig, axes = plt.subplots(3,2, figsize=(14,12))

i, j = 0, 0
for feat in X.keys():
    axes[i,j].boxplot(all_results[feat].values(),
                      labels=all_results[feat].keys(), showmeans=True)
    axes[i,j].set_title(feat + ': Algorithm Comparison')
    j += 1
    if(j > 1):
        i += 1
        j = 0

plt.tight_layout()

The best result is from the LDA in All_Original/All_Cleanned (same accuracy).

Although accuracy is a good indicator, our class is unbalanced, so we need other measures to compare the models.

In [37]:
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

# Evaluate each model in turn
for feat in X.keys():
    all_results[feat] = {}
    all_resumed[feat] = {}
    X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
    for name, model in models:
        kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
        all_results[feat].update({name: {}})
        all_resumed[feat].update({name: {}})
        for score in scores:
          all_results[feat][name].update({score: cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)})
          all_resumed[feat][name].update({score: str(np.round(all_results[feat][name][score].mean(), 4)) + ' (' +
                                                 str(np.round(all_results[feat][name][score].std(), 4)) + ')'})
In [38]:
# list(X.keys())[0] means the first data case: All_Original
pd.DataFrame(all_resumed[list(X.keys())[0]]).T
Out[38]:
accuracy precision_macro recall_macro f1_macro
LR 0.751 (0.0848) 0.7339 (0.1049) 0.6987 (0.0986) 0.7059 (0.102)
LDA 0.7672 (0.0681) 0.7516 (0.0837) 0.72 (0.085) 0.7269 (0.0856)
KNN 0.6841 (0.0705) 0.6518 (0.0864) 0.6417 (0.0813) 0.6422 (0.0836)
CART 0.6806 (0.0493) 0.6579 (0.0482) 0.6485 (0.0403) 0.6463 (0.0463)
NB 0.741 (0.0412) 0.7163 (0.0493) 0.7059 (0.0532) 0.7085 (0.0517)
SVM 0.6466 (0.0065) 0.3233 (0.0032) 0.5 (0.0) 0.3927 (0.0024)
SGD 0.5865 (0.1029) 0.5759 (0.1582) 0.5469 (0.064) 0.4576 (0.0917)
LR_Penalized 0.7428 (0.0631) 0.7285 (0.065) 0.7435 (0.0693) 0.7297 (0.0663)
SVM_Penalized 0.7492 (0.057) 0.7298 (0.0636) 0.741 (0.0714) 0.7316 (0.0649)
CART_Penalized 0.6743 (0.044) 0.6458 (0.0489) 0.6435 (0.0509) 0.6415 (0.0489)
SGD_Penalized 0.5989 (0.1241) 0.5867 (0.2257) 0.57 (0.0923) 0.4847 (0.1371)

When we look at precision, recall and f1, we realize that it is not a good idea to rely on accuracy alone.

Let's make some predictions with SVM case and print a confusion matrix.

In [39]:
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20, random_state=101)

model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[103   0]
 [ 51   0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.67      1.00      0.80       103
         1.0       0.00      0.00      0.00        51

    accuracy                           0.67       154
   macro avg       0.33      0.50      0.40       154
weighted avg       0.45      0.67      0.54       154

This example shows that the model is biased, "forecasting" only the majority class.

Handling Imbalaced Class

As we cannot acquire more data, we need to use other methods such as Undersampling Majority Class, Oversampling Minority Class and generate synthetic samples using SMOTE. After applying each technique, we will evaluate the most appropriate one.

Undersampling Majority Class

Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset. A drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.

In [40]:
# Check version number
import imblearn
print('Imblearn Ver: ' + str(imblearn.__version__))
Imblearn Ver: 0.4.3
In [41]:
# Random undersampling to balance the class distribution
from imblearn.under_sampling import RandomUnderSampler
In [42]:
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_train, y_train = undersample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
In [43]:
cv_df
Out[43]:
accuracy precision_macro recall_macro f1_macro
LR 0.7265 (0.0664) 0.7287 (0.066) 0.7265 (0.0665) 0.7255 (0.0671)
LDA 0.7402 (0.085) 0.7467 (0.0833) 0.7404 (0.0846) 0.7377 (0.0871)
KNN 0.7008 (0.0753) 0.7074 (0.0765) 0.7008 (0.0747) 0.6982 (0.076)
CART 0.6961 (0.0748) 0.7017 (0.0763) 0.6959 (0.0741) 0.694 (0.0748)
NB 0.724 (0.0946) 0.7303 (0.0964) 0.724 (0.095) 0.7216 (0.0962)
SVM 0.5001 (0.0257) 0.481 (0.2164) 0.5089 (0.0238) 0.3665 (0.0411)
SGD 0.5774 (0.0871) 0.6066 (0.1976) 0.5793 (0.0828) 0.4982 (0.1332)
LR_Penalized 0.7172 (0.0784) 0.7196 (0.0786) 0.7171 (0.0787) 0.7159 (0.0793)
SVM_Penalized 0.7402 (0.0824) 0.7478 (0.0796) 0.7403 (0.082) 0.7368 (0.086)
CART_Penalized 0.7007 (0.0823) 0.705 (0.0835) 0.7003 (0.0821) 0.6988 (0.0828)
SGD_Penalized 0.6146 (0.0675) 0.6524 (0.1448) 0.6175 (0.0668) 0.5661 (0.1081)

The best model is LR_Penalizes. Let's make predictions.

In [44]:
model = LogisticRegression(class_weight='balanced') # Penalized
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[74 26]
 [20 34]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.79      0.74      0.76       100
         1.0       0.57      0.63      0.60        54

    accuracy                           0.70       154
   macro avg       0.68      0.68      0.68       154
weighted avg       0.71      0.70      0.70       154

Oversampling Minority Class

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

In [45]:
# Random oversampling to balance the class distribution
from imblearn.over_sampling import RandomOverSampler
In [46]:
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_train, y_train = oversample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
In [47]:
cv_df
Out[47]:
accuracy precision_macro recall_macro f1_macro
LR 0.7425 (0.0376) 0.7461 (0.0353) 0.7425 (0.0376) 0.7413 (0.0386)
LDA 0.7488 (0.0347) 0.7538 (0.0308) 0.7488 (0.0347) 0.7471 (0.0367)
KNN 0.7488 (0.0508) 0.7518 (0.0518) 0.7488 (0.0508) 0.7481 (0.0507)
CART 0.8212 (0.0379) 0.8259 (0.0397) 0.8212 (0.0379) 0.8207 (0.038)
NB 0.7263 (0.0351) 0.7309 (0.0324) 0.7263 (0.0351) 0.7245 (0.0373)
SVM 0.8662 (0.043) 0.8962 (0.0262) 0.8662 (0.043) 0.863 (0.0461)
SGD 0.585 (0.057) 0.6433 (0.0646) 0.585 (0.057) 0.5401 (0.0918)
LR_Penalized 0.7425 (0.0376) 0.7461 (0.0353) 0.7425 (0.0376) 0.7413 (0.0386)
SVM_Penalized 0.7463 (0.0362) 0.7525 (0.0323) 0.7463 (0.0362) 0.7443 (0.0382)
CART_Penalized 0.8212 (0.0379) 0.8259 (0.0397) 0.8212 (0.0379) 0.8207 (0.038)
SGD_Penalized 0.585 (0.057) 0.6433 (0.0646) 0.585 (0.057) 0.5401 (0.0918)

The best model is SVM. Let's make predictions.

In [48]:
model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[100   0]
 [ 54   0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.65      1.00      0.79       100
         1.0       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154

The forecast result is inconsistent when compared to Cross-Validation. The high scores are due to (simple) oversampling, as we randomly duplicate examples from the minority class and adding them to the training dataset, making the model predict exactly the same as the samples used in the training. It is the same effect as data leakage.

Applying CART, the second best result in Cross-Validation, in our validation data.

In [49]:
model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[84 16]
 [22 32]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.79      0.84      0.82       100
         1.0       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154

Synthetic Samples (SMOTE)

SMOTE is another way of oversampling, selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Splitting data into train and test.

In [50]:
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)

Summarizing the number of examples in each class.

In [51]:
from collections import Counter

# summarize class distribution
counter = Counter(y_train)
print(counter)
Counter({0.0: 400, 1.0: 214})

Creating a scatter plot of the dataset and color the examples for each class a different color to clearly see the spatial nature of the class imbalance.

In [52]:
import numpy as np

plt.figure(figsize=(10,6))

color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = np.where(y_train == label)[0]
  # Taking advantage of the coincidence of the label having the same numbering
  # as the color index
  plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
              alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);
In [53]:
from imblearn.over_sampling import SMOTE

# Transform the dataset
oversample = SMOTE(random_state=101)
X_train, y_train = oversample.fit_resample(X_train, y_train)
In [54]:
# Summarize the new class distribution
counter = Counter(y_train)
print(counter)
Counter({0.0: 400, 1.0: 400})
In [55]:
plt.figure(figsize=(10,6))

color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = np.where(y_train == label)[0]
  # Taking advantage of the coincidence of the label having the same numbering
  # as the color index
  plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
              alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);
In [56]:
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
In [57]:
cv_df
Out[57]:
accuracy precision_macro recall_macro f1_macro
LR 0.7725 (0.0348) 0.774 (0.0339) 0.7725 (0.0348) 0.7721 (0.0351)
LDA 0.7612 (0.0323) 0.7652 (0.031) 0.7612 (0.0323) 0.7602 (0.0329)
KNN 0.7662 (0.0454) 0.7749 (0.0473) 0.7662 (0.0454) 0.7644 (0.0459)
CART 0.7588 (0.0454) 0.7613 (0.0454) 0.7588 (0.0454) 0.7581 (0.0458)
NB 0.7437 (0.0187) 0.7465 (0.0204) 0.7437 (0.0187) 0.7431 (0.0185)
SVM 0.6938 (0.0392) 0.8108 (0.0157) 0.6938 (0.0392) 0.6602 (0.0511)
SGD 0.6325 (0.066) 0.6737 (0.0787) 0.6325 (0.066) 0.5957 (0.1043)
LR_Penalized 0.7725 (0.0348) 0.774 (0.0339) 0.7725 (0.0348) 0.7721 (0.0351)
SVM_Penalized 0.7612 (0.0377) 0.765 (0.0366) 0.7612 (0.0377) 0.7603 (0.0382)
CART_Penalized 0.7588 (0.0454) 0.7613 (0.0454) 0.7588 (0.0454) 0.7581 (0.0458)
SGD_Penalized 0.6325 (0.066) 0.6737 (0.0787) 0.6325 (0.066) 0.5957 (0.1043)
In [58]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[71 29]
 [20 34]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.78      0.71      0.74       100
         1.0       0.54      0.63      0.58        54

    accuracy                           0.68       154
   macro avg       0.66      0.67      0.66       154
weighted avg       0.70      0.68      0.69       154

Combining SMOTE and Undersampling

In [59]:
# example of both undersampling and oversampling
from imblearn.combine import SMOTEENN
In [60]:
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)

# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
                                                                stratify=y, random_state=101)
# define oversampling followed by undersampling strategy
overundersample = SMOTEENN(sampling_strategy=0.81, random_state=101)
# fit and apply the transform
X_train, y_train = overundersample.fit_resample(X_train, y_train)

for score in scores:
  for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
    cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
In [61]:
cv_df
Out[61]:
accuracy precision_macro recall_macro f1_macro
LR 0.9221 (0.0492) 0.9234 (0.0472) 0.9222 (0.0518) 0.921 (0.0501)
LDA 0.919 (0.0455) 0.9246 (0.0408) 0.9147 (0.0484) 0.9168 (0.0474)
KNN 0.9565 (0.0375) 0.9568 (0.038) 0.9579 (0.0356) 0.9562 (0.0376)
CART 0.9624 (0.0289) 0.9641 (0.0267) 0.9629 (0.0297) 0.9619 (0.0295)
NB 0.9334 (0.0364) 0.9335 (0.035) 0.9345 (0.0374) 0.9327 (0.0368)
SVM 0.6953 (0.0668) 0.825 (0.0281) 0.6573 (0.0763) 0.6228 (0.1034)
SGD 0.8083 (0.0567) 0.8364 (0.0415) 0.8037 (0.0555) 0.7994 (0.0627)
LR_Penalized 0.9161 (0.0366) 0.918 (0.0322) 0.9169 (0.0411) 0.9148 (0.0379)
SVM_Penalized 0.9306 (0.0314) 0.9334 (0.0271) 0.9326 (0.0349) 0.9297 (0.0324)
CART_Penalized 0.9507 (0.0227) 0.9513 (0.0233) 0.9514 (0.0223) 0.9502 (0.0229)
SGD_Penalized 0.7818 (0.1038) 0.8298 (0.0641) 0.7834 (0.0913) 0.7684 (0.1162)
In [62]:
model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)

print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Confusion Matrix:
[[81 19]
 [14 40]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.85      0.81      0.83       100
         1.0       0.68      0.74      0.71        54

    accuracy                           0.79       154
   macro avg       0.77      0.78      0.77       154
weighted avg       0.79      0.79      0.79       154

Conclusion

After applying different methods to compensate for the imbalance, we can see there are some gains.

Our first time predicting the validation data we had:

  • accuracy = 0.74
  • f1_macro = 0.72

The best result was obtained by combining SMOTE and Undersampling:

  • accuracy = 0.79
  • f1_macro = 0.77
In [62]: