import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')
import sklearn
import warnings
warnings.filterwarnings("ignore")
print('Sklearn Ver: ' + str(sklearn.__version__))
It is important to take a look at the data before starting our analysis, check the number of records and columns. This helps we see how the data is organized and to see if there are abnormalities.
path = 'diabetes.csv'
df = pd.read_csv(path)
df_original = df
df.head()
# (n_rows, n_cols)
df.shape
It seems that all features are numeric.
Checking for missing values. Machine learning algorithms (most of them) do not support NaN values.
df.isna().sum()
Even though all features appear to be numeric, we must confirm all types of data in our data frame.
df.info()
Checking basic statistics.
df.describe()
There are some strange values like 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI' equal to 0 (zero). I will deal with that later. First, let's take a look at whether our target variable ('Outcome') is balanced.
df['Outcome'].value_counts()
Our target is unbalanced, this may affect our prediction. I will deal with that in the next sections.
Let's plot histograms to see the features' distribution.
df.hist(figsize=(10,5), bins=20);
plt.tight_layout()
Plotting density lines we can se a "smooth" distribution, helping to check for Gaussian distributions.
df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
Boxplots show data dispersion and potential outliers.
df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), showmeans=True,
sharex=False, sharey=False);
plt.tight_layout()
After exploring each feature independently, let's look for the correlation. Scatter plots are useful in this task, revealing patterns and linearity between variables.
sns.pairplot(df);
There are many methods for quantifying the correlations. One of the most popular is Person's Correlation, which assumes normaly distributed data and measures the linearity between variables.
I am going to plot a heatmap and use it as a correlarion matrix.
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(method='pearson'), annot=True)
plt.tight_layout()
Ranking the correlation between features and target.
df.corr().iloc[:-1,-1].sort_values(ascending=False) * 100
I don't know much about health data (whether a value is too high or too low), but there are extreme values that are unreasonable.
def MinMaxValues(data, top, remove):
"""
Returns the lowest/highest values from each dataframe column
data: dataframe
top: 'low' or 'high'
"""
if(top == 'low'):
sort = True
elif(top == 'high'):
sort = False
values = {}
for col in data.columns:
values[col] = data[col].value_counts().sort_index(ascending=sort).iloc[:5].index
values.pop(remove)
return pd.DataFrame(values)
# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')
Well, there are some values equals to 0 (zero) - 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin' and 'BMI'. I will take these records as wrong and fill them in with their respective median.
for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
df.loc[df[col] == 0, col] = df[col].median()
# Checking the 5 lowest values
MinMaxValues(df, 'low', 'Outcome')
Now they seem reasonable.
# Checking the 5 highest values
MinMaxValues(df, 'high', 'Outcome')
The number of pregnancies is very high, but it is not impossible to give birth to 17 babies. Due to lack of knowledge in the area, I will accept the other values as correct.
After these operations, I will take another look at data distribution.
df.hist(figsize=(10,5), bins=20);
plt.tight_layout()
df.plot(kind='kde', subplots=True, layout=(3,3), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
df.plot(kind='box', subplots=True, layout=(2,5), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
It is not easy to determine which features are important to our model. There are features that contribute to noise, impairing the forecast. One way to determine the features is through the Univariate Selection approach, which selects the variables based on univariate statistical test.
There are 8 features, we do a test selecting the 4 most relevant. The statistical test that we will use is Chi-Squared.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Spliting the data into input and output (the 'Outcome' column is our target)
X = df.values[:,:-1]
y = df.values[:,-1]
# chi2: Chi-Squared
# k = 4: top 4 features
test = SelectKBest(score_func = chi2, k = 4)
fit = test.fit(X,y)
# Summarizing Scores
print("Scores:")
print(fit.scores_)
print()
# Summarizing Selected features
features = fit.transform(X)
print('Selected features:')
print(features[0:5,:])
# Renaming selected features dataframe columns
features = pd.DataFrame(features)
features.columns = df.columns[sorted(pd.Series(fit.scores_).sort_values(ascending=False).iloc[:4].index)]
features.head()
# Joining our target
features['Outcome'] = df['Outcome']
features.head()
Each machine learning algorithm has its peculiarity, expecting some data requirements when fed. I will try three techniques: Data Normalization, Data Standardization and Power Transformation
The normalizer function rescales each input row to have unit norm. This process does not depend on the distribution of the samples. It is useful for algorithms that use distance measurement.
from sklearn.preprocessing import Normalizer
# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]
# Normalizing
X_normalized = Normalizer().fit_transform(X)
This process transforms a normally distributed random variable with mean u and standard deviation s into a standard normal random variable with mean 0 and standard deviation equal to 1.
# Checking features' normality
features.hist(figsize=(10,5), bins=20);
plt.tight_layout()
features.plot(kind='kde', subplots=True, layout=(3,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
We can see that data is not exactly normally distributed, but we will continue anyway.
from sklearn.preprocessing import StandardScaler
# Spliting the data into input and output
X = features.values[:,:-1]
y = features.values[:,-1]
# Standardizing the data
X_standard = StandardScaler().fit_transform(X)
The previous transformation expected the data was normally distributed. Trying to improve this, we will use Power Transform, which will use a transformation that makes the distributions normal (or something close).
from sklearn.preprocessing import PowerTransformer
# PowerTransformer performs operations on data to make it normally distributed
# Activating "standardize" key word performs standardization
power = PowerTransformer(method='yeo-johnson', standardize=True)
X = features.values[:,:-1]
y = features.values[:,-1]
X_power = power.fit_transform(X)
# Checking the transformation results
pd.DataFrame(X_power).hist(figsize=(10,5), bins=20);
plt.tight_layout()
pd.DataFrame(X_power).plot(kind='kde', subplots=True, layout=(2,2), figsize=(10,5), sharex=False, sharey=False);
plt.tight_layout()
Testing differents models to find the one that make the best prediction.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
# Processed data dictionary
X = {}
# All features without transformations original
X['All_Original'] = df_original.values[:,:-1]
# All features without transformations cleanned
X['All_Cleanned'] = df.values[:,:-1]
# Selected features without transformations
X['Selected_Cleanned'] = features.values[:,:-1]
# Selected features Normalized
X['Normalized'] = X_normalized
# Selected features Standardized
X['Standardized'] = X_standard
# Selected features Power Transformed
X['Power_Transformed'] = X_power
y = df.values[:,-1]
Importing some algorithms. Due to imbalance, I will define some penalized algorithms.
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier(random_state=101)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto', random_state=101)))
models.append(('SGD', SGDClassifier(random_state=101)))
# Penalized algorithms
models.append(('LR_Penalized', LogisticRegression(class_weight='balanced')))
models.append(('SVM_Penalized', SVC(kernel='linear',
class_weight='balanced',
probability=True, random_state=101)))
models.append(('CART_Penalized', DecisionTreeClassifier(class_weight='balanced',
random_state=101)))
models.append(('SGD_Penalized', SGDClassifier(class_weight='balanced',
random_state=101)))
# Store all results
all_results = {}
all_resumed = {}
# Evaluate each model in turn
for feat in X.keys():
all_results[feat] = {}
all_resumed[feat] = {}
X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
all_results[feat].update({name: cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')})
all_resumed[feat].update({name: {'Mean': np.round(all_results[feat][name].mean(), 4)}})
all_resumed[feat][name].update({'Std': np.round(all_results[feat][name].std(), 4)})
# Algorithm comparison
for feat in X.keys():
print(feat + ':')
print(pd.DataFrame(all_resumed[feat]).T)
print()
# Plotting algorithm comparison
fig, axes = plt.subplots(3,2, figsize=(14,12))
i, j = 0, 0
for feat in X.keys():
axes[i,j].boxplot(all_results[feat].values(),
labels=all_results[feat].keys(), showmeans=True)
axes[i,j].set_title(feat + ': Algorithm Comparison')
j += 1
if(j > 1):
i += 1
j = 0
plt.tight_layout()
The best result is from the LDA in All_Original/All_Cleanned (same accuracy).
Although accuracy is a good indicator, our class is unbalanced, so we need other measures to compare the models.
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
# Evaluate each model in turn
for feat in X.keys():
all_results[feat] = {}
all_resumed[feat] = {}
X_train, X_validation, y_train, y_validation = train_test_split(X[feat], y, test_size=0.20, random_state=101)
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
all_results[feat].update({name: {}})
all_resumed[feat].update({name: {}})
for score in scores:
all_results[feat][name].update({score: cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)})
all_resumed[feat][name].update({score: str(np.round(all_results[feat][name][score].mean(), 4)) + ' (' +
str(np.round(all_results[feat][name][score].std(), 4)) + ')'})
# list(X.keys())[0] means the first data case: All_Original
pd.DataFrame(all_resumed[list(X.keys())[0]]).T
When we look at precision, recall and f1, we realize that it is not a good idea to rely on accuracy alone.
Let's make some predictions with SVM case and print a confusion matrix.
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20, random_state=101)
model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
This example shows that the model is biased, "forecasting" only the majority class.
As we cannot acquire more data, we need to use other methods such as Undersampling Majority Class, Oversampling Minority Class and generate synthetic samples using SMOTE. After applying each technique, we will evaluate the most appropriate one.
Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset. A drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
# Check version number
import imblearn
print('Imblearn Ver: ' + str(imblearn.__version__))
# Random undersampling to balance the class distribution
from imblearn.under_sampling import RandomUnderSampler
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)
# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
stratify=y, random_state=101)
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_train, y_train = undersample.fit_resample(X_train, y_train)
for score in scores:
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
cv_df
The best model is LR_Penalizes. Let's make predictions.
model = LogisticRegression(class_weight='balanced') # Penalized
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.
# Random oversampling to balance the class distribution
from imblearn.over_sampling import RandomOverSampler
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)
# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
stratify=y, random_state=101)
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_train, y_train = oversample.fit_resample(X_train, y_train)
for score in scores:
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
cv_df
The best model is SVM. Let's make predictions.
model = SVC(gamma='auto', random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
The forecast result is inconsistent when compared to Cross-Validation. The high scores are due to (simple) oversampling, as we randomly duplicate examples from the minority class and adding them to the training dataset, making the model predict exactly the same as the samples used in the training. It is the same effect as data leakage.
Applying CART, the second best result in Cross-Validation, in our validation data.
model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
SMOTE is another way of oversampling, selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
Splitting data into train and test.
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
stratify=y, random_state=101)
Summarizing the number of examples in each class.
from collections import Counter
# summarize class distribution
counter = Counter(y_train)
print(counter)
Creating a scatter plot of the dataset and color the examples for each class a different color to clearly see the spatial nature of the class imbalance.
import numpy as np
plt.figure(figsize=(10,6))
color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = np.where(y_train == label)[0]
# Taking advantage of the coincidence of the label having the same numbering
# as the color index
plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);
from imblearn.over_sampling import SMOTE
# Transform the dataset
oversample = SMOTE(random_state=101)
X_train, y_train = oversample.fit_resample(X_train, y_train)
# Summarize the new class distribution
counter = Counter(y_train)
print(counter)
plt.figure(figsize=(10,6))
color = ['blue', 'orange']
# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = np.where(y_train == label)[0]
# Taking advantage of the coincidence of the label having the same numbering
# as the color index
plt.scatter(X_train[row_ix, 1], X_train[row_ix, 7],
alpha=0.5, label=str(label), c=color[int(label)])
plt.legend(['Diabetic', 'Non-Diabetic'], frameon=True, facecolor='w', fontsize=12);
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)
# Evaluate each model in turn
for score in scores:
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
cv_df
model = KNeighborsClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
# example of both undersampling and oversampling
from imblearn.combine import SMOTEENN
# Macro metrics treats all classes as equal
scores = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
model_name = [m[0] for m in models]
cv_df = pd.DataFrame(columns=scores, index=model_name)
# Evaluate each model in turn
X_train, X_validation, y_train, y_validation = train_test_split(X['All_Original'], y, test_size=0.20,
stratify=y, random_state=101)
# define oversampling followed by undersampling strategy
overundersample = SMOTEENN(sampling_strategy=0.81, random_state=101)
# fit and apply the transform
X_train, y_train = overundersample.fit_resample(X_train, y_train)
for score in scores:
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=101, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=score)
cv_df.loc[name, score] = str(np.round(cv_results.mean(), 4))+' ('+str(np.round(cv_results.std(), 4))+')'
cv_df
model = DecisionTreeClassifier(random_state=101)
model.fit(X_train, y_train)
predictions = model.predict(X_validation)
print('Confusion Matrix:')
print(confusion_matrix(y_validation, predictions))
print()
print('Classification Report:')
print(classification_report(y_validation, predictions))
After applying different methods to compensate for the imbalance, we can see there are some gains.
Our first time predicting the validation data we had:
The best result was obtained by combining SMOTE and Undersampling: