In the United States, the number of people with diabetes (too much glucose, or sugar, in the blood) have been increasing at an alarming rate where 34.2 million Americans are estimated to have the disease. There are two types (type 1 and type 2) of diabetes which both ultimately lead to the body not making enough insulin to lower glucose in the blood. Type 1 is a genetic condition while type 2 (the most common one) is developed over time as a result of life-style habits. Diabetes can have an impact on many physiological systems and is linked to major problems including, but not limited to, heart disease, stroke, blindness, and renal failure. In the U.S., diabetes was found to be the seventh leading cause of mortality highlighting the fatality of this disease. To learn more click on the hyperlinks above.
In this tutorial, we will look at a dataset from Kaggle that can be found here that was originally obtained from the National Institute of Diabetes and Digestive and Kidney Diseases where the objective "is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset." These diagnostic measurements are:
Pregnancies = number of pregnancies
Glucose = glucose level in the blood (plasma glucose concentration)
BloodPressure = diastolic blood pressure (mm Hg)
SkinThickness = triceps skin fold thickness (mm)
Insulin = insulin level in the blood / 2-Hour serum insulin (mu U/ml)
BMI = body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction = function which scores likelihood of diabetes based on family history
Age = age in years
Outcome = 0 if the patient does not have diabetes, 1 if the patient does
All the patients in the dataset are females of Pima Indian heritage who are at least 21 years of age.
We will load the data, perform exploratory analysis to gain insights about the data, do hypothesis testing to draw conclusions through statistical testing, and then use classification algorithms to build a model that accurately predicts whether a patient in the dataset has diabetes.
We will need to import the following packages for this tutorial
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC,SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,StackingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from scipy.stats import normaltest, ttest_ind, mannwhitneyu, spearmanr
Since the diabetes data is contained in a csv file, which can be exported from Kaggle, we use the Pandas csv read_csv function to create a dataframe that contains all of the data from the csv file. After getting the data, we want to look at the aspects of the dataframe through the info and describe functions.
# Extracting data from the file
data = pd.read_csv('diabetes.csv')
# Getting the details of the dataframe
data.info()
data.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
We can see that there are 768 non-null entries for each column that are all numbers, so there are seemingly no missing values. However, the describe function reveals that there is a minimum value of zero in some columns that are extremely questionable. We can see that there are zeroes in the Glucose, BloodPressure, SkinThickness, Insulin, and BMI columns that would be biologically impossible. These zeroes may be meant to represent missing values, but there was no note of that by the creators of the dataset so we can't be sure.
Let's do some data visualization on this dataset to get a clearer idea on how the data actually looks and how the features relate to the outcomes.
We will now use a count plot to see how many patients in the dataset have each outcome.
# Creating the count plot
sns.countplot(data = data, x = data['Outcome'], palette = 'Greens')
plt.title("Number of Patients in the Dataset with Outcome 0 or Outcome 1")
plt.show()
We can see that most of the patients in the dataset do not have diabetes. It seems that almost 65% (500/768) of the dataset has patients without diabetes and nearly 35% do.
We will now create a box plot to view the statistical relationship between the features and Diabetes Outcomes
# Iterating through the columns and creating a boxplot for each feature in regard to outcome
for col in data.columns:
if col != 'Outcome':
sns.boxplot(data = data, x = 'Outcome', y = col, palette = "Paired")
titlestring = "Box plot of " + col + " and Diabetes Outcomes \n"
plt.title(titlestring)
plt.show()
The box plots show that the median for outcome 1 (patients with diabetes) is higher than outcome 0 across all of the features as well as the upper quartile. The lower quartile is the same for both outcomes in Insulin and SkinThickness, but higher for outcome 0 in the rest of the features. This makes sense since people with diabetes tend to have higher glucose levels and usually need to be administered insulin to lower these glucose levels which may contribute to higher insulin levels. Diabetes usually occurs in middle-aged and older adults which is also reflected in the box plot for age where the median for outcome 1 is between 30 and 40 years old. Overall, the dataset seems to reflect that the diabetes outcome is associated with higher levels of all the features.
The box plots also show that many of the features have outliers. One approach to cleaning the data could be to remove these outlier or replace them with the mean/median value, but since there seems to be many of them overall it may be better not to remove them to prevent modifying too much of the data. Furthermore, they may represent natural variations as opposed to measurement errors, so they may be true outliers that can be left in the dataset which is how we're choosing to treat them.
Now we will create a heatmap to view a correlation matrix in order to see the relationship between each of the features/variables (how correlated they are)
# Computes pairwise correlation (corr()) of columns and then uses it to make a heatmap
corr = data.corr()
plt.figure(figsize = (8,6))
sns.heatmap(corr,annot = True,cmap = 'Purples')
plt.title("Correlation Matrix")
plt.show()
The lighter color indicates that there is little correlation between them while darker colors indicate a strong correlation. According to the heatmap, above the most correlated feature to the target variable 'Outcome' is 'Glucose' which is 0.47 and the second most correlated is 'BMI' is 0.29. However, overall, we don't see much correlation among the variables except for 'Pregnancies' and 'Age', which is pretty intuitive since age affects fertility, and 'Insulin' and 'SkinThickness'.
We can look at the correlations more closely with scatterplots.
# Creating scatterplots between specific columns
sns.scatterplot(x = 'Pregnancies', y = 'Age', data = data)
plt.title("Correlation Between Pregnancies and Age")
plt.show()
sns.scatterplot(x = 'Insulin', y = 'SkinThickness', data = data)
plt.title("Correlation Between Insulin and SkinThickness")
plt.show()
sns.scatterplot(x = 'Outcome', y = 'Glucose', data = data)
plt.title("Correlation Between Outcome and Glucose")
plt.show()
sns.scatterplot(x = 'Outcome', y = 'BMI', data = data)
plt.title("Correlation Between Outcome and BMI")
plt.show()
The scatterplots don't really show a linear trend. We can see in the Pregnancies vs Age plot that most of the lower number of pregnancies happen with patients younger in age and that the density of the data points lower as number of pregnancies get higher. As for Insulin vs Outcome, we can see a large cluster of data point in the lower left portion of the plot highlighting that it's common within the dataset for a patient to have an Insulin level between 0 and 200 accompanied with a SkinThickness between 0 and 40. The last two scatterplots involving Outcome does not tell us much since the data can only lie in two places, 0 and 1. However, using a distribution plot will allow us to look at the distribution of the data for each outcome which will provide us with more information.
Now we will look at the data distribution of each feature in regards to the diabetes outcome using Seaborn's distplot. We use kernal density estimation as a non-parametic (a statistical method where the data doesn't have to fit a normal distribution) way to estimate the probability of distribution of a variable. It's important to look at the distribution of data in order to see if variables are normally distributed or skewed to make more informed choices about what statistical tests to use on the data since some require the data to be normally distributed. To learn more, visit this article.
# Iterating through the columns and plotting a distribution plot using Seaborn, differentiating
# the distribution in each plot by having two colors (hue) correspond to the two outcomes and
# specifying we want to use kernal density estimation 'kde'
for col in data.drop('Outcome',1).columns:
sns.displot(data = data, x = col,hue = 'Outcome',kind='kde', palette='Reds')
titlestring = "Distribution of " + col + " and Diabetes Outcomes \n"
plt.title(titlestring)
plt.show()
/tmp/ipykernel_120/656021879.py:4: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. for col in data.drop('Outcome',1).columns:
Many of the feature distributions seem to be positively, or right, skewed (a statistic describing the data distribution shape or the symmetry of a certain overall value distribution) such as the Pregnancies, DiabetesPedigreeFunction, and Age due to to the presence of a tail on the right side of the distribution which means the presence of outliers are more towards the right.
We can see in some of the graphs the influence of zero values within the dataset. As a result, in the Glucose, BloodPressure, SkinThickness, Insulin, and BMI graphs we can see a bimodal distribution (two curves) in both outcomes in which one of the curves is over the density of the zero values. However, if these zero values were ignored making most of the distributions unimodal, it seems that the two most correlated features with Outcome according to the heatmap, Glucose for Outcome 0 and BMI for both outcomes almost follow a normal distribution (but may not actually be normally distributed); BloodPressure for both outcomes also seems to follow a normal distribution while SkinThickness may be slightly positively skewed and Insulin is definitely positively skewed.
We will do some data preprocessing to look at alternative ways we can represent and modify the data as a result of these zero values that have an effect on the distribution of the data.
We will make another dataframe with the same data except rows with an entry of zero in certain columns where a zero is a questionable finding, especially since these zero values are not acknowledged by the creators of the dataset. For instance, nearly half of the Insulin column has entries of zero which is a concerning observation especially since an Insulin level of zero might be in a person with untreated Type 1 diabetes, but in the dataset patients with insulin levels of zero are mixed in regards to their outcome of diabetes. As a result, we will have a total of three dataframes that we will train machine learning models on-- the original data, data without the rows that contain zero entries in columns where an entry of zero can be questionable, and a dataframe replacing these zero entries with the median value of its column.
The reason we want to use all three representations/modifications of the data as opposed to choosing one is because a large number of zero values are present in some of the columns meaning that if we only use the data without the zero values then we'd have a significantly smaller dataset to work with and if we were to use the data that replaces these zeroes with median values then we have fabricated/modified majority of the data which reduces the authenticity of the data. As a result, we think it's better to be able to build models on all three dataframes and compare model accuracy on them which allows for more fruitful results.
Since we already have our original dataset, we will now create the remaining two.
# Removing the rows that have an entry of zero in the columns 'Glucose', 'BloodPressure',
# 'SkinThickness', 'Insulin', and 'BMI' column
remove_zero_data = data
remove_zero_data = remove_zero_data[(remove_zero_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] != 0).all(axis=1)]
remove_zero_data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 |
mean | 3.301020 | 122.627551 | 70.663265 | 29.145408 | 156.056122 | 33.086224 | 0.523046 | 30.864796 | 0.331633 |
std | 3.211424 | 30.860781 | 12.496092 | 10.516424 | 118.841690 | 7.027659 | 0.345488 | 10.200777 | 0.471401 |
min | 0.000000 | 56.000000 | 24.000000 | 7.000000 | 14.000000 | 18.200000 | 0.085000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 21.000000 | 76.750000 | 28.400000 | 0.269750 | 23.000000 | 0.000000 |
50% | 2.000000 | 119.000000 | 70.000000 | 29.000000 | 125.500000 | 33.200000 | 0.449500 | 27.000000 | 0.000000 |
75% | 5.000000 | 143.000000 | 78.000000 | 37.000000 | 190.000000 | 37.100000 | 0.687000 | 36.000000 | 1.000000 |
max | 17.000000 | 198.000000 | 110.000000 | 63.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
# Replacing the zero values in 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', and 'BMI'
# columns with the median of its column
replace_zero_data = data
replace_zero_data['Glucose']=replace_zero_data['Glucose'].replace(0,replace_zero_data['Glucose'].median())
replace_zero_data['BloodPressure']=replace_zero_data['BloodPressure'].replace(0,replace_zero_data['BloodPressure'].median())
replace_zero_data['SkinThickness']=replace_zero_data['SkinThickness'].replace(0,replace_zero_data['SkinThickness'].median())
replace_zero_data['Insulin']=replace_zero_data['Insulin'].replace(0,replace_zero_data['Insulin'].median())
replace_zero_data['BMI']=replace_zero_data['BMI'].replace(0,replace_zero_data['BMI'].median())
replace_zero_data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 121.656250 | 72.386719 | 27.334635 | 94.652344 | 32.450911 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 30.438286 | 12.096642 | 9.229014 | 105.547598 | 6.875366 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 44.000000 | 24.000000 | 7.000000 | 14.000000 | 18.200000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.750000 | 64.000000 | 23.000000 | 30.500000 | 27.500000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 31.250000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Now that we have our three dataframes, we will do some hypothesis testing.
Hypothesis testing is used to support "the certainty of findings" where a null and alternative hypothesis, essentially two mutually exclusive statements, are evaluated in order to see which hypothesis the sample data best supports. There are parametric and non-parametric hypothesis tests where the difference lies in the distribution of the data. Parametric tests are used on data with a normal distribution while non-parametric tests are used on the opposite.
Although we made some inferences on data distribution using distribution plots earlier, we will utilize Scipy's normaltest function to ensure whether each feature is/isn't normally distributed in regards to each outcome. We do this in order to determine whether we should use a parametric or non-parametric test on our data to reject or fail to reject our null hypothesis. We decide whether the null hypothesis is rejected or not through the p-value result from the hypothesis test. The p-value is "the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct". The alternative hypothesis is more likely to be supported by stronger evidence than the null hypothesis when the p-value is lower. Statistical significance is commonly defined as a p-value of 0.05 or less, so we will use this as our significance level to reject the null hypothesis.
# Splitting the data into two dataframes based on outcome
outcome_0_data = data[data['Outcome'] == 0]
outcome_1_data = data[data['Outcome'] == 1]
# Arrays that will store the column names + outcome depending on their distribution
normaldist = []
notnormaldist = []
# Iterating through all of the columns and using the normaltest function to test the
# distribution of data for each column/feature and outcome by getting the pvalue and
# placing the name of the column and outcome in the appropriate array
for col in data.columns:
if col != 'Outcome':
pval0 = normaltest(outcome_0_data[col])[1]
pval1 = normaltest(outcome_1_data[col])[1]
if pval0 <= 0.05:
col_and_outcome = col + ' Outcome 0'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 0'
normaldist.append(col_and_outcome)
if pval1 <= 0.05:
col_and_outcome = col + ' Outcome 1'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 1'
normaldist.append(col_and_outcome)
print("Normal Distribution:\n", normaldist)
print("Not Normal Distribution:\n", notnormaldist)
Normal Distribution: [] Not Normal Distribution: ['Pregnancies Outcome 0', 'Pregnancies Outcome 1', 'Glucose Outcome 0', 'Glucose Outcome 1', 'BloodPressure Outcome 0', 'BloodPressure Outcome 1', 'SkinThickness Outcome 0', 'SkinThickness Outcome 1', 'Insulin Outcome 0', 'Insulin Outcome 1', 'BMI Outcome 0', 'BMI Outcome 1', 'DiabetesPedigreeFunction Outcome 0', 'DiabetesPedigreeFunction Outcome 1', 'Age Outcome 0', 'Age Outcome 1']
We can see that all of the features for both outcomes in the original data have a p-value less than or equal to 0.05 meaning they all do not have a normal distribution.
Now let's repeat the process for the other two dataframes
# Repeating process for the dataframe with zeroes removed
outcome_0_remove_zero_data = remove_zero_data[remove_zero_data['Outcome'] == 0]
outcome_1_remove_zero_data = remove_zero_data[remove_zero_data['Outcome'] == 1]
normaldist = []
notnormaldist = []
for col in remove_zero_data.columns:
if col != 'Outcome':
pval0 = normaltest(outcome_0_remove_zero_data[col])[1]
pval1 = normaltest(outcome_1_remove_zero_data[col])[1]
if pval0 <= 0.05:
col_and_outcome = col + ' Outcome 0'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 0'
normaldist.append(col_and_outcome)
if pval1 <= 0.05:
col_and_outcome = col + ' Outcome 1'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 1'
normaldist.append(col_and_outcome)
print("Dataframe with Zeroes Removed\n")
print("Normal Distribution:\n", normaldist)
print("Not Normal Distribution:\n", notnormaldist, "\n")
# Repeating process for the dataframe with zeroes replaced by median values
outcome_0_replace_zero_data = replace_zero_data[replace_zero_data['Outcome'] == 0]
outcome_1_replace_zero_data = replace_zero_data[replace_zero_data['Outcome'] == 1]
normaldist = []
notnormaldist = []
for col in replace_zero_data.columns:
if col != 'Outcome':
pval0 = normaltest(outcome_0_replace_zero_data[col])[1]
pval1 = normaltest(outcome_1_replace_zero_data[col])[1]
if pval0 <= 0.05:
col_and_outcome = col + ' Outcome 0'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 0'
normaldist.append(col_and_outcome)
if pval1 <= 0.05:
col_and_outcome = col + ' Outcome 1'
notnormaldist.append(col_and_outcome)
else:
col_and_outcome = col + ' Outcome 1'
normaldist.append(col_and_outcome)
print("Dataframe with Zeroes Replaced by Median\n")
print("Normal Distribution:\n", normaldist)
print("Not Normal Distribution:\n", notnormaldist)
Dataframe with Zeroes Removed Normal Distribution: ['BloodPressure Outcome 0', 'BloodPressure Outcome 1', 'SkinThickness Outcome 1'] Not Normal Distribution: ['Pregnancies Outcome 0', 'Pregnancies Outcome 1', 'Glucose Outcome 0', 'Glucose Outcome 1', 'SkinThickness Outcome 0', 'Insulin Outcome 0', 'Insulin Outcome 1', 'BMI Outcome 0', 'BMI Outcome 1', 'DiabetesPedigreeFunction Outcome 0', 'DiabetesPedigreeFunction Outcome 1', 'Age Outcome 0', 'Age Outcome 1'] Dataframe with Zeroes Replaced by Median Normal Distribution: [] Not Normal Distribution: ['Pregnancies Outcome 0', 'Pregnancies Outcome 1', 'Glucose Outcome 0', 'Glucose Outcome 1', 'BloodPressure Outcome 0', 'BloodPressure Outcome 1', 'SkinThickness Outcome 0', 'SkinThickness Outcome 1', 'Insulin Outcome 0', 'Insulin Outcome 1', 'BMI Outcome 0', 'BMI Outcome 1', 'DiabetesPedigreeFunction Outcome 0', 'DiabetesPedigreeFunction Outcome 1', 'Age Outcome 0', 'Age Outcome 1']
We can see that the data modified to have the zero values in certain columns replaced by the median has all of the features not normally distributed in regards to outcomes. However, the dataframe with the zeroes removed instead of replaced has three normal distributions-- BloodPressure for both outcomes and SkinThickness for Outcome 1. Earlier, when looking at the distribution plots, we had correctly suspected that BloodPressure for both outcomes would be normally distributed without the zero values and incorrectly suspected SkinThickness to be skewed. This highlights the importance of revisting claims made in previous parts of the data science process and using statistical tests to make judgements or claims regarding data because it's more reliable than trying to make educated guesses without proof to support it.
We will now use the Mann-Whitney U test, a non-parametric test, to compare the data of the two outcomes that aren't normally distributed and use the Independent T test, the parametric equivalent to the Mann Whitney test, to compare the BloodPressure data in the Removed Zeroes dataframe. We will utilize these tests to test our hypothesis below:
H0 denotes the null hypothesis while H1 denotes the alternative hypothesis. After conducting the statistical tests mentioned, we will compare the p-values with 0.05 to determine whether to reject/accept the null hypothesis (reject if p-value <= 0.05, fail to reject otherwise).
print("Original Data \n")
# Arrays that will store the names of features that either reject or fail to reject H0
reject = []
fail_to_reject = []
# Iterating through the columns to do the Mann-Whitney test and get the p-value to see if
# it's <= 0.05
for col in data.columns:
if col != 'Outcome':
# alternative = 'greater' means that the alternative hypothesis is defined as
# the distribution underlying x is stochastically greater than the distribution underlying y;
# x being outcome_1_data and y being outcome_0_data
manwhit_pvalue = mannwhitneyu(outcome_1_data[col], outcome_0_data[col], \
alternative = 'greater')[1]
if manwhit_pvalue <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("Reject:", reject)
print("Fail to Reject:", fail_to_reject, "\n")
# Repeat above process for median data
print("Median Data \n")
reject = []
fail_to_reject = []
for col in replace_zero_data.columns:
if col != 'Outcome':
manwhit_pvalue = mannwhitneyu(outcome_1_replace_zero_data[col], \
outcome_0_replace_zero_data[col], alternative = 'greater')[1]
if manwhit_pvalue <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("Reject:", reject)
print("Fail to Reject:", fail_to_reject, "\n")
# Repeat above process for removed zero data with the addition of conducting Independent T test
print("Removed Zero Data \n")
reject = []
fail_to_reject = []
for col in remove_zero_data.columns:
if col != 'Outcome' and col != 'BloodPressure':
manwhit_pvalue = mannwhitneyu(outcome_1_remove_zero_data[col], \
outcome_0_remove_zero_data[col], alternative = 'greater')[1]
if manwhit_pvalue <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
# Independent T test for BloodPressure
elif col == 'BloodPressure':
ttest_pvalue = ttest_ind(outcome_1_remove_zero_data[col], \
outcome_0_remove_zero_data[col], alternative = 'greater')[1]
if ttest_pvalue <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("Reject:", reject)
print("Fail to Reject:", fail_to_reject)
Original Data Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: [] Median Data Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: [] Removed Zero Data Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: []
After performing the Mann-Whitney U and Independent T tests, we can see that all of the p-values were statistically significant (<= 0.05) meaning that we reject the null hypothesis for all of the features and can conclude that Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, and Age are all higher (mean) for patients with diabetes than patients that don't have diabetes across all three of the dataframes. This is consistent with what we saw in the boxplots of the original data where the median and upper quartile was higher for outcome 1 than outcome 0 for all of the features. Consequently, there seems to be a trend between the diabetes outcome and higher levels of the features.
Let's revisit the idea of correlation between the features and outcomes as a whole across the three dataframes to see if this apparent trend may hint at correlation through the Spearman Rank Correlation Coefficient, a non-parametric measure of rank correlation, that tests two variables' strength and direction of correlation.
print("Original Data \n")
reject = []
fail_to_reject = []
# Iterating through the columns to do the Spearman test and get the p-value to see if
# it's <= 0.05
for col in data.columns:
if col != 'Outcome':
# alternative = 'greater' means that the alternative hypothesis is defined as
# the correlation is positive (greater than zero)
spearman_results = spearmanr(data[col], data['Outcome'], \
alternative = 'greater')
pval = spearman_results[1]
print(col,":", spearman_results)
if pval <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("\nReject:", reject)
print("Fail to Reject:", fail_to_reject, "\n")
# Repeat above process for median data
print("Median Data \n")
reject = []
fail_to_reject = []
for col in replace_zero_data.columns:
if col != 'Outcome':
spearman_results = spearmanr(replace_zero_data[col], \
replace_zero_data['Outcome'], alternative = 'greater')
pval = spearman_results[1]
print(col,":", spearman_results)
if pval <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("\nReject:", reject)
print("Fail to Reject:", fail_to_reject, "\n")
# Repeat above process for removed zeroes data
print("Removed Zero Data \n")
reject = []
fail_to_reject = []
for col in remove_zero_data.columns:
if col != 'Outcome':
spearman_results = spearmanr(remove_zero_data[col], \
remove_zero_data['Outcome'], alternative = 'greater')
pval = spearman_results[1]
print(col,":", spearman_results)
if pval <= 0.05:
reject.append(col)
else:
fail_to_reject.append(col)
print("\nReject:", reject)
print("Fail to Reject:", fail_to_reject)
Original Data Pregnancies : SpearmanrResult(correlation=0.19868874913189663, pvalue=1.4063922395649354e-08) Glucose : SpearmanrResult(correlation=0.4813968963350598, pvalue=4.20050321489302e-46) BloodPressure : SpearmanrResult(correlation=0.1708324018917003, pvalue=9.60417975088519e-07) SkinThickness : SpearmanrResult(correlation=0.1966084477604776, pvalue=1.971318139766496e-08) Insulin : SpearmanrResult(correlation=0.0747519433225385, pvalue=0.019174226987829963) BMI : SpearmanrResult(correlation=0.3073382077773797, pvalue=1.4520901823689255e-18) DiabetesPedigreeFunction : SpearmanrResult(correlation=0.17535346981239172, pvalue=5.053170547826717e-07) Age : SpearmanrResult(correlation=0.30904026356718634, pvalue=9.264872122688206e-19) Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: [] Median Data Pregnancies : SpearmanrResult(correlation=0.19868874913189663, pvalue=1.4063922395649354e-08) Glucose : SpearmanrResult(correlation=0.4813968963350598, pvalue=4.20050321489302e-46) BloodPressure : SpearmanrResult(correlation=0.1708324018917003, pvalue=9.60417975088519e-07) SkinThickness : SpearmanrResult(correlation=0.1966084477604776, pvalue=1.971318139766496e-08) Insulin : SpearmanrResult(correlation=0.0747519433225385, pvalue=0.019174226987829963) BMI : SpearmanrResult(correlation=0.3073382077773797, pvalue=1.4520901823689255e-18) DiabetesPedigreeFunction : SpearmanrResult(correlation=0.17535346981239172, pvalue=5.053170547826717e-07) Age : SpearmanrResult(correlation=0.30904026356718634, pvalue=9.264872122688206e-19) Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: [] Removed Zero Data Pregnancies : SpearmanrResult(correlation=0.20023445259481654, pvalue=3.272010288408068e-05) Glucose : SpearmanrResult(correlation=0.498728899671441, pvalue=2.421705565130313e-26) BloodPressure : SpearmanrResult(correlation=0.1981265676706596, pvalue=3.917397274194511e-05) SkinThickness : SpearmanrResult(correlation=0.2601534347999191, pvalue=8.728669666748573e-08) Insulin : SpearmanrResult(correlation=0.37501949347790997, pvalue=7.739426816506896e-15) BMI : SpearmanrResult(correlation=0.26707717071700626, pvalue=3.963032815867018e-08) DiabetesPedigreeFunction : SpearmanrResult(correlation=0.19821353543782874, pvalue=3.888552617588149e-05) Age : SpearmanrResult(correlation=0.39735473573446217, pvalue=1.4029475132681835e-16) Reject: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'] Fail to Reject: []
The result of the Spearman Rank Correlation Coefficient shows us that not only are the features correlated with outcome but that they are positively correlated across all three dataframes. We can see that none of the features were added to the Fail to Reject array because all of the p-values were less than or equal to 0.05 meaning that the correlation between features and outcome is statistically significant so we can reject the null hypothesis.
We conclude our hypothesis testing that has allowed us to make supported claims regarding our findings to move onto building the models for our data.
We will now develop machine learning models to predict the diabetes outcome of patients in the dataset. Using preexisting algorithms, we will produce a model that will make an outcome prediction when given information. There are many algorithms to choose from; however, we decided to use the classification models Random Forest, K-Nearest Neighbor (kNN), and Support Vector Machine (SVM).
Random Forest is a more versatile version of the decision tree, eliminating overfitting datasets and increasing accuracy. Both decision tree and random forest can handle missing data much better than other algorithms, and is overall faster to predict results due to the logorithmic size of the trees. However, the model takes a lot of resources to produce due to making several objects, and the model is far less interpretable than decision tree.
We are expecting Random Forest to produce the best results for our dataset due to its versatility.
kNN is a intuitive and simple algorithm to visualize, with many variations to the algorithm embedded in the distance metrics used. However, it is sensitive to poorly matched features and has difficulty handling too much data at once, and generally favors matching one data over the other. It also handles high dimensions poorly, as there is a risk of overfitting.
kNN is a common algorithm used in many datasets, thus we want to see how accurate this algorithm can be in a situation it may struggle in.
SVMs are "the go-to method for a high performing algorithm with little tuning." They are more effective in datasets with high dimensions and clear margins between classes, and are generally more computationally friendly. It struggles with overlapping classes and becomes resource demanding in large datasets. We want to see how SVM would perform in a medium dimension and large dataset.
In producing a model using an algorithm, there are several parameters that can be adjusted to produce a different model, named hyperparameters, which should be adjusted and tuned to produce the best model for our purpose, called hyperparameter tuning. Since manually testing each combination of hyperparameters is time consuming, we use grid search to test each combination and return the best result.
Hyperparameter tuning will be scored by recall. Recall is the percentage of true positives that is actually marked as true positive. In this situation, we are trying to decrease false negatives. If a diabetic individual was not detected as diabetic -- a false negative -- they could be harmed due to not being given care. Thus we are concerned about the recall score of category 1.
We will not be entirely prioritizing the recall score of category 1, since if a model answers 1 for every data set, the recall for category 1 will technically be 100%. We will also be looking at accuracy overall.
We will run the model algorithms through 3 differently preprocessed datasets: Replaced zeroes dataset, original dataset, and removed zeroes dataset. Each algorithm has an ideal preprocessing method that allows an ideal model, thus we will compare the results of our models for each dataset.
First we will split our data into training and testing data for the replaced zeroes with median dataframe. 25% of the data will be reserved for verifying the quality of our model.
# Initialize X and Y test and train groups
X = replace_zero_data.loc[:, replace_zero_data.columns != 'Outcome']
y = replace_zero_data.loc[:,'Outcome'].values
f2_score = make_scorer(fbeta_score, beta=2, pos_label=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Random Forest Classifier for replaced zero dataset
replace_zero_forest = RandomForestClassifier(random_state=0)
# Fit model to training data
replace_zero_forest.fit(X_train, y_train)
# See how model does on testing data
rzf_pred = replace_zero_forest.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzf_pred))
print ('Classification Report:\n', classification_report(y_test, rzf_pred))
ROC AUC: 0.7317073170731707 Accuracy: 0.75 Confusion Matrix: [[98 25] [23 46]] Classification Report: precision recall f1-score support 0 0.81 0.80 0.80 123 1 0.65 0.67 0.66 69 accuracy 0.75 192 macro avg 0.73 0.73 0.73 192 weighted avg 0.75 0.75 0.75 192
Random Forest has the best results than any other algorithm.
# Specifying the hyperparameters
rzf_param_grid = {'max_samples':[0.1, 0.2, 0.3, 0.4],
'max_features': [1, 2],
'n_estimators':[5, 10, 50, 100],
'max_depth':[8, 9, 10]
}
# Splitting data into 5 folds, or 5 sections to do 5-fold cross validation
rzf_cv = KFold(n_splits=5)
# Using grid search
rzf_grid = GridSearchCV(RandomForestClassifier(), rzf_param_grid, cv=rzf_cv, scoring = f2_score)
rzf_grid.fit(X_train, y_train)
# This should take half a minute
print('Random Forest best Params:', rzf_grid.best_params_)
print('Random Forest best Score:', rzf_grid.best_score_)
hrzf_pred = rzf_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, hrzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, hrzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, hrzf_pred))
print ('Classification Report:\n', classification_report(y_test, hrzf_pred))
Random Forest best Params: {'max_depth': 9, 'max_features': 2, 'max_samples': 0.3, 'n_estimators': 100} Random Forest best Score: 0.6155776749158334 ROC AUC: 0.6891127606928242 Accuracy: 0.7239583333333334 Confusion Matrix: [[100 23] [ 30 39]] Classification Report: precision recall f1-score support 0 0.77 0.81 0.79 123 1 0.63 0.57 0.60 69 accuracy 0.72 192 macro avg 0.70 0.69 0.69 192 weighted avg 0.72 0.72 0.72 192
Notably, the best hyperparameters for our dataset changes every iteration, although our best score rarely changes.
Curiously, while the scorer prioritizes recall for value 1 and has succeed in increase recall for class 1 in subsequent algorithms, in this case, the recall actually decreases, although accuracy has increased overall.
rep_zero_kNN = KNeighborsClassifier(n_neighbors = 10)
# Fitting model to training data
rep_zero_kNN.fit(X_train, y_train)
# See how model does on testing data
rzk_pred = rep_zero_kNN.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzk_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzk_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzk_pred))
print ('Classification Report:\n', classification_report(y_test, rzk_pred))
ROC AUC: 0.6633085896076353 Accuracy: 0.703125 Confusion Matrix: [[99 24] [33 36]] Classification Report: precision recall f1-score support 0 0.75 0.80 0.78 123 1 0.60 0.52 0.56 69 accuracy 0.70 192 macro avg 0.68 0.66 0.67 192 weighted avg 0.70 0.70 0.70 192
# Specifying the hyperparameters
kNN_param_grid = {'n_neighbors':[2,5,10,15,20,30,50],
'weights':['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree'],
'metric':['minkowski','euclidean','manhattan']}
# do not use linear or poly for kernel, it will softlock your kernel
# Splitting data into 5 folds, or 5 sections to do 5-fold cross validation
kNN_cv = KFold(n_splits=5)
# Using grid search
kNN_grid = GridSearchCV(KNeighborsClassifier(), kNN_param_grid, cv=kNN_cv, scoring = f2_score)
kNN_grid.fit(X_train, y_train)
# This runs the smoothest
print('k-NN best Params:', kNN_grid.best_params_)
print('k-NN best Score:', kNN_grid.best_score_)
kNN_pred = kNN_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, kNN_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, kNN_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, kNN_pred))
print ('Classification Report:\n', classification_report(y_test, kNN_pred))
k-NN best Params: {'algorithm': 'auto', 'metric': 'minkowski', 'n_neighbors': 5, 'weights': 'distance'} k-NN best Score: 0.5820232318924576 ROC AUC: 0.6240721102863201 Accuracy: 0.640625 Confusion Matrix: [[84 39] [30 39]] Classification Report: precision recall f1-score support 0 0.74 0.68 0.71 123 1 0.50 0.57 0.53 69 accuracy 0.64 192 macro avg 0.62 0.62 0.62 192 weighted avg 0.65 0.64 0.64 192
The performance on kNN classifier is the worst out of our algorithms, and the increase to recall class 1 comes at the biggest decrease to accuracy we've seen.
data_SVC = SVC()
# Fitting model to training data
data_SVC.fit(X_train, y_train)
# See how model does on testing data
svc_pred = data_SVC.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svc_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svc_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svc_pred))
print ('Classification Report:\n', classification_report(y_test, svc_pred))
ROC AUC: 0.6772711205372923 Accuracy: 0.7291666666666666 Confusion Matrix: [[106 17] [ 35 34]] Classification Report: precision recall f1-score support 0 0.75 0.86 0.80 123 1 0.67 0.49 0.57 69 accuracy 0.73 192 macro avg 0.71 0.68 0.68 192 weighted avg 0.72 0.73 0.72 192
The recall for class 1 is the weakest in this algorithm, although it performs optimally otherwise. The recall for class 0 is actually the highest out of all others.
# Specifying the hyperparameters
svm_param_grid = {'C':[0.01, 0.1, 1, 10, 100], \
'kernel':['rbf'], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]}
# do not use linear or poly for kernel, it will softlock your kernel
# Splitting data into 5 folds, or 5 sections to do 5-fold cross validation
svm_cv = KFold(n_splits=5)
# Using grid search
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=svm_cv, scoring = f2_score)
svm_grid.fit(X_train, y_train)
print('SVM best Params:', svm_grid.best_params_)
print('SVM best Score:', svm_grid.best_score_)
svm_pred = svm_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svm_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svm_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svm_pred))
print ('Classification Report:\n', classification_report(y_test, svm_pred))
SVM best Params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'} SVM best Score: 0.5796992942249825 ROC AUC: 0.7108518911276069 Accuracy: 0.7395833333333334 Confusion Matrix: [[100 23] [ 27 42]] Classification Report: precision recall f1-score support 0 0.79 0.81 0.80 123 1 0.65 0.61 0.63 69 accuracy 0.74 192 macro avg 0.72 0.71 0.71 192 weighted avg 0.74 0.74 0.74 192
Note that while SVM has more hyperparameter options for the kernel we could've included such as 'linear' and 'poly', due to SVM being better fitted to smaller datasets, the cell will not run to completion in a reasonable time. Thus the options has been omitted in hyperparameter tuning.
Our hyperparameter adjusted random forest has the highest accuracy of all of our algorithms.
The accuracy and recall rates we have for all of our algorithms are relatively low, which supports the fact that there is significant overlap in the outer health of non-diabetic and diabetic patients. Notably, the precision, recall, and f1-score for diabetic patients is lower than those for non-diabetic patients in all of our models, indicating a generally lower probability of recognizing a diabetic patient.
Now we will repeat the same process with the original data, without 0 values removed.
# Initialize X and Y test and train groups
X = data.loc[:, data.columns != 'Outcome']
y = data.loc[:,'Outcome'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Classifier for original dataset
original_forest = RandomForestClassifier(random_state=0)
original_forest.fit(X_train, y_train)
rzf_pred = original_forest.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzf_pred))
print ('Classification Report:\n', classification_report(y_test, rzf_pred))
ROC AUC: 0.7353535353535354 Accuracy: 0.7532467532467533 Confusion Matrix: [[79 20] [18 37]] Classification Report: precision recall f1-score support 0 0.81 0.80 0.81 99 1 0.65 0.67 0.66 55 accuracy 0.75 154 macro avg 0.73 0.74 0.73 154 weighted avg 0.76 0.75 0.75 154
The accuracy overall and recall for class 1 has actually increased.
rzf_param_grid = {'max_samples':[0.1, 0.2, 0.3, 0.4],
'max_features': [1, 2],
'n_estimators':[5, 10, 50, 100],
'max_depth':[8, 9, 10]
}
rzf_cv = KFold(n_splits=5)
rzf_grid = GridSearchCV(RandomForestClassifier(), rzf_param_grid, cv=rzf_cv, scoring = f2_score)
rzf_grid.fit(X_train, y_train)
# This should take half a minute
print('Random Forest best Params:', rzf_grid.best_params_)
print('Random Forest best Score:', rzf_grid.best_score_)
rzf_pred = rzf_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzf_pred))
print ('Classification Report:\n', classification_report(y_test, rzf_pred))
Random Forest best Params: {'max_depth': 9, 'max_features': 2, 'max_samples': 0.4, 'n_estimators': 5} Random Forest best Score: 0.6235982694211986 ROC AUC: 0.7161616161616162 Accuracy: 0.7337662337662337 Confusion Matrix: [[77 22] [19 36]] Classification Report: precision recall f1-score support 0 0.80 0.78 0.79 99 1 0.62 0.65 0.64 55 accuracy 0.73 154 macro avg 0.71 0.72 0.71 154 weighted avg 0.74 0.73 0.74 154
This is our best performing model so far.
# k-NN Classifier for original dataset
original_kNN = KNeighborsClassifier()
original_kNN.fit(X_train, y_train)
rzk_pred = original_kNN.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzk_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzk_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzk_pred))
print ('Classification Report:\n', classification_report(y_test, rzk_pred))
ROC AUC: 0.6474747474747474 Accuracy: 0.6558441558441559 Confusion Matrix: [[67 32] [21 34]] Classification Report: precision recall f1-score support 0 0.76 0.68 0.72 99 1 0.52 0.62 0.56 55 accuracy 0.66 154 macro avg 0.64 0.65 0.64 154 weighted avg 0.67 0.66 0.66 154
kNN_param_grid = {'n_neighbors':[2,5,10,15,20,30,50],
'weights':['uniform', 'distance'],
'metric':['minkowski','euclidean','manhattan']}
# do not use linear or poly for kernel, it will softlock your kernel
kNN_cv = KFold(n_splits=5)
kNN_grid = GridSearchCV(KNeighborsClassifier(), kNN_param_grid, cv=kNN_cv, scoring = f2_score)
kNN_grid.fit(X_train, y_train)
# This runs the smoothest
print('k-NN best Params:', kNN_grid.best_params_)
print('k-NN best Score:', kNN_grid.best_score_)
kNN_pred = kNN_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, kNN_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, kNN_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, kNN_pred))
print ('Classification Report:\n', classification_report(y_test, kNN_pred))
k-NN best Params: {'metric': 'minkowski', 'n_neighbors': 15, 'weights': 'distance'} k-NN best Score: 0.5816168275146515 ROC AUC: 0.6828282828282828 Accuracy: 0.7012987012987013 Confusion Matrix: [[74 25] [21 34]] Classification Report: precision recall f1-score support 0 0.78 0.75 0.76 99 1 0.58 0.62 0.60 55 accuracy 0.70 154 macro avg 0.68 0.68 0.68 154 weighted avg 0.71 0.70 0.70 154
# SVM for original dataset
data_SVC = SVC()
data_SVC.fit(X_train, y_train)
svc_pred = data_SVC.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svc_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svc_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svc_pred))
print ('Classification Report:\n', classification_report(y_test, svc_pred))
ROC AUC: 0.7212121212121211 Accuracy: 0.7662337662337663 Confusion Matrix: [[87 12] [24 31]] Classification Report: precision recall f1-score support 0 0.78 0.88 0.83 99 1 0.72 0.56 0.63 55 accuracy 0.77 154 macro avg 0.75 0.72 0.73 154 weighted avg 0.76 0.77 0.76 154
svm_param_grid = {'C':[0.01, 0.1, 1, 10, 100], \
'kernel':['rbf'], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]}
# do not use linear or poly for kernel, it will softlock your kernel
svm_cv = KFold(n_splits=5)
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=svm_cv, scoring = f2_score)
svm_grid.fit(X_train, y_train)
print('SVM best Params:', svm_grid.best_params_)
print('SVM best Score:', svm_grid.best_score_)
svm_pred = svm_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svm_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svm_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svm_pred))
print ('Classification Report:\n', classification_report(y_test, svm_pred))
SVM best Params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'} SVM best Score: 0.5481925824111564 ROC AUC: 0.6818181818181819 Accuracy: 0.6948051948051948 Confusion Matrix: [[72 27] [20 35]] Classification Report: precision recall f1-score support 0 0.78 0.73 0.75 99 1 0.56 0.64 0.60 55 accuracy 0.69 154 macro avg 0.67 0.68 0.68 154 weighted avg 0.70 0.69 0.70 154
Overall, the original dataset performed worse for most algorithms. The hyperparameter tuned random forest however, had the highest performing model so far, with a 67% recall for class 1 and 77% accuracy. While we expected that random forest could handle missing data better than most algorithms, we did not expect an improvement to results.
Now we will repeat the same process with the dataframe that has the zeroes removed. Due to our dataset being much smaller than before, we will be using a smaller percentage of the dataset to be our test group. The results of this test should also be scrutinized, as we are working with a much smaller dataset, with a chance of higher percentages due to a small set of data.
# Initialize X and Y test and train groups
X = remove_zero_data.loc[:, remove_zero_data.columns != 'Outcome']
y = remove_zero_data.loc[:,'Outcome'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
# Random Forest Classifier for removed zeroes dataset
remove_zero_data_forest = RandomForestClassifier(random_state=0)
remove_zero_data_forest.fit(X_train, y_train)
rzf_pred = remove_zero_data_forest.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzf_pred))
print ('Classification Report:\n', classification_report(y_test, rzf_pred))
ROC AUC: 0.7330827067669173 Accuracy: 0.7796610169491526 Confusion Matrix: [[34 4] [ 9 12]] Classification Report: precision recall f1-score support 0 0.79 0.89 0.84 38 1 0.75 0.57 0.65 21 accuracy 0.78 59 macro avg 0.77 0.73 0.74 59 weighted avg 0.78 0.78 0.77 59
The accuracy for this model is the highest out of all other datasets and models, although it is important to consider the much smaller size of our dataset.
rzf_param_grid = {'max_samples':[0.1, 0.2, 0.3, 0.4],
'max_features': [1, 2],
'n_estimators':[5, 10, 50, 100],
'max_depth':[8, 9, 10]
}
rzf_cv = KFold(n_splits=5)
rzf_grid = GridSearchCV(RandomForestClassifier(), rzf_param_grid, cv=rzf_cv, scoring = f2_score)
rzf_grid.fit(X_train, y_train)
# This should take half a minute
print('Random Forest best Params:', rzf_grid.best_params_)
print('Random Forest best Score:', rzf_grid.best_score_)
rzf_pred = rzf_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzf_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzf_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzf_pred))
print ('Classification Report:\n', classification_report(y_test, rzf_pred))
Random Forest best Params: {'max_depth': 9, 'max_features': 2, 'max_samples': 0.4, 'n_estimators': 5} Random Forest best Score: 0.6024608091655231 ROC AUC: 0.7199248120300752 Accuracy: 0.7627118644067796 Confusion Matrix: [[33 5] [ 9 12]] Classification Report: precision recall f1-score support 0 0.79 0.87 0.82 38 1 0.71 0.57 0.63 21 accuracy 0.76 59 macro avg 0.75 0.72 0.73 59 weighted avg 0.76 0.76 0.76 59
Just like in the replaced zeroes dataset, the quality of our model has actually decreased. Random forest appears to be hindered by preprocessed datasets.
# k-NN Classifier for removed zero dataset
remove_zero_kNN = KNeighborsClassifier()
remove_zero_kNN.fit(X_train, y_train)
rzk_pred = remove_zero_kNN.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, rzk_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, rzk_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, rzk_pred))
print ('Classification Report:\n', classification_report(y_test, rzk_pred))
ROC AUC: 0.6672932330827068 Accuracy: 0.6949152542372882 Confusion Matrix: [[29 9] [ 9 12]] Classification Report: precision recall f1-score support 0 0.76 0.76 0.76 38 1 0.57 0.57 0.57 21 accuracy 0.69 59 macro avg 0.67 0.67 0.67 59 weighted avg 0.69 0.69 0.69 59
kNN_param_grid = {'n_neighbors':[2,5,10,15,20,30,50],
'weights':['uniform', 'distance'],
'metric':['minkowski','euclidean','manhattan']}
# do not use linear or poly for kernel, it will softlock your kernel
kNN_cv = KFold(n_splits=5)
kNN_grid = GridSearchCV(KNeighborsClassifier(), kNN_param_grid, cv=kNN_cv, scoring=f2_score)
kNN_grid.fit(X_train, y_train)
# This runs the smoothest
print('k-NN best Params:', kNN_grid.best_params_)
print('k-NN best Score:', kNN_grid.best_score_)
kNN_pred = kNN_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, kNN_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, kNN_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, kNN_pred))
print ('Classification Report:\n', classification_report(y_test, kNN_pred))
k-NN best Params: {'metric': 'minkowski', 'n_neighbors': 5, 'weights': 'uniform'} k-NN best Score: 0.5848404497630284 ROC AUC: 0.6672932330827068 Accuracy: 0.6949152542372882 Confusion Matrix: [[29 9] [ 9 12]] Classification Report: precision recall f1-score support 0 0.76 0.76 0.76 38 1 0.57 0.57 0.57 21 accuracy 0.69 59 macro avg 0.67 0.67 0.67 59 weighted avg 0.69 0.69 0.69 59
There was no change in the hyperparameter tuned k-NN and the original k-NN, possibly indicating any increase to recall for class 1 will disproportionally affect accuracy.
# SVM for removed zeroes dataset
data_SVC = SVC()
data_SVC.fit(X_train, y_train)
svc_pred = data_SVC.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svc_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svc_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svc_pred))
print ('Classification Report:\n', classification_report(y_test, svc_pred))
ROC AUC: 0.6879699248120301 Accuracy: 0.7627118644067796 Confusion Matrix: [[36 2] [12 9]] Classification Report: precision recall f1-score support 0 0.75 0.95 0.84 38 1 0.82 0.43 0.56 21 accuracy 0.76 59 macro avg 0.78 0.69 0.70 59 weighted avg 0.77 0.76 0.74 59
The recall for class 0 is the highest out of all datasets and models, but the recall for class 1 is the lowest out of all datasets and models.
svm_param_grid = {'C':[0.01, 0.1, 1, 10, 100], \
'kernel':['rbf'], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]}
# do not use linear or poly for kernel, it will softlock your kernel
svm_cv = KFold(n_splits=5)
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=svm_cv, scoring=f2_score)
svm_grid.fit(X_train, y_train)
print('SVM best Params:', svm_grid.best_params_)
print('SVM best Score:', svm_grid.best_score_)
svm_pred = svm_grid.predict(X_test)
print ('ROC AUC:', roc_auc_score(y_test, svm_pred, average=None))
print ('Accuracy:', accuracy_score(y_test, svm_pred))
print ('Confusion Matrix:\n', confusion_matrix(y_test, svm_pred))
print ('Classification Report:\n', classification_report(y_test, svm_pred))
SVM best Params: {'C': 10, 'gamma': 1e-05, 'kernel': 'rbf'} SVM best Score: 0.5797285359397277 ROC AUC: 0.6854636591478697 Accuracy: 0.7457627118644068 Confusion Matrix: [[34 4] [11 10]] Classification Report: precision recall f1-score support 0 0.76 0.89 0.82 38 1 0.71 0.48 0.57 21 accuracy 0.75 59 macro avg 0.73 0.69 0.70 59 weighted avg 0.74 0.75 0.73 59
Some processing methods of remove zeros lead to the imbalance of sample labels, which will lead to overfitting of samples with a large proportion, that is to say, the prediction is biased towards the classification with a large number of samples. This will greatly reduce the generalization ability of the model. While a few performance metrics are higher than other models from other preprocessed datasets, the overall performance was worse. The recall for class 1 is also lower than any other models.
Our best performing model and dataset was the hyperparameter tuned random forest in the original dataset. While preprocessing is an important element in machine learning, removing or adjusting missing information could be detrimental to the quality of prediction models in certain situations and datasets.
In this tutorial, we developed Support Vector Machine, Random Forest and KNN classification models for diabetes data set. The hyperparameters of all the models were also tuned to optimize the performance of the classification model for all the algorithms.
There isn’t a known cure for diabetes at this moment in time highlighting the importance of researching and studying potential risk factors or diagnostics that have some relationship or impact on diabetes. As a result, being able to predict diabetes outcome of patients in a dataset would help researchers better understand the disease by seeing which features have an impact on outcome and to what extent. A model like this could also be tweaked to communicate warning signs of someone who is on the pathway toward developing diabetes by looking at their health levels. We utilized three dataframes to represent the common ways data scientists modify (or don't) datasets that have many zero entries. We wanted to explore how using each dataframe could lead to similar or different results in regards to hypothesis testing and model building. We felt that this was important in learning the influence modifications of data can have on findings.
We created this tutorial to emulate the data science pipeline that allows for us to take data, do testing and analysis on it, and communicate results in a way that others can understand. We were able to not only take a dataset and make visualizations out of it but also make inferences and educated guesses about relationships between variables supported by testing. Since this process was not linear, we revisited claims and previous parts of the process to build upon new findings such as the correlation between features and diabetes outcomes. Being able to utilize the data science pipeline is crucial to making and observing replicable results to make strides in research. Although we do not have a completely accurate model to predict diabetes outcome, we have gained many new insights and skills that will serve as a foundation for future projects.