I was asked to fit several ML models to the well known Iris Species dataset. I’ll be going through the process and explaining the steps starting from preprocessing the data, to doing k-fold cross validation.
Data Preprocessing
The iris data set contain information about iris flowers and the specie of the flower, in the following format:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class, one of the following values
- Iris Setosa
- Iris Versicolour
- Iris Virginica
We’ll start first by importing the data and splliting them into
X
, for the features, and y
, for the
lables.
import pandas as pd
= pd.read_csv(
data 'iris.data',
=None,
header=['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
names
)
# Split data
= data.iloc[:, :-1]
X = data.iloc[:, -1]
y
First, we check if our data is balanced or not, as imbalanced data can lead to biased models that perform well on the majority class but poorly on the minority class.
'class'].value_counts() data[
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64
Next, we check for feature correlation. This step is important, especially in our case, since we are dealing with width and length data, it is expected that some of those measurments will scale linearly, and thus have high correlation.
From the below correlation matrix, we can see that petal dimensions are highly correlated.
X.corr()
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
sepal length | 1.000000 | -0.109369 | 0.871754 | 0.817954 |
sepal width | -0.109369 | 1.000000 | -0.420516 | -0.356544 |
petal length | 0.871754 | -0.420516 | 1.000000 | 0.962757 |
petal width | 0.817954 | -0.356544 | 0.962757 | 1.000000 |
The reason we’re interested in correlated features is that they’re essentially just repeated information, in the sense tat if I give you the petal width you can reliably guess what its length is, and telling you explicitly what the length is would be unnecessary. This suggests that we can drop one of the petal dimensions without affecting our model performance.
In fact, not removing one of those features may lead to over-fitting in some models, especially those who give equal weight to all features, like k nearest neighbours. The reason behind that is that by including correlated features we are telling the model to focus on the same information twice (samples that have close petal width will also have close petal length, and will appear closer than they should be).
Another reason to drop correlated features, is that we’re increasing the model’s complexity without much benefit to the model’s predictive ability.
Note: even though petal length and sepal width have a high correlation (0.87), dropping this feature gave me worst model performance.
= X.drop(['petal length'], axis=1) X
Next, we’ll normalize our data and split it into testing and training sets.
Normalizing data helps our models since all features will be on the same scale. Similar to the reasoning behind removing correlated features, if our features are on different scales, some of the data may be given more weight by a model just because they have bigger values, which can make the model biased.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Normalize data
= MinMaxScaler().fit_transform(X)
X_
# Split into training and testing sets
= \
X_train, X_test, y_train, y_test =0.33)
train_test_split(X_, y, test_size#train_test_split(X, y, test_size=0.33, random_state=42) # use random state for reproducible results
Fitting learners
Now that we’re done with data preprocessing, here comes the easier
part of fitting the models and testing them. I say easier part becuase
most of the heavy lifting is done by sklearn
. Below we
initialize various models and store them in a dictionary.
from sklearn.naive_bayes import ComplementNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
# Create models
= {
models 'Complement Naive Bayes': ComplementNB(),
'Gaussian Naive Bayes': GaussianNB(),
'Logistic Regression': LogisticRegression(solver='liblinear'),
'5-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
'10-Nearest Neighbors': KNeighborsClassifier(n_neighbors=10),
'Neural Network': MLPClassifier(
=(12, 12, 12),
hidden_layer_sizes='relu', solver='adam', random_state=42,
activation=30, max_iter=500, alpha=0.01,
batch_size
), }
Now, we can just loop over the models as they have almost the same interface; and we print performance metrics related to each model: accuracy, precision, recall, f1.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
for model_name, model in models.items():
print(f'>>> Fitting {model_name} model')
# Train model
model.fit(X_train, y_train)# Test the model
= model.predict(X_test)
y_pred
# Measure model metrics
= accuracy_score(y_test, y_pred)
accuracy = precision_score(y_test, y_pred, average='weighted', zero_division=1)
precision = recall_score(y_test, y_pred, average='weighted')
recall = f1_score(y_test, y_pred, average='weighted')
f1
print('accuracy %.3f' % accuracy)
print('precision %.3f' % precision)
print('recall %.3f' % recall)
print('f1 %.3f' % f1)
print()
>>> Fitting Complement Naive Bayes model
accuracy 0.800
precision 0.863
recall 0.800
f1 0.719
>>> Fitting Gaussian Naive Bayes model
accuracy 0.967
precision 0.971
recall 0.967
f1 0.967
>>> Fitting Logistic Regression model
accuracy 0.967
precision 0.971
recall 0.967
f1 0.967
>>> Fitting 5-Nearest Neighbors model
accuracy 0.933
precision 0.950
recall 0.933
f1 0.935
>>> Fitting 10-Nearest Neighbors model
accuracy 0.933
precision 0.950
recall 0.933
f1 0.935
>>> Fitting Neural Network model
accuracy 0.967
precision 0.971
recall 0.967
f1 0.967
We also apply k-fold and print the resultsfor each model.
from sklearn.model_selection import KFold
# Create a KFold object
= KFold(n_splits=5, shuffle=True)
kf
print('\n\nKFold scores:')
for model_name, model in models.items():
print(f'>>> Appying KFold on {model_name} model')
# Create lists for train and test scores
= []
train_scores = []
test_scores
# Loop over folds
for train_index, test_index in kf.split(X_):
# Split the data
= X_[train_index], X_[test_index]
X_train, X_test = y[train_index], y[test_index]
y_train, y_test
# Train the model
model.fit(X_train, y_train)
# Evaluate the model
train_scores.append(model.score(X_train, y_train))
test_scores.append(model.score(X_test, y_test))
= [round(x, 4) for x in train_scores]
train_scores = [round(x, 4) for x in test_scores]
test_scores # Print summary of the model
print('Train scores:', train_scores)
print('Test scores:', test_scores)
print('Average train score:', round(sum(train_scores) / len(train_scores), 4))
print('Average test score:', round(sum(test_scores) / len(test_scores), 4))
print()
KFold scores:
>>> Appying KFold on Complement Naive Bayes model
Train scores: [0.6833, 0.65, 0.6417, 0.6833, 0.675]
Test scores: [0.6, 0.7333, 0.7667, 0.6, 0.6333]
Average train score: 0.6667
Average test score: 0.6667
>>> Appying KFold on Gaussian Naive Bayes model
Train scores: [0.95, 0.9417, 0.9583, 0.95, 0.9417]
Test scores: [0.9333, 1.0, 0.9, 0.9, 0.9667]
Average train score: 0.9483
Average test score: 0.94
>>> Appying KFold on Logistic Regression model
Train scores: [0.9, 0.8333, 0.8667, 0.7917, 0.8417]
Test scores: [0.9333, 0.8667, 0.8333, 0.8, 0.7667]
Average train score: 0.8467
Average test score: 0.84
>>> Appying KFold on 5-Nearest Neighbors model
Train scores: [0.9667, 0.9667, 0.95, 0.95, 0.9583]
Test scores: [0.9, 0.9333, 1.0, 0.9667, 0.9333]
Average train score: 0.9583
Average test score: 0.9467
>>> Appying KFold on 10-Nearest Neighbors model
Train scores: [0.9833, 0.9417, 0.9583, 0.9667, 0.9583]
Test scores: [0.9, 1.0, 1.0, 0.9667, 0.9667]
Average train score: 0.9617
Average test score: 0.9667
>>> Appying KFold on Neural Network model
Train scores: [0.95, 0.9667, 0.975, 0.9667, 0.9583]
Test scores: [1.0, 0.9333, 0.9333, 0.9, 0.9667]
Average train score: 0.9633
Average test score: 0.9467