I was asked to fit several ML models to the well known Iris Species dataset. I’ll be going through the process and explaining the steps starting from preprocessing the data, to doing k-fold cross validation.

Data Preprocessing

The iris data set contain information about iris flowers and the specie of the flower, in the following format:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class, one of the following values
    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica

We’ll start first by importing the data and splliting them into X, for the features, and y, for the lables.

import pandas as pd

data = pd.read_csv(
  'iris.data',
  header=None,
  names=['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
  )


# Split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

First, we check if our data is balanced or not, as imbalanced data can lead to biased models that perform well on the majority class but poorly on the minority class.

data['class'].value_counts()
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

Next, we check for feature correlation. This step is important, especially in our case, since we are dealing with width and length data, it is expected that some of those measurments will scale linearly, and thus have high correlation.

From the below correlation matrix, we can see that petal dimensions are highly correlated.

X.corr()
sepal lengthsepal widthpetal lengthpetal width
sepal length1.000000-0.1093690.8717540.817954
sepal width-0.1093691.000000-0.420516-0.356544
petal length0.871754-0.4205161.0000000.962757
petal width0.817954-0.3565440.9627571.000000

The reason we’re interested in correlated features is that they’re essentially just repeated information, in the sense tat if I give you the petal width you can reliably guess what its length is, and telling you explicitly what the length is would be unnecessary. This suggests that we can drop one of the petal dimensions without affecting our model performance.

In fact, not removing one of those features may lead to over-fitting in some models, especially those who give equal weight to all features, like k nearest neighbours. The reason behind that is that by including correlated features we are telling the model to focus on the same information twice (samples that have close petal width will also have close petal length, and will appear closer than they should be).

Another reason to drop correlated features, is that we’re increasing the model’s complexity without much benefit to the model’s predictive ability.

Note: even though petal length and sepal width have a high correlation (0.87), dropping this feature gave me worst model performance.

X = X.drop(['petal length'], axis=1)

Next, we’ll normalize our data and split it into testing and training sets.

Normalizing data helps our models since all features will be on the same scale. Similar to the reasoning behind removing correlated features, if our features are on different scales, some of the data may be given more weight by a model just because they have bigger values, which can make the model biased.

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Normalize data
X_ = MinMaxScaler().fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = \
  train_test_split(X_, y, test_size=0.33)
  #train_test_split(X, y, test_size=0.33, random_state=42) # use random state for reproducible results

Fitting learners

Now that we’re done with data preprocessing, here comes the easier part of fitting the models and testing them. I say easier part becuase most of the heavy lifting is done by sklearn. Below we initialize various models and store them in a dictionary.

from sklearn.naive_bayes import ComplementNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

# Create models
models = {
  'Complement Naive Bayes': ComplementNB(),
  'Gaussian Naive Bayes':   GaussianNB(),
  'Logistic Regression':    LogisticRegression(solver='liblinear'),
  '5-Nearest Neighbors':    KNeighborsClassifier(n_neighbors=5),
  '10-Nearest Neighbors':   KNeighborsClassifier(n_neighbors=10),
  'Neural Network':         MLPClassifier(
    hidden_layer_sizes=(12, 12, 12),
    activation='relu', solver='adam', random_state=42,
    batch_size=30, max_iter=500, alpha=0.01,
    ),
}

Now, we can just loop over the models as they have almost the same interface; and we print performance metrics related to each model: accuracy, precision, recall, f1.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

for model_name, model in models.items():
  print(f'>>> Fitting {model_name} model')

  # Train model
  model.fit(X_train, y_train)
  # Test the model
  y_pred = model.predict(X_test)

  # Measure model metrics
  accuracy = accuracy_score(y_test, y_pred)
  precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
  recall = recall_score(y_test, y_pred, average='weighted')
  f1 = f1_score(y_test, y_pred, average='weighted')

  print('accuracy    %.3f' % accuracy)
  print('precision   %.3f' % precision)
  print('recall      %.3f' % recall)
  print('f1          %.3f' % f1)
  print()
>>> Fitting Complement Naive Bayes model
accuracy    0.800
precision   0.863
recall      0.800
f1          0.719

>>> Fitting Gaussian Naive Bayes model
accuracy    0.967
precision   0.971
recall      0.967
f1          0.967

>>> Fitting Logistic Regression model
accuracy    0.967
precision   0.971
recall      0.967
f1          0.967

>>> Fitting 5-Nearest Neighbors model
accuracy    0.933
precision   0.950
recall      0.933
f1          0.935

>>> Fitting 10-Nearest Neighbors model
accuracy    0.933
precision   0.950
recall      0.933
f1          0.935

>>> Fitting Neural Network model
accuracy    0.967
precision   0.971
recall      0.967
f1          0.967

We also apply k-fold and print the resultsfor each model.

from sklearn.model_selection import KFold

# Create a KFold object
kf = KFold(n_splits=5, shuffle=True)

print('\n\nKFold scores:')
for model_name, model in models.items():
  print(f'>>> Appying KFold on {model_name} model')

  # Create lists for train and test scores
  train_scores = []
  test_scores = []

  # Loop over folds
  for train_index, test_index in kf.split(X_):
      # Split the data
      X_train, X_test = X_[train_index], X_[test_index]
      y_train, y_test = y[train_index], y[test_index]

      # Train the model
      model.fit(X_train, y_train)

      # Evaluate the model
      train_scores.append(model.score(X_train, y_train))
      test_scores.append(model.score(X_test, y_test))

  train_scores = [round(x, 4) for x in train_scores]
  test_scores = [round(x, 4) for x in test_scores]
  # Print summary of the model
  print('Train scores:', train_scores)
  print('Test scores:', test_scores)
  print('Average train score:', round(sum(train_scores) / len(train_scores), 4))
  print('Average test score:', round(sum(test_scores) / len(test_scores), 4))
  print()
KFold scores:
>>> Appying KFold on Complement Naive Bayes model
Train scores: [0.6833, 0.65, 0.6417, 0.6833, 0.675]
Test scores: [0.6, 0.7333, 0.7667, 0.6, 0.6333]
Average train score: 0.6667
Average test score: 0.6667

>>> Appying KFold on Gaussian Naive Bayes model
Train scores: [0.95, 0.9417, 0.9583, 0.95, 0.9417]
Test scores: [0.9333, 1.0, 0.9, 0.9, 0.9667]
Average train score: 0.9483
Average test score: 0.94

>>> Appying KFold on Logistic Regression model
Train scores: [0.9, 0.8333, 0.8667, 0.7917, 0.8417]
Test scores: [0.9333, 0.8667, 0.8333, 0.8, 0.7667]
Average train score: 0.8467
Average test score: 0.84

>>> Appying KFold on 5-Nearest Neighbors model
Train scores: [0.9667, 0.9667, 0.95, 0.95, 0.9583]
Test scores: [0.9, 0.9333, 1.0, 0.9667, 0.9333]
Average train score: 0.9583
Average test score: 0.9467

>>> Appying KFold on 10-Nearest Neighbors model
Train scores: [0.9833, 0.9417, 0.9583, 0.9667, 0.9583]
Test scores: [0.9, 1.0, 1.0, 0.9667, 0.9667]
Average train score: 0.9617
Average test score: 0.9667

>>> Appying KFold on Neural Network model
Train scores: [0.95, 0.9667, 0.975, 0.9667, 0.9583]
Test scores: [1.0, 0.9333, 0.9333, 0.9, 0.9667]
Average train score: 0.9633
Average test score: 0.9467