K-Fold cross-validation has a single parameter called k that refers to the number of groups that a given dataset is to be split(fold). First Split the dataset into k groups then take the group as a test data set and the remaining groups as a training data set. In this tutorial, we create a simple classification keras model and train and evaluate using K-fold cross-validation.

Download Dataset

This guide uses Iris Dataset to categorize flowers by species. This is a popular dataset for a beginner in machine learning classification problems. Download the training dataset file using the tf.keras.utils.get_file function.

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv"

data_csv = tf.keras.utils.get_file(fname=os.path.basename(dataset_url),
                                           origin=dataset_url)

df=pd.read_csv(data_csv,skiprows=1,header=None)

X=df.iloc[:,0:4].values
Y=df.iloc[:,4:5].values

There are 120 total examples. Each example has four features and one of three possible label names.

Create a Model

The TensorFlow Keras API makes it easy to build models and experiments while Keras handles the complexity of connecting everything together. The tf.keras.Sequential model is a linear stack of layers. In this case, two Dense layers with 10 nodes each, and an output layer with 3 nodes representing our label predictions. The first layer’s input_shape parameter corresponds to the number of features from the dataset and is required.

def create_model():
  model = tf.keras.models.Sequential()
  model.add(tf.keras.layers.Dense(10, input_shape=(4,) , activation = 'relu'))
  model.add(tf.keras.layers.Dense(10, activation = 'relu'))
  model.add(tf.keras.layers.Dense(3, activation = 'softmax'))

  model.compile(loss = 'sparse_categorical_crossentropy' , optimizer = 'adam' , metrics = ['accuracy'] )

  return model

The model will calculate its loss using the sparse_categorical_crossentropy function.

Fold Dataset and Train Model

We use the scikit-learn library to implementation of K-fold data. That will split a given data. It takes as arguments the number of splits(fold) and whether to shuffle the sample or not.

from sklearn.model_selection import KFold

n_split=3

for train_index,test_index in KFold(n_split).split(X):
  x_train,x_test=X[train_index],X[test_index]
  y_train,y_test=Y[train_index],Y[test_index]
  
  model=create_model()
  model.fit(x_train, y_train,epochs=20)
  
  print('Model evaluation ',model.evaluate(x_test,y_test))

The split() will return each group of the train and test sets. The returned array contains the indexes of the original data sample of observations to use for train and test sets on each iteration.

K-Fold Validation

Define sets like fold 1,2 3 and you conduct 3 experiments. In each of the experiments, fold i is used for assessment, and the other folds are merged together for training. For experiment 1, I hold out fold 1 for testing and trained on 2 and 3, and I get a number.

In the second experiment, I hold out 2 and train on 1 and 3 and I get a number. so forth for the third one, here I test on fold 3 and train on 1 and 2.

Now I’ve seen a combination of these three folds, and get three numbers, and I can average them or something, and maybe report if I do more folds, kind of confidence intervals around those means.

Advantages and Disadvantages 

Every example appears in a train set exactly K-1 times and in-in the test set exactly once. So that’s nice. You have some guarantees about how you’ve gone through the data.

The disadvantage of this is that the size of K determines the size of the train test splits. So with three-fold cross-validation, you train on 67% of the data and test on 33%. But with 10-fold, you train on 90% and test on 10%.

Those are very likely to be different experiment scenarios, and I feel like two things have gotten muddled together when you’re doing this.

On the one hand, you wanted a lot of runs because you wanted a real sense of system performance across different settings. On the other hand, you end up changing the size of the training and test data when you do that, and that’s just a consequence of the good here. So it’s not like we can blame the method.

Related Post