K-Fold cross-validation has a single parameter called k that refers to the number of groups that a given dataset is to be split(fold). First Split the dataset into k groups than take the group as a test data set the remaining groups as a training data set. In this tutorial, we create a simple classification keras model and train and evaluate using K-fold cross-validation.
This guide uses Iris Dataset to categorize flowers by species. This is a popular dataset for a beginner in machine learning classification problems. Download the training dataset file using the tf.keras.utils.get_file function.
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv" data_csv = tf.keras.utils.get_file(fname=os.path.basename(dataset_url), origin=dataset_url) df=pd.read_csv(data_csv,skiprows=1,header=None) X=df.iloc[:,0:4].values Y=df.iloc[:,4:5].values
There are 120 total examples. Each example has four features and one of three possible label names.
Create a Model
The TensorFlow Keras API makes easy to build models and experiment while Keras handles the complexity of connecting everything together. The tf.keras.Sequential model is a linear stack of layers. In this case, two Dense layers with 10 nodes each, and an output layer with 3 nodes representing our label predictions. The first layer’s
input_shape parameter corresponds to the number of features from the dataset and is required.
def create_model(): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(10, input_shape=(4,) , activation = 'relu')) model.add(tf.keras.layers.Dense(10, activation = 'relu')) model.add(tf.keras.layers.Dense(3, activation = 'softmax')) model.compile(loss = 'sparse_categorical_crossentropy' , optimizer = 'adam' , metrics = ['accuracy'] ) return model
The model will calculate its loss using the
Fold Dataset and Train Model
We use the scikit-learn library to implementation of K-fold data. That will split a given data. It takes as arguments the number of splits(fold) and to shuffle the sample or not.
from sklearn.model_selection import KFold n_split=3 for train_index,test_index in KFold(n_split).split(X): x_train,x_test=X[train_index],X[test_index] y_train,y_test=Y[train_index],Y[test_index] model=create_model() model.fit(x_train, y_train,epochs=20) print('Model evaluation ',model.evaluate(x_test,y_test))
The split() will return each group of the train and test sets. The returned array contains the indexes of the original data sample of observations to use for train and test sets on each iteration.
Define sets like fold 1,2 3 and you conduct 3 experiments. In each of the experiments, the fold i is used for assessment and the other folds are merged together for training. For experiment 1, I hold out fold 1 for testing and train on 2 and 3, and I get a number.
In the second experiment, I hold out 2 and train on 1 and 3 and I get a number. so forth for the third one, here I test on fold 3 and train on 1 and 2.
Now I’ve seen a combination of these three folds, and get three numbers, and I can average them or something, and maybe report if I do more folds, kind of confidence intervals around those means.
Advantages and Disadvantages
Every example appears in a train set exactly K-1 times and in-in the test set exactly once. So that’s nice. You have some guarantees about how you’ve gone through the data.
The disadvantage of this is that the size of K determines the size of the train test splits. So with three-fold cross-validation, you train on 67% of the data and test on 33%. But with 10-fold, you train on 90% and test on 10%.
Those are very likely to be different experiment scenarios, and I feel like two things have gotten muddled together when you’re doing this.
On the one hand, you wanted a lot of runs because you wanted a real sense for a system performance across different settings. On the other hand, you end up changing the size of the training and test data when you do that, and that’s just a consequence of the good here. So it’s not like we can blame the method.