How many samples do you want to load at once that is what batch size is. Let’s just look at a closer look at the batch size. It’s not just how many images at a time but there are certain implications so let me quickly cover that.

The batch size defines the number of samples in the batch. That batch propagated through the network before updating the model parameters. After each batch, the model parameters are updated and in other words, a batch of samples go through one full forward and backward propagation in your neural network training. So that’s why batch size is very important.

If you have very minimal or if you have very large there are certain implications, for example:

- Total training samples=5000
- Batch Size=32
- Epochs=100

One epoch is been all of your data goes through the forward and backward like all of your 5000 samples. Then…

- 32 samples will be taken at a time to train the network.
- To go through all 5000 samples it takes 157(5000/32)iterations for one epoch.
- This process continues 100(epochs) times.

## What is a good batch size?

First of all, you may not have a choice because you probably have a crappy system like most of us. You cannot have the luxury of working with large batch sizes like 128,256 or 512.

You may be limited to small batch sizes based on the hardware that you have which includes RAM and GPU. If your GPU is 4GB and say you have 512×512 images and you’re trying to load 64 of them then you will be filling up your GPU or your ram but if you work with an MNIST dataset where your images are 32×32 you can use 256 batch size. **It’s completely dependent on your local hardware.**

## Smaller Batch Size

Smaller batches mean each step in the gradient may be less accurate. Remember in the gradient descent you’re trying to find the minimum, so if you have a small batch, it’s representing only a small portion of your data which means you may be inaccurate when it comes to gradient descent. Instead of going in the right direction, you’re probably in the wrong direction.

You’ll eventually find the minimum but then it takes a long a little bit more time. It takes a longer for the algorithm to converge the overall process may **still be faster but longer time to converge in general**.

## Larger Batch Size

For larger batch sizes the ability of the model to generalize apparently seems to be decreasing. It’s been observed that for large batches let’s say if you have like 512 batch size 1k or 2k batch size there seems to be a significant **degradation in the quality**.

It’s okay with the smaller batch size, it is preferable not too small not too large so what is good at 32 is appropriate, and luckily on most systems that are possible.

If you’re using google collab, free RAM, and GPU then 32 is a good starting point for most images. If you plan on working on larger image sizes then you can go down to 16 but 32 is a good starting point. If you have the luxury of going to 64 that’s also fine.

Let’s quickly show Keras example so you can test this yourself on your own data if you want.

We are going to generate a bunch of random data points, to do that I’m using scikit-learn, which creates blobs with 1000 data points centered around 3 which means we have three clusters.

```
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
y = keras.utils.to_categorical(y)
n_train = 500
X_train, X_test = X[:n_train, :], X[n_train:, :]
y_train, y_test = y[:n_train], y[n_train:]
```

Y is one-hot encoded using *to_categorical* this is to convert from integers to binary class. I am actually splitting half as the training data set and half as the testing dataset.

Now let’s create the model, again keeping it very simple just two dense layers, initial dense with 50 and then the next dance with three nodes because we have three clusters and the activation is softmax.

```
def build_and_compile_model():
model = keras.Sequential()
model.add(keras.layers.Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(keras.layers.Dense(3, activation='softmax')) #Predict with 3 classes
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
```

I’m using Adam as my optimizer here but you can test this with SGD and then see how things actually work out. I’m using accuracy as the metric and categorical_cross-entropy because this is a multi-class classification problem.

Let’s fit the model for 200 epochs. We run this entire fit for a batch size of 4,8,16,128,256, and 512, for 512 it should be 2 iterations per epoch because we have 1000 datasets. This type of code can really help you study the effect of various parameters pretty easily on this type of data again.

```
batch_sizes = [4, 8, 16, 32, 64, 128, 256, 512]
plt.figure(figsize=(16,15))
for i in range(len(batch_sizes)):
print('\n Batch Size'+str(batch_sizes[i]))
start=time()
model=build_and_compile_model()
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=1, batch_size=batch_sizes[i])
plot_no = 420 + (i+1)
plt.subplot(plot_no)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.title('batch='+str(batch_sizes[i])+'Time:'+str(time()-start))
# show learning curves
plt.show()
```

Finally, we just plotted it into 4×2, we have 8 plots here. It started off with batches 4 and 8 and it took forever but still took some time.

Once it came down to like 64,128,512 it was super fast in terms of updating. So it was very fast at training but look **how fast these things converge **when I say converse that means actually getting to a stable value.

For a batch of 4 and 8 almost immediate like about 30 epochs for a stable value and for a batch of 64+, it was slightly more than almost 50 epochs to stabilize.

For larger batch size In terms of execution was pretty fast you know because the number of iterations was smaller.

These are the types of studies you can do by taking code and changing different parameters so you can get a better understanding. No one is going to tell you all the answers about your dataset but this is the only way to learn batch size hyper parameter.

The summary of this whole thing is larger batch size results in faster progress in training but a larger batch size doesn’t always converge as fast. With a small batch size is trained slow but converges faster.