Training a neural network model you usually update a metric of your model using some calculations on the data. When the size of your data is large it might need a lot of time to complete training and may consume a lot of resources. 

Iterative calculations on a portion of the data to save time and computational resources. This portion calls the batch of data and the process is called batch data processing. That’s especially important if you are not able to fit the whole dataset in your machine’s memory.

Batch size is the number of items from the data to takes the training model. If you use the batch size of one you update weights after every sample. If you use batch size 32, you calculate the average error and then update weights every 32 items.

For instance, let’s say you have 24000 training samples and you want to set up a batch size equal to 32. The algorithm takes the first 32 samples from the training dataset and trains the network. Next, it takes the second 32 samples and trains the network again. We can keep doing this procedure until we have propagated all samples through the network.

Typically networks train faster with mini-batches. That’s because we update the weights after each batch.

How to choose the batch size

When you put m examples in a mini-batch, you need to do O(m) computation and use O(m) memory, and you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).

Using a larger batch decreases the quality of the model, as measured by its ability to generalize.

In contrast, small-batch methods consistently converge to flat minimizers this is due to the inherent noise in the gradient estimation. In terms of computational power, while the single-sample Stochastic Gradient Descent process takes more iterations, you end up getting there for less cost than the full batch mode.

Too small batch size has the risk of making learning too stochastic, faster but will converge to unreliable models, too big and it won’t fit into memory and still take ages. The higher the batch size, the more memory space you’ll need.

What is epoch

Number epoch equal to the number of times the algorithm sees the entire data set. So, each time the algorithm has seen all samples in the dataset, one epoch has completed.

What is iteration

Every time you pass a batch of data through the neural network, you completed one iteration. In the case of neural networks, that means the forward pass and backward pass. So, batch size * number of iterations = epoch

Epoch vs iteration

One epoch includes all the training examples whereas one iteration includes only one batch of training examples.

Steps vs Epoch in TensorFlow

Important different is that the one-step equal to process one batch of data, while you have to process all batches to make one epoch. Steps parameter indicating the number of steps to run over data.

A training step is one gradient update. In one step batch_size, many examples are processed.

An epoch consists of one full cycle through the training data. This are usually many steps. As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of 2,000 images / (10 images / step) = 200 steps.

Online Learning

Typically when people say online learning they mean batch_size=1. The idea behind online learning is that you update your model as soon as you see the example.

How does batch size affect the performance of the model

Computing the gradient of a batch generally involves computing some function over each training example in the batch and summing over the functions. In particular, gradient computation is roughly linear in the batch size. So it’s going to take about 100x longer to compute the gradient of a 10,000-batch than a 100-batch. 

The gradient of a single data point is going to be a lot noisier than the gradient of a 100-batch. This means that we won’t necessarily be moving down the error function in the direction of steepest descent. 

If we used the entire training set to compute each gradient, our model would get stuck in the first valley because it would register a gradient of 0 at this point. If we use smaller mini-batches, on the other hand, we’ll get more noise in our estimate of the gradient. This noise might be enough to push us out of some of the shallow valleys in the error function. 

In general, a batch size of 32 is a good starting point, and you should also try with 64, 128, and 256. Other values may be fine for some data sets, but the given range is generally the best to start experimenting with. Though, under 32, it might get too slow because of significantly lower computational speed, because of not exploiting vectorization to the full extent. If you get an “out of memory” error, you should try reducing the batch size.