The MNIST (Modified National Institute of Standards and Technology) data consists of 60,000 training images and 10,000 test images. Each image is a crude 28 x 28 (784 pixels) handwritten digit from “0” to “9.” Each pixel value is a grayscale integer between 0 and 255.
The source MNIST data files are stored in a proprietary binary format. This article explains how to fetch and prepare MNIST data from
.idx3.ubyte file format and the utility function that loads the MNIST dataset from byte form into NumPy arrays.
Download Source Data Files
The primary storage site for the binary MNIST data files is The MNIST Database of handwritten digits, but you can also find it at many locations. There are links to four GNU zip-compressed files:
The MNIST dataset is publicly available at https://yann.lecun.com/exdb/mnist/ and consists of the following four parts:
- Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, and 60,000 samples) –
- Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, and 60,000 labels) –
- Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, unzipped, and 10,000 samples) –
- Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, and 10,000 labels).
The first two files hold the pixel values and the associated labels for the 60,000-item training data. The second two files are the 10,000-item test data. If you click on a link you can download the associated file.
The organization of the source binary MNIST files is somewhat unusual. Storing features (the pixel predictor values) and labels (the digit to predict) in separate files, rather than together in one file, was common in the 1990s when computers had limited memory.
idx file format
This file format is designed for storing vectors and multidimensional matrices of various numerical types, it is stored in a very simple. All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. The basic format is:
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). The magic number is an integer (MSB first). The first 2 bytes are always 0. The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices.
The sizes in each dimension are 4-byte integers (MSB first, high-endian, like in most non-Intel processors). The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.
Converting Binary MNIST to Numpy Array
import gzip import numpy as np import matplotlib.pyplot as plt from os.path import join input_path='/content' training_images_path = join(input_path, 'train-images-idx3-ubyte.gz') training_labels_path = join(input_path, 'train-labels-idx1-ubyte.gz') train_images_byte = gzip.open(training_images_path,'r') image_size = 28 sample_size = 10 train_images_byte.read(16) buf = train_images_byte.read(image_size * image_size * sample_size) data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32) images = data.reshape(sample_size, image_size, image_size, 1) train_labels_byte = gzip.open('train-labels-idx1-ubyte.gz','r') train_labels_byte.read(8) buf = train_labels_byte.read(sample_size) labels = np.frombuffer(buf, dtype=np.uint8)
Displaying MNIST Data
After the MNIST data has been stored as a Numpy array, it’s useful to display it to verify the data has been converted and saved correctly:
def show_images(images, labels): cols = 5 rows = int(len(images)/cols) + 1 plt.figure(figsize=(10,7)) index = 1 for x in zip(images, labels): image = x label = x plt.subplot(rows, cols, index) plt.imshow(image, cmap=plt.cm.gray) plt.title(label); index += 1 show_images(images,labels)
Most popular neural network libraries, including PyTorch, Scikit, and Keras, have some form of built-in MNIST dataset designed to work with the library. But there are problems with using a built-in dataset. data access becomes a magic black box and important information is hidden.