The MNIST (Modified National Institute of Standards and Technology) data consists of 60,000 training images and 10,000 test images. Each image is a crude 28 x 28 (784 pixels) handwritten digit from “0” to “9.” Each pixel value is a grayscale integer between 0 and 255.

idx3 ubyte numpy

The source MNIST data files are stored in a proprietary binary format. This article explains how to fetch and prepare MNIST data from .idx3.ubyte file format and the utility function that loads the MNIST dataset from byte form into NumPy arrays.

Download Source Data Files

The primary storage site for the binary MNIST data files is The MNIST Database of handwritten digits, but you can also find it at many locations. There are links to four GNU zip-compressed files:

The MNIST dataset is publicly available at https://yann.lecun.com/exdb/mnist/ and consists of the following four parts: 

The first two files hold the pixel values and the associated labels for the 60,000-item training data. The second two files are the 10,000-item test data. If you click on a link you can download the associated file. 

The organization of the source binary MNIST files is somewhat unusual. Storing features (the pixel predictor values) and labels (the digit to predict) in separate files, rather than together in one file, was common in the 1990s when computers had limited memory.

idx file format

This file format is designed for storing vectors and multidimensional matrices of various numerical types, it is stored in a very simple. All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. The basic format is:

MNIST idx file format

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). The magic number is an integer (MSB first). The first 2 bytes are always 0. The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices.

The sizes in each dimension are 4-byte integers (MSB first, high-endian, like in most non-Intel processors). The data is stored like in a C array, i.e. the index in the last dimension changes the fastest. 

Converting Binary MNIST to Numpy Array

import gzip

import numpy as np
import matplotlib.pyplot as plt

from os.path  import join

input_path='/content'
training_images_path = join(input_path, 'train-images-idx3-ubyte.gz')
training_labels_path = join(input_path, 'train-labels-idx1-ubyte.gz')

train_images_byte = gzip.open(training_images_path,'r')

image_size = 28
sample_size = 10

train_images_byte.read(16)
buf = train_images_byte.read(image_size * image_size * sample_size)

data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
images = data.reshape(sample_size, image_size, image_size, 1)

train_labels_byte = gzip.open('train-labels-idx1-ubyte.gz','r')

train_labels_byte.read(8)
buf = train_labels_byte.read(sample_size)

labels = np.frombuffer(buf, dtype=np.uint8)

Displaying MNIST Data

After the MNIST data has been stored as a Numpy array, it’s useful to display it to verify the data has been converted and saved correctly:

def show_images(images, labels):
    
    cols = 5
    rows = int(len(images)/cols) + 1
    
    plt.figure(figsize=(10,7))
    index = 1    
    
    for x in zip(images, labels):        
        image = x[0]        
        label = x[1]
        
        plt.subplot(rows, cols, index)        
        plt.imshow(image, cmap=plt.cm.gray)

        plt.title(label);        
        index += 1
  
show_images(images,labels)
idx3 ubyte mnist

Most popular neural network libraries, including PyTorch, Scikit, and Keras, have some form of built-in MNIST dataset designed to work with the library. But there are problems with using a built-in dataset. data access becomes a magic black box and important information is hidden.

Related Post