PyTorch provides many tools to make data loading easy and make your code more readable. In this tutorial, we will see how to load and preprocess Pandas DataFrame.We use California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. This is the dataset that, we wanted to use.

PyTorch Pandas dataset

The CSV file that I have, contains 17000 data entries of 8 features and one label for each.

Now that we have the data, we will go to the next step. That is, create a custom Dataset and DataLoader to preprocess the time series-like data into a matrix-like shape. We reshape the data in that way to just illustrate the point. 

Custom Dataset

We have to first create a Dataset class. Then we can pass the dataset to the DataLoader. Every dataset class must implement the __len__ method that determines the length of the dataset and the __getitem__ method that iterates over the dataset item by item. In our case, the item would mean the processed version of data. The following dataset class takes file names as arguments.

class MyDataset(Dataset):

  def __init__(self,file_name):
    price_df=pd.read_csv(file_name)

    x=price_df.iloc[:,0:8].values
    y=price_df.iloc[:,8].values

    self.x_train=torch.tensor(x,dtype=torch.float32)
    self.y_train=torch.tensor(y,dtype=torch.float32)

  def __len__(self):
    return len(self.y_train)
  
  def __getitem__(self,idx):
    return self.x_train[idx],self.y_train[idx]

The ‘MyDataset’ class is our customized class for data preparation for our torch model. It is inherited from the parent abstract class Dataset. Dataset is the object type accepted by torch models.

First, we build the constructor. The argument passed to the constructor is the file_name (the file path). Pandas have been used to read the CSV file. Then, the file output is separated into features and labels accordingly. Finally, we convert our dataset into torch tensors.

Create DataLoader

To train a deep learning model, we need to create a DataLoader from the dataset. DataLoaders offer multi-worker, multi-processing capabilities without requiring us to right codes for that. So let’s first create a DataLoader from the Dataset.

myDs=MyDataset(csv_path)
train_loader=DataLoader(myDs,batch_size=10,shuffle=False)

Now we will check whether the dataset works as intended or not. We will set batch_size to 10. 

for i, (data, labels) in enumerate(train_loader):
  print(data.shape, labels.shape)
  print(data,labels)
  break;

torch.Size([10, 8]) torch.Size([10])
tensor([[-1.1431e+02,  3.4190e+01,  1.5000e+01,  5.6120e+03,  1.2830e+03,
          1.0150e+03,  4.7200e+02,  1.4936e+00],
        [-1.1447e+02,  3.4400e+01,  1.9000e+01,  7.6500e+03,  1.9010e+03,
          1.1290e+03,  4.6300e+02,  1.8200e+00],
        [-1.1456e+02,  3.3690e+01,  1.7000e+01,  7.2000e+02,  1.7400e+02,
          3.3300e+02,  1.1700e+02,  1.6509e+00],
        [-1.1457e+02,  3.3640e+01,  1.4000e+01,  1.5010e+03,  3.3700e+02,
          5.1500e+02,  2.2600e+02,  3.1917e+00],
        [-1.1457e+02,  3.3570e+01,  2.0000e+01,  1.4540e+03,  3.2600e+02,
          6.2400e+02,  2.6200e+02,  1.9250e+00],
        [-1.1458e+02,  3.3630e+01,  2.9000e+01,  1.3870e+03,  2.3600e+02,
          6.7100e+02,  2.3900e+02,  3.3438e+00],
        [-1.1458e+02,  3.3610e+01,  2.5000e+01,  2.9070e+03,  6.8000e+02,
          1.8410e+03,  6.3300e+02,  2.6768e+00],
        [-1.1459e+02,  3.4830e+01,  4.1000e+01,  8.1200e+02,  1.6800e+02,
          3.7500e+02,  1.5800e+02,  1.7083e+00],
        [-1.1459e+02,  3.3610e+01,  3.4000e+01,  4.7890e+03,  1.1750e+03,
          3.1340e+03,  1.0560e+03,  2.1782e+00],
        [-1.1460e+02,  3.4830e+01,  4.6000e+01,  1.4970e+03,  3.0900e+02,
          7.8700e+02,  2.7100e+02,  2.1908e+00]]) tensor([66900., 80100., 85700., 73400., 65500., 74000., 82400., 48500., 58400.,
        48100.])

Now that DataLoader works, we will use it to train a simple deep-learning model. The focus of this post is not on the model. If you want a different Dataframe, then can do so by just replacing our Dataset class.

Related Post