Text classification is one of the important and common tasks in machine learning. It is about assigning a class to anything that involves text. It is a core task in natural language processing.

There are many applications of text classification like spam filtering, sentiment analysis, speech tagging, language detection, and many more. In this tutorial, we will build a text classifier model using PyTorch in Python.

We will work on classifying a large number of Wikipedia comments as being either toxic or not. The data set we will use comes from the Toxic Comment Classification Challenge on Kaggle

By the end of this project, you will be able to apply word embeddings for text classification, use LSTM as feature extractors in natural language processing (NLP), and perform binary text classification using PyTorch.

Let us first import all the necessary libraries required to build a model.

import pandas as pd
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext import data

import random
import spacy

SEED=42
torch.manual_seed(SEED)

 Preprocessing Text

Processing text is the first step in NLP. TorchText is incredibly convenient as it allows you to rapidly tokenize your data. TorchText is a PyTorch package that contains different data processing methods as well as popular NLP datasets.

TorchText has 4 main functionalities: data, datasets, vocab, and utils. Data is mainly used to create a custom dataset class, batching samples, etc. Datasets consist of various NLP datasets from sentiment analysis to question answering. Vocab covers different methods of processing text and utils consists of additional helper functions. 

Tokenize data

The code below shows how to tokenize the text using TorchText and Spacy together. Spacy is a library that has been specifically built to take sentences in various languages and split them into different tokens. Without Spacy, Torchtext defaults to a simple .split(‘ ’) method for tokenization. This is much less nuanced than Spacy’s approach, which will also split words like “don’t” into “do” and “n’t”, and much more.

tokenizer=spacy.load('en')

stop_words=spacy.lang.en.STOP_WORDS
stop_words.update(['nt','m','s','wikipedia','article','articles','im','page'])


def spacy_token(x):
  x= re.sub(r'[^a-zA-Z\s]','',x)
  x= re.sub(r'[\n]',' ',x)
  return [tok.text for tok in tokenizer.tokenizer(x)]

Create Field

TorchText takes a declarative approach to loading its data. The way you do this is by declaring a Field. The Field specifies how you want a certain field to be processed. Let’s look at an example.

TEXT= data.Field(sequential=True,lower=True,tokenize=spacy_token,eos_token='EOS',stop_words=stop_words,include_lengths=True)

LABEL=data.Field(dtype = torch.float,sequential=False,use_vocab=False,pad_token=None,unk_token=None)

dataField=[(None,None),("comment_text",TEXT),("toxic",LABEL)]

In the toxic comment classification dataset, there are two kinds of fields: the comment text and the labels (toxic).

Create Dataset

TorchText takes raw data in the form of text files, CSV, JSON, and directories and converts them to Datasets. Datasets are simply preprocessed blocks of data read into memory with various fields. They are a canonical form of processed data that other data structures can use.

dataset=data.TabularDataset(path='train.csv',format='csv',fields=dataField,skip_header=True)

Split Dataset

Calling the magic TabularDataset.splits then returns a train and validation dataset with the respective data loaded into them, processed(tokenized) according to the fields we defined earlier.

train_data,val_data= dataset.split(split_ratio=0.8,random_state=random.seed(SEED))

Load Pre-Train Embedding

Using the TEXT field that you have created, you can use the method build_vocab() and pass in the training data so that it will learn the full range of words. The vocab class can also build different embedding matrix using pre-trained embeddings.

TEXT.build_vocab(train_data)
TEXT.vocab.load_vectors('glove.6B.300d')
embedding = TEXT.vocab.vectors.to(device)

Dataset to Iterator

Iterators handle numerical sizing, batching, packaging, and moving the data to the GPU. Basically, it does all the heavy lifting necessary to pass the data to a neural network.

BATCH_SIZE = 16
nlabel = 1
hidden_dim = 25
nfilter = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  


train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, val_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.comment_text),
    sort_within_batch=True,
    device = device)

It automatically shuffles and group input sequences of similar length. This is very useful as the amount of padding is determined by the longest sequence in the batch and therefore padding is most efficient when sequences are of similar lengths.

Build PyTorch Model

It’s time to define the architecture to solve the binary classification problem. The nn.module from torch is a base model for all the models. This means that every model must be a subclass of the nn.module.

class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, label_size, batch_size, embedding_weights, bidirectional = True):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.word_embeddings = nn.Embedding.from_pretrained(embedding_weights)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim,
                            bidirectional = bidirectional,batch_first=True)
        if bidirectional:
          self.fc = nn.Linear(hidden_dim*2, label_size)
        else:
          self.fc = nn.Linear(hidden_dim, label_size)
        self.act = nn.Sigmoid()

    def forward(self, sentence, src_len, train = True):
        embeds = self.word_embeddings(sentence)

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embeds, src_len)
        packed_outputs, (hidden,cell) = self.lstm(packed_embedded)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        
        dense_outputs = self.fc(hidden)
        outputs=self.act(dense_outputs)
        return outputs

Here I have defined the optimizer, loss, and metric for the model:

model = LSTMClassifier(embedding_dim=embedding.shape[1],hidden_dim=hidden_dim,label_size=nlabel, batch_size=BATCH_SIZE, embedding_weights=embedding)
model = model.to(device)

optimizer = optim.Adam(model.parameters(), lr = 0.001)
loss_function = nn.BCELoss()

def model_accuracy(predict,y):
  true_predict=(predict==y).float()
  acc=true_predict.sum()/len(true_predict)

  return acc

model.train() sets the model on the training phase and activates the dropout layers. Here is the code block to define a function for training the model. We will train the model for a certain number of epochs.

epochs=10
for epoch in range(epochs):
      total_loss = 0.0
      total_acc=0.0
      for i, batch in enumerate(train_iterator):
            (feature, batch_length), label = batch.comment_text, batch.toxic
            
            optimizer.zero_grad()
            
            output = model(feature, batch_length).squeeze() 
            # print(output)
            
            loss = loss_function(output, label)
            acc=model_accuracy(output,label)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            total_acc+=acc.item() 
      
      print(f"loss on epoch {epoch} = {total_loss/len(train_iterator)}")
      print(f"accuracy on epoch {epoch} = {total_acc/len(train_iterator)}")

Related Post

Run this code in Google Colab