Word embeddings give you a way to use a dense representation of the word in which similar words have a similar meaning(encoding). An embedding is a dense vector of floating-point values. Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training) or you can use pre-trained word embeddings like word2vec, glove, or fasttext.

PyTorch Word Embedding

Each word is represented as an N-dimensional vector of floating-point values. Another way to think of an embedding is as a “lookup table”. After these weights have been trained, we can encode each word by looking up the dense vector it corresponds to in the table.

This tutorial shows you “How to use pre-train word embeddings to train RNN model for text classification”. We seed the PyTorch Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

Download Word Embedding

It is common in Natural Language to train, save, and make freely available word embeddings. For example, GloVe embedding provides a suite of pre-trained word embeddings.

The smallest package of embeddings is 822Mb, called glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

!wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
!unzip glove.6B.zip

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt    

Next, we need to load the entire GloVe word embedding file into memory as a dictionary of a word to the embedding array.

glove = pd.read_csv('glove.6B.100d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

def create_embedding_matrix(word_index,embedding_dict,dimension):
  embedding_matrix=np.zeros((len(word_index)+1,dimension))

  for word,index in word_index.items():
    if word in embedding_dict:
      embedding_matrix[index]=embedding_dict[word]
  return embedding_matrix

text=["The cat sat on mat","we can play with model"]

tokenizer=tf.keras.preprocessing.text.Tokenizer(split=" ")
tokenizer.fit_on_texts(text)

text_token=tokenizer.texts_to_sequences(text)

embedding_matrix=create_embedding_matrix(tokenizer.word_index,embedding_dict=glove_embedding,dimension=100)

The result is a matrix of weights only for words we will see during training. The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100.

Create Embedding Layer

PyTorch makes it easy to use word embeddings using Embedding Layer. The Embedding layer is a lookup table that maps from integer indices to dense vectors (their embeddings). Before using it you should specify the size of the lookup table, and initialize the word vectors.

vocab_size=embedding_matrix.shape[0]
vector_size=embedding_matrix.shape[1]

embedding=nn.Embedding(num_embeddings=vocab_size,embedding_dim=vector_size)

vocab_size is the number of words in your dataset and vector_size is the dimension of the word vectors you are using.

Initialize the embedding layer using pre-trained weights. It is a NumPy array of size (vocab_size, vector_size).

embedding.weight=nn.Parameter(torch.tensor(embedding_matrix,dtype=torch.float32))

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table.

embedding(torch.LongTensor([1]))
tensor([[-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
         -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
          0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
          0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
          0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
         -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
         -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
          0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
          1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
         -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
          0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
          0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
         -0.5203, -0.1459,  0.8278,  0.2706]])

nn.Embedding is a model parameter layer, which is by default trainable, If you want to fine-tune word vectors during training, these word vectors are treated as model parameters and are updated by backpropagation. You can also make it untrainable by freezing its gradient. 

embedding.weight.requires_grad=False

We do not want to update the learned word weights in this model, therefore we will set the requires_grad attribute for the model to be False.

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. You could feed into the embedding layer with batches of shapes (32, 10) (batch of 32 sequences of length 10) .

embedding_vec=embedding(torch.LongTensor(text_token))
print(embedding)
print(embedding_vec.shape)

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis.

Embedding(11, 100)
torch.Size([2, 5, 100])

When given a batch of sequences as input, an embedding layer returns a 3D floating-point tensor, of shape (samples, sequence_length, embedding_dimensionality). To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer.

lstm=nn.LSTM(embedding_dim,128,bidirectional=True,batch_first=True)(embedding_vec)

Related Post