How to use Pre-trained(GloVe or Word2Vec) Word Embedding in Keras?

Anytime you create a deep learning model with the natural language you’re going to generate some embeddings and these embeddings can be useful in other problems. Some very useful embeddings like Word2Vec by Google or GloVe by Stanford. They constructed specific problems with the idea of creating embeddings for the words which were maximally useful across a wide range of problems.

In this tutorial, you will learn how to use pre-train Word Embeddings in Tensorflow RNN. We use Global Vectors as the Embedding layer.

Encoding Words

You can encode words using one-hot encoding. If you have a vocabulary of 100,000 words it is a possibility to create a vector of a 100,000 of zeroes and mark with 1 the word you are encoding. It’s highly unpractical because it has a really big vector.

Rather than doing this, we tend to represent words with shorter vectors that can have continuous values.it’s called an embedding.

You will be able to reuse this embedding matrix that you generated for other problems as long as it as has the same language.

Words that are similar to one another like Man and Women will appear close to each other in the embedding space and that means that if your network learned something about Men like how it can be used in a sentence it’ll for free get to know the same or similar thing about Women or words like Apple and Orange will appear close to each other.

Load Data

In this tutorial, we use Amazon fine food reviews which is consists of reviews of fine foods from Amazon.
Downloading the data from Kaggle and cleaning and splitting it for use in training.

import tensorflow as tf
import numpy as np
import os

import pandas as pd
import numpy as np

#Read file as panda dataframe
df = pd.read_csv('reviews.csv')


# 80% for training and 20% for testing
m = np.random.rand(len(df)) < 0.8
train, test = df[m].copy(deep = True), df[~m].copy(deep = True)


x_train=train.Text
y_train=train.Score

x_test=test.Text;
y_test=test.Score

print(len(train))
print(len(test))


#hyperparameters
max_document_len=200
embedding_size=50
num_class=5

The key columns are ‘Text’ which contains the text of a review and ‘Score’ which is scaled from 1 to 5.

Data Preprocessing using VocabularyProcessor

Before you build a neural network on review strings, you need to tokenize the string, splitting it into an array of tokens. Each token will be a word in our sentence and they will be separated by spaces and punctuation.

Once you tokenized the sentences, each word will be replaced with an integer.

Tensorflow provides a VocabularyProcessor. It takes care of both the tokenization and integer mapping. You only have to give it the max_document_length argument which will determine the length of the output arrays.

If sentences are shorter than this length, they will be padded and if they are longer, they will be trimmed. The VocabularyProcessor is then trained on the training set to build the initial vocabulary and map the words to integers.

vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_document_len)


x_train = vocab_processor.fit_transform(x_train)
   
x_test = vocab_processor.transform(x_test)

    
x_train = np.array(list(x_train))
x_test = np.array(list(x_test))
y_train = np.array(y_train).astype(int)
y_test = np.array(y_test).astype(int)


vocab_size = len(vocab_processor.vocabulary_)
print('Total words: %d' % vocab_size)


vocab_dict = vocab_processor.vocabulary_._mapping

Loads Embedding Vector in Numpy Array

We will use the pre-trained vectors from GloVe and use them in an Estimator.

Download the pre-train vectors and loads them into a numpy.array.

if not os.path.exists('glove.6B.zip'):
    ! wget http://nlp.stanford.edu/data/glove.6B.zip
    ! unzip glove.6B.zip


def load_glove_embeddings(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            w = values[0]
            vectors = np.asarray(values[1:], dtype='float32')
            embeddings[w] = vectors

If you open the file, you will see a token (word) followed by the weights (50 numbers) on each line. Below is the first line of the embedding ASCII text file showing the embedding for “the”.

the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353

Next, you need to create a matrix of one embedding for each word in the training dataset and initialize with random float values. After that enumerating all unique words in the vocab_processor.vocabulary_._mapping and locating the embedding weight vector from the loaded GloVe embedding.

    vocab_dict = vocab_processor.vocabulary_._mapping
    embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, embedding_size))
    num_loaded = 0
    for w, i in vocab_dict.items():
        v = embeddings.get(w)
        if v is not None and i < vocab_size:
            embedding_matrix[i] = v
            num_loaded += 1
   
    embedding_matrix = embedding_matrix.astype(np.float32)
    return embedding_matrix

Create Embedding Layer in TensorFlow

Seed the TensorFlow Embedding layer with weights from the pre-trained embedding(GloVe word embedding weights) for the words in your training dataset. We chose the 50-dimensional version, therefore the embedding size 50.

def initializer(shape=None, dtype=tf.float32, partition_info=None):
    assert dtype is tf.float32
    return embedding_matrix

params = {'initializer': initializer}
model = tf.estimator.Estimator(model_fn=rnn_model_fn,model_dir=model_dir,
                                        params=params)


....

def rnn_model_fn(features, labels, mode, params):
    input_layer = tf.contrib.layers.embed_sequence(
        features['x'], vocab_size, embedding_size,
        initializer=params['initializer'])


    .....

You can experiment with learning a word embedding using a pre-trained embedding to perform learning on top of a training dataset.