Anytime you create deep learning model with the natural language you’re going to generate some embeddings and these embeddings can be useful in other problems. Some very useful embeddings like Word2Vec by Google or GloVe by Stanford. They’re constructed specific problems with the idea of creating embeddings for the words which were maximally useful across a wide range of problems.

We have already discussed how to train Word Embeddings. In this tutorial, you will learn how to use pre-train Word Embeddings in Tensorflow RNN. We use Global Vectors as the Embedding layer.

Encoding Words

You can encode words using one-hot encoding. If you have a vocabulary of 100,000 words it is a possibility to create a vector of a 100,000 of zeroes and mark with 1 the word you are encoding. It’s highly unpractical because it has a really big vector.

one-hot vector

Rather than doing this, we tend to represent words with shorter vectors which can have continuous’s called an embedding.

word embedding metrix

Word embedding matrix with embed size 50.

You will be able to reuse this embedding matrix that you generated for other problems as long as it as has the same language.

Word embeddingWords which are similar to one another like Man and Women will appear close to each other in the embedding space and that means that if your network learned something about Man like how it can be used in a sentence it’ll for free get to know the same or similar thing about Women or words like Apple and Orange will appear close to each other.

Load Data

In this tutorial we use Amazon fine food reviews which is consists of reviews of fine foods from Amazon.
Downloading the data from Kaggle and cleaning and splitting it for use in training.

The key columns are ‘Text’ which contains the text of a review and ‘Score’ which is scaled from 1 to 5.

Data Preprocessing using VocabularyProcessor

Before you build a neural network on review strings, you need to tokenize the string, splitting it into an array of tokens. Each token will be a word in our sentence and they will be separated by spaces and punctuation.

Once you tokenized the sentences, each word will be replaced with an integer.

Tensorflow provides a VocabularyProcessor. It takes care of both the tokenization and integer mapping. You only have to give it the max_document_length argument which will determine the length of the output arrays.

If sentences are shorter than this length, they will be padded and if they are longer, they will be trimmed. The VocabularyProcessor is then trained on the training set to build the initial vocabulary and map the words to integers.

Loads Embedding Vector in Numpy Array

We will use the pre-trained vectors from GloVe and use them in an Estimator.

Download the pre-train vectors and loads them into a numpy.array.

If you open the file, you will see a token (word) followed by the weights (50 numbers) on each line. Below is the first line of the embedding ASCII text file showing the embedding for “the”.

Next, you need to create a matrix of one embedding for each word in the training dataset and initilize with random float values. After that enumerating all unique words in the vocab_processor.vocabulary_._mapping and locating the embedding weight vector from the loaded GloVe embedding.

Create Embedding Layer in TensorFlow

Seed the TensorFlow Embedding layer with weights from the pre-trained embedding(GloVe word embedding weights) for the words in your training dataset. We chose the 50-dimensional version, therefore the embedding size 50.

You can experiment with learning a word embedding using a pre-trained embedding to perform learning on top of a training dataset.