In this post, you will learn how to implement a skip gram model in TensorFlow to generate word vectors and then use TensorBoard to visualize them.

TensorBoard visualize

Jupyter notebook that can run locally, or on Colaboratory.

One of the key ideas of word embeddings is a way of representing words that model automatically understand analogies like that “Man is the Woman as King is the Queen”.You’ll be able to build NLP application even relatively small training sets.

Word2Vec


Word2Vec use the theory of meaning to predict between every word and its context words. There are two methods to producing word vectors 1. Continuous Bag-of-Words model (CBOW) and 2.Skip-Gram model. For this post we implement is a Skip-Gram method.

Skip-Gram

TensorFlow Skip-gram model

The idea of the skip-gram model is, for each estimation step, you’re taking one word as the center word. In above example “for” is the center word and you’re going to try and predict words in its context out to some window size. The model is going to define a probability distribution that is the probability if a word appearing in the context given in this center word.

1.Setup


This code requires Python 3 and TensorFlow v1.4 +.

import os
import string
import tempfile
import tensorflow as tf
import numpy as np

from tensorflow.python.keras.datasets import imdb
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.contrib.tensorboard.plugins import projector
from tensorboard import summary as summary_lib

print(tf.__version__)
tf.logging.set_verbosity(tf.logging.INFO)

model_dir = '/tmp/log'

2. Download Dataset


The “IMDB dataset” is a set of around 50,000 positive or negative reviews for movies from the Internet Movie Database. Each review consists of a series of word indexes that go from 4(the most frequent word in the dataset, the) to 4999.Index 1 for the beginning of the sentence and 2 for unknown.TensorFlow provides Keras API for download the dataset.

vocab_size = 5000
sentence_size = 200
embedding_size = 50

pad_id = 0
start_id = 1
oov_id = 2
index_offset = 2

(x_train_variable, y_train), (x_test_variable, y_test) = tf.keras.datasets.imdb.load_data(
    num_words=vocab_size, start_char=start_id, oov_char=oov_id,
    index_from=index_offset)

word_index = tf.keras.datasets.imdb.get_word_index()
word_inverted_index = {v + index_offset: k for k, v in word_index.items()}


word_inverted_index[pad_id] = '<PAD>'
word_inverted_index[start_id] = '<START>'
word_inverted_index[oov_id] = '<OOV>'

3. Pad Sequences


After Loaded the data in memory pad each of the sentences with 0 to a fixed size (200). You have two 2-dimensional (25000X200)arrays for training and testing respectively.

x_train = sequence.pad_sequences(x_train_variable, 
                                 maxlen=sentence_size,
                                 truncating='post',
                                 padding='post',
                                 value=pad_id)
x_test = sequence.pad_sequences(x_test_variable, 
                                maxlen=sentence_size,
                                truncating='post',
                                padding='post', 
                                value=pad_id)

4. Input Function


You needs to convert the data from numpy arrays into Tensors. The tf.data provide classes that allow you to easily load data, manipulate it, and pipe it into your model.

To construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensor_slices().
You might create one function to import the training set and another function to import the test set.

x_len_train = np.array([min(len(x), sentence_size) for x in x_train_variable])
x_len_test = np.array([min(len(x), sentence_size) for x in x_test_variable])

def parser(x, length, y):
    features = {"x": x, "len": length}
    return features, y

def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
    dataset = dataset.shuffle(buffer_size=len(x_train_variable))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    dataset = dataset.repeat()
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

This input function builds an input pipeline that yields batches of (features, labels) pairs, where features is a dictionary feature.

5. Create Feature Columns


When you build an Estimator model, you pass a list of feature columns that describe each of the features you want the model to use.

categorical_column_with_identity is the right choice for text input. You need to convert your existing feature column into an embedding_column. The representation seen by the model is the mean of the embeddings for each token. We can plug in the embedded features into a pre-canned DNNClassifier.

column = tf.feature_column.categorical_column_with_identity('x', vocab_size)
word_embedding_column = tf.feature_column.embedding_column(column, dimension=embedding_size)

6. Estimator


You have to not worry about creating the computational graph or sessions since Pre-made Estimators handle all the “plumbing” for you. DNNClassifier is a pre-made Estimator class that trains classification models through dense, feed-forward neural networks.

classifier = tf.estimator.DNNClassifier(
    hidden_units=[100],
    feature_columns=[word_embedding_column], 
    model_dir=os.path.join(model_dir, 'embedd'))

7.Train Model


Estimators expect an input_fn to take no arguments.

classifier.train(input_fn=train_input_fn, steps=25000)

Evaluate Model


The following code block evaluates the accuracy of the trained model on the test data

eval_results = classifier.evaluate(input_fn=eval_input_fn)
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_results))

Running this code yields the following output (or something similar):

Test set accuracy: 0.821

8. Visualizing Embeddings


TensorBoard includes the Embedding Projector, a tool for interactively visualize embeddings.It can read embeddings from your model and render them in two or three dimensions.

Metadata

You need to attach labels to the data points. You can do this by generating a metadata file containing the labels for each point and clicking “Load data” in the data panel of the Embedding Projector.

with open(os.path.join(model_dir, 'metadata.tsv'), 'w', encoding="utf-8") as f:
   
    for index in range(0, vocab_size):
        f.write(word_inverted_index[index] + '\n')