In this post, you will learn how to implement a skip gram model in TensorFlow to generate word vectors and then use TensorBoard to visualize them.
Jupyter notebook that can run locally, or on Colaboratory.
One of the key ideas of word embeddings is a way of representing words that model automatically understand analogies like that “Man is the Woman as King is the Queen”.You’ll be able to build NLP application even relatively small training sets.
Word2Vec
Word2Vec use the theory of meaning to predict between every word and its context words. There are two methods to producing word vectors 1. Continuous Bag-of-Words model (CBOW) and 2.Skip-Gram model. For this post we implement is a Skip-Gram method.
Skip-Gram
The idea of the skip-gram model is, for each estimation step, you’re taking one word as the center word. In above example “for” is the center word and you’re going to try and predict words in its context out to some window size. The model is going to define a probability distribution that is the probability if a word appearing in the context given in this center word.
1.Setup
This code requires Python 3 and TensorFlow v1.4 +.
import os import string import tempfile import tensorflow as tf import numpy as np from tensorflow.python.keras.datasets import imdb from tensorflow.python.keras.preprocessing import sequence from tensorflow.contrib.tensorboard.plugins import projector from tensorboard import summary as summary_lib print(tf.__version__) tf.logging.set_verbosity(tf.logging.INFO) model_dir = '/tmp/log'
2. Download Dataset
The “IMDB dataset” is a set of around 50,000 positive or negative reviews for movies from the Internet Movie Database. Each review consists of a series of word indexes that go from 4(the most frequent word in the dataset, the) to 4999.Index 1 for the beginning of the sentence and 2 for unknown.TensorFlow provides Keras API for download the dataset.
vocab_size = 5000 sentence_size = 200 embedding_size = 50 pad_id = 0 start_id = 1 oov_id = 2 index_offset = 2 (x_train_variable, y_train), (x_test_variable, y_test) = tf.keras.datasets.imdb.load_data( num_words=vocab_size, start_char=start_id, oov_char=oov_id, index_from=index_offset) word_index = tf.keras.datasets.imdb.get_word_index() word_inverted_index = {v + index_offset: k for k, v in word_index.items()} word_inverted_index[pad_id] = '<PAD>' word_inverted_index[start_id] = '<START>' word_inverted_index[oov_id] = '<OOV>'
3. Pad Sequences
After Loaded the data in memory pad each of the sentences with 0 to a fixed size (200). You have two 2-dimensional (25000X200)arrays for training and testing respectively.
x_train = sequence.pad_sequences(x_train_variable, maxlen=sentence_size, truncating='post', padding='post', value=pad_id) x_test = sequence.pad_sequences(x_test_variable, maxlen=sentence_size, truncating='post', padding='post', value=pad_id)
4. Input Function
You needs to convert the data from numpy arrays into Tensors. The tf.data provide classes that allow you to easily load data, manipulate it, and pipe it into your model.
To construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensor_slices()
.
You might create one function to import the training set and another function to import the test set.
x_len_train = np.array([min(len(x), sentence_size) for x in x_train_variable]) x_len_test = np.array([min(len(x), sentence_size) for x in x_test_variable]) def parser(x, length, y): features = {"x": x, "len": length} return features, y def train_input_fn(): dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) dataset = dataset.shuffle(buffer_size=len(x_train_variable)) dataset = dataset.batch(100) dataset = dataset.map(parser) dataset = dataset.repeat() iterator = dataset.make_one_shot_iterator() return iterator.get_next() def eval_input_fn(): dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test)) dataset = dataset.batch(100) dataset = dataset.map(parser) iterator = dataset.make_one_shot_iterator() return iterator.get_next()
This input function builds an input pipeline that yields batches of (features, labels) pairs, where features is a dictionary feature.
5. Create Feature Columns
When you build an Estimator model, you pass a list of feature columns that describe each of the features you want the model to use.
categorical_column_with_identity is the right choice for text input. You need to convert your existing feature column into an embedding_column. The representation seen by the model is the mean of the embeddings for each token. We can plug in the embedded features into a pre-canned DNNClassifier.
column = tf.feature_column.categorical_column_with_identity('x', vocab_size) word_embedding_column = tf.feature_column.embedding_column(column, dimension=embedding_size)
6. Estimator
You have to not worry about creating the computational graph or sessions since Pre-made Estimators handle all the “plumbing” for you. DNNClassifier is a pre-made Estimator class that trains classification models through dense, feed-forward neural networks.
classifier = tf.estimator.DNNClassifier( hidden_units=[100], feature_columns=[word_embedding_column], model_dir=os.path.join(model_dir, 'embedd'))
7.Train Model
Estimators expect an input_fn
to take no arguments.
classifier.train(input_fn=train_input_fn, steps=25000)
Evaluate Model
The following code block evaluates the accuracy of the trained model on the test data
eval_results = classifier.evaluate(input_fn=eval_input_fn) print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_results))
Running this code yields the following output (or something similar):
Test set accuracy: 0.821
8. Visualizing Embeddings
TensorBoard includes the Embedding Projector, a tool for interactively visualize embeddings.It can read embeddings from your model and render them in two or three dimensions.
Metadata
You need to attach labels to the data points. You can do this by generating a metadata file containing the labels for each point and clicking “Load data” in the data panel of the Embedding Projector.
with open(os.path.join(model_dir, 'metadata.tsv'), 'w', encoding="utf-8") as f: for index in range(0, vocab_size): f.write(word_inverted_index[index] + '\n')