In this tutorial, we will build a basic seq2seq model in TensorFlow for chatbot application. This tutorial gives you a basic understanding of seq2seq models and shows how to build a competitive seq2seq model from scratch and bit of work to prepare input pipeline using TensorFlow dataset API. The seq2seq models have great success in different tasks such as machine translation, speech recognition, and text summarization.

The domain-based assistants can be used to answer questions to customers and can act as the first line of the contract between a company and a customer. They can also as a Q&A tool for employees looking for answers to a particular question related to a particular domain.

Dataset


In this tutorial, we will be using conversations from Reddit Comments to build a simple chatbot. The dataset can be found at kaggle. The dataset has about 54 million comments that add to 30GB of data that was made on reddit.com for the month of May 2015. The dataset comes in the form of an SQLite database with one table May 2015. You can download process data directly from here.

Create Data Input Pipeline


In this tutorial we will use TensorFlow Dataset API to feed data into the model. We will also use
lookup_ops for constructs a lookup table to convert tensor of strings into int64 IDs. The mapping can be initialized from a vocabulary file specified in vocabulary_file, where the whole line is the key and the zero-based line number is the ID.

The tf.data.TextLineDataset provides an easy way to extract lines from text files. Given filenames, a TextLineDataset will produce one string-valued element per line of those files.

All datasets can be treated similarly via input processing. This includes reading and cleaning the data, filtering, and batching.

Convert each sentence into vectors of word strings.

You can perform a vocabulary lookup on each sentence. Given a lookup table object table, this map converts the first tuple elements from a vector of strings to a vector of integers.

Batching of variable-length sentences is straightforward. The following transformation batches batch_size elements from source_target_dataset, and respectively pads the source and target vectors to the length of the longest source and target vector in each batch.

Once the iterator is initialized, every session.run call that accesses the source or target tensors will request the next minibatch from the underlying dataset.

Create Embedding


Embeddings are important for input to machine learning. An embedding is a mapping from discrete objects, such as words, to vectors of real numbers. Neural networks work on vectors of real numbers. Embedding functions are the standard and effective way to transform such discrete input objects into useful continuous vectors.

Embeddings are also valuable as outputs of machine learning. Because embeddings map objects to vectors, applications can use similarity in vector space as a robust and flexible measure of object similarity.

Create Encoder


Once you create word embeddings then you can feed as input into the main network, which consists of two RNN-an encoder for the source(question) and decoder for the target(answer). The encoder-decoder RNN share the same weights. The encoder RNN uses zero vector as starting states and is built as follows:

We build only a single layer GRU, encoder_cell. Every sentence has a different length to avoid wasting computation, we will use dynamic_rnn.

Create Decoder


The decoder needs to have access to the source information. One simple way to achieve that is to initialize it with the last hidden state of the encoder, encoder_state.

Projection Layer
projection layer which is a dense matrix to turn the top hidden states to logit vectors of dimension vocab size.

Loss Function


Now, we have logits, we are now ready to compute out training loss:

target_weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.
Gradient Clipping
One of the important steps in training RNNs is gradient clipping. Here, we clip by the global norm. The max value, max_gradient_norm, is often set to a value like 5 or 1.

Optimizer
The last step is selecting the optimizer.

Train Model


Instead of using feed_dicts to feed data at each session.run call, we use stateful iterator objects. These iterators make the input pipeline much easier in both the single-machine and distributed setting.

Conclusion

To produce the best results, I encourage you to train the model with Bidirectional RNNs and Attention mechanism. There are many ways this model can be altered and improved upon. One cool thing you could do, is trying a few different methods, compare the results.

Leave a Reply