In this tutorial, we will build a basic seq2seq model in TensorFlow for chatbot application. This tutorial gives you a basic understanding of seq2seq models and shows how to build a competitive seq2seq model from scratch and bit of work to prepare input pipeline using TensorFlow dataset API. The seq2seq models have great success in different tasks such as machine translation, speech recognition, and text summarization.

The domain-based assistants can be used to answer questions to customers and can act as the first line of the contract between a company and a customer. They can also as a Q&A tool for employees looking for answers to a particular question related to a particular domain.


In this tutorial, we will be using conversations from Reddit Comments to build a simple chatbot. The dataset can be found at kaggle. The dataset has about 54 million comments that add to 30GB of data that was made on for the month of May 2015. The dataset comes in the form of an SQLite database with one table May 2015. You can download process data directly from here.

Create Data Input Pipeline

In this tutorial we will use TensorFlow Dataset API to feed data into the model. We will also use
lookup_ops for constructs a lookup table to convert tensor of strings into int64 IDs. The mapping can be initialized from a vocabulary file specified in vocabulary_file, where the whole line is the key and the zero-based line number is the ID.

from tensorflow.python.ops import lookup_ops

UNK = "<unk>"
SOS = "<s>"
EOS = "</s>"
UNK_ID = 0

vocab_table = lookup_ops.index_table_from_file(
    src_vocab_file, default_value=UNK_ID)

The provides an easy way to extract lines from text files. Given filenames, a TextLineDataset will produce one string-valued element per line of those files.

src_dataset =
tgt_dataset =

All datasets can be treated similarly via input processing. This includes reading and cleaning the data, filtering, and batching.

class BatchedInput(
                           ("initializer", "source", "target_input",
                            "target_output", "source_sequence_length",

def get_iterator(src_dataset,

    output_buffer_size = batch_size * 1000

    src_eos_id = tf.cast(src_vocab_table.lookup(tf.constant(eos)), tf.int32)

    tgt_sos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(sos)), tf.int32)

    tgt_eos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(eos)), tf.int32)

    src_tgt_dataset =, tgt_dataset))

    src_tgt_dataset = src_tgt_dataset.shard(num_shards, shard_index)

    src_tgt_dataset = src_tgt_dataset.shuffle(output_buffer_size, random_seed, )

Convert each sentence into vectors of word strings.

src_tgt_dataset =
        lambda src, tgt: (
            tf.string_split([src]).values, tf.string_split([tgt]).values),

You can perform a vocabulary lookup on each sentence. Given a lookup table object table, this map converts the first tuple elements from a vector of strings to a vector of integers.

src_tgt_dataset =
        lambda src, tgt: (tf.cast(src_vocab_table.lookup(src), tf.int32),
                          tf.cast(tgt_vocab_table.lookup(tgt), tf.int32)),

    src_tgt_dataset = src_tgt_dataset.prefetch(output_buffer_size)

    # Create a tgt_input prefixed with <sos> and a tgt_output suffixed with <eos>.
    src_tgt_dataset =
        lambda src, tgt: (src,
                          tf.concat(([tgt_sos_id], tgt), 0),
                          tf.concat((tgt, [tgt_eos_id]), 0)),

    src_tgt_dataset =
        lambda src, tgt_in, tgt_out: (
            src, tgt_in, tgt_out,
            tf.size(src), tf.size(tgt_in)),

    src_tgt_dataset = src_tgt_dataset.prefetch(output_buffer_size)

Batching of variable-length sentences is straightforward. The following transformation batches batch_size elements from source_target_dataset, and respectively pads the source and target vectors to the length of the longest source and target vector in each batch.

    def batching_func(x):
        return x.padded_batch(
            # The first three entries are the source and target line rows;
            # these have unknown-length vectors.  The last two entries are
            # the source and target row sizes; these are scalars.
                tf.TensorShape([None]),  # src
                tf.TensorShape([None]),  # tgt_input
                tf.TensorShape([None]),  # tgt_output

                tf.TensorShape([]),  # src_len
                tf.TensorShape([])),  # tgt_len
            # Pad the source and target sequences with eos tokens.
            # (Though notice we don't generally need to do this since
            # later on we will be masking out calculations past the true sequence.
                src_eos_id,  # src
                tgt_eos_id,  # tgt_input
                tgt_eos_id,  # tgt_output
                0,  # src_len -- unused
                0))  # tgt_len -- unused

    batched_dataset = batching_func(src_tgt_dataset)

    batched_iter = batched_dataset.make_initializable_iterator()

    (src_ids, tgt_input_ids, tgt_output_ids, src_seq_len,
     tgt_seq_len) = (batched_iter.get_next())

    return BatchedInput(

Once the iterator is initialized, every call that accesses the source or target tensors will request the next minibatch from the underlying dataset.

Create Embedding

Embeddings are important for input to machine learning. An embedding is a mapping from discrete objects, such as words, to vectors of real numbers. Neural networks work on vectors of real numbers. Embedding functions are the standard and effective way to transform such discrete input objects into useful continuous vectors.

with tf.variable_scope("encoder"):
    embedding_encoder = tf.get_variable(
        "embedding_encoder", [src_vocab_size, num_units])

with tf.variable_scope("decoder"):
    embedding_decoder = tf.get_variable(
        "embedding_decoder", [tgt_vocab_size, num_units])

Embeddings are also valuable as outputs of machine learning. Because embeddings map objects to vectors, applications can use similarity in vector space as a robust and flexible measure of object similarity.

Create Encoder

Once you create word embeddings then you can feed as input into the main network, which consists of two RNN-an encoder for the source(question) and decoder for the target(answer). The encoder-decoder RNN share the same weights. The encoder RNN uses zero vector as starting states and is built as follows:

with tf.variable_scope("encoder"):
    encoder_emb_inputs = tf.nn.embedding_lookup(embedding_encoder, input_sequence)

    encoder_cell = tf.nn.rnn_cell.GRUCell(num_units)

    encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
        encoder_cell, encoder_emb_inputs,
        time_major=time_major, dtype=tf.float32)

We build only a single layer GRU, encoder_cell. Every sentence has a different length to avoid wasting computation, we will use dynamic_rnn.

Create Decoder

The decoder needs to have access to the source information. One simple way to achieve that is to initialize it with the last hidden state of the encoder, encoder_state.

target_input = iterator.target_input
if time_major:
    target_input = tf.transpose(target_input)
with tf.variable_scope("decoder"):
    decoder_emb_inputs = tf.nn.embedding_lookup(embedding_decoder, target_input)

    helper = tf.contrib.seq2seq. \
        TrainingHelper(decoder_emb_inputs, iterator.target_sequence_length,

    decoder_cell = tf.nn.rnn_cell.GRUCell(num_units)

    my_decoder = tf.contrib.seq2seq.BasicDecoder(

    logits = tf.no_op()
    decoder_cell_outputs = None

    outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(

    sample_id = outputs.sample_id
    decoder_cell_outputs = outputs.rnn_output
    logits = output_layer(outputs.rnn_output)

Projection Layer
projection layer which is a dense matrix to turn the top hidden states to logit vectors of dimension vocab size.

output_layer = tf.layers.Dense(tgt_vocab_size, use_bias=False, name="output_projection")

Loss Function

Now, we have logits, we are now ready to compute out training loss:

if time_major:
    target_output = tf.transpose(target_output)

time_axis = 0 if time_major else 1
max_time = target_output.shape[time_axis].value or tf.shape(target_output)[time_axis]

crossent =tf.nn.sparse_softmax_cross_entropy_with_logits(labels=logits, logits=target_output)

target_weights = tf.sequence_mask(
    iterator.target_sequence_length, max_time, dtype=dtype)
if time_major:
    target_weights = tf.transpose(target_weights)

loss = tf.reduce_sum(
    crossent * target_weights) / tf.to_float(hparams.batch_size)

target_weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.
Gradient Clipping
One of the important steps in training RNNs is gradient clipping. Here, we clip by the global norm. The max value, max_gradient_norm, is often set to a value like 5 or 1.

gradients = tf.gradients(

predict_count = tf.reduce_sum(

clipped_gradients, gradient_norm = tf.clip_by_global_norm(
    gradients, hparams.max_gradient_norm)

gradient_norm_summary = [tf.summary.scalar("grad_norm", gradient_norm)]

    tf.summary.scalar("clipped_gradient", tf.global_norm(clipped_gradients)))

global_step = tf.Variable(0, name='global_step', trainable=False)

The last step is selecting the optimizer.

learning_rate = tf.constant(hparams.learning_rate)

opt = tf.train.AdamOptimizer(learning_rate)

update = opt.apply_gradients(
    zip(clipped_gradients, params), global_step=global_step)

train_summary = tf.summary.merge(
    [tf.summary.scalar("lr", learning_rate),
     tf.summary.scalar("train_loss", loss)] +

output_tuple = TrainOutputTuple(train_loss=loss,

Train Model

Instead of using feed_dicts to feed data at each call, we use stateful iterator objects. These iterators make the input pipeline much easier in both the single-machine and distributed setting.

train_sess = tf.Session()
steps_per_stats = 10
last_stats_step = 0
for i in range(100):
    step_result =[update, output_tuple])
    if i - last_stats_step >= steps_per_stats:
        last_stats_step = i


To produce the best results, I encourage you to train the model with Bidirectional RNNs and Attention mechanism. There are many ways this model can be altered and improved upon. One cool thing you could do, is trying a few different methods, compare the results.