In this tutorial, we are going to build machine translation seq2seq or encoder-decoder model in TensorFlow.The objective of this seq2seq model is translating English sentences into German sentences.

After training the model, you will be able to input an English sentence, such as “I am a student” and return the German translation: “Ich bin ein Student”.

Prepare Translation DataSet

In this tutorial, we will use an English to German dataset from the website. Download the dataset and decompress it. You will have a deu.txt that contains pairs of English to German phases, one pair per line with a tab separating the language.
tensorflow NMT
After downloading the dataset, here are the steps we’ll take to prepare the data:

  • First, we must load the data in a way that preserves the Unicode German characters.
  • We must split the loaded text by line and then by phrase.
  • Clean the sentences by removing special characters.
  • Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
  • After clean dataset, all the text have less than 2 words and more than 30 words are removed from the dataset.

Any word that is used at least 1 time in the set of English or German is added to the vocabulary. The new English and German text are created by using only the words in the created vocabulary. Any word not found in the vocabulary is replaced by <UNK> in both files. This set act as the training set. The vocabulary of words is also mapped with integer aliases for identification purposes.

This dataset is then sorted based on the number of words in the English file to reduce the impact of padding required in the batch of questions put to training. Please visit the GitHub repo for more detailed implementation and code.

Data Input Pipeline

Estimator input_fn function creates and returns TF placeholders related to the building model.

def input_fn():
     source = tf.placeholder(tf.int64, shape=[None, None], name='input')
     target = tf.placeholder(tf.int64, shape=[None, None], name='output')
     tf.identity(source[0], 'input_0')
     tf.identity(target[0], 'output_0')
     return {
                'input': source,
                 'output': target,
            }, None

source placeholder will be feed with English sentence data and its shape is [None, None]. The first None means the batch size and the batch size is unknown since the user can set it. The second None means the lengths of sentences. The maximum length of sentence is different from batch to batch, so it cannot be set with the exact number. The targets placeholder is similar to placeholder except that it will be feed with German sentence data.

Estimator Feed Data Function

def sampler():
    while True:
        with open(input_filename) as finput:
            with open(output_filename) as foutput:
                for in_line in finput:
                    out_line = foutput.readline()
                    yield {
                        'input': input_process(in_line, vocab)[:input_max_length - 1] + [END_TOKEN],
                        'output': output_process(out_line, vocab)[:output_max_length - 1] + [END_TOKEN]

sample_me = sampler()

def feed_fn():
    inputs, outputs = [], []
    input_length, output_length = 0, 0
    for i in range(batch_size):
        rec = next(sample_me)
        input_length = max(input_length, len(inputs[-1]))
        output_length = max(output_length, len(outputs[-1]))
    # Pad me right with </S> token.
    for i in range(batch_size):
        inputs[i] += [END_TOKEN] * (input_length - len(inputs[i]))
        outputs[i] += [END_TOKEN] * (output_length - len(outputs[i]))
    return {
        'input:0': inputs,
        'output:0': outputs

Set the lengths of every sentence to the maximum length across all sentences in every batch. you need to add <pad> a special character.

Build NMT Seq2Seq Model

Tensorflow NMT Model
An encoder converts a source sentence into a “meaning” vector which is passed through a decoder to produce a translation. You have two recurrent neural network which you tag back to back. One is called an encoder and the other one is called a decoder. You will feed an English sentence to encoder then feed the output state of encoder into the decoder and the decoder will generate a German sentence.


Let’s first embed our words using embedding lookups then we need a GRU cell for our encoder and actually just to show you that these cells can be wrapped to implement various regularization techniques like a dropout. Then I use my dynamic RNN to unroll this encoder cell.

def seq2seq_model(features, labels, mode, params):
    vocab_size = params['vocab_size']
    embed_dim = params['embed_dim']
    num_units = params['num_units']
    output_max_length = params['output_max_length']
    dropout = params['dropout']
    beam_width = params['beam_width']

    inp = features['input']
    batch_size = tf.shape(inp)[0]
    start_tokens = tf.zeros([batch_size], dtype=tf.int64)
    input_lengths = tf.reduce_sum(tf.to_int32(tf.not_equal(inp, 1)), 1)

    input_embed = layers.embed_sequence(
        inp, vocab_size=vocab_size, embed_dim=embed_dim, scope='embed')

    with tf.variable_scope('embed', reuse=True):
        embeddings = tf.get_variable('embeddings')

    fw_cell = tf.contrib.rnn.GRUCell(num_units=num_units)
    bw_cell = tf.contrib.rnn.GRUCell(num_units=num_units)

    if dropout > 0.0:
        print("  %s, dropout=%g " % (type(fw_cell).__name__, dropout))
        fw_cell = tf.contrib.rnn.DropoutWrapper(
            cell=fw_cell, input_keep_prob=(1.0 - dropout))
        bw_cell = tf.contrib.rnn.DropoutWrapper(
            cell=bw_cell, input_keep_prob=(1.0 - dropout))

    bd_encoder_outputs, bd_encoder_final_state = \
        tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_cell, cell_bw=bw_cell,
                                        inputs=input_embed, dtype=tf.float32)

    encoder_outputs = tf.concat(bd_encoder_outputs, -1)
    encoder_final_state = tf.concat(bd_encoder_final_state, -1)

Bidirectionality on the encoder side gives better performance. Here, we give a simplified example of how to build an encoder with a single bidirectional layer.encoder_outputs is the set of all source hidden states at the top layer and has the shape of [max_len, batch_size, num_units]


The decoder is again a GRU cell. We will use the Beam Search trick to produce from the unrolled decoder the most probable sequence of words instead of just the most probable word. The seq2seq API also has a dynamic decoder function to which I feed my decoder cell and this will unroll the sequence and build my decoder.

def setting_decoder(helper, scope, num_units, encoder_outputs, encoder_final_state, input_lengths,
                    vocab_size, batch_size, output_max_length, embeddings, start_tokens, END_TOKEN, beam_width,
    num_units = num_units * 2

    with tf.variable_scope(scope, reuse=reuse):

        if beam_width > 0:
            encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
            encoder_final_state = tf.contrib.seq2seq.tile_batch(encoder_final_state, multiplier=beam_width)
            input_lengths = tf.contrib.seq2seq.tile_batch(input_lengths, multiplier=beam_width)

        # Selecting the Attention Mechanism
        attention_mechanism = tf.contrib.seq2seq.LuongAttention(
            num_units=num_units, memory=encoder_outputs,

        # Selecting the Cell Type to use
        cell = tf.contrib.rnn.GRUCell(num_units=num_units)

        # Wrapping attention to the cell
        attn_cell = tf.contrib.seq2seq.AttentionWrapper(
            cell, attention_mechanism, attention_layer_size=num_units)
        out_cell = tf.contrib.rnn.OutputProjectionWrapper(
            attn_cell, vocab_size, reuse=reuse

        if (beam_width > 0):

            encoder_state = out_cell.zero_state(dtype=tf.float32,
                                                batch_size=batch_size * beam_width).clone(

            decoder = tf.contrib.seq2seq.BeamSearchDecoder(
                cell=out_cell, embedding=embeddings,
                start_tokens=tf.to_int32(start_tokens), end_token=END_TOKEN,

            outputs = tf.contrib.seq2seq.dynamic_decode(
                decoder=decoder, output_time_major=False,
                impute_finished=False, maximum_iterations=output_max_length
            return outputs[0]

            decoder = tf.contrib.seq2seq.BasicDecoder(cell=out_cell, helper=helper,
            outputs = tf.contrib.seq2seq.dynamic_decode(
                decoder=decoder, output_time_major=False,
                impute_finished=True, maximum_iterations=output_max_length
            return outputs[0]

Attention Mechanism

In the Encoder, encoder_outputs is the set of all source hidden states at the top layer and has the shape of [max_len, batch_size, num_units]. For the attention mechanism, we need to make sure the “memory” passed in is batch major, so we need to transpose attention_states. We pass source_sequence_length to the attention mechanism to ensure that the attention weights are properly normalized.

Decoder Input
One obvious question is what do you put on the inputs of this decoder neural network. During training is actually very simple what is supposed to be happening a little bit like in the language model. Each one of those cells in the decoder is supposed to produce a word and to produce an output state which feeds into the next cell. You are supposed to feed also the word that was produced before as the input into the next cell at least that’s how you train it.

    if mode == tf.estimator.ModeKeys.TRAIN:
        # Specific For Training
        output = features['output']
        train_output = tf.concat([tf.expand_dims(start_tokens, 1), output], 1)
        output_lengths = tf.reduce_sum(tf.to_int32(tf.not_equal(train_output, 1)), 1)

        output_embed = layers.embed_sequence(
            train_output, vocab_size=vocab_size, embed_dim=embed_dim, scope='embed', reuse=True)

        train_helper = tf.contrib.seq2seq.TrainingHelper(output_embed, output_lengths)

        train_outputs = decoder.setting_decoder(train_helper, 'decode', num_units, encoder_outputs,
                                                encoder_final_state, input_lengths,
                                                vocab_size, batch_size, output_max_length, embeddings,
                                                start_tokens, END_TOKEN, beam_width, reuse=None)

        pred_outputs = decoder.setting_decoder(pred_helper, 'decode', num_units, encoder_outputs,
                                               encoder_final_state, input_lengths,
                                               vocab_size, batch_size, output_max_length, embeddings,
                                               start_tokens, END_TOKEN, beam_width, reuse=True)

        tf.identity(train_outputs.sample_id[0], name='train_pred')
        weights = tf.to_float(tf.not_equal(train_output[:, :-1], 1))

        logits = tf.identity(train_outputs.rnn_output, 'logits')


Given the logits above, we are now ready to compute our training loss.

loss = tf.contrib.seq2seq.sequence_loss(
            logits, output, weights=weights)

train_op = layers.optimize_loss(
    loss, tf.train.get_global_step(),
    optimizer=params.get('optimizer', 'Adam'),
    learning_rate=params.get('learning_rate', 0.001),
    summaries=['loss', 'learning_rate'])

tf.identity(pred_outputs.sample_id[0], name='predictions')
return tf.estimator.EstimatorSpec(

weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.

During Inference

When you are actually prediction something it’s a bit more complicated. Once you have trained this network to actually translate a sentence you feed in into the encoder “I am student” and you have an output vector then I will feed this output vector into my first decoder cell and this decoder cell needs also something on its inputs if this go token which is like a word so I need to embed it which is why I use embedding lookup then I run this through dynamic RNN cell.

if mode == tf.estimator.ModeKeys.PREDICT:
    # Specific for Prediction
    pred_outputs = decoder.setting_decoder(pred_helper, 'decode', num_units, encoder_outputs,
                                           encoder_final_state, input_lengths,
                                           vocab_size, batch_size, output_max_length,
                                           embeddings, start_tokens, END_TOKEN, beam_width,

    if beam_width > 0:
        tf.identity(pred_outputs.predicted_ids, name='predictions')
        return tf.estimator.EstimatorSpec(mode=mode, predictions=pred_outputs.predicted_ids)
        tf.identity(pred_outputs.sample_id[0], name='predictions')
        return tf.estimator.EstimatorSpec(mode=mode, predictions=pred_outputs.sample_id)

Train Model

Create Estimators

An Estimator is TensorFlow’s high-level representation of a complete model. It handles the details of initialization, logging, saving and restoring, and many other features so you can concentrate on your model.

vocab = input_helper.load_vocab(vocab_file)

params = {
    'vocab_size': len(vocab),
    'batch_size': 64,
    'input_max_length': 20,
    'output_max_length': 20,
    'embed_dim': 100,
    'num_units': 256,
    'dropout': 0.2,
    'beam_width': 0

input_fn, feed_fn = input_helper.make_input_fn(
    vocab, params['input_max_length'], params['output_max_length'])

run_config = tf.estimator.RunConfig(

seq2seq_esti = tf.estimator.Estimator(

TensorFlow has written for you a ton of boilerplate code that is not interesting to write, things like regularly outputting checkpoints. If your training crashes after 24 hours, you can restart from where you were, exporting the model at the end so that you have something that is ready to deploy to a serving infrastructure or distributed training. The distributional algorithms of distributed training also baked in into the estimator.

Train Model

Train the model by calling the Estimator’s train method as follows:


Train NMT model

loss NMT

Predictions (inferring) from Trained Model

We now have a trained model. We can now use the trained model to translate the English sentence. As with training, we make inferring using a single function call.

def predict_input_fn(input_filename, vocab, input_process=tokenize_and_map):
    max_len = 0.

    with open(input_filename) as finput:
        for in_line in finput:
            max_len = max(len(in_line.split(" ")), max_len)

    predict_lines = np.empty(max_len + 1, int)

    with open(input_filename) as finput:
        for in_line in finput:
            in_line = in_line.lower()
            new_line_tmp = np.array(input_process(in_line, vocab), dtype=int)
            new_line = np.append(new_line_tmp, np.array([UNK_TOKEN for _ in range(max_len - len(new_line_tmp))] +
                                                        [int(END_TOKEN)], dtype=int))
            predict_lines = np.vstack((predict_lines, new_line))

    pred_line_tmp = np.delete(predict_lines, 0, 0)

    pred_lines = np.array(pred_line_tmp)
    return {'input': pred_lines}

pred_input_fn = tf.estimator.inputs.numpy_input_fn(x=inputs_with_tokens,

predictions_obj = model.predict(input_fn=pred_input_fn)
if params['beam_width'] > 0:
    final_answer = p_helper.get_out_put_from_tokens_beam_search(predictions_obj, vocab)
    final_answer = p_helper.get_out_put_from_tokens(predictions_obj, vocab)

with open(input_file) as finput:
    for each_answer in final_answer:
        question = finput.readline()
        print('Source: ', question.replace('\n', '').replace('<EOS>', ''))
        print('Target: ', str(each_answer).replace('<EOS>', '').replace('<GO>', ''))

Predicting NMT

Please visit the GitHub repo for more detailed information and actual codes. It will cover a bit more topics like how to preprocess the dataset, how to define inputs, and how to train and get a prediction.