NLP models use text to produce a basic form of natural language understanding. Applications ranging from document classification, sentiment analysis, or even question answering. In this tutorial we will learn how to use TensorFlow’s Dataset API to build pipelines for text.

Download Dataset

First, head to https://www.kaggle.com/quora/question-pairs-dataset/kernels and download the raw Question Pairs Dataset. Uncompress it. It is related to the problem of identifying duplicate questions

import pandas as pd
import tensorflow as tf
from tensorflow.python.ops import lookup_ops

UNK = "<unk>"
SOS = "<s>"
EOS = "</s>"
UNK_ID = 0

tf.enable_eager_execution()

questions = pd.read_csv('download/questions.csv')
vocab_file = "download/vocab.txt"

question1 = questions.question1[:100000]
question2 = questions.question2[:100000]
labels = questions.is_duplicate[:100000]

max_question_len = 200

Reading input data

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensorobjects and use Dataset.from_tensor_slices().

dataset = tf.data.Dataset.from_tensor_slices((question1, question2, labels))

dataset = dataset.shuffle(buffer_size)

Tokenize Data

Natural language processing is pattern recognition applied to words, sentences, and paragraphs. NLP models don’t take as input raw text: they only work with numeric tensors. The vectorizing text is the process of transforming text into numeric tensors.

dataset = dataset.map(
    lambda qus1, qus2, labels: (
        tf.string_split([qus1]).values, tf.string_split([qus2]).values, labels))

# Filter zero length input sequences.
dataset = dataset.filter(
    lambda qus1, qus2, labels: tf.logical_and(tf.size(qus1) > 0, tf.size(qus2) > 0))

dataset = dataset.map(
    lambda qus1, qus2, labels: (qus1[:max_question_len], qus2[:max_question_len], labels))

You can break down text into words, characters or N-grams are called “tokens”, and breaking down text into such tokens is called “tokenization”. All text vectorization processes consist in applying some tokenization scheme, then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are what get fed into deep neural networks.

NLP rely on ids as input for the words, that means you need to convert your sentence into a sequence of ids. We have to transform sequence of words into a sequence of ids

Tensorflow has a built-in tool to take care of the mapping. We simply define lookup tables.


string_to_index = lookup_ops.index_table_from_file(vocab_file, default_value=UNK_ID)

Now that we initialized this lookup table, we are going to transform.

dataset = dataset.map(
    lambda qus1, qus2, labels: (tf.cast(string_to_index.lookup(qus1), tf.int32),
                                tf.cast(string_to_index.lookup(qus2), tf.int32), labels))

Batching with padding

Sequence models work with input data that can have varying size (e.g. sequences of different lengths). To handle this case, theDataset.padded_batch() transformation enables you to batch tensors of different shape by specifying one or more dimensions in which they may be padded.

dataset = dataset.padded_batch(
    32,
    padded_shapes=(
        tf.TensorShape([None]),
        tf.TensorShape([None]),
        tf.TensorShape([])),
    padding_values=(
        sos_id,eos_id,0))

for data in dataset:
    tf.print(data)

It allows you to set different padding for each dimension of each component, and it may be variable-length (signified by None in the example above) or constant-length. It is also possible to override the padding value, which defaults to 0.