NLP models use text to produce a basic form of natural language understanding. Applications ranging from document classification, sentiment analysis, or even question answering. In this tutorial we will learn how to use TensorFlow’s Dataset API to build pipelines for text.
Download Dataset
First, head to https://www.kaggle.com/quora/question-pairs-dataset/kernels and download the raw Question Pairs Dataset. Uncompress it. It is related to the problem of identifying duplicate questions
import pandas as pd
import tensorflow as tf
from tensorflow.python.ops import lookup_ops
UNK = "<unk>"
SOS = "<s>"
EOS = "</s>"
UNK_ID = 0
tf.enable_eager_execution()
questions = pd.read_csv('download/questions.csv')
vocab_file = "download/vocab.txt"
question1 = questions.question1[:100000]
question2 = questions.question2[:100000]
labels = questions.is_duplicate[:100000]
max_question_len = 200
Reading input data
If all of your input data fit in memory, the simplest way to create a Dataset
from them is to convert them to tf.Tensor
objects and use Dataset.from_tensor_slices()
.
dataset = tf.data.Dataset.from_tensor_slices((question1, question2, labels))
dataset = dataset.shuffle(buffer_size)
Tokenize Data
Natural language processing is pattern recognition applied to words, sentences, and paragraphs. NLP models don’t take as input raw text: they only work with numeric tensors. The vectorizing text is the process of transforming text into numeric tensors.
dataset = dataset.map(
lambda qus1, qus2, labels: (
tf.string_split([qus1]).values, tf.string_split([qus2]).values, labels))
# Filter zero length input sequences.
dataset = dataset.filter(
lambda qus1, qus2, labels: tf.logical_and(tf.size(qus1) > 0, tf.size(qus2) > 0))
dataset = dataset.map(
lambda qus1, qus2, labels: (qus1[:max_question_len], qus2[:max_question_len], labels))
You can break down text into words, characters or N-grams are called “tokens”, and breaking down text into such tokens is called “tokenization”. All text vectorization processes consist in applying some tokenization scheme, then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are what get fed into deep neural networks.
NLP rely on ids as input for the words, that means you need to convert your sentence into a sequence of ids. We have to transform
Tensorflow has a built-in tool to take care of the mapping. We simply define lookup tables.
string_to_index = lookup_ops.index_table_from_file(vocab_file, default_value=UNK_ID)
Now that we initialized this lookup table, we are going to transform.
dataset = dataset.map(
lambda qus1, qus2, labels: (tf.cast(string_to_index.lookup(qus1), tf.int32),
tf.cast(string_to_index.lookup(qus2), tf.int32), labels))
Batching with padding
Sequence models work with input data that can have varying size (e.g. sequences of different lengths). To handle this case, theDataset.padded_batch()
transformation enables you to batch tensors of different shape by specifying one or more dimensions in which they may be padded.
dataset = dataset.padded_batch(
32,
padded_shapes=(
tf.TensorShape([None]),
tf.TensorShape([None]),
tf.TensorShape([])),
padding_values=(
sos_id,eos_id,0))
for data in dataset:
tf.print(data)
It allows you to set different padding for each dimension of each component, and it may be variable-length (signified by None
in the example above) or constant-length. It is also possible to override the padding value, which defaults to 0.