In sequence data, individual samples have different lengths. Consider the following example of text tokenized as words.

[
  ["The", "weather", "will", "be", "nice", "tomorrow"],
  ["Hello", "world", "!"],
  ["How", "are", "you", "today"]
  
]

The data is a 2D list where individual samples have lengths 6, 3, and 4 respectively. The input data for a deep learning model must be a single tensor of shape e.g. (batch_size, seq_len, vocab_size) in this case, samples that are shorter than the longest item need to be padded with some placeholder value or truncate long samples before padding short samples.

In this tutorial, you will discover RaggedTensor that you can use it to prepare your variable-length sequence data for NLP in Python with Keras without any additional padding or user-facing logic.

RaggedTensor is a new type of Tensor, it efficiently represents sequence data. It is designed to handle text and other variable-length sequences. It native representation of sequences of varying shapes.

Keras RaggedTensor
RaggedTensor contains three batch items.

Different Between RaggedTensor and SparseTensor

SparseTensors make the assumption that the underlying dense tensor is regularly shaped and unmentioned values are missing.RaggedTensors, on the other hand, makes no such assumption.

Ragged Tensor vs SparseTensor
1.RaggedTensor and 2.SparseTensor

Here, the SparseTensor interprets the first batch element as John, null, null. While the RaggedTensor interprets it as simply John.

Create RaggedTensor

The simplest way to construct a ragged tensor is using tf.ragged.constant, which builds the RaggedTensor corresponding to a given nested Python list or NumPy array.

max_features = 20000
batch_size = 32
BUFFER_SIZE=1000

(x_train, y_train), (x_test, y_test)=tf.keras.datasets.imdb.load_data(
    path="imdb.npz",
    num_words=max_features,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3)

r_train_x = tf.ragged.constant(x_train)
r_test_x = tf.ragged.constant(x_test)

Shape

A RaggedTensor can contain any number of irregular dimensions. The RaggedTensor.shape attribute returns a tf.TensorShape for a ragged tensor, where ragged dimensions have size None.

r_train_x.shape
r_train_x.bounding_shape()

The method tf.RaggedTensor.bounding_shape can be used to find a tight bounding shape for a given RaggedTensor:

But with raggedTensors, you don’t need to worry about maximum sizes, padding, or anything else.

Create Model

RaggedTensors support many TensorFlow APIs, including Keras, Datasets, SavedModels

RaggedTensors passed as inputs to a Keras model by setting ragged=True on tf.keras.Input. RaggedTensors may also be passed between Keras layers, and returned by Keras models. The following LSTM model is trained using ragged tensors.

keras_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[None], dtype=tf.int32, ragged=True),
    tf.keras.layers.Embedding(max_features,128),
    tf.keras.layers.LSTM(32, use_bias=False),
    tf.keras.layers.Dense(32),
    tf.keras.layers.Activation(tf.nn.relu),
    tf.keras.layers.Dense(1)
])

NumEpochs = 10
BatchSize = 32

keras_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = keras_model.fit(r_train_x, y_train, epochs=NumEpochs, batch_size=BatchSize, validation_data=(r_test_x, y_test))

Create tf.data from RaggedTensor

tf.data is an API that enables you to build input pipelines from RaggedTensor. Datasets can be built from RaggedTensors using the same methods that are used to build them from tf.Tensors or NumPy arrays, such as Dataset.from_tensor_slices.

train_data=tf.data.Dataset.from_tensor_slices((r_train_x,y_train)).shuffle(BUFFER_SIZE).batch(32)
test_data=tf.data.Dataset.from_tensor_slices((r_test_x,y_test)).batch(32)

...

keras_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = keras_model.fit(train_data,epochs=5,validation_data=test_data)

Save Model

RaggedTensors can be used transparently with the functions and methods defined by a SavedModel.

import tempfile

keras_model_path = tempfile.mkdtemp()
tf.saved_model.save(keras_model, keras_model_path)
imported_model = tf.saved_model.load(keras_model_path)

#predict

imported_model(r_train_x[:10])

Related Post

Run this code in Google colab