In sequence data, individual samples have different lengths. Consider the following example text tokenized as words.
[ ["The", "weather", "will", "be", "nice", "tomorrow"], ["Hello", "world", "!"], ["How", "are", "you", "today"] ]
The data is a 2D list where individual samples have lengths 6, 3, and 4 respectively. The input data for a deep learning model must be a single tensor of shape e.g.
(batch_size, seq_len, vocab_size) in this case, samples that are shorter than the longest item need to be padded with some placeholder value or truncate long samples before padding short samples.
In this tutorial, you will discover
RaggedTensor that you can use to prepare your variable-length sequence data for NLP in python with Keras without any additional padding or user-facing logic.
RaggedTensor is a new type of Tensor, it efficiently represents sequence data. It designed to handle text and other variable-length sequences. It native representation of sequences of varying shapes.
Different Between RaggedTensor and SparseTensor
SparseTensors make the assumption that the underlying dense tensor is regularly shaped and unmentioned values are missing.RaggedTensors, on the other hand, makes no such assumption.
Here, the SparseTensor interprets the first batch element as John, null, null. While the RaggedTensor interprets it as simply John.
The simplest way to construct a ragged tensor is using tf.ragged.constant, which builds the RaggedTensor corresponding to a given nested Python list or NumPy array.
max_features = 20000 batch_size = 32 BUFFER_SIZE=1000 (x_train, y_train), (x_test, y_test)=tf.keras.datasets.imdb.load_data( path="imdb.npz", num_words=max_features, skip_top=0, maxlen=None, seed=113, start_char=1, oov_char=2, index_from=3) r_train_x = tf.ragged.constant(x_train) r_test_x = tf.ragged.constant(x_test)
The method tf.RaggedTensor.bounding_shape can be used to find a tight bounding shape for a given RaggedTensor:
But with raggedTensors you don’t need to worry about maximum sizes, padding, or anything else.
RaggedTensors passed as inputs to a Keras model by setting
ragged=True on tf.keras.Input. RaggedTensors may also be passed between Keras layers, and returned by Keras models. The following LSTM model is trained using ragged tensors.
keras_model = tf.keras.Sequential([ tf.keras.layers.Input(shape=[None], dtype=tf.int32, ragged=True), tf.keras.layers.Embedding(max_features,128), tf.keras.layers.LSTM(32, use_bias=False), tf.keras.layers.Dense(32), tf.keras.layers.Activation(tf.nn.relu), tf.keras.layers.Dense(1) ]) NumEpochs = 10 BatchSize = 32 keras_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = keras_model.fit(r_train_x, y_train, epochs=NumEpochs, batch_size=BatchSize, validation_data=(r_test_x, y_test))
Create tf.data from RaggedTensor
tf.data is an API that enables you to build input pipelines from RaggedTensor. Datasets can be built from RaggedTensors using the same methods that are used to build them from tf.Tensors or NumPy arrays, such as Dataset.from_tensor_slices.
train_data=tf.data.Dataset.from_tensor_slices((r_train_x,y_train)).shuffle(BUFFER_SIZE).batch(32) test_data=tf.data.Dataset.from_tensor_slices((r_test_x,y_test)).batch(32) ... keras_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = keras_model.fit(train_data,epochs=5,validation_data=test_data)
RaggedTensors can be used transparently with the functions and methods defined by a SavedModel.
import tempfile keras_model_path = tempfile.mkdtemp() tf.saved_model.save(keras_model, keras_model_path) imported_model = tf.saved_model.load(keras_model_path) #predict imported_model(r_train_x[:10])