Once you train a deep learning model in Keras, you can use it to make predictions on new data. In this tutorial, we train the RNN model for text analysis and save a model so I could load it later to use again for prediction.

Download Data

Amazon Fine Food Reviews dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. You can directly download from Kaggle website.

Prepare Dataset

Keras Tokenizer API removes punctuation and split strings into lists of individual words and Convert the individual words into integers.

df=pd.read_csv(data_file_path)
df=df[['Text','Score']]

df.drop_duplicates(subset=['Score','Text'],keep='first',inplace=True) 

df=df.sample(frac=1).reset_index(drop=True)

review=df['Text']
rating=df['Score']

num_of_words=80000
max_len=250

tokenizer = Tokenizer(num_words=num_of_words,filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)

tokenizer.fit_on_texts(review)
review_seq = tokenizer.texts_to_sequences(review)

review_seq_pad = pad_sequences(review_seq, maxlen=max_len)

train_x,test_x,train_y,test_y = train_test_split(review_seq_pad, rating_1, test_size=0.20, random_state=42)

By default, this removes all punctuation, lowercases words, and then converts words to sequences of integers. A Tokenizer is first fit on a list of strings and then converts this list into a list of lists of integers. 

Create the Model

Keras allows us to build state-of-the-art models in a few lines of Python code. We are using the Keras Sequential API. We build the network up one layer at a time. 

epochs = 10
emb_dim = 128
batch_size = 256

model = Sequential()
model.add(Embedding(num_of_words, emb_dim, input_length=train_x.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
print(model.summary())

The model is compiled with the Adam optimizer and trained using the sparse_categorical_crossentropy loss.

Train Model

We are ready to train our model to learn. When training neural networks use EarlyStopping in the form of Keras callbacks:

history = model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss')])

Save Keras Model

In your Python code probable the should be:

model.save('Food_Reviews.h5')

This allows you to save the entirety of the state of a model in a single file.

Save Keras Tokenizer

The tokenizer will transform the text into vectors, it’s important to have the same vector space between training & predicting. The most common way is to save tokenizer and load the same tokenizer at predicting time using pickle.

import pickle

with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Load Keras Model for Prediction

Saved models can be re-instantiated via keras.models.load_model().

loaded_model = tf.keras.models.load_model('Food_Reviews.h5')

The model returned by load_model() is a compiled model ready to be used.

You have to load both a model and a tokenizer in order to predict new data.

with open('tokenizer.pickle', 'rb') as handle:
    loaded_tokenizer = pickle.load(handle)

You must use the same Tokenizer you used to build your model. Else this will give different vector to each word at time prediction. The pickle loaded tokenizer is ready to use

Predict New Text

Now for scoring, I am using a load model, I was able to save the tokenizer and model to a file and load from a file. Without this, I’ll have to process the corpus every time I need to score even a single sentence.

txt=review[10]
seq= loaded_tokenizer.texts_to_sequences([txt])
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict_classes(padded)
print(pred)

If you call fit again it could change the index.