In this tutorial, we’re going to talk about multi-layer RNNs. In Multi-layer RNNs we apply multiple RNNs on top of each other. You could regard RNN as deep in some sense because you’ve unrolled them over potentially very many timesteps, and you could regard that as a kind of **depth**. There’s another way that RNNs could be deep.

If you apply multiple RNNs one after another, then this would be a different way to make your RNN deep. This is the idea behind a multi-layer RNN.

The reason why you would want to do this is because this might allow the network to compute more complex representations. This is the logic behind deep networks in general. If you’re familiar with the idea of why deeper is better for let’s say convolutional networks, then this is kind of the same logic.

It’s saying that your lower RNN might be computing lower-level features like syntax and your higher level RNN gonna compute higher-level features like semantics. These are sometimes called stacked RNNs.

Here’s an example of how a multi-layer RNN might work.

```
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 16))
model.add(tf.keras.layers.LSTM(32,return_sequences=True))
model.add(tf.keras.layers.LSTM(32,return_sequences=True))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
```

It’s three layers of unidirectional RNN, but it could be bidirectional if you have access to the entire input sequence. The main thing is that the hidden states from one RNN layer are going to be used as the inputs to the RNN layer that’s coming next.

Multi-layer RNN certainly isn’t as deep as the deep convolutional or feed-forward networks you might have seen in, for example, image tasks. Two to Four layers were best for the encoder RNN, and Four layers were best for the decoder RNN.

If you have this depth in like, two dimensions, you have the depth over the timesteps and then the depth over the RNN layer is two, then it becomes very, very expensive to compute these RNNs. That’s another reason why they don’t get very deep.

Bidirectionality is useful if you can apply it and if you have access to the entire input sequence, you can apply bi-directionality. You should probably do that by default.

Then the last tip is that multi-layer RNNs are pretty powerful. Again you should probably do that if you have enough computational power to do it. But if you’re going to make your multi-layer RNNs pretty deep, then you might need to skip connections.

### Related Post

- Bidirectional LSTM using Keras
- Simple Text Classification using BERT in TensorFlow Keras 2.0
- State-of-the-Art Text Classification using BERT in ten lines of Keras
- Text Classification using Attention Mechanism in Keras
- Replac your RNN and LSTM with Attention base Transformer model for NLP