How to use Dropout and BatchNormalization in LSTM Networks (Keras)

Question

I am using LSTM Networks for Multivariate Multi-Timestep predictions. So basically seq2seq prediction where a number of n_inputs is fed into the model in order to predict a number of n_outputs of a time series.

My question is how to meaningfully apply Dropout and BatchnNormalization as this appears to be a highly discussed topic for Recurrent and therefore LSTM Networks. Lets stick to Keras as framework for the sake of simplicity.

Case 1: Vanilla LSTM

model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs, n_features), dropout=dropout_rate))
model.add(Dense(int(n_blocks/2)))
model.add(BatchNormalization())
model.add(Activation(activation))
model.add(Dense(n_outputs))

Q1: Is it good practice not to use BatchNormalization directly after LSTM layers?
Q2: Is it good practice to use Dropout inside LSTM layer?
Q3: Is the usage of BatchNormalization and Dropout between the Dense layers good practice?
Q4: If I stack multiple LSTM layers, is it a good idea to use BatchNormalization between them?

Case 2: Encoder Decoder like LSTM with TimeDistributed Layers

model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs,n_features), dropout=dropout_rate))
model.add(RepeatVector(n_outputs))
model.add(LSTM(n_blocks, activation=activation, return_sequences=True, dropout=dropout_rate))
model.add(TimeDistributed(Dense(int(n_blocks/2)), use_bias=False))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Activation(activation)))
model.add(TimeDistributed(Dropout(dropout_rate)))
model.add(TimeDistributed(Dense(1)))

Q5: Should BatchNormalozation and Dropout wrapped inside TimeDistributed layers when used between TimeDistributed(Dense()) layers, or is it correct to leave them without?
Q6: Can or should Batchnormalization be applied after, before, or in between the Encoder-Decoder LSTM Blocks?
Q7: If a ConvLSTM2D layer is used as first Layer (Encoder) would this make a difference in the usage of Dropout and BatchNormalization?
Q8: should the recurrent_dropout argument be used inside LSTM blocks? If yes should it be combined with normal dropout argument as it is in the example, or should it be exchanged? Thank you very much in advance!

`There are questions about recurrent_dropout vs dropout in LSTMCell, but as far as I understand this is not implemented in normal LSTM layer.` - I'd be interested in where you got this information from? Following, for example, the examples in F. Chollets book he uses and advises to set both since LSTM-dropout-calculation depends on it. — Markus, Jun 13 '19 at 10:14
Furthermore, take a look at this question: https://stackoverflow.com/questions/34716454/where-do-i-call-the-batchnormalization-function-in-keras. Some of your questions have been answered there. — Markus, Jun 13 '19 at 10:16
Possible duplicate of [Where do I call the BatchNormalization function in Keras?](https://stackoverflow.com/questions/34716454/where-do-i-call-the-batchnormalization-function-in-keras) — Markus, Jun 13 '19 at 10:22
@Markus no it is no duplicate, there are quite a few extra questions in it. Furthermore the question is not how to apply batchnorm in keras in general, but how to apply it in lstm networks. With the recurrent_dropout you are right, I overread that, I will adjust the question. — gustavz, Jun 13 '19 at 13:11
Hmm when I read the answers in the linked post I do not see why they should not apply to LSTM (since the discussion was climaxing regarding the question wheather to normalize the input for the activation-function or layer-input, whereby the kind of layer was secondary). — Markus, Jun 13 '19 at 13:20
There are multiple open questions on this topic. on stackoverflow e.g. https://stackoverflow.com/questions/45493384/is-it-normal-to-use-batch-normalization-in-rnn-lstm-rnn or somewhere else, e.g. https://www.quora.com/Why-is-it-difficult-to-apply-batch-normalization-to-RNNs — gustavz, Jun 13 '19 at 13:50

How to use Dropout and BatchNormalization in LSTM Networks (Keras)

0 Answers0