I am using LSTM Networks for Multivariate Multi-Timestep predictions.
So basically seq2seq prediction where a number of n_inputs is fed into the model in order to predict a number of n_outputs of a time series.
My question is how to meaningfully apply Dropout and BatchnNormalization as this appears to be a highly discussed topic for Recurrent and therefore LSTM Networks. Lets stick to Keras as framework for the sake of simplicity.
Case 1: Vanilla LSTM
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs, n_features), dropout=dropout_rate))
model.add(Dense(int(n_blocks/2)))
model.add(BatchNormalization())
model.add(Activation(activation))
model.add(Dense(n_outputs))
- Q1: Is it good practice not to use BatchNormalization directly after LSTM layers?
- Q2: Is it good practice to use Dropout inside LSTM layer?
- Q3: Is the usage of BatchNormalization and Dropout between the Dense layers good practice?
- Q4: If I stack multiple LSTM layers, is it a good idea to use BatchNormalization between them?
Case 2: Encoder Decoder like LSTM with TimeDistributed Layers
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs,n_features), dropout=dropout_rate))
model.add(RepeatVector(n_outputs))
model.add(LSTM(n_blocks, activation=activation, return_sequences=True, dropout=dropout_rate))
model.add(TimeDistributed(Dense(int(n_blocks/2)), use_bias=False))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Activation(activation)))
model.add(TimeDistributed(Dropout(dropout_rate)))
model.add(TimeDistributed(Dense(1)))
- Q5: Should
BatchNormalozationandDropoutwrapped insideTimeDistributedlayers when used betweenTimeDistributed(Dense())layers, or is it correct to leave them without? - Q6: Can or should Batchnormalization be applied after, before, or in between the Encoder-Decoder LSTM Blocks?
Q7: If a
ConvLSTM2Dlayer is used as first Layer (Encoder) would this make a difference in the usage of Dropout and BatchNormalization?Q8: should the
recurrent_dropoutargument be used inside LSTM blocks? If yes should it be combined with normaldropoutargument as it is in the example, or should it be exchanged? Thank you very much in advance!