References & Further Reading

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, 1997
The original LSTM paper solving the vanishing gradient problem.
K. Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP, 2014
Introduces the GRU cell and the encoder-decoder (Seq2Seq) architecture.
S. Bai, J. Z. Kolter, and V. Koltun, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, arXiv:1803.01271, 2018
Shows TCNs often outperform LSTMs on standard sequence benchmarks.
A. Graves, A. Mohamed, and G. Hinton, Speech Recognition with Deep Recurrent Neural Networks, ICASSP, 2013
Demonstrated LSTMs for large-scale speech recognition.