Chapter Summary

Key Points

1.
LSTM is the workhorse recurrent cell. The cell state acts as a gradient highway, enabling learning of long-range dependencies. Use GRU for fewer parameters with similar performance.
2.
Always pack variable-length sequences. Use pack_padded_sequence before LSTM and pad_packed_sequence after. This avoids wasted computation on padding and hidden state corruption.
3.
Seq2Seq encodes input into a context vector. The decoder generates output autoregressively. Teacher forcing accelerates training but creates exposure bias. Beam search improves inference quality.
4.
TCNs offer parallel sequence processing. Causal dilated convolutions process all time steps simultaneously, enabling much faster training than LSTMs. The receptive field grows exponentially with depth.
5.
Gradient clipping is essential for RNN training. Recurrent connections amplify gradient magnitudes. Always use clip_grad_norm_ with max_norm around 1-5.

Looking Ahead

Chapter 30 introduces attention mechanisms that allow models to selectively focus on relevant parts of the input, overcoming the bottleneck of compressing the entire sequence into a fixed-size vector.

Temporal Convolutional Networks (TCN)Exercises