Machine Learning Top 1% Impact

Attention Is All You Need

Ashish Vaswani Google Brain
Noam Shazeer Google Brain
Niki Parmar Google Research
Jakob Uszkoreit Google Research
+4 others
Paper Analysis

"Proposes the Transformer model, which uses self-attention to process sequences in parallel, replacing RNNs and CNNs for state-of-the-art translation quality."

Abstract

T The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. This parallel processing makes training much faster and helps the model understand complex relationships in language.

The results were remarkable: the Transformer achieved state-of-the-art performance on translation tasks while being significantly more efficient to train.

Key Contributions

Parallelization

Replaces sequential RNN processing with parallelizable attention layers.

Self-Attention

Allows modeling of dependencies regardless of distance in the sequence.

Peer Discourse

14
Sort:
SC
Guidelines: Be constructive. Cite sources where possible.
D
Dr. Emily Zhang Research Scientist, Meta AI
Oct 12, 2024

The scalability of the self-attention mechanism is theoretically sound, but I'm curious about the memory constraints on sequence lengths > 4096. Has anyone benchmarked the quadratic bottleneck on consumer hardware?

| |
J
Jakob Uszkoreit Author Oct 13, 2024

Valid point. We addressed this in the follow-up work on 'Linear Attention' variants. The quadratic cost is indeed the main limiter for context windows.

M
Michael Chen PhD Candidate, Stanford
Oct 15, 2024

For those implementing this from scratch: pay attention to the scaling factor in the dot-product attention. Forgetting the division by sqrt(d_k) completely destabilizes the gradients during early training.

| |