Attention enables models to focus on relevant context regardless of position, key to handling long sequences.
How It Works
- Query-key-value computation
- Attention weights computed
- Weighted combination of values
- Multi-head for different aspects
Significance
- Enables long-range dependencies
- Parallelizable (unlike RNNs)
- Foundation of modern LLMs