Scaled Dot-Product Attention
Scaled Dot-Product Attention is a fundamental building block of the Transformer architecture, introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. This mechanism allows the model to weigh the importance of different tokens in the input sequence when generating a representation or an output.
Overview
Attention mechanisms help models focus on relevant parts of the input sequence, effectively allowing them to handle long-range dependencies without the need for recurrence. Scaled Dot-Product Attention operates by calculating a weighted sum of values based on the similarity between queries and keys.
Components of Scaled Dot-Product Attention
The Scaled Dot-Product Attention can be mathematically described using three main components: - Queries (Q): A matrix of queries. - Keys (K): A matrix of keys. - Values (V): A matrix of values associated with the keys.
The attention mechanism can be computed using the following formula:
$$ ext{Attention}(Q, K, V) = ext{softmax}igg(rac{QK^T}{ ext{sqrt}(d_k)}igg)V $$
Where: - $d_k$ is the dimension of the keys (and queries). - The softmax function is applied to the scaled dot products to normalize the scores into a probability distribution.
Step-by-Step Calculation
1. Dot Product: Compute the dot products of the query with all keys. 2. Scaling: Scale the dot products by the square root of the dimensionality of the keys, $d_k$, to avoid excessively large values. 3. Softmax: Apply the softmax function to obtain the attention weights. 4. Weighted Sum: Multiply these weights by the values to get the output.Example Calculation
Let's consider a simple example with: - Queries (Q): $$ Q = \begin{bmatrix} 1 & 0 \ 0 & 1 \ \end{bmatrix} $$ - Keys (K): $$ K = \begin{bmatrix} 1 & 0 \ 0 & 1 \ 1 & 1 \end{bmatrix} $$ - Values (V): $$ V = \begin{bmatrix} 1 \ 2 \ 3 \end{bmatrix} $$1. Dot Product: - For the first query: $$ Q_1K^T = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 1 \end{bmatrix} = 1 $$ - For the second query: $$ Q_2K^T = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \ 0 \ 1 \end{bmatrix} = 1 $$
2. Scaling: Assuming $d_k = 2$: - Scaled scores: $$ rac{1}{\sqrt{2}} = 0.707 $$
3. Softmax: Applying softmax to the scaled scores: - Softmax outputs: $$ \text{softmax}(0.707, 0.707) = \begin{bmatrix} 0.5 \ 0.5 \end{bmatrix} $$
4. Weighted Sum: - Final output: $$ 0.5 V_1 + 0.5 V_2 + 0.5 V_3 = 0.5 1 + 0.5 2 + 0.5 3 = 3 $$