Why Was the Transformer Architecture So Revolutionary?

The Transformer architecture, introduced in AIAYN¹, revolutionized machine learning and natural language processing (NLP). It became the foundation for state-of-the-art models such as BERT² and GPT³. But what exactly makes the Transformer so revolutionary? In this post, we will delve into the key innovations and how they work, with a focus on the math behind them.

1. Limitations of Previous Models

Before the advent of Transformers, models like RNNs⁴ (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) processed information sequentially. This meant that the hidden state $h_t$ at time $t$ was updated based on the input at time $t$ and the hidden state from the previous time step $h_{t-1}$ , as shown below:

$h_t = \tanh(W_h \cdot h_{t-1} + W_x \cdot x_t)$

Where:

$W_h$ is the weight matrix for the hidden state.
$W_x$ is the weight matrix for the input $x_t$ at time $t$ .
$\tanh$ is a non-linear activation function (or $\sigma$ sigmoid).

While effective for short sequences, this process struggled with long-range dependencies due to the vanishing gradient problem. Sequential processing also slowed down training since each word depended on processing the previous word.

2. Self-Attention: A Game-Changer

The breakthrough of the Transformer architecture lies in self-attention. In self-attention, instead of processing words one by one, the model considers all words in the input sequence at once. Each word attends to every other word in the sentence to understand context, using attention scores.

The attention mechanism computes a weighted sum of values $V$ based on how relevant other words (or tokens) are. The relevance is determined by querying the word’s relationship to other words using a query $Q$ and key $K$ mechanism. The formula for calculating the attention scores is as follows:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

Where:

$Q$ is the query matrix.
$K$ is the key matrix.
$V$ is the value matrix.
$d_k$ is the dimension of the key/query vectors.
The softmax function normalizes the attention scores.

This allows the model to weigh the importance of different words. For example, in the sentence, “The cat sat on the mat, and it was happy,” the model can understand that the word “it” refers to “cat” by assigning a higher attention score between these two words.

3. Parallelization for Faster Training

One of the major advantages of the Transformer over previous architectures is its ability to parallelize computations. Since each word attends to every other word at once (using self-attention), the model can process all words simultaneously rather than sequentially.

This results in significantly faster training because the model doesn’t need to wait for previous words to be processed before moving on to the next. In mathematical terms, the self-attention mechanism computes all pairwise relationships between words in parallel, which can be efficiently implemented using matrix multiplications.

4. Positional Encoding

Since Transformers do not process sequences in order, they need a way to represent the position of words in a sentence. This is done through positional encodings. The positional encoding for each word at position $i$ is added to the input embeddings, helping the model understand the order of words in the sequence.

The positional encoding formula is as follows:

$PE_{(i,2k)} = \sin \left( \frac{i}{10000^{2k/d}} \right)$

$PE_{(i,2k+1)} = \cos \left( \frac{i}{10000^{2k/d}} \right)$

Where:

$i$ is the word’s position in the sequence.
$k$ represents the position in the word embedding.
$d$ is the dimension of the word embeddings.

These sine and cosine functions encode positions in a way that allows the model to learn patterns of relationships between words in a sequence, even though it processes them all in parallel.

5. Scalability and Pretraining

The self-attention mechanism is also highly scalable. It allows for deeper models with more layers and can handle larger datasets. By pretraining large models on vast amounts of text data, researchers can create general-purpose language models like BERT and GPT-3 that are fine-tuned on specific tasks, such as sentiment analysis or machine translation.

These pre-trained models leverage the power of large-scale attention mechanisms to improve their performance. The output of the self-attention mechanism can be fed into a stack of feedforward layers, making it easier for the model to learn complex patterns in the data.

Conclusion

The Transformer architecture has transformed machine learning, primarily through the use of self-attention and the ability to parallelize computations. This revolutionary approach overcame the limitations of sequential models like RNNs and LSTMs, enabling the model to capture long-range dependencies and scale efficiently. With its flexibility, the Transformer architecture now underpins many of the most advanced AI models used today in NLP, computer vision, and beyond.

drew's blog