The paper **“Attention is all you need”** from google propose a novel neural network architecture based on a self-attention mechanism that believe to be particularly well-suited for language understanding.

# Introduction

## Current Recurrent Neural Network

**Advantages**- Popular and successful for variable-length representations such as sequences (e.g. languages, time-series, etc)
- The gating models such as
*LSTM*or*GRU*are for long-range error propagation. - RNN are considered core of Seq2Seq with attention.

**Disadvantages**- The sequentiality prohibits parallelization within instances.
- Sequence-aligned states in RNN are wasteful.
- Hard to model hierarchical-alike domains such as languages.
- Long-range dependencies still tricky despite of gating models, so there are attempts to solve this by using CNN

## Current Convolutional Neural Network

**Advantages**- Trivial to parallelize (per layer)
- Fit intuition that most dependencies are local
- The problem of long-range dependencies of RNN has been achieved by using convolution.
- Path length between positions can be logarithmic when using dilated convolutions, left-padding for text. (autoregressive CNNs WaveNet, ByteNet)
- The complexity of for
*ByteNet*and for*CovS2S*

**Disadvantages**- Long-range dependencies using CNN are referencing things by position while referencing by context makes much more sense.
*WaveNet*and*ByteNet*architecture for example below.

# Attention

Attention between encoder and decoder is crucial in *Neural Machine Translation (NMT)*.
So, why not use (self-)attention for the representations?

## Self-Attention

*Convolution*sees only positional things while*self-attention*looks at everything considering similarity.- Sometimes called
*intra-attention* - Relating different positions of a single sequence in order to compute a representation of the sequence.
- Pairwise comparison between every two positions on signal
- Comparison is used to generate distribution for each positions to other positions in weighted average.

## Why Self-Attention?

- Total computational complexity by layer (v.s in RNN)
- Parallelizable computational capacity measured by minimum number of required sequential operations
- Constant path length between long distance dependencies in the network
- It is possible to generate a model in which self-attention can be more interpreted.

## Three Ways of Attention

# The Transformer

*Encoder-decoder structure*of general neural sequence transduction models- The
*encoder*maps an input sequence of symbol representations to a sequence of continuous representation . - Given , the
*decoder*then generates an output sequence of symbols one element at a time. - At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

- The

*Transformer* follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

**Encoder**- Composed of a stack of identical layers, each has two sub-layers
- The first is a multi-head self-attention mechanism, and the second is position-wise fully connected feed-forward network.
- Employ residual connection around each of the two sub-layers, followed by layer normalization

**Decoder**- Composed of a stack of identical layers, each has three sub-layers
- Two are the same with those of encoder, and a third sub-layer performs multi-head attention over the output of the encoder stack.
- Each sub-layer has residual connections, followed by layer normalization.
- Modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.

- Example of how Transformer works

## Scaled Dot-Product Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The proposed **scaled dot-product attention** is described as below,

- : query vector (current word what I am operating on)
- : key memory (find the most similar key)
- : value memory (get the value corresponding the similar key)
- : queries and keys of dimension
- : values of dimension

There are two most commonly used attention functions: *additive attention* and *dot-product attention*.
The Transformer uses *dot-product attention* and scaling factor of .

- Dot-product is quicker to calculate, it is more resistant to sparsity.
- It is easy to respond as in research.
- When is large, it is assumed that the gradient becomes smaller as the dot product becomes larger, so scaling by .
- Below table is time complexity comparison among attention, recurrent, and convolutional

- Almost every time, (usually 1000) is larger than (usually 20), so attention () is way faster than recurrent ().

## Multi-Head Attention

- What’s missing from Self-Attention?
- Convolution: a different linear transformation for each relative position. Allows you to distinguish what information came from where.
- Self-Attention: a weighted average

- The Fix:
**Multi-Head Attention**- Multiple attention layers (heads) in parallel (shown by different colors below figure)
- Each head uses different linear transformations.
- Different heads can learn different relationships.

Instead of performing a single attention function with -dimensional keys, values and queries, it is beneficial to linearly project the queries, keys and values times with different, learned projections to , and dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding -dimensional output values.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

## Point-wise Feed-Forward Networks

- Feed-forward network which is applied to each position separately and identically.
- Consists of two linear transformations with ReLU activation in between.

## Positional Encoding

- Transmit location information since there is no recurrent or convolutional layer.
- There are many choices of positional encodings, in this work, sine and cosine functions of different frequencies are used.

# Experiments

**Data and Batch Processing**- WMT 2014 English - German Dataset and WMT 2014 English - French Language Data
- German: 4.5million sentence pairs and 37,000 tokens
- French: 36M sentences and split tokens into a 32,000 - sentence pairs are batched together with approximate sequence length

**Hardware and schedule**- One machine with 8 NVIDIA P100 GPUs
- Base model
- Approximately 0.4 seconds per each training step
- Total 100,000 step learning = 12 hours training

- Great model (big)
- Step time 1.0 seconds
- learning at 300,000 steps, 3.5 days

## Machine Translation Results: WMT-14

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.

## Variations on the Transformer

To evaluate the importance of different components of the Transformer, the authors varied the base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.