Why Transformer?
Before Transformers
Model Structure
attention mechanism
Attention mechanism allows the model to attend every token in the sequence with different amount of focus for each token.
scaled dotproduct attention
Before applying softmax to the dot product attention, it should be scaled by a factor of $\sqrt{d_{k}}$ to avoid gradient vanishing and slow training.
selfattention
mask interactions between two tokens by setting the attention values to $\infty$ before softmax
layer.
crossattention
In selfattention, we are working with the same input sequence. While in crossattention, we are mixing or combining two different input sequences. In the case of the vanilla transformer architecture, that’s the sequence returned by the last/top encoder layer on the left and the input sequence being processed by the decoder part on the right.
causal/masked attention
layer norm
read more
 Review — PreLN Transformer: On Layer Normalization in the Transformer Architecture
 为什么Pre Norm的效果不如Post Norm？
Calculating Transformers Parameters
On a high level, the transformer model consists of $L$ identical blocks, each block composed of an attention module and an MLP module, or FFN for feedforward neural network.
The weight matrices for query $Q$, key $K$, value $V$ and output $O$ are $W_{q}, W_{k}, W_{v}$, and $W_{o}\in\mathbb{R}^{h\times h}$, respectively. Same goes for bias matrices of shape $\mathbb{R}^{h}$.^{1} Hence the parameters size for this part is $4h^{2}+4h$.
The FFN module has two linear layers. What happenes is the first layer scales up to a higher dimension, or intermediate dimension, and then the second layer scales back down to a dimension of $h$. Back in GPT’s early days, the scaling factor is 4 (recent models adopt different intermediate dimensions but around 3 to 5 times of $h$) ^{2}, i.e., the weight matrix for the first layer is $W_{1}\in\mathbb{R}^{h\times 4h}$ and the weight matrix for the second layer is $W_{2}\in\mathbb{R}^{4h\times h}$. The bias matrices are $\mathbb{R}^{4h}$ and $\mathbb{R}^{h}$, respectively. Hence the parameters size for the MLP module is $8h^{2}+5h$.
Dont’t forget about LayerNorm. Both self attention module and MLP module are equipped with layer norm layers, learnable parameters including weights $\gamma$ and biases $\beta$. They are all $\mathbb{R}^{h}$. Hence the parameters size for layer norm is $4h$.
In terms of positional encoding, there is a relatively small amount of parameters if the encoding is learnable. For relative positional encoding, such as RoPE and ALiBi, no trainable parameters are included.
As a matter of fact, the model starts with tokenization with word embedding and positional embedding. Word embedding matrix is of shape $\mathbb{R}^{V\times h}$. To reduce memory footprint, many models made the adoption to share the same parameters for the FFN in the final output layer and the word embedding.
Take a look at the model layers of EleutherAI’s gptneo1.3B, a replication of the GPT3 architecture.
Memory Footprint During Training
During the training process, the memory footprint is mainly divided into four parts: model parameters, intermediate activations results produced during the forward pass, gradients computed during the backward pass, and optimizer states. Here we focus on the memory footprint of parameters, gradients, and optimizer states. During training large language models, AdamW optimizer is commonly used, and mixed precision training is used to accelerate the training process. Based on this premise, we now take on analyzing the memory footprint in the training process.
Inside a typical training iteration, each learnable parameter corresponds to one gradient and two optimizer states (first and second order momentums from AdamW). Denote the number of learnable parameters in the model as $\varPhi$, the number of gradients is also $\varPhi$, and the number of optimizer states is $2\varPhi$.
A float16
typed data occupies 2 bytes, 4 bytes for float32
. In mixed precision training, float16
is used for forward and backward passes, hence the gradients are stored in float16
.
During model parameter update, float32
optimizer states, float32
gradients, and float32
model parameters are used. Therefore, for each learnable parameter, it occupies:
Memory Footprint During Inference
During the inference process, there is no optimizer states and gradients, and we don’t need to store intermediate activation results.
The memory footprint is therefore significantly smaller than that of training.
The majority of memory footprint comes from the model parameters.
If float16
is used for inference, the memory footprint of model parameters is about $2\varPhi$ bytes.
Moreover, if KVCache is used for speeding up inference, it would also induce additional memory footprint.
Estimating FLOPs
Footnotes

For instance, llama2 uses an intermediate dimension of 11008 (scaled by 2.6875 times of $h$), Qwen2 uses 22016 (scaled by 5.375 times of $h$), while mistal and llama 3 use 14336 (scaled by 3.5 times of $h$). They all use 4096 as the hidden dimension. ↩