Transformer tricks: Removing weights for skipless transformers
DOI:
https://doi.org/10.31224/3629Abstract
He and Hofmann [1] detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped- query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity).
Downloads
Downloads
Posted
License
Copyright (c) 2024 Nils Graef
This work is licensed under a Creative Commons Attribution 4.0 International License.