Transformer tricks: Removing weights for skipless transformers

Nils Graef

doi:10.31224/3629

##article.authors##

Nils Graef OpenMachine

DOI:

https://doi.org/10.31224/3629

Abstract

He and Hofmann [1] detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped- query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity).

Downloads

Download data is not yet available.

Transformer tricks: Removing weights for skipless transformers

##article.authors##

DOI:

Abstract

Downloads

Downloads

Posted

License

Latest preprints