Upgrade the easier to understand GPT-2 attention code to allow loading GPT-2 weights.
i.e. avoid separate loaders/code for pre-trained and non pre-trained model weights https://github.com/LxMLS/lxmls-toolkit/blob/master/lxmls/transformers/model.py#L123