Made by: Deep Phrase HK Limited
Unlike Open-AI, they have released the complete code for data processing, training, and evaluation.
Detailed writeup: https://nv-adlr.github.io/MegatronLM
Their submission is not in the leaderboard of SQuAD, but this exceeds the previous best single model performance (RoBERTa 89.8).
For language modelling they get zero-shot wikitext perplexity of 17.4 (8.3B model) better than 18.3 of transformer-xl (257M). However they claim it as SOTA when GPT-2 itself has 17.48 ppl, and another model has 16.4 (https://paperswithcode.com/sota/language-modelling-on-wikitext-103)
Sadly they haven't mentioned anything about release of the model weights.