ALBERT 已看paper
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
Summary
在bert上做了一些parameter reduction使得bert更加轻量化。
Research Objective 作者的研究目标
present two parameter-reduction techniques
Problem Statement问题陈述,需要解决的问题是什么?
embedding的参数太多
层与层之间的参数共享
Method(s)作者解决问题的方法/算法是什么?是否基于前人的方法?
factorized embedding parameterization
把字典向量从 reduce 到
cross-layer parameter sharing
曾与层之间参数共享
参数空间和DQE不同
sentence ordering objectives
sentence order prediction loss:
正样本正序,负样本前后交换
Evaluation作者如何评估自己的方法,实验的setup是什么样的,有没有问题或者可以借鉴的地方。
有一些结论:
- The all-shared strategy hurts performance under both conditions
- When sharing all cross-layer parameters, there is not need for models deeper than a 12-layer configuration.
- Removing dropout 提升 MLM accuracy and a combination and dropout in Convolutional Neural Networks may have harmful results.
Conclusion作者给了哪些结论,哪些是strong conclusions, 哪些又是weak的conclusions?
可以做做看sparse attention and block attention,hard example mining
Notes(optional) 不符合此框架,但需要额外记录的笔记。
DQE
Reference
- Training multi-billion parameter language models using model parallelism, 2019.
- Backpropagation without storing activations.
- Deep equilibrium models. In Neural Information Processing Systems (NeurIPS), 2019.
- RoBERTa: A robustly optimized BERT pre- training approach. arXiv preprint arXiv:1907.11692, 2019.