ALBERT 已看paper

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

Summary

在bert上做了一些parameter reduction使得bert更加轻量化。

Research Objective 作者的研究目标

present two parameter-reduction techniques

Problem Statement问题陈述,需要解决的问题是什么?

embedding的参数太多
层与层之间的参数共享

Method(s)作者解决问题的方法/算法是什么?是否基于前人的方法?

factorized embedding parameterization
把字典向量从 reduce 到

cross-layer parameter sharing
曾与层之间参数共享
参数空间和DQE不同

sentence ordering objectives
sentence order prediction loss:
正样本正序,负样本前后交换

Evaluation作者如何评估自己的方法,实验的setup是什么样的,有没有问题或者可以借鉴的地方。

有一些结论:

  1. The all-shared strategy hurts performance under both conditions
  2. When sharing all cross-layer parameters, there is not need for models deeper than a 12-layer configuration.
  3. Removing dropout 提升 MLM accuracy and a combination and dropout in Convolutional Neural Networks may have harmful results.

Conclusion作者给了哪些结论,哪些是strong conclusions, 哪些又是weak的conclusions?

可以做做看sparse attention and block attention,hard example mining

Notes(optional) 不符合此框架,但需要额外记录的笔记。

DQE

Reference

  1. Training multi-billion parameter language models using model parallelism, 2019.
  2. Backpropagation without storing activations.
  3. Deep equilibrium models. In Neural Information Processing Systems (NeurIPS), 2019.
  4. RoBERTa: A robustly optimized BERT pre- training approach. arXiv preprint arXiv:1907.11692, 2019.
全部评论

相关推荐

点赞 收藏 评论
分享
牛客网
牛客企业服务