深度学习归一化方法[一] Batch Normalization


论文: https://arxiv.org/pdf/1502.03167.pdf

Batch Normalization

对模型的初始输入进行归一化处理,可以提高模型训练收敛的速度;对神经网络内层的数据进行归一化处理,同样也可以达到加速训练的效果。Batch Normalization 就是归一化的一个重要方法。以下将介绍BN是如何归一化数据,能起到什么样的效果以及产生该效果的原因展开介绍。

1. Batch Normalization原理

(一)前向传播

在训练阶段:
(1)对于输入的mini-batch数据 B = { x 1 m } \mathcal{B}=\left\{x_{1 \ldots m}\right\} B={x1m},假设shape为(m, C, H, W),计算其在Channel维度上的均值和方差:
<mstyle displaystyle="true" scriptlevel="0"> μ B </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 m <munderover> i = 1 m </munderover> x i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> σ B 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 m <munderover> i = 1 m </munderover> ( x i μ B ) 2 </mstyle> \begin{aligned} \mu_{\mathcal{B}} &amp;= \frac{1}{m} \sum_{i=1}^{m} x_{i}\\ \sigma_{\mathcal{B}}^{2} &amp;= \frac{1}{m} \sum_{i=1}^{m}\left(x_{i}-\mu_{\mathcal{B}}\right)^{2} \end{aligned} μBσB2=m1i=1mxi=m1i=1m(xiμB)2
(2)根据计算出来的均值和方差,归一化mini-Batch中每一个样本:
<mover accent="true"> x ^ </mover> i = x i μ B σ B 2 + ϵ \widehat{x}_{i} = \frac{x_{i}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}} x i=σB2+ϵ xiμB
(3)最后,对归一化后的数据进行一次平移+缩放变换:
y i = γ <mover accent="true"> x ^ </mover> i + β B N γ , β ( x i ) y_{i} = \gamma \widehat{x}_{i}+\beta \equiv \mathrm{B} \mathrm{N}_{\gamma, \beta}\left(x_{i}\right) yi=γx i+βBNγ,β(xi)
γ β \gamma、\beta γβ是需要学习的参数。

在测试阶段:
使用训练集中数据的 μ B \mu_{\mathcal{B}} μB σ B 2 \sigma_{\mathcal{B}}^{2} σB2无偏估计作为测试数据归一化变换的均值和方差:
<mstyle displaystyle="false" scriptlevel="0"> E ( x ) = E B ( μ B ) </mstyle> <mstyle displaystyle="false" scriptlevel="0"> Var ( x ) = m m 1 E B ( σ B 2 ) </mstyle> \begin{array}{l}{E(x)=E_{\mathcal{B}}\left(\mu_{\mathcal{B}}\right)} \\ {\operatorname{Var}(x)=\frac{m}{m-1} E_{\mathcal{B}}\left(\sigma_{\mathcal{B}}^{2}\right)}\end{array} E(x)=EB(μB)Var(x)=m1mEB(σB2)
通过记录训练时每一个mini-batch的均值和方差最后取均值得到。
而在实际运用中,常动态均值和动态方差,通过一个动量参数维护:
<mstyle displaystyle="true" scriptlevel="0"> r μ B t </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = β r μ B t 1 + ( 1 β ) μ B </mstyle> <mstyle displaystyle="true" scriptlevel="0"> r σ B t 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = β r σ B t 1 2 + ( 1 β ) σ B 2 </mstyle> \begin{aligned} r \mu_{B_t} &amp;=\beta r \mu_{B_{t-1}}+(1-\beta) \mu_{\mathcal{B}} \\ r \sigma_{B_t}^{2} &amp;=\beta r \sigma_{B_{t-1}}^{2}+(1-\beta) \sigma_{\mathcal{B}}^{2} \end{aligned} rμBtrσBt2=βrμBt1+(1β)μB=βrσBt12+(1β)σB2
β \beta β一般取0.9.
因此可以得到测试阶段的变换为:
y = γ x E ( x ) V a r ( x ) + ϵ + β y = \gamma \frac{x-E(x)}{\sqrt{Var(x)+\epsilon}} + \beta y=γVar(x)+ϵ xE(x)+β
或:
y = γ x r μ B t r σ B t 1 2 + ϵ + β y = \gamma \frac{x-r \mu_{B_t}}{\sqrt{r \sigma_{B_{t-1}}^{2}+\epsilon}} + \beta y=γrσBt12+ϵ xrμBt+β

(二)反向传播(梯度计算)
计算梯度最好的方法是根据前向传播的推导公式,构造出计算图,直观反映变量间的依赖关系。
前向传播公式:
<mstyle displaystyle="true" scriptlevel="0"> μ B </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 m <munderover> i = 1 m </munderover> x i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> σ B 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 m <munderover> i = 1 m </munderover> ( x i μ B ) 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> <mover accent="true"> x ^ </mover> i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = x i μ B σ B 2 + ϵ </mstyle> <mstyle displaystyle="true" scriptlevel="0"> y i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = γ <mover accent="true"> x ^ </mover> i + β </mstyle> \begin{aligned} \mu_{\mathcal{B}} &amp;= \frac{1}{m} \sum_{i=1}^{m} x_{i}\\ \sigma_{\mathcal{B}}^{2} &amp;= \frac{1}{m} \sum_{i=1}^{m}\left(x_{i}-\mu_{\mathcal{B}}\right)^{2}\\ \widehat{x}_{i} &amp;= \frac{x_{i}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\\ y_{i} &amp;= \gamma \widehat{x}_{i}+\beta \end{aligned} μBσB2x iyi=m1i=1mxi=m1i=1m(xiμB)2=σB2+ϵ xiμB=γx i+β
计算图:

黑色线表示前向传播的关系,橙色线表示反向传播的关系
利用计算图计算梯度的套路:

  • 先计算离已知梯度近的变量的梯度,这样在计算远一点变量的梯度时,可能可以直接利用已经计算好的梯度;
  • 一个变量有几个出边(向外延伸的边),其梯度就由几项相加。

当前已知 l y i \frac{\partial l}{\partial y_{i}} yil,要计算loss对图中每一个变量的梯度.
按照由近及远的方法,依次算 γ , β , <mover accent="true"> x i ^ </mover> \gamma,\beta,\hat{x_i} γ,β,xi^.由于 σ B 2 \sigma_{\mathcal{B}}^2 σB2依赖于 μ B \mu_{\mathcal{B}} μB,所以先计算 σ B 2 \sigma_{\mathcal{B}}^2 σB2,再计算 μ B \mu_{\mathcal{B}} μB,最后计算$x_i
$
γ \gamma γ:表面一个出边,实际上有m个出边,因为每一个 y i y_i yi的计算都与 γ \gamma γ有关,因此
<mstyle displaystyle="true" scriptlevel="0"> l γ </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l y i y i γ </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l y i <mover accent="true"> x i ^ </mover> </mstyle> \begin{aligned} \frac{\partial l}{\partial \gamma} &amp;= \sum_{i=1}^m \frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial \gamma}\\ &amp;= \sum_{i=1}^m \frac{\partial l}{\partial y_i} \hat{x_i} \end{aligned} γl=i=1myilγyi=i=1myilxi^
β \beta β:同理
<mstyle displaystyle="true" scriptlevel="0"> l β </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l y i y i β </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l y i </mstyle> \begin{aligned} \frac{\partial l}{\partial \beta} &amp;= \sum_{i=1}^m \frac{\partial l}{\partial y_i} \frac{\partial y_i}{\partial \beta}\\ &amp;= \sum_{i=1}^m \frac{\partial l}{\partial y_i} \end{aligned} βl=i=1myilβyi=i=1myil
<mover accent="true"> x i ^ </mover> \hat{x_i} xi^:一条出边
<mstyle displaystyle="true" scriptlevel="0"> l <mover accent="true"> x i ^ </mover> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y i y i <mover accent="true"> x i ^ </mover> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y i γ </mstyle> \begin{aligned} \frac{\partial l}{\partial \hat{x_i}} &amp; =\frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial \hat{x_i}}\\ &amp; =\frac{\partial l}{\partial y_i}\gamma \end{aligned} xi^l=yilxi^yi=yilγ
σ B 2 \sigma_{\mathcal{B}}^2 σB2:m条出边,每一个 <mover accent="true"> x i ^ </mover> \hat{x_i} xi^的计算都依赖于 σ B 2 \sigma_{\mathcal{B}}^2 σB2.找到其到loss的路径: σ B 2 <mover accent="true"> x i ^ </mover> y i l o s s \sigma_{\mathcal{B}}^2\rightarrow\hat{x_i}\rightarrow y_i \rightarrow loss σB2xi^yiloss,由于 <mover accent="true"> x i ^ </mover> \hat{x_i} xi^关于loss的梯度已经计算好了,所以路径为 σ B 2 <mover accent="true"> x i ^ </mover> l o s s \sigma_{\mathcal{B}}^2\rightarrow\hat{x_i} \rightarrow loss σB2xi^loss,因此
<mstyle displaystyle="true" scriptlevel="0"> l σ B 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l <mover accent="true"> x i ^ </mover> <mover accent="true"> x i ^ </mover> σ B 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 m </munderover> l <mover accent="true"> x i ^ </mover> 1 2 ( x i μ B ) ( σ B 2 + ϵ ) 3 2 </mstyle> \begin{aligned} \frac{\partial l}{\partial \sigma_{\mathcal{B}}^2} &amp;= \sum_{i=1}^m \frac{\partial l}{\partial \hat{x_i}}\frac{\hat{x_i}}{\sigma_{\mathcal{B}}^2}\\ &amp;= \sum_{i=1}^m \frac{\partial l}{\partial \hat{x_i}} \frac{-1}{2}(x_i-\mu_{\mathcal{B}})(\sigma_{\mathcal{B}}^2+\epsilon)^{-\frac{3}{2}} \end{aligned} σB2l=i=1mxi^lσB2xi^=i=1mxi^l21(xiμB)(σB2+ϵ)23
μ B \mu_{\mathcal{B}} μB出边有m + 1条.路径: μ B σ B 2 l o s s \mu_{\mathcal{B}} \rightarrow \sigma_{\mathcal{B}}^2 \rightarrow loss μBσB2loss, μ B <mover accent="true"> x i ^ </mover> l o s s \mu_{\mathcal{B}} \rightarrow \hat{x_i} \rightarrow loss μBxi^loss,因此;
<mstyle displaystyle="true" scriptlevel="0"> l μ B </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l σ B 2 σ B 2 μ B + <munderover> i = 1 m </munderover> l <mover accent="true"> x i ^ </mover> <mover accent="true"> x i ^ </mover> μ B </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l σ B 2 2 m <munderover> i = 1 m </munderover> ( x i μ B ) + <munderover> i = 1 m </munderover> l <mover accent="true"> x i ^ </mover> ( 1 σ B 2 + ϵ ) </mstyle> \begin{aligned} \frac{\partial l}{\partial \mu_{\mathcal{B}}} &amp;=\frac{\partial l}{\partial \sigma_{\mathcal{B}}^2}\frac{\sigma_{\mathcal{B}}^2}{\partial \mu_{\mathcal{B}}} + \sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}}\frac{\partial \hat{x_i}}{\partial \mu_{\mathcal{B}}}\\ &amp;=\frac{\partial l}{\partial \sigma_{\mathcal{B}}^2} \frac{-2}{m}\sum_{i=1}^m(x_i-\mu_{\mathcal{B}}) + \sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}} (\frac{-1}{\sqrt{\sigma_{\mathcal{B}}^2+\epsilon}})\\ \end{aligned} μBl=σB2lμBσB2+i=1mxi^lμBxi^=σB2lm2i=1m(xiμB)+i=1mxi^l(σB2+ϵ 1)
x i x_i xi:有3条边,路径: x i μ B l o s s x_i\rightarrow\mu_{\mathcal{B}}\rightarrow loss xiμBloss, x i σ B 2 l o s s x_i\rightarrow \sigma_{\mathcal{B}}^2\rightarrow loss xiσB2loss, x i <mover accent="true"> x i ^ </mover> l o s s x_i\rightarrow \hat{x_i}\rightarrow loss xixi^loss
<mstyle displaystyle="true" scriptlevel="0"> l x i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l μ B μ B x i + l σ B 2 σ B 2 x i + l <mover accent="true"> x ^ </mover> i <mover accent="true"> x ^ </mover> i x i </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l μ B 1 m + l σ B 2 2 m ( x i μ B ) + l <mover accent="true"> x ^ </mover> i 1 σ B 2 + ϵ </mstyle> \begin{aligned} \frac{\partial l}{\partial x_{i}} &amp;=\frac{\partial l}{\partial \mu_{\mathcal{B}}} \frac{\partial \mu_{\mathcal{B}}}{\partial x_{i}}+\frac{\partial l}{\sigma_{\mathcal{B}}^{2}} \frac{\sigma_{\mathcal{B}}^{2}}{\partial x_{i}}+\frac{\partial l}{\hat{x}_{i}} \frac{\hat{x}_{i}}{\partial x_{i}} \\ &amp;=\frac{\partial l}{\partial \mu_{\mathcal{B}}} \frac{1}{m}+\frac{\partial l}{\partial \sigma_{\mathcal{B}}^{2}} \frac{-2}{m}\left(x_{i}-\mu_{\mathcal{B}}\right)+\frac{\partial l}{\partial \hat{x}_{i}} \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}} \end{aligned} xil=μBlxiμB+σB2lxiσB2+x^ilxix^i=μBlm1+σB2lm2(xiμB)+x^ilσB2+ϵ 1

求导完毕!!!
附上代码实现:

def batchnorm_forward(x, gamma, beta, bn_param):
    """ Forward pass for batch normalization. Input: - x: Data of shape (N, D) - gamma: Scale parameter of shape (D,) - beta: Shift paremeter of shape (D,) - bn_param: Dictionary with the following keys: - mode: 'train' or 'test'; required - eps: Constant for numeric stability - momentum: Constant for running mean / variance. - running_mean: Array of shape (D,) giving running mean of features - running_var Array of shape (D,) giving running variance of features Returns a tuple of: - out: of shape (N, D) - cache: A tuple of values needed in the backward pass """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    
    if mode == 'train':

        sample_mean = np.mean(x, axis=0)
        sample_var = np.var(x, axis=0)
        out_ = (x - sample_mean) / np.sqrt(sample_var + eps)

        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var

        out = gamma * out_ + beta
        cache = (out_, x, sample_var, sample_mean, eps, gamma, beta)

    elif mode == 'test':

        scale = gamma / np.sqrt(running_var + eps)
        out = x * scale + (beta - running_mean * scale)

    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

def batchnorm_backward(dout, cache):
    """ Backward pass for batch normalization. Inputs: - dout: Upstream derivatives, of shape (N, D) - cache: Variable of intermediates from batchnorm_forward. Returns a tuple of: - dx: Gradient with respect to inputs x, of shape (N, D) - dgamma: Gradient with respect to scale parameter gamma, of shape (D,) - dbeta: Gradient with respect to shift parameter beta, of shape (D,) """
    dx, dgamma, dbeta = None, None, None

    out_, x, sample_var, sample_mean, eps, gamma, beta = cache

    N = x.shape[0]
    dout_ = gamma * dout
    dvar = np.sum(dout_ * (x - sample_mean) * -0.5 * (sample_var + eps) ** -1.5, axis=0)
    dx_ = 1 / np.sqrt(sample_var + eps)
    dvar_ = 2 * (x - sample_mean) / N

    # intermediate for convenient calculation
    di = dout_ * dx_ + dvar * dvar_
    dmean = -1 * np.sum(di, axis=0)
    dmean_ = np.ones_like(x) / N

    dx = di + dmean * dmean_
    dgamma = np.sum(dout * out_, axis=0)
    dbeta = np.sum(dout, axis=0)

    return dx, dgamma, dbeta

2. Batch Normalization的效果及其证明

(一) 减小Internal Covariate Shift的影响, 权重的更新更加稳健.
对于不带BN的网络,当权重发生更新后,神经元的输出会发生变化,也即下一层神经元的输入发生了变化.随着网络的加深,该影响越来越大,导致在每一轮迭代训练时,神经元接受的输入有很大的变化,此称为Internal Covariate Shift. 而BatchNormalization通过归一化和仿射变换(平移+缩放),使得每一层神经元的输入有近似的分布.
假设某一层神经网络为:
H i + 1 = W H i + b \mathbf{H_{i+1}} = \mathbf{W} \mathbf{H_i} + \mathbf{b} Hi+1=WHi+b
对权重的导数为:
l W = l H i + 1 H i T \frac{\partial l}{\partial \mathbf{W}} = \frac{\partial l}{\partial \mathbf{H_{i+1}}} \mathbf{H_i}^T Wl=Hi+1lHiT
对权重进行更新:
W W η l H i + 1 H i T \mathbf{W} \leftarrow\mathbf{W} - \eta \frac{\partial l}{\partial \mathbf{H_{i+1}}} \mathbf{H_i}^T WWηHi+1lHiT
可见,当上一层神经元的输入( H i \mathbf{H_i} Hi)变化较大时,权重的更新变化波动大.
(二)batch Normalization具有权重伸缩不变性,可以有效提高反向传播的效率,同时还具有参数正则化的效果.
记BN为:
N o r m ( W x ) = = g W x μ σ + b Norm(\mathbf{Wx}) = = \mathbf{g} \cdot \frac{\mathbf{W} \mathbf{x}-\mu}{\sigma}+\mathbf{b} Norm(Wx)==gσWxμ+b
为什么具有权重不变性? \downarrow
假设权重按照常量 λ \lambda λ进行伸缩,则其对应的均值和方差也会按比例伸缩,于是有:
<mstyle displaystyle="true" scriptlevel="0"> N o r m ( W x ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = g W x μ σ + b </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = g λ W x λ μ λ σ + b </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = g W x μ σ + b </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = N o r m ( W x ) </mstyle> \begin{aligned} Norm\left(\mathbf{W}^{\prime} \mathbf{x}\right) &amp;= \mathbf{g} \cdot \frac{\mathbf{W}^{\prime} \mathbf{x}-\mu^{\prime}}{\sigma^{\prime}}+\mathbf{b}\\&amp;=\mathbf{g} \cdot \frac{\lambda \mathbf{W} \mathbf{x}-\lambda \mu}{\lambda \sigma}+\mathbf{b}\\ &amp;= \mathbf{g} \cdot \frac{\mathbf{W} \mathbf{x}-\mu}{\sigma}+\mathbf{b}\\ &amp;=Norm(\mathbf{W} \mathbf{x}) \end{aligned} Norm(Wx)=gσWxμ+b=gλσλWxλμ+b=gσWxμ+b=Norm(Wx)

为什么能提高反向传播的效率? \downarrow
考虑权重发生伸缩后,梯度的变化:
为方便(公式打累了),记 y = N o r m ( W x ) \mathbf{y}=Norm(\mathbf{Wx}) y=Norm(Wx)
<mstyle displaystyle="true" scriptlevel="0"> l x </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y y x </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y ( g W x μ σ + b ) ) x </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y g W σ </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y g λ W λ σ </mstyle> \begin{aligned} \frac{\partial l}{\partial \mathbf{x}} &amp;= \frac{\partial l}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}\\ &amp;=\frac{\partial l}{\partial \mathbf{y}} \frac{\partial (\mathbf{g} \cdot \frac{\mathbf{W} \mathbf{x}-\mu}{\sigma}+\mathbf{b}))}{\partial \mathbf{x}}\\ &amp;=\frac{\partial l}{\partial \mathbf{y}} \frac{\mathbf{g}\cdot\mathbf{W}}{\sigma}\\ &amp;=\frac{\partial l}{\partial \mathbf{y}} \frac{\mathbf{g}\cdot \lambda\mathbf{W}}{\lambda\sigma} \end{aligned} xl=ylxy=ylx(gσWxμ+b))=ylσgW=ylλσgλW
可以发现,当权重发生伸缩时,相应的 σ \sigma σ也会发生伸缩,最终抵消掉了权重伸缩的影响.
考虑更一般的情况,当该层权重较大(小)时,相应 σ \sigma σ也较大(小),最终梯度的传递受到权重影响被减弱,提高了梯度反向传播的效率.同时, g g g也是可训练的参数,起到自适应调节梯度大小的作用.

为什么具有参数正则化的作用? \downarrow
计算对权重的梯度:
<mstyle displaystyle="true" scriptlevel="0"> l W </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y y W </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y ( g W x μ σ + b ) W </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = l y g x T σ </mstyle> \begin{aligned} \frac{\partial l}{\partial \mathbf{W}} &amp;= \frac{\partial l}{\partial \mathbf{y}} \frac{\partial\mathbf{y}}{\partial\mathbf{W}}\\ &amp;=\frac{\partial l}{\partial \mathbf{y}} \frac{\partial(\mathbf{g} \cdot \frac{\mathbf{W} \mathbf{x}-\mu}{\sigma}+\mathbf{b})}{\partial \mathbf{W}}\\ &amp;=\frac{\partial l}{\partial \mathbf{y}} \frac{\mathbf{g}\cdot \mathbf{x}^T}{\sigma} \end{aligned} Wl=ylWy=ylW(gσWxμ+b)=ylσgxT
假设该层权重较大,则相应 σ \sigma σ也更大,计算出来梯度更小,相应地, W \mathbf{W} W的变化值也越小,从而权重的变化更为稳定.但当权重较小时,相应 σ \sigma σ较小,梯度相应会更大,权重变化也会变大.

3. 为什么要加 γ \gamma γ, β \beta β

为了保证模型的表达能力不会因为规范化而下降.
如果激活函数为Sigmoid,则规范化后的数据会被映射到非饱和区(线性区),仅利用到线性变化能力会降低神经网络的表达能力.
如果激活函数使用的是ReLU,则规范化后的数据会固定地有一半被置为0.而可学习参数 β \beta β能通过梯度下降调整被激活的比例,提高了神经网络的表达能力.

经过归一化后再仿射变换,会不会跟没变一样?
首先,新参数的引入,将原来输入的分布囊括进去,而且能表达更多的分布;
其次, x \mathbf{x} x的均值和方差和浅层的神经网络有复杂的关联,归一化之后变成 <mover accent="true"> x ^ </mover> \hat{\mathbf{x}} x^,再进行仿射变换 y = g <mover accent="true"> x ^ </mover> + b \mathbf{y}=\mathbf{g} \cdot \hat{\mathbf{x}}+\mathbf{b} y=gx^+b,能去除与浅层网络的密切耦合;
最后新参数可以通过梯度下降来学习,形成利于模型表达的分布.

全部评论

相关推荐

点赞 评论 收藏
分享
码农索隆:想看offer细节
点赞 评论 收藏
分享
点赞 评论 收藏
分享
评论
1
1
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务