• 从头推导与实现 BP 网络


    从头推导与实现 BP 网络

    回归模型

    目标

    学习 (y = 2x)

    模型

    单隐层、单节点的 BP 神经网络

    策略

    Mean Square Error 均方误差

    [MSE = frac{1}{2}(hat{y} - y)^2 ]

    模型的目标是 (min frac{1}{2} (hat{y} - y)^2)

    算法

    朴素梯度下降。在每个 epoch 内,使模型对所有的训练数据都误差最小化。

    网络结构

    Forward Propagation Derivation

    [E = frac{1}{2}(hat{Y}-Y)^2 \ hat{Y} = eta \ eta = W b \ b = sigmoid(alpha) \ alpha = V x ]

    Back Propagation Derivation

    模型的可学习参数为 (w,v) ,更新的策略遵循感知机模型:

    参数 w 的更新算法

    [w leftarrow w + Delta w \ Delta w = - eta frac{partial E}{partial w} \ frac{partial E}{partial w} = frac{partial E}{partial hat{Y}} frac{partial hat{Y}}{partial eta} frac{partial eta}{partial w} \ = (hat{Y} - Y) cdot 1 cdot b ]

    参数 v 的更新算法

    [v leftarrow v + Delta v \ Delta v = -eta frac{partial E}{partial v} \ frac{partial E}{partial v} = frac{partial E}{partial hat{Y}} frac{partial hat{Y}}{partial eta} frac{partial eta}{partial b} frac{partial eta}{partial alpha} frac{partial alpha}{partial v} \ = (hat{Y} - Y) cdot 1 cdot w cdot frac{partial eta}{partial alpha} cdot x \ frac{partial eta}{partial alpha} = sigmoid(alpha) [ 1 - sigmoid(alpha) ] \ sigmoid(alpha) = frac{1}{1+e^{-alpha}} ]

    代码实现

    C++ 实现

    #include <iostream>
    #include <cmath>
    
    using namespace std;
    
    class Network {
    public :
        Network(float eta) :eta(eta) {}
    
        float predict(int x) {  // forward propagation
            this->alpha = this->v * x;
            this->b = this->sigmoid(alpha);
            this->beta = this->w * this->b;
            float prediction = this->beta;
            return prediction;
        }
    
        void step(int x, float prediction, float label) {
            this->w = this->w 
                - this->eta 
                * (prediction - label) 
                * this->b;
            this->alpha = this->v * x;
            this->v = this->v 
                - this->eta 
                * (prediction - label) 
                * this->w 
                * this->sigmoid(this->alpha) * (1 - this->sigmoid(this->alpha)) 
                * x;
        }
    private:
        float sigmoid(float x) {return (float)1 / (1 + exp(-x));}
        float v = 1, w = 1, alpha = 1, beta = 1, b = 1, prediction, eta;
    };
    
    int main() {  // Going to learn the linear relationship y = 2*x
        float loss, pred;
        Network model(0.01);
        cout << "x is " << 3 << " prediction is " << model.predict(3) << " label is " << 2*3 << endl;
        for (int epoch = 0; epoch < 500; epoch++) {
            loss = 0;
            for (int i = 0; i < 10; i++) {
                pred = model.predict(i);
                loss += pow((pred - 2*i), 2) / 2;
                model.step(i, pred, 2*i);
            }
            loss /= 10;
            cout << "Epoch: " << epoch << "  Loss:" << loss << endl;
        }
        cout << "x is " << 3 << " prediction is " << model.predict(3) << " label is " << 2*3 << endl;
        return 0;
    }
    
    

    C++ 运行结果

    初始网络权重,对数据 x=3, y=6的 预测结果为 (hat{y} = 0.952534)

    训练了 500 个 epoch 以后,平均损失下降至 7.82519,对数据 x=3, y=6的 预测结果为 (hat{y} = 11.242)

    PyTorch 实现

    # encoding:utf8
    # 极简的神经网络,单隐层、单节点、单输入、单输出
    
    import torch as t
    import torch.nn as nn
    import torch.optim as optim
    
    
    class Model(nn.Module):
        def __init__(self, in_dim, out_dim):
            super(Model, self).__init__()
            self.hidden_layer = nn.Linear(in_dim, out_dim)
    
        def forward(self, x):
            out = self.hidden_layer(x)
            out = t.sigmoid(out)
            return out
    
    
    if __name__ == '__main__':
        X, Y = [[i] for i in range(10)], [2*i for i in range(10)]
        X, Y = t.Tensor(X), t.Tensor(Y)
        model = Model(1, 1)
        optimizer = optim.SGD(model.parameters(), lr=0.01)
        criticism = nn.MSELoss(reduction='mean')
        y_pred = model.forward(t.Tensor([[3]]))
        print(y_pred.data)
        for i in range(500):
            optimizer.zero_grad()
            y_pred = model.forward(X)
            loss = criticism(y_pred, Y)
            loss.backward()
            optimizer.step()
            print(loss.data)
        y_pred = model.forward(t.Tensor([[3]]))
        print(y_pred.data)
    
    

    PyTorch 运行结果

    初始网络权重,对数据 x=3, y=6的 预测结果为 $hat{y} =0.5164 $ 。

    训练了 500 个 epoch 以后,平均损失下降至 98.8590,对数据 x=3, y=6的 预测结果为 (hat{y} = 0.8651)

    结论

    居然手工编程的实现其学习效果比 PyTorch 的实现更好,真是奇怪!但是我估计差距就产生于学习算法的不同,PyTorch采用的是 SGD。

    分类模型

    目标

    目标未知,因为本实验的数据集是对 iris 取前两类样本,然后把四维特征降维成两维,得到本实验的数据集。

    数据简介:

    -1.653443 0.198723 1 0  # 前两列为特正,最后两列“1 0”表示第一类
    1.373162 -0.194633 0 1  # "0 1",第二类
    

    模型

    单隐层双输入输入节点的分类 BP 网络

    策略

    在整个模型的优化过程中,使得在整个训练集上交叉熵最小:

    [mathop{argmin}_{ heta} H(Y, hat{Y}) ]

    交叉熵:

    [egin{align} H(y, hat y) & = -sum_{i=1}^{2} y_i log hat{y}_i \ & = - (y_1 log hat{y}_1 + y_2 log hat{y}_2) end{align} ]

    算法

    梯度下降,也即在每个 epoch 内,使模型对所有的训练数据都误差最小。

    网络结构

    如图

    softmax

    Forward Propagation

    公式推导如下

    [a_1 = w_{11}x_1 + w_{21}x_2 \ a_2 = w_{12}x_1 + w_{22}x_2 \ b_1 = sigmoid(a_1) \ b_2 = sigmoid(a_2) \ hat{y_1} = frac{exp(b_1)}{exp(b_1) + exp(b_2)} \ hat{y_2} = frac{exp(b_2)}{exp(b_1) + exp(b_2)} \ ]

    [egin{align} E^{(k)} & = H(y^{(k)}, hat{y}^{(k)}) \ & =- (y_1 loghat{y}_1 + y_2 loghat{y}_2) end{align} ]

    Back Propagation

    [frac{partial E}{partial w_{11}} = ( frac{partial E}{partialhat{y}_1} frac{partialhat{y}_1}{partial b_1} + frac{partial E}{partialhat{y}_2} frac{partialhat{y}_2}{partial b_1}) frac{partial b_1}{partial a_1} frac{partial a_1}{partial w_{11}} ]

    其中,

    [frac{partial E}{partialhat{y}_1} = frac{-y_1}{hat{y}_1} \ frac{partial E}{partialhat{y}_2} = frac{-y_2}{hat{y}_2} \ frac{partial hat{y}_1}{partial b_1} = hat{y}_1 (1- hat{y}_1) \ frac{partial hat{y}_2}{partial b_1} = - hat{y}_1 hat{y}_2 \ frac{partial b1}{partial a1} = sigmoid(a_1) [1 - sigmoid(a_1)] \ frac{partial a_1}{partial w_{11}} = x_1 ]

    所以,

    [frac{partial E}{partial w_{11}} = (hat{y}_1 - y_1) sigmoid(a_1) [ 1 - sigmoid(a_1)] x_1 ]

    类似的,可得

    [frac{partial E}{partial w_{21}} = (hat{y}_1 - y_1) sigmoid(a_1) [ 1 - sigmoid(a_1)] x_2 \ frac{partial E}{partial w_{12}} = (hat{y}_2 - y_2) sigmoid(a_2) [ 1 - sigmoid(a_2)] x_1 \ frac{partial E}{partial w_{22}} = (hat{y}_2 - y_2) sigmoid(a_2) [ 1 - sigmoid(a_2)] x_2 ]

    代码实现

    Python 3 实现

    # encoding:utf8
    
    from math import exp, log
    import numpy as np
    
    
    def load_data(fname):
        X, Y = list(), list()
        with open(fname, encoding='utf8') as f:
            for line in f:
                line = line.strip().split()
                X.append(line[:2])
                Y.append(line[2:])
        return X, Y
    
    
    class Network:
        eta = 0.5
        w = [[0.5, 0.5], [0.5, 0.5]]
        b = [0.5, 0.5]
        a = [0.5, 0.5]
        pred = [0.5, 0.5]
    
        def __sigmoid(self, x):
            return 1 / (1 + exp(-x))
    
        def forward(self, x):
            self.a[0] = self.w[0][0] * x[0] + self.w[1][0] * x[1]
            self.a[1] = self.w[0][1] * x[0] + self.w[1][1] * x[1]
            self.b[0] = self.__sigmoid(self.a[0])
            self.b[1] = self.__sigmoid(self.a[1])
            self.pred[0] = self.__sigmoid(self.b[0]) / (self.__sigmoid(self.b[0]) + self.__sigmoid(self.b[1]))
            self.pred[1] = self.__sigmoid(self.b[1]) / (self.__sigmoid(self.b[0]) + self.__sigmoid(self.b[1]))
            return self.pred
    
        def step(self, x, label):
            g = (self.pred[0] - label[0]) * self.__sigmoid(self.a[0]) * (1-self.__sigmoid(self.a[0])) * x[0]
            self.w[0][0] = self.w[0][0] - self.eta * g
            g = (self.pred[0] - label[0]) * self.__sigmoid(self.a[0]) * (1 - self.__sigmoid(self.a[0])) * x[1]
            self.w[1][0] = self.w[1][0] - self.eta * g
            g = (self.pred[1] - label[1]) * self.__sigmoid(self.a[1]) * (1 - self.__sigmoid(self.a[1])) * x[0]
            self.w[0][1] = self.w[0][1] - self.eta * g
            g = (self.pred[1] - label[1]) * self.__sigmoid(self.a[1]) * (1 - self.__sigmoid(self.a[1])) * x[1]
            self.w[1][1] = self.w[1][1] - self.eta * g
    
    
    if __name__ == '__main__':
        X, Y = load_data('iris.txt')
        X, Y = np.array(X).astype(float), np.array(Y).astype(float)
    
        model = Network()
        pred = model.forward(X[0])
        print("Label: %d %d, Pred: %f %f" % (Y[0][0], Y[0][1], pred[0], pred[1]))
    
        epoch = 100
        loss = 0
        for i in range(epoch):
            loss = 0
            for j in range(len(X)):
                pred = model.forward(X[j])
                loss = loss - Y[j][0] * log(pred[0]) - Y[j][1] * log(pred[1])
                model.step(X[j], Y[j])
            print("Loss: %f" % (loss))
    
        pred = model.forward(X[0])
        print("Label: %d %d, Pred: %f %f" % (Y[0][0], Y[0][1], pred[0], pred[1]))
    
    

    网络在训练之前,预测为:

    Label: 1 0, Pred: 0.500000 0.500000
    Loss: 55.430875
    

    学习率 0.5, 训练 100 个 epoch 以后:

    Label: 1 0, Pred: 0.593839 0.406161
    Loss: 52.136626
    

    结论

    训练后损失减小,模型预测的趋势朝着更贴近标签的方向前进,本次实验成功。

    只不过模型的参数较少,所以学习能力有限。

    Reference

    Derivative of Softmax Loss Function

  • 相关阅读:
    【动画技巧】在Flash中自定义鼠标外观
    【动画技巧】GIF动画转SWF小技巧
    SQL Server 2008空间数据应用系列十一:提取MapInfo地图数据中的空间数据解决方案
    SQL Server 2008空间数据应用系列十:使用存储过程生成GeoRSS聚合空间信息
    jQuery的模板与数据绑定插件
    ASP.NET MVC 入门5、View与ViewData
    一个jQuery写的虚拟键盘
    ASP.NET MVC 入门4、Controller与Action
    使用XML文件来动态配置ASP.NET MVC的Route规则
    ASP.NET MVC 入门9、Action Filter 与 内置的Filter实现(介绍)
  • 原文地址:https://www.cnblogs.com/fengyubo/p/10554040.html
Copyright © 2020-2023  润新知