深度学习入门——基于Python的理论与实现读书笔记

深度学习入门——基于Python的理论与实现

第一章 python入门

1.5.1 广播

import numpy as np

# 广播
A = np.array([[1, 2], [3, 4]])
B = np.array([10, 20])
C = np.array([[10, 20], [30, 40]])

print(A * 10)
print(A * B)
print(A * C)

输出：

[[10 20]
 [30 40]]
[[10 40]
 [30 80]]
[[ 10  40]
 [ 90 160]]

第二章感知机

感知机是由美国学者Frank Rosenblatt在1957年提出来的。
只有一个神经元的单层神经网络，可以完成简单的线性分类任务
多个输入、一个输出信号

2.1 感知机是什么

感知机的信号只有0/1两种取值
输出为1时神经元被激活，此时的界限值成为阈值
数学表示为 $ y = \begin{cases} 0 , (w_1x_1 + w_2x_2 \leq \theta) \ 1 (w_1x_1 + w_2x_2 > \theta ) \end{cases} $
- 多个输入信号都有各自固有的权重，权重越大，其对应的信号重要性越高

2.2 简单逻辑电路

2.2.1 与门

AND gate

2.2.2 与非门和或门

NAND gate

OR gate

2.3 感知机的实现

2.3.1 简单的实现

# AND gate
def AND(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.7
    tmp = x1 * w1 + x2 * w2
    if tmp <= theta:
        return 0
    else:
        return 1


print(AND(0, 0))
print(AND(1, 0))
print(AND(0, 1))
print(AND(1, 1))

# output:
# 0 0 0 1

2.3.2 导入权重和偏置

把 $ \theta $ 换成 $ -b $ ，则新感知机的行为即可数学表示为：

\[y = \begin{cases} 0,(b + w_1 x_1 + w_2 x_2 \leq 0 )\\ 1,(b + w_1 x_1 + w_2 x_2 > 0 ) \end{cases} \]

b称为偏置
$ w_1 , w_2 $ 称为权重

2.3.3 使用权重和偏置的实现

# new AND gate
def AND(x1, x2):
    x = np.array([x1, x2])
    w = np.array([0.5, 0.5])
    b = -0.7
    tmp = np.sum(w * x) + b
    if tmp <= 0:
        return 0
    else:
        return 1


# NAND gate
def NAND(x1, x2):
    x = np.array([x1, x2])
    w = np.array([-0.5, -0.5])
    b = 0.7
    tmp = np.sum(w * x) + b
    if tmp <= 0:
        return 0
    else:
        return 1


# OR gate
def OR(x1, x2):
    x = np.array([x1, x2])
    w = np.array([0.5, 0.5])
    b = -0.2
    tmp = np.sum(w * x) + b
    if tmp <= 0:
        return 0
    else:
        return 1


print(AND(0, 0))
print(AND(1, 0))
print(AND(0, 1))
print(AND(1, 1))

print(NAND(0, 0))
print(NAND(1, 0))
print(NAND(0, 1))
print(NAND(1, 1))

print(OR(0, 0))
print(OR(1, 0))
print(OR(0, 1))
print(OR(1, 1))

# output
# 0 0 0 1
# 1 1 1 0
# 0 1 1 1

2.4 感知机的局限性

2.4.1 异或门

无法画一条直线分开两个输出面，即无法分出一个合适的线性空间

2.5 多层感知机

可使用与非门、或门和与门实现异或门

2.2

2.5.2 异或门的具体实现

# XOR gate
def XOR(x1, x2):
    s1 = NAND(x1, x2)
    s2 = OR(x1, x2)
    y = AND(s1, s2)
    return y


print(XOR(0, 0))
print(XOR(1, 0))
print(XOR(0, 1))
print(XOR(1, 1))

# output
# 0 1 1 0

叠加了多层的感知机也称为多层感知机（multi-layered perception）

2.3

第三章神经网络

3.1 从感知机到神经网络

3.1.1 神经网络的例子

3.1

3.1.2 复习感知机

3.2

3.2 激活函数

感知机中使用了阶跃函数作为激活函数。也就是说，在激活函数的众多候选函数中，感知机使用了阶跃函数。

如果将激活函数从阶跃函数换成其他函数，就可以进入神经网络的世界了。

3.2.1 sigmoid函数

\[h(x)=\frac{1}{1+e^{-x}} \]

pros 神经网络中每一层权重对应的输入都是一个固定范围内的值，权重取值比较固定
cons 输出均值不是0
cons 计算复杂度高
cons 饱和性问题

3.2.2 阶跃函数的实现

\[h(x)= \begin{cases} 0, x \le 0 \\ 1, x>0 \end{cases} \]

def step_function1(x):
    y = x > 0
    return y.astype(np.int)

3.2.3 阶跃函数的图形

import numpy as np
import matplotlib.pylab as plt

def step_function(x):
    return np.array(x > 0, dtype=np.int)


x = np.arange(-5.0, 5.0, 0.1)
y = step_function(x)
plt.plot(x, y)
plt.ylim(-0.1, 1.1)
plt.show()

3.3

3.2.4 sigmoid 函数的实现

# sigmoid 函数
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


x = np.arange(-5.0, 5.0, 0.1)
y = sigmoid(x)

plt.plot(x, y)
plt.ylim(-0.1, 1.1)
plt.show()

3.4

3.2.5 sigmoid 函数和阶跃函数的比较

3.5

sigmoid函数与阶跃函数相比：

平滑性不同
- sigmoid函数是一条平滑的曲线，输出随着输入发生连续性的变化。sigmoid函数的平滑性对神经网络的学习具有重要意义。
- 而阶跃函数以0为界，输出发生急剧性的变化。
连续性不同
- sigmoid函数可以返回0.731 ...、0.880 ...等实数（这一点和刚才的平滑性有关）
- 阶跃函数只能返回0或1

相同点：

两者的结构均是“输入小时，输出接近0（为0）；随着输入增大，输出向1靠近（变成1）”。也就是说，当输入信号为重要信息时，阶跃函数和sigmoid函数都会输出较大的值；当输入信号为不重要的信息时，两者都输出较小的值。
不管输入信号有多小，或者有多大，输出信号的值都在0到1之间

3.2.6 非线性函数

神经网络的激活函数必须使用非线性函数。换句话说，激活函数不能使用线性函数。为什么不能使用线性函数呢？因为使用线性函数的话，加深神经网络的层数就没有意义了。

3.2.7 ReLU函数

ReLU函数， Rectified Linear Unit。

\[h(x) = \begin{cases} x ( x > 0) \\ 0 ( x \le 0) \end{cases} \]

# ReLU
def relu(x):
    return np.maximum(0, x)


x = np.arange(-6, 6, 2)
y = relu(x)
plt.plot(x, y)
plt.ylim(-1, 5)

plt.show()

3.6

3.3 多维数组的运算

3.3.1 多维数组

import numpy as np

A = np.array([[1, 2], [3, 4], [5, 6]])
print(A)
print(np.ndim(A))	# np.ndim()表示向量的维数
print(A.shape)

# output
# [[1 2]
# [3 4]
# [5 6]]
# 2
# (3, 2)

3.3.2 矩阵乘法

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(np.dot(A, B))

# output
# [[19 22]
# [43 50]]

3.3.3 神经网络的内积

3.7

import numpy as np

x = np.array([1, 2])
w = np.array([[1, 3, 5], [2, 4, 6]])
y = np.dot(x, w)
print(y)

# output
# [ 5 11 17]

3.4 三层神经网络的实现

3.8

3.9

3.10

3.11

代码实现

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def init_network():
    network = {'W1': np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]]),
               'b1': np.array([0.1, 0.2, 0.3]),
               'W2': np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]),
               'b2': np.array([0.1, 0.2]),
               'W3': np.array([[01., 0.3], [0.2, 0.4]]),
               'b3': np.array([0.1, 0.2])}

    return network


def identity_function(x):
    return x


def forward(network, x):
    W1, W2, W3 = network['W1'], network['W2'], network['W3']
    b1, b2, b3 = network['b1'], network['b2'], network['b3']

    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    z2 = sigmoid(a2)
    a3 = np.dot(z2, W3) + b3
    y = identity_function(a3)

    return y


network = init_network()
x = np.array([1.0, 0.5])
y = forward(network, x)
print(y)

# output
# [0.88045151 0.69627909]

3.5 输出层的设计

神经网络可以用在分类问题和回归问题上，不过需要根据情况改变输出层的激活函数。一般而言，回归问题用恒等函数，分类问题用softmax函数。

3.5.1 恒等函数和softmax函数

恒等函数

\[ y_i = x_i \]

将输入按原样输出，对于输入的信息，不加以任何改动地直接输出。因此，在输出层使用恒等函数时，输入信号会原封不动地被输出。

3.12

softmax函数

\[y_k = \frac{e^{a_k}}{\sum^n_{i=1}e^{a_i}} \]

- 假设输出层共有 n 个神经元，计算第 k 个神经元的输出 $ y*_k $
- softmax函数的分子是输入信号a**k的指数函数
- 分母是所有输入信号的指数函数的和。

3.13

import numpy as np

def softmax(a):
    exp_a = np.exp(a)
    sum_exp_a = np.sum(exp_a)
    y = exp_a / sum_exp_a

    return y

a = np.array([0.3, 2.9, 4.0])
y = softmax(a)

print(y)

# output
# [0.01821127 0.24519181 0.73659691]

3.5.2 实现softmax函数的注意事项

注意溢出问题

改进：

\[y_k = \frac{e^{a_k}}{\sum^n_{i=1}e^{a_i}}= \frac{C · e^{a_k}}{C·\sum^n_{i=1}e^{a_i}} \\ = \frac{e^{a_k + \log{C}}}{\sum^n_{i=1}e^{a_i + \log C}}\\ =\frac{e^{a_k + C'}}{\sum^n_{i=1}e^{a_i + C'}} \]

$ C' $ 为 $ \log C $

在进行softmax的指数函数的运算时，加上（或者减去）某个常数并不会改变运算的结果。这里的C  可以使用任何值，但是为了防止溢出，一般会使用输入信号中的最大值。

def softmax(a):
    c = np.max(a)
    exp_a = np.exp(a - c)       # 溢出对策
    sum_exp_a = np.sum(exp_a)
    y = exp_a / sum_exp_a

    return y


a = np.array([0.3, 2.9, 4.0])
y = softmax(a)

print(y)
print(np.sum(y))

# output
# [0.01821127 0.24519181 0.73659691]
# 1.0

3.5.3 softmax函数的特征

softmax函数的输出是0.0到1.0之间的实数。并且，softmax函数的输出值的总和是1。
可以把softmax函数的输出解释为“概率”。

一般而言，神经网络只把输出值最大的神经元所对应的类别作为识别结果。并且，即便使用softmax函数，输出值最大的神经元的位置也不会变。因此，神经网络在进行分类时，输出层的softmax函数可以省略。在实际的问题中，由于指数函数的运算需要一定的计算机运算量，因此输出层的softmax函数一般会被省略。

3.5.4 输出层的神经元数量

输出层的神经元数量需要根据待解决的问题来决定。对于分类问题，输出层的神经元数量一般设定为类别的数量。

3.14

3.6 手写数字识别

mnist.py

# package MNIST/dataset/mnist.py

import os
import pickle
import gzip
import numpy as np

img_size = 784

filepath = r'G:\code\Python\FirstNumPy\NeuralNetwork\MNIST\dataset'
save_file = 'my_mnist.pkl'

key_file = {
    'train_img': 'train-images-idx3-ubyte.gz',
    'train_label': 'train-labels-idx1-ubyte.gz',
    'test_img': 't10k-images-idx3-ubyte.gz',
    'test_label': 't10k-labels-idx1-ubyte.gz'
}


def load_label(filename):
    path = filepath + os.sep + filename
    with gzip.open(path, 'rb') as f:
        labels = np.frombuffer(f.read(), np.uint8, offset=8)

        return labels


def load_img(filename):
    path = filepath + os.sep + filename
    with gzip.open(path, 'rb') as f:
        data = np.frombuffer(f.read(), np.uint8, offset=16)
        data = data.reshape(-1, img_size)
        print("done")

        return data


def convert_numpy():
    dataset = {'train_img': load_img(key_file['train_img']),
               'train_label': load_label(key_file['train_label']),
               'test_img': load_img(key_file['test_img']),
               'test_label': load_label(key_file['test_label'])}

    return dataset


def init_mnist():  # memory dataset to a pkl file sequencely, next time when use it just need to antisequence it
    dataset = convert_numpy()
    with open(save_file, 'wb') as f:
        pickle.dump(dataset, f, -1)
    print('done!')


def _change_one_hot_label(X):  # turn label into on-hot vector
    T = np.zeros((X.size, 10))
    for idx, row in enumerate(T):
        row[X[idx]] = 1

    return T


def load_mnist(normalize=True, flatten=True, one_hot_label=False):
    """ read MNIST dataset

    :param normalize: 将图像像素正规化为0.0~1.0
    :param flatten: 是否将图像展开为一维数组
    :param one_hot_label: One_hot_label为true时，标签作为one-hot数组返回
    :return: (train_img, train_label), (test_img, test_label)

    """
    if not os.path.exists(save_file):
        init_mnist()

    with open(save_file, 'rb') as f:
        dataset = pickle.load(f)

        if normalize:  # 归一化，将0-255像素值转化到0-1之间
            for key in ('train_img', 'test_img'):
                dataset[key] = dataset[key].astype(np.float32)
                dataset[key] /= 255.0

        if one_hot_label:
            dataset['train_label'] = _change_one_hot_label(dataset['train_label'])
            dataset['test_label'] = _change_one_hot_label(dataset['test_label'])

        if not flatten:
            for key in ('train_img', 'test_img'):
                dataset[key] = dataset[key].reshape(-1, 1, 28, 28)

        return dataset['train_img'], dataset['train_label'], dataset['test_img'], dataset['test_label']

mnist_show.py

import sys, os
sys.path.append(os.pardir)  # for import FDir
import numpy as np
from dataset.mnist import load_mnist
from PIL import Image

a = load_mnist(flatten=True, normalize=False)
(x_train, t_train), (x_test, t_test) = (a[0], a[1]), (a[2], a[3])

# print shape
print(x_train.shape)
print(t_train.shape)
print(x_test.shape)
print(t_test.shape)


def img_show(img):
    pil_img = Image.fromarray(np.uint8(img))
    pil_img.show()

img = x_train[0]
label = t_train[0]
print(label)

print(img.shape)
img = img.reshape(28, 28)

img_show(img)

# output
# (60000, 784)
# (60000,)
# (10000, 784)
# (10000,)
# 5
# (784,)

3.15

neuralnet_mnist.py

import os
import pickle
import sys
import numpy as np

sys.path.append(os.pardir)  # for import FDir
from dataset.mnist import load_mnist


def get_data():
    a = load_mnist(flatten=True, normalize=True, one_hot_label=False)
    (x_train, t_train), (x_test, t_test) = (a[0], a[1]), (a[2], a[3])
    return x_test, t_test


def init_network():
    with open("sample_weight.pkl", 'rb') as f:
        network = pickle.load(f)


    return network


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def predict(network, x):
    W1, W2, W3 = network['W1'], network['W2'], network['W3']
    b1, b2, b3 = network['b1'], network['b2'], network['b3']
    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    z2 = sigmoid(a2)
    a3 = np.dot(z2, W3) + b3
    y = sigmoid(a3)

    return y


x, t = get_data()
network = init_network()

accuracay_cnt = 0
for i in range(len(x)):
    y = predict(network, x[i])
    p = np.argmax(y)
    if p == t[i]:
        accuracay_cnt += 1

print("Accuracy:" + str(float(accuracay_cnt) / len(x)))

# output
# 0.9352

3.6.3 批处理

neuralnet_mnist_batch.py

import os
import pickle
import sys
import numpy as np

sys.path.append(os.pardir)  # for import FDir
from dataset.mnist import load_mnist


def get_data():
    a = load_mnist(flatten=True, normalize=True, one_hot_label=False)
    (x_train, t_train), (x_test, t_test) = (a[0], a[1]), (a[2], a[3])
    return x_test, t_test


def init_network():
    with open("sample_weight.pkl", 'rb') as f:
        network = pickle.load(f)


    return network


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def predict(network, x):
    W1, W2, W3 = network['W1'], network['W2'], network['W3']
    b1, b2, b3 = network['b1'], network['b2'], network['b3']
    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    z2 = sigmoid(a2)
    a3 = np.dot(z2, W3) + b3
    y = sigmoid(a3)

    return y


x, t = get_data()
network = init_network()

batch_size = 100            # batch size
accuracay_cnt = 0
for i in range(0, len(x), batch_size):
    x_batch = x[i:i+batch_size]
    y_batch = predict(network, x_batch)
    p = np.argmax(y_batch, axis=1)
    accuracay_cnt += np.sum(p == t[i:i+batch_size])

print("Accuracy:" + str(float(accuracay_cnt) / len(x)))

# output
# Accuracy:0.9352

与neuralnet_mnist.py的不同之处：

# 主体实现部分
x, t = get_data()
network = init_network()

batch_size = 100            # batch size
accuracay_cnt = 0
for i in range(0, len(x), batch_size):
    x_batch = x[i:i+batch_size]
    y_batch = predict(network, x_batch)
    p = np.argmax(y_batch, axis=1)
    accuracay_cnt += np.sum(p == t[i:i+batch_size])

range()函数。
- range()函数若指定为range(start, end)，则会生成一个由start到end-1之间的整数构成的列表。
- 若像range(start, end, step)这样指定3个整数，则生成的列表中的下一个元素会增加step指定的值。
在range()函数生成的列表的基础上，通过x[i:i+batch_size]从输入数据中抽出批数据。x[i:i+batch_n]会取出从第i个到第i+batch_n个之间的数据。本例中是像x[0:100]、x[100:200]……这样，从头开始以100为单位将数据提取为批数据。
通过argmax()获取值最大的元素的索引。
- 不过这里需要注意的是，我们给定了参数axis=1。这指定了在100 × 10的数组中，沿着第1维方向（以第1维为轴）找到值最大的元素的索引（第0维对应第1个维度）

第四章神经网络的学习

4.1 从数据中学习

“从数据中学习”，是指可以由数据自动决定权重参数的值。

4.1.1 数据驱动

从图像中提取特征量
用机器学习技术学习这些特征量的模式

4.1

深度学习有时也称为端到端机器学习。

端到端指从一端到另一端，即从原始数据（输入）中获得目标结果（输出）。

4.1.2 训练数据和测试数据

使用训练数据进行学习，寻找最优的参数；然后，使用测试数据评价训练得到的模型的实际能力。

为了正确评价模型的泛化能力，就必须划分训练数据和测试数据。另外，训练数据也可以称为监督数据。

只对某个数据集过度拟合的状态称为过拟合（over fitting）。

4.2 损失函数

4.2.1 均方误差

\[E = \frac 1 2 \sum_k(y_k -t_k)^2 \]

$ y_k $ 表示神经网络的输出
$ t_k $ 表示监督数据
$ k $ 表示数据的维数

4.2.2 交叉熵误差

\[E = - \sum_k{t_k \log{y_k}} \]

$ y_k $ 是神经网络的输出
$ t_k $是正确解标签

$ t_k $ 中只有正确解标签的索引为1，其他均为0（one-hot表示）

import numpy as np


# 均方误差 mean_squared_error
def mean_squared_error(y, t):
    return 0.5 * np.sum((y-t)**2)


# 交叉熵误差 cross_entropy_error
def cross_entropy_error(y, t):
    delta = 1e-7
    return -np.sum(t * np.log(y + delta))


# 设“2”为正确解
t1 = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
# 例1：“2”的概率最高的情况（0.6）
y1 = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0]
print(mean_squared_error(np.array(y1), np.array(t1)))
# 0.09750000000000003
# 例2：“7”的概率最高的情况（0.6）
y2 = [0.1, 0.05, 0.1, 0.0, 0.05, 0.1, 0.0, 0.6, 0.0, 0.0]
print(mean_squared_error(np.array(y2), np.array(t1)))
# 0.5975

t2 = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
y3 = [0.1, 0.05, 0.6, 0.0, 0.05, 0.1, 0.0, 0.1, 0.0, 0.0]
print(cross_entropy_error(np.array(y3), np.array(t2)))
# 510825457099338
y4 = [0.1, 0.05, 0.1, 0.0, 0.05, 0.1, 0.0, 0.6, 0.0, 0.0]
print(cross_entropy_error(np.array(y4), np.array(t2)))
# 2.302584092994546

4.2.3 mini-batch 学习

机器学习使用训练数据进行学习。即针对训练数据计算损失函数的值，找出使该值尽可能小的参数。因此，计算损失函数时必须将所有的训练数据作为对象。也就是说，如果训练数据有100个的话，我们就要把这100个损失函数的总和作为学习的指标。

若要求所有训练数据的损失函数的总和，以交叉熵误差为例，可以写成下面的式

\[E = - \frac 1 N \sum_n \sum_k t_{nk} \log{y_{nk}} \]

假设数据有 $N$ 个，

$t_{nk}$ 表示第 $n$ 个数据的第 $k$ 个元素的值
- $ y_{nk}$ 是神经网络的输出
- $t_{nk}$ 是监督数据

式子虽然看起来有一些复杂，其实只是把求单个数据的损失函数的式扩大到了 $N$ 份数据，不过最后还要除以 $ N $ 进行正规化。通过除以 $ N $ ，可以求单个数据的“平均损失函数”。通过这样的平均化，可以获得和训练数据的数量无关的统一指标。比如，即便训练数据有1000个或10000个，也可以求得单个数据的平均损失函数

我们从全部数据中选出一部分，作为全部数据的“近似”。神经网络的学习也是从训练数据中选出一批数据（称为mini-batch,小批量），然后对每个mini-batch进行学习。比如，从60000个训练数据中随机选择100笔，再用这100笔数据进行学习。这种学习方式称为mini-batch学习。

import sys, os
sys.path.append(os.pardir)
import  numpy as np
from NeuralNetwork.MNIST.dataset.mnist import load_mnist

a = load_mnist(flatten=True, normalize=False)
(x_train, t_train), (x_test, t_test) = (a[0], a[1]), (a[2], a[3])
print(x_train.shape)
print(t_train.shape)

print('import succees!')

train_size = x_train.shape[0]
batch_size = 10
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]

print(x_batch)
print(t_batch)


def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size

4.3 数值微分

以 $$ y = 0.01 x^2 + 0.1 x $$ 为例计算数值微分

import numpy as np
import matplotlib.pylab as plt


def function_1(x):
    return 0.01*x**2 + 0.1*x


x = np.arange(0.0, 20.0, 0.1)
y = function_1(x)
plt.xlabel("x")
plt.ylabel("f(x)")
plt.plot(x, y)
plt.show()

4.2

4.3

4.3.3 偏导数

4.4 梯度

4.5 学习算法的实现

步骤1（mini-batch）
- 从训练数据中随机选出一部分数据，这部分数据称为mini-batch。我们的目标是减小mini-batch的损失函数的值。
步骤2（计算梯度）
- 为了减小mini-batch的损失函数的值，需要求出各个权重参数的梯度。梯度表示损失函数的值减小最多的方向。
步骤3（更新参数）
- 将权重参数沿梯度方向进行微小更新。
步骤4（重复）
- 重复步骤1、步骤2、步骤3。

4.5.1 2层神经网络的类

import sys, os

import numpy as np

sys.setrecursionlimit(600000)
from NeuralNetwork.OutputLayer import softmax
from NeuralNetworkLearning.loss_func import cross_entropy_error

sys.path.append(os.pardir)
from NeuralNetwork.ActFunc import sigmoid


# from common.functions import *
# from common.gradient import numerical_gradient
# def numerical_gradient


class TwoLayerNet:

    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # init weight
        self.params = {'W1': weight_init_std * np.random.randn(input_size, hidden_size),
                       'b1': np.zeros(hidden_size),
                       'W2': weight_init_std * np.random.randn(hidden_size, output_size),
                       'b2': np.zeros(output_size)}

    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']

        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)

        return y

    # x:input,  t:monitor
    def loss(self, x, t):
        y = self.predict(x)

        return cross_entropy_error(y, t)

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)

        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy

    # x: input data ,   t: monitor data
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)

        grads = {'W1': self.numerical_gradient(loss_W, self.params['W1']),
                 'b1': self.numerical_gradient(loss_W, self.params['b1']),
                 'W2': self.numerical_gradient(loss_W, self.params['W2']),
                 'b2': self.numerical_gradient(loss_W, self.params['b2']),
                 }
        return grads


net = TwoLayerNet(input_size=784, hidden_size=100, output_size=10)
print(net.params['W1'].shape)
print(net.params['b1'].shape)
print(net.params['W2'].shape)
print(net.params['b2'].shape)

# output
# (784, 100)
# (100,)
# (100, 10)
# (10,)

4.5.2 mini-batch 的实现

import numpy as np

from NeuralNetwork.MNIST.dataset.mnist import load_mnist
from NeuralNetworkLearning.two_layer_net import TwoLayerNet

a = load_mnist(normalize=True, one_hot_label=True)
(x_train, t_train), (x_test, t_test) = (a[0], a[1]), (a[2], a[3])

train_loss_list = []

# 超参数
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

for i in range(iters_num):
    # get mini-batch
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    # get grad
    # grad = network.numerical_gradient(x_batch, t_batch)       # NameError: name 'numerical_gradient' is not defined
    grad = network.gradient(x_batch, t_batch) # 高速版!

    # reflesh params
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -=learning_rate * grad[key]

    # write down learning process
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

输出图像：

item=100

[851.8828174557582, 715.5608905743816, 707.8621977970988, 689.0447042914232, 686.051900191763, 687.782737363747, 686.3297066708171, 688.9935493218624, 687.3874316617355, 681.0674244958459, 686.752635163728, 685.2977138412227, 687.8577783500675, 685.7488995098734, 688.0695367888086, 687.7198601806163, 683.8472872208768, 683.3957714053213, 685.923162926521, 686.4828781126374, 679.5660096343612, 685.5568815272213, 687.0796906464748, 684.9496612444507, 684.2562657589624, 678.0359753660575, 678.0692773117189, 694.6284655374884, 683.0200493570162, 686.4332502604708, 672.4599775910647, 671.8235868062452, 668.0716455369304, 671.1357936160248, 666.566911627731, 705.3218102234689, 665.3262183253327, 655.0155134029308, 685.4485184496702, 678.4632271903947, 653.4813035446953, 661.2422907045936, 660.1992366107972, 701.5466630690082, 670.7216570522571, 678.4488091408618, 775.8696858478299, 802.1953732473994, 745.8180500887559, 735.2972352966578, 738.3294927734206, 745.1274157361361, 724.7165950807796, 699.785847901582, 688.8030811222038, 689.0684258480364, 684.9767292374769, 687.7120769815946, 688.1293962051088, 687.3742471114531, 685.7427522392487, 679.8809657510301, 684.6205098391105, 687.8550833148265, 682.2946636060018, 687.8186014734212, 688.5541534349416, 683.3558695755296, 687.5556902773861, 683.3250013615834, 682.6146128515729, 685.2709884547667, 688.5737595808173, 684.7269083057677, 685.5954086710658, 688.5149625001944, 687.3358994922071, 687.3559412488489, 689.9299816701289, 682.0674913754335, 687.5397704775446, 687.0667678955434, 685.4479249860614, 686.42854859374, 686.6513772957378, 680.4911942728838, 685.5146380689279, 679.5683773402951, 674.4220749572926, 669.7951457749391, 676.0560330687529, 697.0254320928962, 680.0960423166229, 668.64383676514, 698.0748037960914, 667.8019251416468, 658.6732280179694, 689.920934735304, 661.3375157377709, 663.1267222551064]

4_item=100

[860.5703583842944, 730.0615959004034, 726.6035235527722, 702.8215449109166, 701.1889397065415, 694.597262555182, 688.6922273678942, 685.1715066680792, 687.1400795657335, 687.4489537971982, 689.2557543075671, 683.5101098451028, 686.9690082037682, 686.8107051849897, 684.9917877888172, 687.3032712591496, 686.275233979477, 684.752770741107, 689.1925329614073, 683.2037184194091, 682.6472630731155, 687.2746864496045, 689.0269215296934, 684.9261125605655, 687.412664728397, 685.4078542362594, 676.895905629909, 678.7803769282805, 693.8938431250338, 684.7032982197787, 686.2667902036035, 681.6194943381332, 675.3403926005435, 670.0356146493529, 685.3444834366395, 681.6679117694534, 702.7724629437209, 695.2051571178818, 687.7851374644199, 681.7549268422579, 677.3105105667466, 677.2989752125668, 672.3110835217568, 662.6534011501158, 661.9665963632243, 666.914621144909, 670.2423383601622, 662.5683566311715, 665.097690271364, 658.0620838228783, 671.7009536626224, 662.8514465666515, 666.8077002612827, 660.0334982268481, 679.9630425230741, 652.8754087307774, 657.9180471075535, 657.3668012653272, 668.6656825328837, 673.0826792570884, 659.2306948488447, 667.1594432121066, 661.6104927890926, 650.840463046999, 657.148172938583, 675.0258094592307, 652.4218514446482, 641.5627152777733, 652.8216613583209, 667.792683303886, 666.831195285933, 675.5142193689171, 666.7770886382725, 668.0513362990288, 660.6318563757641, 651.7702288784562, 651.2928067400617, 655.7689795488818, 661.701505966548, 659.7252226450266, 660.6002723354848, 656.5064085529577, 656.6499012031138, 648.525765810543, 661.7721722970778, 662.2958881658634, 656.2199733966829, 665.9176446961292, 664.951202433162, 665.3797715881676, 661.0591605673446, 650.8556732435676, 653.8016928172033, 656.6367621522744, 669.3897348706164, 664.0574916313226, 673.5559115300359, 660.6280485628063, 660.1758455145472, 657.8911190024935, 656.5714794102219, 668.2957873081496, 665.2284617071381, 653.1370810085041, 661.8470183899519, 653.6886082236622, 662.1849851203326, 657.4237241853493, 645.5617450003373, 660.9214782251039, 649.9439447994703, 646.6753325849814, 656.5051312728864, 658.0098741737211, 655.548375928641, 658.095692447916, 670.6687768925237, 692.4688304366813, 657.081237572998, 658.2804159835403, 652.3081537279888, 660.5878699414452, 661.3337717632764, 665.50515106646, 653.5609527537474, 648.6146271384088, 643.5024605967828, 667.7243260217725, 653.9867423184573, 650.5558881582685, 658.8529758684855, 655.000258442398, 660.2383962366791, 659.5805331739441, 657.9436764696005, 660.1098192183929, 649.0846317176699, 657.1521191472316, 661.2861229784071, 662.4526310199813, 662.560302558868, 652.3383435050687, 656.8824779451643, 656.9564328167316, 646.0698859184914, 663.137724087364, 663.6194645672272, 671.114295265125, 663.4677757082839, 668.7906428110471, 645.2191083631863, 659.9174157621771, 647.3229816601283, 654.6400558085866, 659.8355808840615, 653.6935112568706, 652.024324933851, 669.2589282793615, 647.5162666767358, 672.3556893127247, 654.1139156980851, 659.5788959538058, 662.3385138515945, 658.7928049385789, 655.6753585387296, 663.3182462747213, 658.8048682960972, 656.1380310055897, 658.5648380967793, 653.7965223401575, 659.8532262213213, 667.8312808437854, 661.4007640492204, 659.1685141134964, 672.067503877303, 659.5537872608661, 644.3062738959192, 656.3167208152265, 658.8950822141093, 656.2848725667918, 655.7836911997299, 655.4155595513379, 660.4525762027008, 646.8583256529218, 658.1213254761069, 663.3647251061551, 654.1033726702738, 666.3645907819871, 651.4370211674529, 663.1883515800232, 634.4262774872409, 649.1505709630733, 655.2302170979997, 657.3832939317526, 665.1230492092601, 661.2289951128987, 646.7066208491904, 649.6821137730976, 664.8130239434593, 678.968568503757, 686.3704858772244, 650.5275936569924, 662.2526671873394, 657.0942984700098, 654.5568793622297, 649.9022277989015, 647.4525634976476, 662.8335503079727, 668.2554630264469, 645.6510261937917, 660.7998439746359, 651.9842809365356, 653.6072344230404, 664.1265426091684, 656.4267841357059, 660.4792764941215, 656.6843356027393, 660.1448922313598, 658.251150673628, 644.1603864410993, 654.7654075923905, 652.7702988781774, 666.8817542443037, 652.9594353711778, 654.5580907483223, 657.8060011369722, 652.6052291570144, 652.8473531131364, 659.0252472256761, 657.8387228044171, 659.0171894250623, 652.621947727956, 656.9331903386568, 656.9351402321515, 658.2546700243863, 650.7908861191917, 662.8153229468974, 681.7506929016654, 659.5485209208869, 661.4852021515942, 648.3212673810799, 657.0939230083035, 646.447065427659, 664.1117343990163, 656.7671681150038, 666.4516312773585, 658.865963285898, 664.3341271754957, 659.7371721808179, 661.033715886029, 654.7484755325399, 659.3600285428695, 643.4546788295397, 654.988465166399, 650.375500828584, 669.110732886538, 652.6867358497789, 667.5362550660058, 664.4386670841793, 650.2511584194109, 650.2681439701166, 635.8044143154211, 654.4329013336846, 659.6695985469742, 650.1483446352529, 661.992175829189, 650.2927585942978, 654.359727719567, 654.3230905734655, 656.9997341700603, 655.5943742902152, 653.4520235571262, 654.2312503725511, 669.8601552194474, 647.2070135811778, 668.0340068966609, 655.7504716101384, 652.9566640614813, 666.1114082427523, 659.6598552803422, 656.1346745881019, 661.1930055989201, 657.6116102255116, 661.6244630302656, 655.74544056981, 657.7579069188225, 659.3950467308298, 656.3492460496965, 662.7105106779627, 652.8894722318192, 660.745633876987, 670.0007627260679, 657.2122349366097, 653.7055402021048, 664.0866219028646, 639.323486051087, 659.826810795319, 655.7322659100789, 653.1652043913633, 658.408278733949, 647.0802438588812, 646.3774376091142, 666.3619196073768, 654.1116907913288, 653.7303484809045, 659.2205047025584, 645.1275564895968, 656.6260156151629, 657.8560510956211, 650.6092833141928, 655.4786855390284, 645.354534425964, 664.785704969489, 701.9901185501406, 676.3153531945005, 638.533147250197, 641.3090627618019, 639.4447732901027, 645.846435493445, 628.8002959347706, 627.8985298939303, 627.7526307746865, 645.6935557461338, 620.0919537797114, 635.5420271771187, 649.3342020551514, 701.6773970113073, 636.0475412470662, 639.8226297478052, 621.2861921131988, 628.8555757175141, 633.2569705807794, 624.8467288785897, 665.7965573568526, 627.0677751462301, 617.3877355688599, 656.842644404652, 647.3858315932962, 653.3282521099125, 711.7283950385607, 666.605663006319, 651.8609667165764, 640.0793367908359, 631.8641149963387, 624.5428325256662, 620.0142539848595, 620.4473105773916, 631.8949545859082, 619.0558245651797, 615.0375843226963, 627.1383774535946, 611.6611727858699, 625.9071405394511, 634.2714453514027, 642.5810117744306, 659.3310032951913, 641.245827693024, 655.7236770478548, 622.0555778652372, 622.6765499132082, 619.2583171903909, 622.1273424790572, 630.5276121871608, 630.1708405873273, 629.3171313081205, 645.2289840642844, 669.6205971240175, 630.2793535405167, 628.1711434946791, 633.458889956434, 630.8876110403561, 615.2306145245739, 614.1808748010728, 628.2654307685904, 611.819445818139, 605.8238185834689, 605.1927468393725, 626.5512669885412, 632.2166852240132, 612.7036869571161, 655.8259405600191, 621.6284833090415, 601.2167052984539, 612.018252574436, 623.7712617418649, 615.210769856209, 633.9734881874347, 638.4704089895401, 663.8964110301081, 632.362322831157, 623.8570122019197, 652.192163566567, 633.8135909716705, 634.2080605518581, 656.1891124472876, 614.8118260236336, 620.2647627016318, 622.9475123368104, 624.2183363511642, 631.1546590022508, 618.4191928252114, 604.4814640116842, 618.5984769603136, 609.5415858644567, 593.0596037006547, 599.5245989772613, 632.4991409997831, 641.4133717482541, 642.0263081893289, 645.4744772215228, 640.4413142732817, 637.8347171427619, 645.6371834732456, 638.7960030438242, 653.5905739303194, 660.6870787615867, 647.4798211381886, 626.4919035679125, 623.5107271877465, 674.9476543972461, 638.5513406535641, 621.430021473934, 630.7358022156582, 629.6151711619761, 608.9426996452218, 611.1648059809045, 619.8521892602068, 619.9831610096732, 615.479979488917, 631.3951263863621, 639.0010355004385, 642.1347719168339, 628.033163191207, 642.4870093459549, 610.2020830484464, 625.4372332861776, 620.6446830614208, 619.0293176718483, 637.0165312961111, 611.6685690319027, 613.0804443917857, 620.6845299275063, 614.4445490434696, 634.3787977572433, 640.1135425210651, 676.7086610155395, 667.9157982006102, 654.9502540329005, 649.1431638000233, 629.5418349608694, 642.2452316470847, 605.9479002098517, 614.8041194823744, 627.2316153851255, 612.3348350308901, 628.9036547477235, 621.8318881566342, 625.9019639134106, 607.6983361412293, 616.6771452620908, 631.3198131279759, 614.72569849053, 609.8458175507319, 614.7187693762721, 617.0050976366713, 620.1430259504654, 628.8242373149826, 611.0253366930708, 621.5167934965726, 608.4390818877362, 614.9096563452231, 609.0410530667684, 607.7110066629031, 614.3182144015833, 635.6144371142561, 667.0293937300506, 645.9795839816247, 621.5174532961962, 624.7101143030146, 628.277272750474, 628.3735712996233, 649.066780946607, 640.2305442576967, 657.4826465173264, 635.286233804705, 619.9846409795404, 631.4334558970706, 634.1786091797338, 622.0251522709602, 614.0340214332825, 627.5907853318722, 630.5403418295409, 613.4070002322567, 623.0433569302055, 621.6466057294043, 619.7067990290079, 621.3874097492214, 665.6526936370382, 670.4095745496044, 670.7723085928396, 621.7148084730985, 624.3948485285136, 637.0301058359752, 627.1078496332731, 645.2315039221053, 655.4244715553945, 653.9675738921087, 612.0831941753854, 638.1655175251313, 616.6895708975892, 607.446872382635, 627.7844048314358, 634.7032529228031, 615.2163095569078, 656.7038961290891, 658.5430678755099, 651.2586452475839, 619.2452844995153, 608.6882199367465, 607.336758460898, 636.8383041924304, 606.6055941852962, 623.4876666929895, 615.1128499659515, 622.1707846415176, 656.7451320970849, 699.4515605969541, 624.51239070109, 606.267824615586, 614.5085504100774, 616.8091770972677, 630.5368657754275, 641.1952427487424, 616.4206306239723, 618.9275181714868, 607.7276642740765, 619.5493818330224, 634.2963665853288, 608.6589641631402, 628.3935776892447, 631.5240144174686, 607.0975118579037, 585.656104704977, 597.5852470607846, 615.7130647620506, 606.6568734961627, 605.6849993162689, 598.2989263350905, 607.8885064471174, 623.0186916296041, 593.073536537385, 611.8093877364552, 605.7374358367294, 623.3224510359521, 611.8475480404971, 701.495812303192, 661.5006205852554, 617.4948958521994, 630.2296809967233, 650.196316564295, 617.1333030438234, 611.476398911431, 629.9457462565191, 648.0539003544172, 592.1401555784822, 602.217809314126, 618.7373336163191, 612.9418970130046, 624.0239269773763, 597.4550942004319, 614.700678637546, 589.2127299383903, 607.5889259282967, 632.2577017964733, 642.4573441660611, 614.0319477894191, 601.3350848156804, 633.0487320864524, 588.3949547474083, 609.773266573709, 603.7582235463728, 635.7024448903245, 660.9963037109904, 611.5818595466944, 642.2734392910177, 655.4608694206343, 640.206816064527, 631.991980482769, 687.5100293933377, 730.9374106731746, 631.7960346656484, 605.93472314927, 615.3000195994413, 602.4255650590621, 588.8560815739291, 594.6657817605683, 591.0108806622451, 589.4176155308844, 571.8901796050748, 593.5362490925545, 602.5061330480169, 600.1099506170962, 592.0879496082666, 614.0427279599983, 617.0102697245815, 625.1569455046895, 599.06005829523, 603.5546723285397, 643.6000236410696, 609.9372262738218, 584.9680244132179, 577.9453061778344, 602.3665629875507, 606.6213081780469, 634.4553695638299, 641.3496029703304, 587.427676491686, 590.451932618713, 586.3384000667561, 576.3287686981502, 569.7300061744368, 569.4661789966833, 602.6627344292558, 603.784036262581, 643.5003077445438, 676.8327222219859, 615.806942784159, 604.8175717943295, 611.8932648161858, 596.5401531355325, 575.1511594904432, 661.8346040432335, 751.4786757297537, 712.1045293019232, 658.7702208876349, 609.7860678217835, 586.0606050432345, 585.7681338243781, 614.5356036014052, 674.4691511520316, 657.4161862618663, 692.2751873798957, 654.9157732186579, 618.3725011842087, 618.8792734522298, 588.6602930423128, 580.3160351943097, 595.6773259319643, 615.4834498457717, 589.812019882025, 583.4617644570174, 580.4960831242063, 596.259951121082, 600.6223673458619, 600.2609468589379, 583.7088475269434, 612.9398120708029, 595.8325486613055, 611.7890804974753, 601.503762759444, 590.4307063329483, 595.2995653078037, 594.6649641041998, 623.8684256685493, 574.5634407290374, 576.2328909056182, 572.7884186292638, 569.6278526228348, 552.7803534760366, 564.741872051948, 607.4570878935872, 599.1364863278961, 655.9180872317852, 595.1114103583077, 599.8118423540798, 592.6566152455152, 568.8211802235905, 605.0870229176696, 615.9125266103247, 583.6814039501683, 594.8068985589663, 585.4720523804772, 578.070644657576, 622.3838214155169, 594.4847551087805, 583.1406121579726, 573.5948985848044, 569.7713210994323, 567.4885371296969, 630.9886661045531, 688.0647116789156, 664.862082163409, 643.700698005212, 601.3751741786855, 564.9183881580368, 555.9779682899066, 622.7211275631441, 619.6358584220172, 613.1769531884968, 597.0978850951171, 606.4654392300781, 607.4872139259015, 603.4152103489797, 563.2954891453649, 564.7984491977219, 614.985348944776, 585.6876184916965, 572.779326414703, 572.1807200905571, 592.2566466509818, 577.4625473837028, 551.5209567450825, 572.8476781064238, 658.2308067270496, 667.192505080395, 642.4772641341717, 642.4705723146205, 597.9512504968407, 576.7120699102337, 564.9028811468045, 570.1910490602786, 582.8058610796915, 555.1751162879676, 571.2872575315064, 581.2532812857122, 619.5267204848991, 591.339024205023, 578.2405872190268, 582.9425621086062, 575.6586848936805, 612.2369054251398, 626.5379766290414, 584.078084781521, 587.9651018280122, 613.3627333951644, 659.7031135306582, 619.5484862551814, 586.8944557571815, 611.907868981448, 599.7909663632364, 648.1314096956363, 617.6988419176564, 649.633443162174, 586.4242189889047, 569.2140589066837, 592.745271379592, 629.5536334720069, 610.7676019158762, 670.2095557356026, 659.4426148718762, 617.0683754767158, 601.8602499659607, 599.4708233093767, 571.9291411858636, 569.4007684737107, 575.3732122301092, 684.213409704606, 616.7053945459271, 615.9289876998728, 621.9559241399775, 602.4977946164568, 592.9106528063771, 598.7991230369553, 650.6323739473557, 629.0102082664366, 602.9036395287543, 596.9161910366856, 646.1504114111603, 658.1896587382806, 617.2297448792226, 635.0406726681678, 623.889676567914, 589.372338172152, 649.9233825219443, 589.0840791069725, 585.6082807227026, 591.2620483014947, 592.4660164892889, 572.5021092873117, 561.7199941430821, 551.9555504326773, 557.7082811448192, 557.8217753579669, 551.2489700388089, 557.699405554231, 575.4957236343141, 596.3864592889269, 589.8852179404892, 575.4452331712079, 572.829954868588, 569.8815744336823, 551.1915537730213, 565.2194010907398, 558.1604346979045, 566.1476853745685, 564.9771929271184, 561.2797226960325, 561.1010732817083, 561.9603672067713, 582.5023536437847, 575.3222763198853, 586.9495109940838, 622.8607196393444, 605.2007289751883, 563.3488167858523, 567.5521556966896, 571.01988935737, 576.8700739099481, 599.3641306553046, 591.3710691809486, 597.8146026982722, 571.073986214222, 581.2367617684259, 552.2947250975903, 568.2306536269058, 597.6844470075046, 567.1074164620131, 613.9187721813065, 577.6482245099617, 572.2016844583295, 552.1334952403943, 572.1988826523448, 565.3864540519314, 580.9168954550873, 614.6658171107485, 662.7661360901152, 709.8374383389381, 663.0203995420163, 638.4055152436711, 599.3988060788846, 604.7435311677128, 632.5574653728661, 631.5685213522345, 628.8358400638956, 594.0361411134431, 561.5448134819342, 570.5546266223025, 608.0833302147546, 628.9147336786777, 574.8212983914323, 578.9822073031896, 588.5385654282945, 585.3552471509486, 593.9317517160906, 613.6124153566616, 638.1850020594285, 625.1026437886346, 631.6146953542251, 721.9212377406267, 623.2397769033398, 710.149295910938, 654.3252549601586, 700.3575879921402, 682.165560948253, 709.8873803008291, 756.1146954117952, 644.0455620183875, 646.6989251799791, 636.6352212151713, 617.5752581534546, 626.3568437371421, 645.5715792308099, 610.8867049010275, 621.1158574044241, 580.4445218999768, 564.9184527935793, 546.4042597427938, 558.5318968097154, 563.2392092223861, 567.3550289229349, 573.8429036252587, 582.116962718128, 554.8898451713478, 579.3217723370578, 587.6978875601529, 599.9685467866851, 571.0238651227437, 579.1838688533361, 560.4648488153039, 568.4655148941455, 568.635309189098, 558.9654974548143, 568.2008404529537, 570.7088793625919, 579.2384575823471, 576.9320767243039, 547.07231125742, 557.6729274083266, 572.0983721638992, 539.7902380144217, 532.5307141522215, 541.2766079158099, 541.9241067151935, 536.3573377019698, 563.3667918163628, 606.6836686085597, 641.4072359134238, 600.5890169196283, 595.1254933178396, 557.6575612326542, 565.479096381041, 552.113751009269, 575.1429439897894, 561.9203290982351, 585.1778967294664, 603.3824179362138, 618.4488031313153, 725.8770850545473, 682.542021635164, 612.2666950861287, 585.5351887169445, 580.9516086424571, 631.5553093835638, 584.9872469712736, 573.6051044134299, 546.8828713307615, 571.9699758607078, 568.7924942606946, 552.9252509875738, 559.6968750072872, 556.0349450453498, 571.4115229746972, 547.944089978686, 555.0953688564738, 538.7285245937991, 528.7903517676166, 592.2212765151626, 537.5045823269984, 521.7032036614307, 531.7586781289835, 557.3316399854046, 562.9693949388623, 562.947526019057, 539.870010622478, 575.1049484905355, 563.8776987597344, 558.9144723846953, 556.1696701612707, 563.9506379495756, 614.1902429731847, 584.0038427719372, 558.2173501765442, 573.8477611196604, 542.5147138948748, 539.1437749926486, 565.5654275605386, 573.1007033866882, 564.2385630966902, 612.9059991485474, 598.8094624375617, 629.269126016201, 583.923539896592, 562.3294823784705, 612.8186306927423, 604.506337836133, 582.801222481105, 548.9577091574065, 640.1461461106858, 583.3400969447889, 615.3081095928889, 553.1934555223856, 579.744652859234, 612.0247017083242, 687.0883467781032, 683.3091138263665, 620.5468815227589, 650.1248698618656, 663.8091166145631, 606.6911361504705, 572.5000971074151, 556.9227491489579, 554.5771118652119, 540.2298155171758, 566.9177003356384, 595.4726928030756, 603.3413935975832, 633.1237006841238, 648.8960326073859, 662.8341402712024, 627.7018987351314, 585.1990486291604, 591.3953850207998, 569.3086130024368, 551.0999768436191, 540.4819655785591, 573.8791744185244, 544.1892150476006, 554.6586630310487, 551.1375722349603, 556.4751342792279, 531.1534341178194, 577.1966811887565, 661.6149777542928, 624.4648823852547, 651.9943916275081, 618.7544005755236, 613.8674353991187, 606.8755369226051, 684.3938292255679, 716.7238831787542]

4_item=1000

4.4

4.5.3 基于测试数据的评价

import numpy as np
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet
(x_train, t_train), (x_test, t_test) = \ load_mnist(normalize=True, one_hot_
laobel = True)
train_loss_list = []
train_acc_list = []
test_acc_list = []
# 平均每个epoch的重复次数
iter_per_epoch = max(train_size / batch_size, 1)
# 超参数
iters_num = 10000
batch_size = 100
learning_rate = 0.1
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
for i in range(iters_num):
 # 获取mini-batch
 batch_mask = np.random.choice(train_size, batch_size)
 x_batch = x_train[batch_mask]
 t_batch = t_train[batch_mask]
 # 计算梯度
 grad = network.numerical_gradient(x_batch, t_batch)
 # grad = network.gradient(x_batch, t_batch) # 高速版!
 # 更新参数
 for key in ('W1', 'b1', 'W2', 'b2'):
 network.params[key] -= learning_rate * grad[key]
 loss = network.loss(x_batch, t_batch)
 train_loss_list.append(loss)
 # 计算每个epoch的识别精度
 if i % iter_per_epoch == 0:
 train_acc = network.accuracy(x_train, t_train)
 test_acc = network.accuracy(x_test, t_test)
 train_acc_list.append(train_acc)
 test_acc_list.append(test_acc)
 print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc))

4.5

第五章误差反向传播法

基于数学式
基于计算图

5.1 计算图

5.1.1 用计算图求解

5.1

5.1.2 局部计算

各个节点处只需进行与自己有关的计算。

计算图可以集中精力于局部计算。
无论全局的计算有多么复杂，各个步骤所要做的就是对象节点的局部计算。
虽然局部计算非常简单，但是通过传递它的计算结果，可以获得全局的复杂计算的结果。

5.1.3 为何用计算图解题

无论全局是多么复杂的计算，都可以通过局部计算使各个节点致力于简单的计算，从而简化问题。
利用计算图可以将中间的计算结果全部保存起来。但是只有这些理由可能还无法令人信服。
使用计算图最大的原因是，可以通过反向传播高效计算导数。

5.2 链式法则

反向传播将局部导数向正方向的反方向（从右到左）传递，一开始可能会让人感到困惑。传递这个局部导数的原理，是基于链式法则（chain rule）的。

5.2

5.3 反向传播

5.3.1 加法节点的反向传播

加法节点的反向传播只是将输入信号输出到下一个节点

5.3.2 乘法节点的反向传播

5.3

5.4 简单层的实现

乘法层
加法层

5.4.1 乘法层的实现

layer_naive.py

class MulLayer:
    def __init__(self):
        self.y = None
        self.x = None

    def forward(self, x, y):
        self.x = x
        self.y = y
        out = x * y

        return out

    def backward(self, dout):
        dx = dout * self.y      # reverse x , y
        dy = dout * self.x

        return dx, dy

buy_apple.py

from BPNeuralNet.layer_naive import MulLayer

apple = 100
apple_num = 2
tax = 1.1

# layer
mul_apple_layer = MulLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
price = mul_tax_layer.forward(apple_price, tax)

print(price)

# backward
dprice = 1
dapple_price, dtax = mul_tax_layer.backward(dprice)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print(dapple, dapple_num, dtax)

# output
# 220.00000000000003
# 2.2 110.00000000000001 200

5.4.2 加法层的实现

class AddLayer:
    def __init__(self):
        pass

    def forward(self, x, y):
        out = x + y
        return out

    def backward(self, dout):
        dx = dout * 1
        dy = dout * 1
        return dx, dy

加法层不需要特意进行初始化，所以__init__()中什么也不运行（pass语句表示“什么也不运行”）。加法层的forward()接收x和y两个参数，将它们相加后输出。backward()将上游传来的导数（dout）原封不动地传递给下游。

5.4

from BPNeuralNet.layer_naive import MulLayer, AddLayer

apple = 100
apple_num = 2
orange = 150
orange_num = 3
tax = 1.1

# lay
mul_apple_layer = MulLayer()
mul_orange_layer = MulLayer()
add_apple_orange_layer = AddLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
orange_price = mul_orange_layer.forward(orange, orange_num)
all_price = add_apple_orange_layer.forward(apple_price,orange_price)
price = mul_tax_layer.forward(all_price, tax)

# backward
dprice = 1
dall_price, dtax = mul_tax_layer.backward(dprice)
dapple_price, dorange_price = add_apple_orange_layer.backward(dall_price)
dorange, dorange_num = mul_orange_layer.backward(dorange_price)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print(price)
print(dapple_num, dapple, dorange, dorange_num, dtax)

# output
# 715.0000000000001
# 110.00000000000001 2.2 3.3000000000000003 165.0 650

5.5 激活函数层的实现

5.5.1 ReLU层

5.5

class ReLU:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout

        return dx

mask是由True/False构成的NumPy数组，它会把正向传播时的输入x的元素中小于等于0的地方保存为True，其他地方（大于0的元素）保存为False。

5.5.2 Sigmoid层

5.6

5.7

5.8

5.9

class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        self.out = out

        return out

    def backward(self, dout):
        dx = dout * (1.0 - self.out) * self.out

        return dx

正向传播时将输出保存在了实例变量out中。然后，反向传播时，使用该变量out进行计算。

5.6 Affine/Softmax 层的实现

5.6.1 Affine 层

神经网络的正向传播中进行的矩阵的乘积运算在几何学领域被称为“仿射变换”。因此，这里将进行仿射变换的处理实现为“Affine层”。

5.10

5.11

5.12

5.6.2 批版本的Affine层

5.13

class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
        self.dW = None
        self.db = None
        
        def forward(self, x):
            self.x = x
            out = np.dot(x, self.W) + self.b
            
            return out
        
        def backward(self, dout):
            dx = np.dot(dout, self.W.T)
            self.dW = np.dot(self.x.T, dout)
            self.db = np.sum(dout, axis=0)
            
            return dx

5.6.3 Softmax-with-Loss层

5.14

神经网络中进行的处理有推理（inference）和学习两个阶段。神经网络的推理通常不使用 Softmax层。比如，用图 5-28的网络进行推理时，会将最后一个 Affine层的输出作为识别结果。神经网络中未被正规化的输出结果（图 5-28中 Softmax层前面的 Affine层的输出）有时被称为“得分”。也就是说，当神经网络的推理只需要给出一个答案的情况下，因为此时只对得分最大值感兴趣，所以不需要 Softmax层。不过，神经网络的学习阶段则需要 Softmax层。

5.15

简化：

softmax函数记为Softmax层，交叉熵误差记为Cross Entropy Error层。这里假设要进行3类分类，从前面的层接收3个输入（得分）。如图5-30所示，Softmax层将输入（a1, a2, a3）正规化，输出（y1, y2, y3）。Cross Entropy Error层接收Softmax的输出（y1, y2, y3）和教师标签（t1, t2, t3），从这些数据中输出损失L。

5.16

注意的是反向传播的结果。Softmax层的反向传播得到了（y1 − t1, y2 − t2, y3 − t3）这样“漂亮”的结果。由于（y1, y2, y3）是Softmax层的输出，（t1, t2, t3）是监督数据，所以（y1 − t1, y2 − t2, y3 − t3）是Softmax层的输出和教师标签的差分。神经网络的反向传播会把这个差分表示的误差传递给前面的层，这是神经网络学习中的重要性质。

神经网络学习的目的就是通过调整权重参数，使神经网络的输出（Softmax的输出）接近教师标签。因此，必须将神经网络的输出与教师标签的误差高效地传递给前面的层。刚刚的（y1 − t1, y2 − t2, y3 − t3）正是Softmax层的输出与教师标签的差，直截了当地表示了当前神经网络的输出与教师标签的误差。

使用交叉熵误差作为 softmax函数的损失函数后，反向传播得到（y1 − t1, y2 − t2, y3 − t3）这样“漂亮”的结果。实际上，这样“漂亮”的结果并不是偶然的，而是为了得到这样的结果，特意设计了交叉熵误差函数。回归问题中输出层使用“恒等函数”，损失函数使用“平方和误差”，也是出于同样的理由（3.5节）。也就是说，使用“平方和误差”作为“恒等函数”的损失函数，反向传播才能得到（y1 − t1, y2 − t2, y3 − t3）这样“漂亮”的结果。

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None           # softmax output
        self.t = None           # monitor data (one-hot vector)
        
    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        
        return self.loss
    
    def backward(self, dout= 1):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size
        
        return dx

5.7 误差反向传播法的实现

5.7.1 神经网络学习的全貌图

step 1 mini-batch
- 从训练数据中随机选择一部分数据。
step 2 get grad
- 计算损失函数关于各个权重参数的梯度。
step 3 reflash the params
- 将权重参数沿梯度方向进行微小的更新。
step 4 repeat
- 重复步骤1、步骤2、步骤3。

5.7.2 对应误差反向传播法的神经网络的实现

5.17

import sys, os

sys.path.append(os.pardir)
import numpy as np
from common.layers import *
from common.gradient import numerical_gradient
from collections import OrderedDict


class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # init weight
        self.params = {
            'W1': weight_init_std * np.random.randn(input_size, hidden_size),
            'b1': np.zeros(hidden_size),
            'W2': weight_init_std * np.random.randn(hidden_size, output_size),
            'b2': np.zeros(output_size)
        }

        # layers
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
        self.layers['Relu1'] = ReLU()
        self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])

        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    # x:input t:monitor
    def loss(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        if t.ndim != 1:
            t = np.argmax(t, axis=1)

        accuracy = np.sum(y == t) / float(x.shape[0])

        return accuracy

    # x:input t:monitor
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)

        grads = {
            'W1': numerical_gradient(loss_W, self.params['W1']),
            'b1': numerical_gradient(loss_W, self.params['b1']),
            'W2': numerical_gradient(loss_W, self.params['W2']),
            'b2': numerical_gradient(loss_W, self.params['b2']),
        }

        return grads

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # setup
        grads = {
            'W1': self.layers['Affine1'].dW,
            'b1': self.layers['Affine1'].db,
            'W2': self.layers['Affine2'].dW,
            'b2': self.layers['Affine2'].db
        }

        return grads

OrderedDict是有序字典，“有序”是指它可以记住向字典里添加元素的顺序。因此，神经网络的正向传播只需按照添加元素的顺序调用各层的forward()方法就可以完成处理，而反向传播只需要按照相反的顺序调用各层即可。因为Affine层和ReLU层的内部会正确处理正向传播和反向传播，所以这里要做的事情仅仅是以正确的顺序连接各层，再按顺序（或者逆序）调用各层。

像这样通过将神经网络的组成元素以层的方式实现，可以轻松地构建神经网络。这个用层进行模块化的实现具有很大优点。因为想另外构建一个神经网络（比如5层、10层、20层……的大的神经网络）时，只需像组装乐高积木那样添加必要的层就可以了。之后，通过各个层内部实现的正向传播和反向传播，就可以正确计算进行识别处理或学习所需的梯度。

5.7.3 误差反向传播法的梯度确认

数值微分的优点是实现简单，因此，一般情况下不太容易出错。而误差反向传播法的实现很复杂，容易出错。所以，经常会比较数值微分的结果和误差反向传播法的结果，以确认误差反向传播法的实现是否正确。确认数值微分求出的梯度结果和误差反向传播法求出的结果是否一致（严格地讲，是非常相近）的操作称为梯度确认（gradient check）。

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

x_batch = x_train[:3]
t_batch = t_train[:3]

grad_numerical = network.numerical_gradient(x_batch, t_batch)
grad_backprop = network.gradient(x_batch, t_batch)

for key in grad_numerical.keys():
    diff = np.average( np.abs(grad_backprop[key] - grad_numerical[key]) )
    print(key + ":" + str(diff))
    
# output
# W1:3.915006700095741e-10
# b1:2.284402318441549e-09
# W2:5.4334005556133635e-09
# b2:1.3996668284527168e-07

5.7.4 使用误差反向传播法的学习

import sys, os
sys.path.append(os.pardir)

import numpy as np
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    # 梯度
    #grad = network.numerical_gradient(x_batch, t_batch)
    grad = network.gradient(x_batch, t_batch)
    
    # 更新
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
    
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
    
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print(train_acc, test_acc)
        
# output
# 0.09918333333333333 0.1038
# 0.9054833333333333 0.9099
# 0.925 0.9284
# 0.9399 0.9395
# 0.9452 0.9441
# 0.9528833333333333 0.9492
# 0.9583333333333334 0.9561
# 0.9625666666666667 0.9599
# 0.9653 0.9622
# 0.96625 0.9594
# 0.9708166666666667 0.9657
# 0.9724 0.9663
# 0.9726666666666667 0.9673
# 0.9753833333333334 0.9688
# 0.9770166666666666 0.9706
# 0.9784333333333334 0.9699
# 0.9792166666666666 0.9705

第六章与学习相关的技巧

6.1 参数的更新

使用参数的梯度，沿梯度方向更新参数，并重复这个步骤多次，从而逐渐靠近最优参数，这个过程称为随机梯度下降法（stochastic gradient descent），简称SGD。SGD是一个简单的方法，不过比起胡乱地搜索参数空间，也算是“聪明”的方法。但是，根据不同的问题，也存在比SGD更加聪明的方法。

6.1.1 探险家的故事

探险家虽然看不到周围的情况，但是能够知道当前所在位置的坡度（通过脚底感受地面的倾斜状况）。于是，朝着当前所在位置的坡度最大的方向前进，就是SGD的策略。勇敢的探险家心里可能想着只要重复这一策略，总有一天可以到达“至深之地”。

6.1.2 SGD

\[W \leftarrow W - \eta \frac{\part L}{\part W} \]

更新的权重参数记为W
损失函数关于W的梯度记为 $ \frac{\part L}{\part W} $
η表示学习率，实际上会取0.01或0.001这些事先决定好的值。
式子中的←表示用右边的值更新左边的值。

import numpy as np

#SGD Stochastic Gradient Descent
class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr

    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

这里，进行初始化时的参数lr表示learning rate（学习率）。这个学习率会保存为实例变量。此外，代码段中还定义了update(params, grads)方法，这个方法在SGD中会被反复调用。参数params和grads（与之前的神经网络的实现一样）是字典型变量，按params['W1']、grads['W1']的形式，分别保存了权重参数和它们的梯度。

6.1.3 SGD的缺点

SGD的缺点是，如果函数的形状非均向（anisotropic），比如呈延伸状，搜索的路径就会非常低效。因此，我们需要比单纯朝梯度方向前进的SGD更聪明的方法。SGD低效的根本原因是，梯度的方向并没有指向最小值的方向。

6.1.4 Momentum

\[v \leftarrow \alpha v - \eta \frac{\part L}{\part W} \\ W \leftarrow W + v \]

v 对应物理上的速度

class Momentum:
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v =None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]
            params[key] += self.v[key]

初始化时，v中什么都不保存，但当第一次调用update()时，v会以字典型变量的形式保存与参数结构相同的数据。

6.1.5 AdaGrad

为学习率衰减（learning rate decay）的方法，即随着学习的进行，使学习率逐渐减小。实际上，一开始“多”学，然后逐渐“少”学的方法，在神经网络的学习中经常被使用。

AdaGrad会为参数的每个元素适当地调整学习率，与此同时进行学习（AdaGrad的Ada来自英文单词Adaptive，即“适当的”的意思）。下面，让我们用数学式表示AdaGrad的更新方法。

\[h \leftarrow h + \frac{\part L}{\part W} \bigodot \frac{\part L}{\part W} \\ W \leftarrow W - \eta \frac{1}{\sqrt h} \frac{\part L}{\part W} \]

h，它保存了以前的所有梯度值的平方和，在更新参数时，通过乘以 $\frac{1}{\sqrt h}$，就可以调整学习的尺度。这意味着，参数的元素中变动较大（被大幅更新）的元素的学习率将变小。也就是说，可以按参数的元素进行学习率衰减，使变动大的参数的学习率逐渐减小。

AdaGrad会记录过去所有梯度的平方和。因此，学习越深入，更新的幅度就越小。实际上，如果无止境地学习，更新量就会变为 0，完全不再更新。为了改善这个问题，可以使用 RMSProp方法。RMSProp方法并不是将过去所有的梯度一视同仁地相加，而是逐渐地遗忘过去的梯度，在做加法运算时将新梯度的信息更多地反映出来。这种操作从专业上讲，称为“指数移动平均”，呈指数函数式地减小过去的梯度的尺度。

class AdaGrad:
    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None

    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.item():
                self.h[key] = np.zeros_like(val)

        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

最后一行加上了微小值1e-7。这是为了防止当self.h[key]中有0时，将0用作除数的情况。在很多深度学习的框架中，这个微小值也可以设定为参数，但这里我们用的是1e-7这个固定值。

6.1.6 Adam

Adam会设置 3个超参数。一个是学习率（论文中以α出现），另外两个是一次momentum系数β1和二次momentum系数β2。根据论文，标准的设定值是β1为 0.9，β2 为 0.999。设置了这些值后，大多数情况下都能顺利运行。

class Adam:

    """Adam (http://arxiv.org/abs/1412.6980v8)"""

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
        
        self.iter += 1
        lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
        
        for key in params.keys():
            #self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
            #self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
            
            #unbias_m += (1 - self.beta1) * (grads[key] - self.m[key]) # correct bias
            #unbisa_b += (1 - self.beta2) * (grads[key]*grads[key] - self.v[key]) # correct bias
            #params[key] += self.lr * unbias_m / (np.sqrt(unbisa_b) + 1e-7)

6.1.7 使用哪种更新方法

import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from collections import OrderedDict
from common.optimizer import *


def f(x, y):
    return x**2 / 20.0 + y**2


def df(x, y):
    return x / 10.0, 2.0*y

init_pos = (-7.0, 2.0)
params = {}
params['x'], params['y'] = init_pos[0], init_pos[1]
grads = {}
grads['x'], grads['y'] = 0, 0


optimizers = OrderedDict()
optimizers["SGD"] = SGD(lr=0.95)
optimizers["Momentum"] = Momentum(lr=0.1)
optimizers["AdaGrad"] = AdaGrad(lr=1.5)
optimizers["Adam"] = Adam(lr=0.3)

idx = 1

for key in optimizers:
    optimizer = optimizers[key]
    x_history = []
    y_history = []
    params['x'], params['y'] = init_pos[0], init_pos[1]
    
    for i in range(30):
        x_history.append(params['x'])
        y_history.append(params['y'])
        
        grads['x'], grads['y'] = df(params['x'], params['y'])
        optimizer.update(params, grads)
    

    x = np.arange(-10, 10, 0.01)
    y = np.arange(-5, 5, 0.01)
    
    X, Y = np.meshgrid(x, y) 
    Z = f(X, Y)
    
    # for simple contour line  
    mask = Z > 7
    Z[mask] = 0
    
    # plot 
    plt.subplot(2, 2, idx)
    idx += 1
    plt.plot(x_history, y_history, 'o-', color="red")
    plt.contour(X, Y, Z)
    plt.ylim(-10, 10)
    plt.xlim(-10, 10)
    plt.plot(0, 0, '+')
    #colorbar()
    #spring()
    plt.title(key)
    plt.xlabel("x")
    plt.ylabel("y")
    
plt.show()

6.1

6.1.8 基于MNIST数据集的更新方法的比较

# coding: utf-8
import os
import sys
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import *


# 0:读入MNIST数据==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1:进行实验的设置==========
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['AdaGrad'] = AdaGrad()
optimizers['Adam'] = Adam()
#optimizers['RMSprop'] = RMSprop()

networks = {}
train_loss = {}
for key in optimizers.keys():
    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100, 100, 100],
        output_size=10)
    train_loss[key] = []    


# 2:开始训练==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in optimizers.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizers[key].update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    if i % 100 == 0:
        print( "===========" + "iteration:" + str(i) + "===========")
        for key in optimizers.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


# 3.绘制图形==========
markers = {"SGD": "o", "Momentum": "x", "AdaGrad": "s", "Adam": "D"}
x = np.arange(max_iterations)
for key in optimizers.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 1)
plt.legend()
plt.show()

# output
"""
===========iteration:0===========
SGD:2.329331997760732
Momentum:2.239740872105693
AdaGrad:2.0115770564653603
Adam:2.1785930354166183
===========iteration:100===========
SGD:1.5920489715423023
Momentum:0.40445967201487354
AdaGrad:0.186087116303349
Adam:0.3202725675759585
===========iteration:200===========
SGD:0.8408885389025664
Momentum:0.24873588375542072
AdaGrad:0.10373581273435521
Adam:0.16653281247982823
===========iteration:300===========
SGD:0.5596081864950619
Momentum:0.18414831005843796
AdaGrad:0.09125911092710051
Adam:0.11371677999728938
===========iteration:400===========
SGD:0.46933835282849873
Momentum:0.24217775785072454
AdaGrad:0.10003178887653093
Adam:0.1661620287373278
===========iteration:500===========
SGD:0.29609331242089415
Momentum:0.07445559661712355
AdaGrad:0.03267636502916081
Adam:0.06502710962026288
===========iteration:600===========
SGD:0.37601822062465085
Momentum:0.15003670583089365
AdaGrad:0.04684578673211999
Adam:0.07692807935356297
===========iteration:700===========
SGD:0.3894117630292092
Momentum:0.15315018744891754
AdaGrad:0.08145681956534924
Adam:0.1299980929871106
===========iteration:800===========
SGD:0.31380572940653384
Momentum:0.10224738540798715
AdaGrad:0.043447184552153864
Adam:0.07816942336275769
===========iteration:900===========
SGD:0.3287891481693203
Momentum:0.08263662072344545
AdaGrad:0.06528379930283314
Adam:0.04980330381233172
===========iteration:1000===========
SGD:0.28671448636999114
Momentum:0.142584091147013
AdaGrad:0.05315507947076199
Adam:0.04629697111431068
===========iteration:1100===========
SGD:0.2437537644227087
Momentum:0.07411436829359078
AdaGrad:0.02345543572879498
Adam:0.019983858202636526
===========iteration:1200===========
SGD:0.2599874646085678
Momentum:0.08101600943800173
AdaGrad:0.0328496960177962
Adam:0.04400568097017252
===========iteration:1300===========
SGD:0.2732296779478336
Momentum:0.12174912404866345
AdaGrad:0.04576321258085212
Adam:0.05222935659937171
===========iteration:1400===========
SGD:0.2094405834812796
Momentum:0.05774905191908389
AdaGrad:0.022105036105863483
Adam:0.03810723404437133
===========iteration:1500===========
SGD:0.28297437694515026
Momentum:0.09332288290561128
AdaGrad:0.036781364998291405
Adam:0.052597671291856135
===========iteration:1600===========
SGD:0.1538592664823519
Momentum:0.022802756823788968
AdaGrad:0.01358420944437242
Adam:0.012526788305912279
===========iteration:1700===========
SGD:0.18623466123617463
Momentum:0.07009903942845747
AdaGrad:0.030205169973035427
Adam:0.06250165917942531
===========iteration:1800===========
SGD:0.10647058591246433
Momentum:0.06438272550756108
AdaGrad:0.013762186668756747
Adam:0.017969324260635036
===========iteration:1900===========
SGD:0.3658588317785016
Momentum:0.11484925108134243
AdaGrad:0.0435907026264711
Adam:0.08143912741187331
"""

6.2

基于MNIST数据集的4种更新方法的比较：横轴表示学习的迭代次数（iteration），纵轴表示损失函数的值（loss）

这个实验以一个5层神经网络为对象，其中每层有100个神经元。激活函数使用的是ReLU。

与SGD相比，其他3种方法学习得更快，而且速度基本相同，仔细看的话，AdaGrad的学习进行得稍微快一点。这个实验需要注意的地方是，实验结果会随学习率等超参数、神经网络的结构（几层深等）的不同而发生变化。不过，一般而言，与SGD相比，其他3种方法可以学习得更快，有时最终的识别精度也更高

6.2 权重的初始值

权值衰减就是一种以减小权重参数的值为目的进行学习的方法。通过减小权重参数的值来抑制过拟合的发生。

为了防止“权重均一化”（严格地讲，是为了瓦解权重的对称结构），必须随机生成初始值。

6.2.2 隐藏层的激活值的分布

input_data = np.random.randn(1000, 100)  # 1000个数据
node_num = 100  # 各隐藏层的节点（神经元）数
hidden_layer_size = 5  # 隐藏层有5层
activations = {}  # 激活值的结果保存在这里

x = input_data

for i in range(hidden_layer_size):
   if i != 0:
       x = activations[i - 1]

   # 改变初始值进行实验！
   w = np.random.randn(node_num, node_num) * 1
   # w = np.random.randn(node_num, node_num) * 0.01
   # w = np.random.randn(node_num, node_num) * np.sqrt(1.0 / node_num)
   # w = np.random.randn(node_num, node_num) * np.sqrt(2.0 / node_num)

   a = np.dot(x, w)

   # 将激活函数的种类也改变，来进行实验！
   z = sigmoid(a)
   # z = ReLU(a)
   # z = tanh(a)

   activations[i] = z

# 绘制直方图
for i, a in activations.items():
   plt.subplot(1, len(activations), i + 1)
   plt.title(str(i + 1) + "-layer")
   if i != 0: plt.yticks([], [])
   # plt.xlim(0.1, 1)
   # plt.ylim(0, 7000)
   plt.hist(a.flatten(), 30, range=(0, 1))
plt.show()

6.3

这里使用的sigmoid函数是S型函数，随着输出不断地靠近0（或者靠近1），它的导数的值逐渐接近0。因此，偏向0和1的数据分布会造成反向传播中梯度的值不断变小，最后消失。这个问题称为梯度消失（gradient vanishing）。层次加深的深度学习中，梯度消失的问题可能会更加严重。

下面，将权重的标准差设为0.01，进行相同的实验。

6.4

这次呈集中在0.5附近的分布。因为不像刚才的例子那样偏向0和1，所以不会发生梯度消失的问题。但是，激活值的分布有所偏向，说明在表现力上会有很大问题。为什么这么说呢？因为如果有多个神经元都输出几乎相同的值，那它们就没有存在的意义了。比如，如果100个神经元都输出几乎相同的值，那么也可以由1个神经元来表达基本相同的事情。因此，激活值在分布上有所偏向会出现“表现力受限”的问题。

各层的激活值的分布都要求有适当的广度。为什么呢？因为通过在各层间传递多样性的数据，神经网络可以进行高效的学习。反过来，如果传递的是有所偏向的数据，就会出现梯度消失或者“表现力受限”的问题，导致学习可能无法顺利进行。

接着，我们尝试使用Xavier Glorot等人的论文[9]中推荐的权重初始值（俗称“Xavier初始值”）。现在，在一般的深度学习框架中，Xavier初始值已被作为标准使用。比如，Caffe框架中，通过在设定权重初始值时赋予xavier参数，就可以使用Xavier初始值。

Xavier的论文中，为了使各层的激活值呈现出具有相同广度的分布，推导了合适的权重尺度。推导出的结论是，如果前一层的节点数为n，则初始值使用标准差为$ \frac{1}{\sqrt n} $的分布.

6.5

使用Xavier初始值进行实验。进行实验的代码只需要将设定权重初始值的地方换成如下内容即可（因为此处所有层的节点数都是100，所以简化了实现）。

node_num = 100 # 前一层的节点数

w = np.random.randn(node_num, node_num) / np.sqrt((1.0 / node_num)

使用Xavier初始值后的结果如图6-13所示。从这个结果可知，越是后面的层，图像变得越歪斜，但是呈现了比之前更有广度的分布。因为各层间传递的数据有适当的广度，所以sigmoid函数的表现力不受限制，有望进行高效的学习。

6.6

后面的层的分布呈稍微歪斜的形状。如果用tanh函数（双曲线函数）代替sigmoid函数，这个稍微歪斜的问题就能得到改善。实际上，使用tanh函数后，会呈漂亮的吊钟型分布。tanh函数和sigmoid函数同是 S型曲线函数，但tanh函数是关于原点(0, 0)对称的 S型曲线，而sigmoid函数是关于(x, y)=(0, 0.5)对称的S型曲线。众所周知，用作激活函数的函数最好具有关于原点对称的性质。

6.7

6.2.3 ReLU的权重初始值

6.2.4 基于MNIST数据集的权重初始值的比较

基于std = 0.01、Xavier初始值、He初始值进行实验

import os
import sys

sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import SGD


# 0:读入MNIST数据==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1:进行实验的设置==========
weight_init_types = {'std=0.01': 0.01, 'Xavier': 'sigmoid', 'He': 'relu'}
optimizer = SGD(lr=0.01)

networks = {}
train_loss = {}
for key, weight_type in weight_init_types.items():
    networks[key] = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100],
                                  output_size=10, weight_init_std=weight_type)
    train_loss[key] = []


# 2:开始训练==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in weight_init_types.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizer.update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    if i % 100 == 0:
        print("===========" + "iteration:" + str(i) + "===========")
        for key in weight_init_types.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


# 3.绘制图形==========
markers = {'std=0.01': 'o', 'Xavier': 's', 'He': 'D'}
x = np.arange(max_iterations)
for key in weight_init_types.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 2.5)
plt.legend()
plt.show()

# output
"""
===========iteration:0===========
std=0.01:2.302436391025383
Xavier:2.300084080358946
He:2.386787120004988
===========iteration:100===========
std=0.01:2.3022394969462425
Xavier:2.253234279514994
He:1.7575500618138786
===========iteration:200===========
std=0.01:2.302069951736108
Xavier:2.142814733558863
He:0.9790482950812666
===========iteration:300===========
std=0.01:2.3013049716549725
Xavier:1.896462626208228
He:0.8083410617816422
===========iteration:400===========
std=0.01:2.302008006626096
Xavier:1.3253100980594796
He:0.4590751030353189
===========iteration:500===========
std=0.01:2.3014269030890357
Xavier:0.8068592230383133
He:0.3180060687285214
===========iteration:600===========
std=0.01:2.301968772619441
Xavier:0.5794102719507597
He:0.3347743637851007
===========iteration:700===========
std=0.01:2.3021897879900326
Xavier:0.5588898890252033
He:0.372040423638795
===========iteration:800===========
std=0.01:2.301080316597943
Xavier:0.5479176307754225
He:0.39650108813554363
===========iteration:900===========
std=0.01:2.300080021476741
Xavier:0.4341211761837327
He:0.278681699178202
===========iteration:1000===========
std=0.01:2.304514741525719
Xavier:0.3707061790933307
He:0.2648089489312217
===========iteration:1100===========
std=0.01:2.3004539305945375
Xavier:0.4915418374044957
He:0.3356737712359732
===========iteration:1200===========
std=0.01:2.3080665943377427
Xavier:0.3969163571496253
He:0.2935253899321855
===========iteration:1300===========
std=0.01:2.3012077760792424
Xavier:0.3407360593082971
He:0.2942012842021573
===========iteration:1400===========
std=0.01:2.298276820540031
Xavier:0.359006013729691
He:0.278139425213775
===========iteration:1500===========
std=0.01:2.310406173713167
Xavier:0.4101741561391103
He:0.30956212151105017
===========iteration:1600===========
std=0.01:2.3047867409446177
Xavier:0.24739212103654792
He:0.16658095765465294
===========iteration:1700===========
std=0.01:2.2943479684706896
Xavier:0.2953734058124592
He:0.18372966666756582
===========iteration:1800===========
std=0.01:2.296332673636079
Xavier:0.38176579873634764
He:0.36160779362439865
===========iteration:1900===========
std=0.01:2.2950138265680415
Xavier:0.2812381548905389
He:0.21066174459718814
"""

6.8

神经网络有5层，每层有100个神经元，激活函数使用的是ReLU。从图6-15的结果可知，std = 0.01时完全无法进行学习。这和刚才观察到的激活值的分布一样，是因为正向传播中传递的值很小（集中在0附近的数据）。因此，逆向传播时求到的梯度也很小，权重几乎不进行更新。相反，当权重初始值为Xavier初始值和He初始值时，学习进行得很顺利。并且，我们发现He初始值时的学习进度更快一些。

6.3 Batch Normalization

6.3.1 Batch Normalization的算法

优点：

可以使学习快速进行（可以增大学习率）。
不那么依赖初始值（对于初始值不用那么神经质）。
抑制过拟合（降低Dropout等的必要性）。

6.9

6.10

6.3.2 Batch Normalization的评估

import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net_extend import MultiLayerNetExtend
from common.optimizer import SGD, Adam

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# 减少学习数据
x_train = x_train[:1000]
t_train = t_train[:1000]

max_epochs = 20
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.01


def __train(weight_init_std):
    bn_network = MultiLayerNetExtend(input_size=784, hidden_size_list=[100, 100, 100, 100, 100], output_size=10, 
                                    weight_init_std=weight_init_std, use_batchnorm=True)
    network = MultiLayerNetExtend(input_size=784, hidden_size_list=[100, 100, 100, 100, 100], output_size=10,
                                weight_init_std=weight_init_std)
    optimizer = SGD(lr=learning_rate)
    
    train_acc_list = []
    bn_train_acc_list = []
    
    iter_per_epoch = max(train_size / batch_size, 1)
    epoch_cnt = 0
    
    for i in range(1000000000):
        batch_mask = np.random.choice(train_size, batch_size)
        x_batch = x_train[batch_mask]
        t_batch = t_train[batch_mask]
    
        for _network in (bn_network, network):
            grads = _network.gradient(x_batch, t_batch)
            optimizer.update(_network.params, grads)
    
        if i % iter_per_epoch == 0:
            train_acc = network.accuracy(x_train, t_train)
            bn_train_acc = bn_network.accuracy(x_train, t_train)
            train_acc_list.append(train_acc)
            bn_train_acc_list.append(bn_train_acc)
    
            print("epoch:" + str(epoch_cnt) + " | " + str(train_acc) + " - " + str(bn_train_acc))
    
            epoch_cnt += 1
            if epoch_cnt >= max_epochs:
                break
                
    return train_acc_list, bn_train_acc_list


# 3.绘制图形==========
weight_scale_list = np.logspace(0, -4, num=16)
x = np.arange(max_epochs)

for i, w in enumerate(weight_scale_list):
    print( "============== " + str(i+1) + "/16" + " ==============")
    train_acc_list, bn_train_acc_list = __train(w)
    
    plt.subplot(4,4,i+1)
    plt.title("W:" + str(w))
    if i == 15:
        plt.plot(x, bn_train_acc_list, label='Batch Normalization', markevery=2)
        plt.plot(x, train_acc_list, linestyle = "--", label='Normal(without BatchNorm)', markevery=2)
    else:
        plt.plot(x, bn_train_acc_list, markevery=2)
        plt.plot(x, train_acc_list, linestyle="--", markevery=2)

    plt.ylim(0, 1.0)
    if i % 4:
        plt.yticks([])
    else:
        plt.ylabel("accuracy")
    if i < 12:
        plt.xticks([])
    else:
        plt.xlabel("epochs")
    plt.legend(loc='lower right')
    
plt.show()

6.11

6.4 正则化

机器学习的问题中，过拟合是一个很常见的问题。过拟合指的是只能拟合训练数据，但不能很好地拟合不包含在训练数据中的其他数据的状态。机器学习的目标是提高泛化能力，即便是没有包含在训练数据里的未观测数据，也希望模型可以进行正确的识别。

6.4.1 过拟合

主要原因：

模型拥有大量参数、表现力强。
训练数据少。

import os
import sys

sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.optimizer import SGD

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# 为了再现过拟合，减少学习数据
x_train = x_train[:300]
t_train = t_train[:300]

# weight decay（权值衰减）的设定 =======================
#weight_decay_lambda = 0 # 不使用权值衰减的情况
weight_decay_lambda = 0.1
# ====================================================

network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100], output_size=10,
                        weight_decay_lambda=weight_decay_lambda)
optimizer = SGD(lr=0.01)

max_epochs = 201
train_size = x_train.shape[0]
batch_size = 100

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)
epoch_cnt = 0

for i in range(1000000000):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        print("epoch:" + str(epoch_cnt) + ", train acc:" + str(train_acc) + ", test acc:" + str(test_acc))

        epoch_cnt += 1
        if epoch_cnt >= max_epochs:
            break


# 3.绘制图形==========
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

6.12

6.4.2 权值衰减

权值衰减是一直以来经常被使用的一种抑制过拟合的方法。该方法通过

在学习的过程中对大的权重进行惩罚，来抑制过拟合。很多过拟合原本就是

因为权重参数取值过大才发生的。

6.13

6.4.3 Dropout

Dropout是一种在学习的过程中随机删除神经元的方法。训练时，随机选出隐藏层的神经元，然后将其删除。被删除的神经元不再进行信号的传递，如图6-22所示。训练时，每传递一次数据，就会随机选择要删除的神经元。然后，测试时，虽然会传递所有的神经元信号，但是对于各个神经元的输出，要乘上训练时的删除比例后再输出。

6.14

class Dropout:
  
    def __init__(self, dropout_ratio=0.5):
        self.dropout_ratio = dropout_ratio
        self.mask = None

    def forward(self, x, train_flg=True):
        if train_flg:
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask
        else:
            return x * (1.0 - self.dropout_ratio)

    def backward(self, dout):
        return dout * self.mask

这里的要点是，每次正向传播时，self.mask中都会以False的形式保存要删除的神经元。self.mask会随机生成和x形状相同的数组，并将值比dropout_ratio大的元素设为True。反向传播时的行为和ReLU相同。也就是说，正向传播时传递了信号的神经元，反向传播时按原样传递信号；正向传播时没有传递信号的神经元，反向传播时信号将停在那里。

import os
import sys
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net_extend import MultiLayerNetExtend
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# 为了再现过拟合，减少学习数据
x_train = x_train[:300]
t_train = t_train[:300]

# 设定是否使用Dropuout，以及比例 ========================
use_dropout = True  # 不使用Dropout的情况下为False
dropout_ratio = 0.2
# ====================================================

network = MultiLayerNetExtend(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                              output_size=10, use_dropout=use_dropout, dropout_ration=dropout_ratio)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=301, mini_batch_size=100,
                  optimizer='sgd', optimizer_param={'lr': 0.01}, verbose=True)
trainer.train()

train_acc_list, test_acc_list = trainer.train_acc_list, trainer.test_acc_list

# 绘制图形==========
markers = {'train': 'o', 'test': 's'}
x = np.arange(len(train_acc_list))
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

6.15

6.5 超参数的验证

6.5.1 验证数据

不能使用测试数据评估超参数的性能。这一点非常重要，但也容易被忽视。

如果使用测试数据调整超参数，超参数的值会对测试数据发生过拟合。换句话说，用测试数据确认超参数的值的“好坏”，就会导致超参数的值被调整为只拟合测试数据。这样的话，可能就会得到不能拟合其他数据、泛化能力低的模型。

调整超参数时，必须使用超参数专用的确认数据。用于调整超参数的数据，一般称为验证数据（validation data）。我们使用这个验证数据来评估超参数的好坏。

6.5.2 超参数的最优化

进行超参数的最优化时，逐渐缩小超参数的“好值”的存在范围非常重要。所谓逐渐缩小范围，是指：

一开始先大致设定一个范围，从这个范围中随机选出一个超参数（采样），用这个采样到的值进行识别精度的评估
然后，多次重复该操作，观察识别精度的结果，根据这个结果缩小超参数的“好值”的范围。
通过重复这一操作，就可以逐渐确定超参数的合适范围。
步骤0 设定超参数的范围。
步骤1 从设定的超参数范围中随机采样。
步骤2 使用步骤1中采样到的超参数的值进行学习，通过验证数据评估识别精度（但是要将epoch设置得很小）。
步骤3 重复步骤1和步骤2（100次等），根据它们的识别精度的结果，缩小超参数的范围。

6.5.3 超参数最优化的实现

import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.util import shuffle_dataset
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# 为了实现高速化，减少训练数据
x_train = x_train[:500]
t_train = t_train[:500]

# 分割验证数据
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_train, t_train = shuffle_dataset(x_train, t_train)
x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]


def __train(lr, weight_decay, epocs=50):
    network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                            output_size=10, weight_decay_lambda=weight_decay)
    trainer = Trainer(network, x_train, t_train, x_val, t_val,
                      epochs=epocs, mini_batch_size=100,
                      optimizer='sgd', optimizer_param={'lr': lr}, verbose=False)
    trainer.train()

    return trainer.test_acc_list, trainer.train_acc_list


# 超参数的随机搜索======================================
optimization_trial = 100
results_val = {}
results_train = {}
for _ in range(optimization_trial):
    # 指定搜索的超参数的范围===============
    weight_decay = 10 ** np.random.uniform(-8, -4)
    lr = 10 ** np.random.uniform(-6, -2)
    # ================================================

    val_acc_list, train_acc_list = __train(lr, weight_decay)
    print("val acc:" + str(val_acc_list[-1]) + " | lr:" + str(lr) + ", weight decay:" + str(weight_decay))
    key = "lr:" + str(lr) + ", weight decay:" + str(weight_decay)
    results_val[key] = val_acc_list
    results_train[key] = train_acc_list

# 绘制图形========================================================
print("=========== Hyper-Parameter Optimization Result ===========")
graph_draw_num = 20
col_num = 5
row_num = int(np.ceil(graph_draw_num / col_num))
i = 0

for key, val_acc_list in sorted(results_val.items(), key=lambda x:x[1][-1], reverse=True):
    print("Best-" + str(i+1) + "(val acc:" + str(val_acc_list[-1]) + ") | " + key)

    plt.subplot(row_num, col_num, i+1)
    plt.title("Best-" + str(i+1))
    plt.ylim(0.0, 1.0)
    if i % 5: plt.yticks([])
    plt.xticks([])
    x = np.arange(len(val_acc_list))
    plt.plot(x, val_acc_list)
    plt.plot(x, results_train[key], "--")
    i += 1

    if i >= graph_draw_num:
        break

plt.show()
# best 20 output
"""
=========== Hyper-Parameter Optimization Result ===========
Best-1(val acc:0.75) | lr:0.008200070286392537, weight decay:4.5482072610407504e-08
Best-2(val acc:0.74) | lr:0.009677914844216824, weight decay:2.704927973779481e-05
Best-3(val acc:0.73) | lr:0.006006587357128985, weight decay:2.7218451758199435e-05
Best-4(val acc:0.73) | lr:0.00701304725392402, weight decay:8.733070897433102e-05
Best-5(val acc:0.72) | lr:0.006546067260416884, weight decay:1.5425740472014986e-07
Best-6(val acc:0.71) | lr:0.006562359239889968, weight decay:1.000627934904485e-08
Best-7(val acc:0.71) | lr:0.006094759400051071, weight decay:1.883591991527881e-08
Best-8(val acc:0.68) | lr:0.004589425471331721, weight decay:1.0236764580568442e-06
Best-9(val acc:0.61) | lr:0.003944409201003058, weight decay:2.2466208552094944e-05
Best-10(val acc:0.59) | lr:0.005025854453052883, weight decay:4.097107208014831e-07
Best-11(val acc:0.56) | lr:0.0035697749335822765, weight decay:2.709040333229583e-08
Best-12(val acc:0.43) | lr:0.001520050057799809, weight decay:1.9555450068116936e-06
Best-13(val acc:0.42) | lr:0.002880244852484325, weight decay:2.62770521736281e-05
Best-14(val acc:0.36) | lr:0.0015114014270343886, weight decay:3.284562455276901e-08
Best-15(val acc:0.35) | lr:0.002550339616502866, weight decay:1.939261592796376e-06
Best-16(val acc:0.32) | lr:0.0012609123427361356, weight decay:2.8721606736079626e-08
Best-17(val acc:0.3) | lr:0.001916403939414624, weight decay:6.365133816732824e-08
Best-18(val acc:0.3) | lr:0.001152759968041137, weight decay:3.8127883002145895e-06
Best-19(val acc:0.21) | lr:0.0006833707169231253, weight decay:2.5272013724886472e-08
Best-20(val acc:0.21) | lr:0.0012061158487880427, weight decay:1.0842194820846423e-05
"""

6.16

第七章卷积神经网络

CNN, Convolutional Neural Network

7.1 整体结构

之前介绍的神经网络中，相邻层的所有神经元之间都有连接，这称为全连接（fully-connected）。另外，我们用Affine层实现了全连接层。如果使用这个Affine层，一个5层的全连接的神经网络就可以通过图7-1所示的网络结构来实现。

如图7-1所示，全连接的神经网络中，Affine层后面跟着激活函数ReLU层（或者Sigmoid层）。这里堆叠了4层“Affine-ReLU”组合，然后第5层是Affine层，最后由Softmax层输出最终结果（概率）。

7.1

如图 7-2 所示，CNN 中新增了 Convolution 层和 Pooling 层。CNN 的层的连接顺序是“Convolution - ReLU -（Pooling）”（Pooling层有时会被省略）。这可以理解为之前的“Affi ne - ReLU”连接被替换成了“Convolution -

ReLU -（Pooling）”连接。

还需要注意的是，在图7-2的CNN中，靠近输出的层中使用了之前的“Affi ne - ReLU”组合。此外，最后的输出层中使用了之前的“Affi ne - Softmax”组合。这些都是一般的CNN中比较常见的结构。

7.2 卷积层

7.2.1 全连接层的问题

数据的形状被“忽视”了。比如，输入数据是图像时，图像通常是高、长、通道方向上的3维形状。但是，向全连接层输入时，需要将3维数据拉平为1维数据。

CNN 中，有时将卷积层的输入输出数据称为特征图（feature map）。其中，卷积层的输入数据称为输入特征图（input feature map），输出数据称为输出特征图（output feature map）。本书中将“输入输出数据”和“特征图”作为含义相同的词使用。

7.2.2 卷积运算

7.2

7.3

将各个位置上滤波器的元素和输入的对应元素相乘，然后再求和（有时将这个计算称为乘积累加运算）。然后，将这个结果保存到输出的对应位置。将这个过程在所有位置都进行一遍，就可以得到卷积运算的输出。

7.2.3 填充

在进行卷积层的处理之前，有时要向输入数据的周围填入固定的数据（比如0等），这称为填充（padding），是卷积运算中经常会用到的处理。比如，在图7-6的例子中，对大小为(4, 4)的输入数据应用了幅度为1的填充。“幅度为1的填充”是指用幅度为1像素的0填充周围。

7.4

因为如果每次进行卷积运算都会缩小空间，那么在某个时刻输出大小就有可能变为 1，导致无法再应用卷积运算。为了避免出现这样的情况，就要使用填充。在刚才的例子中，将填充的幅度设为 1，那么相对于输入大小(4, 4)，输出大小也保持为原来的(4, 4)。因此，卷积运算就可以在保持空间大小不变的情况下将数据传给下一层。

7.2.4 步幅

应用滤波器的位置间隔称为步幅（stride）。之前的例子中步幅都是1，如果将步幅设为2，则如图7-7所示，应用滤波器的窗口的间隔变为2个元素。

7.5

7.6

7.2.5 三维数据的卷积运算

7.7

7.8

在3维数据的卷积运算中，输入数据和滤波器的通道数要设为相同的值。在这个例子中，输入数据和滤波器的通道数一致，均为3。滤波器大小可以设定为任意值（不过，每个通道的滤波器大小要全部相同）。这个例子中滤波器大小为(3, 3)，但也可以设定为(2, 2)、(1, 1)、(5, 5)等任意值。再强调一下，通道数只能设定为和输入数据的通道数相同的值

7.2.6 结合方块思考

把3维数据表示为多维数组时，书写顺序为（channel, height, width）。比如，通道数为C、高度为H、长度为W的数据的形状可以写成（C**, H**, W）。滤波器也一样，要按（channel, height, width）的顺序书写。比如，通道数为C、滤波器高度为FH（Filter Height）、长度为FW（Filter Width）时，可以写成（C**, FH**, FW）

7.9

在这个例子中，数据输出是1张特征图。所谓1张特征图，换句话说，就是通道数为1的特征图。那么，如果要在通道方向上也拥有多个卷积运算的输出，该怎么做呢？为此，就需要用到多个滤波器（权重）

7.10

通过应用FN个滤波器，输出特征图也生成了FN个。如果将这FN个特征图汇集在一起，就得到了形状为(FN, OH, OW)的方块。将这个方块传给下一层，就是CNN的处理流。

作为4维数据，滤波器的权重数据要按(output_channel, input_channel, height, width)的顺序书写。比如，通道数为3、大小为5 × 5的滤波器有20个时，可以写成(20, 3, 5, 5)。

每个通道只有一个偏置。这里，偏置的形状是(FN, 1, 1)，滤波器的输出结果的形状是(FN, OH, OW)。这两个方块相加时，要对滤波器的输出结果(FN, OH, OW)按通道加上相同的偏置值。另外，不同形状的方块相加时，可以基于NumPy的广播功能轻松实现

7.11

7.2.7 批处理

是按(batch_num, channel, height, width)的顺序保存数据。比如，将图7-12中的处理改成对N个数据进行批处理时，数据的形状如图7-13所示。

图7-13的批处理版的数据流中，在各个数据的开头添加了批用的维度。像这样，数据作为4维的形状在各层间传递。这里需要注意的是，网络间传递的是4维数据，对这N个数据进行了卷积运算。也就是说，批处理将N次的处理汇总成了1次进行。

7.12

7.3 池化层

7.13

除了Max池化之外，还有Average池化等。相对于Max池化是从目标区域中取出最大值，Average池化则是计算目标区域的平均值。在图像识别领域，主要使用Max池化。因此，本书中说到“池化层”时，指的是Max池化。

池化层的特征

没有要学习的参数
通道数不发生变化
对微小的位置变化具有鲁棒性

7.4 卷积层和池化层的实现

7.4.1 四维数组

7.4.2 基于im2col的展开

7.14

7.15

im2col这个名称是“image to column”的缩写，翻译过来就是“从图像到矩阵”的意思。Caffe、Chainer 等深度学习框架中有名为im2col的函数，并且在卷积层的实现中，都使用了im2col。

使用im2col展开输入数据后，之后就只需将卷积层的滤波器（权重）纵向展开为1列，并计算2个矩阵的乘积即可（参照图7-19）。这和全连接层的Affi ne层进行的处理基本相同

7.16

7.4.3 卷积层的实现

def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
    """

    Parameters
    ----------
    input_data : 由(数据量, 通道, 高, 长)的4维数组构成的输入数据
    filter_h : 滤波器的高
    filter_w : 滤波器的长
    stride : 步幅
    pad : 填充

    Returns
    -------
    col : 2维数组
    """
    N, C, H, W = input_data.shape
    out_h = (H + 2*pad - filter_h)//stride + 1
    out_w = (W + 2*pad - filter_w)//stride + 1

    img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
    col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))

    for y in range(filter_h):
        y_max = y + stride*out_h
        for x in range(filter_w):
            x_max = x + stride*out_w
            col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col

import numpy as np

sys.path.append(os.pardir)

from common.util import im2col

x1 = np.random.rand(1, 3, 7, 7)
col1 = im2col(x1, 5, 5, stride=1, pad=0)
print(col1)
print(col1.shape)

x2 = np.random.rand(10, 3, 7, 7)
col2 = im2col(x2, 5, 5, stride=1, pad=0)
print(col2)
print(col2.shape)
# output
# (9, 75)
# (90, 75)

class Convolution:
    def __init__(self, W, b, stride=1, pad=0):
        self.W = W
        self.b = b
        self.stride = stride
        self.pad = pad

        # mid data used with backward
        self.x = None
        self.col = None
        self.col_W = None

        # grad of W and b
        self.dW = None
        self.db = None

    def forward(self, x):
        FN, C, FH, FW = self.W.shape
        N, C, H, W = x.shape
        out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
        out_w = 1 + int((W + 2*self.pad - FW) / self.stride)

        col = im2col(x, FH, FW, self.stride, self.pad)
        col_W = self.W.reshape(FN, -1).T

        out = np.dot(col, col_W) + self.b
        out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)

        self.x = x
        self.col = col
        self.col_W = col_W

        return out

    def backward(self, dout):
        FN, C, FH, FW = self.W.shape
        dout = dout.transpose(0, 2, 3, 1).reshape(-1, FN)

        self.db = np.sum(dout, axis=0)
        self.dW = np.dot(self.col_W.T, dout)
        self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)

        dcol = np.dot(dout, self.col_W.T)
        dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)

        return dx

7.4.4 池化层的实现

7.17

7.18

class Pooling:
    def __init__(self, pool_h, pool_w, stride=1, pad=0):
        self.pool_h = pool_h
        self.pool_w = pool_w
        self.stride = stride
        self.pad = pad
        
        self.x = None
        self.arg_max = None
        
    def forward(self, x):
        N, C, H, W = x.shape
        out_h = int(1 + (H - self.pool_h) / self.stride)
        out_w = int(1 + (W - self.pool_w) / self.stride)
        
        col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
        col = col.reshape(-1, self.pool_h * self.pool_w)
        
        arg_max = np.argmax(col, axis=1)
        out = np.max(col, axis=1)
        out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)
        
        self.x = x
        self.arg_max = arg_max
        
        return out
    
    def backward(self, dout):    
        dout = dout.transpose(0, 2, 3, 1)
        
        pool_size = self.pool_h * self.pool_w
        dmax = np.zeros(dout.size, pool_size)
        dmax[np.arange(self.arg_max.size), self.arg_max.flatten()] = dout.flatten()
        dmax = dmax.reshape(dout.shape + (pool_size,))
        
        dcol = dmax.reshape(dmax.shape[0] * dmax.shape[1] * dmax.shape[2], -1)
        dx = col2im(dcol, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad)
        
        return dx

展开输入数据
求各行的最大值
转换为合适的输出大小

7.5 CNN的实现

7.19

import sys
import os

sys.path.append(os.pardir)
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient


class SimpleConvNet:
    """Simple ConvNet

    conv - relu - pool - affine - relu - affine - softmax

    Parameters
    ----------
    input_size : 输入大小（MNIST的情况下为784）
    hidden_size_list : 隐藏层的神经元数量的列表（e.g. [100, 100, 100]）
    output_size : 输出大小（MNIST的情况下为10）
    activation : 'relu' or 'sigmoid'
    weight_init_std : 指定权重的标准差（e.g. 0.01）
        指定'relu'或'he'的情况下设定“He的初始值”
        指定'sigmoid'或'xavier'的情况下设定“Xavier的初始值”
    """

    def __init__(self, input_dim=(1, 28, 28),
                 conv_param={'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                 hidden_size=100, output_size=10, weight_init_std=0.01):
        filter_num = conv_param['filter_num']
        filter_size = conv_param['filter_size']
        filter_pad = conv_param['filter_pad']
        filter_stride = conv_param['filter_stride']
        input_size = input_dim[1]
        conv_output_size = (input_size - filter_size + 2 * filter_pad) / filter_stride + 1
        pool_output_size = int(filter_num * (conv_output_size / 2) * (conv_output_size / 2))
		# 这里将由初始化参数传入的卷积层的超参数从字典中取了出来（以方便后面使用），然后，计算卷积层的输出大小。接下来是权重参数的初始化部分。

        # init weight
        self.params = {
            'W1': weight_init_std * np.random.randn(filter_num, input_dim[0], filter_size, filter_size),
            'b1': np.zeros(filter_num),
            'W2': weight_init_std * np.random.randn(pool_output_size, hidden_size),
            'b2': np.zeros(hidden_size),
            'W3': weight_init_std * np.random.randn(hidden_size, output_size),
            'b3': np.zeros(output_size)
        }
		# 学习所需的参数是第1层的卷积层和剩余两个全连接层的权重和偏置。将这些参数保存在实例变量的params字典中。将第1层的卷积层的权重设为关键字W1，偏置设为关键字b1。同样，分别用关键字W2、b2和关键字W3、b3来保存第2个和第3个全连接层的权重和偏置。最后，生成必要的层。

        # 生成层
        self.layers = OrderedDict()
        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['strid'], conv_param['pad'])

        self.layers['Relu1'] = ReLu()
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['Relu2'] = ReLu()
        self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])

        self.last_layer = SoftmaxWithLoss()
		# 从最前面开始按顺序向有序字典（OrderedDict）的layers中添加层。只有最后的SoftmaxWithLoss层被添加到别的变量lastLayer中。


    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        """求损失函数
        参数x是输入数据、t是教师标签
        用于推理的predict方法从头开始依次调用已添加的层，并将结果传递给下一层。在求损失函数的loss方法中，除了使用 predict方法进行的 forward处理之外，还会继续进行forward处理，直到到达最后的SoftmaxWithLoss层。
        """
        y = self.predict(x)
        return self.last_layer.forward(y, t)

    def accuracy(self, x, t, batch_size=100):
        if t.ndim != 1 :
            t = np.argmax(t, axis=1)

        acc = 0.0

        for i in range(int(x.shape[0] / batch_size)):
            tx = x[i * batch_size : (i+1) * batch_size]
            tt = t[i * batch_size : (i+1) * batch_size]
            y = self.predict(tx)
            y = np.argmax(y, axis=1)
            acc += np.sum(y == tt)

        return acc / x.shape[0]

    def numerical_gradient(self, x, t):
        """求梯度（数值微分）

        Parameters
        ----------
        x : 输入数据
        t : 教师标签

         Returns
        -------
        具有各层的梯度的字典变量
            grads['W1']、grads['W2']、...是各层的权重
            grads['b1']、grads['b2']、...是各层的偏置
        """
        loss_w = lambda w:self.loss(x, t)

        grads = {}
        for idx in (1, 2, 3):
            grads['W' + str(idx)] = numerical_gradient(loss_w, self.params['W' + str(idx)])
            grads['b' + str(idx)] = numerical_gradient(loss_w, self.params['b' + str(idx)])

        return grads

    def gradient(self, x, t):
        """求梯度（误差反向传播法）

        Parameters
         ----------
        x : 输入数据
        t : 教师标签

        Returns
        -------
        具有各层的梯度的字典变量
            rads['W1']、grads['W2']、...是各层的权重
            grads['b1']、grads['b2']、...是各层的偏置
        """
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # setup
        grads = {'W1': self.layers['Conv1'].dW, 'b1': self.layers['Conv1'].db,
                 'W2': self.layers['Affine1'].dW, 'b2': self.layers['Affine1'].db,
                 'W3': self.layers['Affine2'].dW, 'b3': self.layers['Affine2'].db}

        return grads

    def save_params(self, file_name='params.pkl'):
        params = {}
        for key, val in self.params.items():
            params[key] = val
        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name="params.pkl"):
        with open(file_name, 'rb') as f:
            params = pickle.load(f)
        for key, val in params.items():
            self.params[key] = val

        for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
            self.layers[key].W = self.params['W' + str(i+1)]
            self.layers[key].b = self.params['b' + str(i+1)]

7.6 CNN的可视化

7.6.1 第一层权重的可视化

第1层的卷积层的权重的形状是(30, 1, 5, 5)，即30个大小为5 × 5、通道为1的滤波器。滤波器大小是5 × 5、通道数是1，意味着滤波器可以可视化为1通道的灰度图像。现在，我们将卷积层（第1层）的滤波器显示为图像

import numpy as np
import matplotlib.pyplot as plt
from simple_convnet import SimpleConvNet

def filter_show(filters, nx=8, margin=3, scale=10):
    
    FN, C, FH, FW = filters.shape
    ny = int(np.ceil(FN / nx))

    fig = plt.figure()
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

    for i in range(FN):
        ax = fig.add_subplot(ny, nx, i+1, xticks=[], yticks=[])
        ax.imshow(filters[i, 0], cmap=plt.cm.gray_r, interpolation='nearest')
    plt.show()


network = SimpleConvNet()
# 随机进行初始化后的权重
filter_show(network.params['W1'])

# 学习后的权重
network.load_params("params.pkl")
filter_show(network.params['W1'])

7.21

7.22

7.23

“滤波器1”对垂直方向上的边缘有响应，“滤波器2”对水平方向上的边缘有响应。

由此可知，卷积层的滤波器会提取边缘或斑块等原始信息。而刚才实现的CNN会将这些原始信息传递给后面的层。

7.6.2 基于分层结构的信息提取

7.24

如果堆叠了多层卷积层，则随着层次加深，提取的信息也愈加复杂、抽象，这是深度学习中很有意思的一个地方。最开始的层对简单的边缘有响应，接下来的层对纹理有响应，再后面的层对更加复杂的物体部件有响应。也就是说，随着层次加深，神经元从简单的形状向“高级”信息变化。换句话说，就像我们理解东西的“含义”一样，响应的对象在逐渐变化。

7.7 具有代表性的CNN

7.7.1 LeNet

7.25

7.7.2 AlexNet

7.26

AlexNet叠有多个卷积层和池化层，最后经由全连接层输出结果。虽然结构上AlexNet和LeNet没有大的不同，但有以下几点差异。

• 激活函数使用ReLU。

• 使用进行局部正规化的LRN（Local Response Normalization）层。

• 使用Dropout（6.4.3节）。

如上所述，关于网络结构，LeNet和AlexNet没有太大的不同。但是，围绕它们的环境和计算机技术有了很大的进步。具体地说，现在任何人都可以获得大量的数据。而且，擅长大规模并行计算的GPU得到普及，高速进行大量的运算已经成为可能。大数据和GPU已成为深度学习发展的巨大的原动力

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from simple_convnet import SimpleConvNet
from common.trainer import Trainer

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)

# 处理花费时间较长的情况下减少数据 
#x_train, t_train = x_train[:5000], t_train[:5000]
#x_test, t_test = x_test[:1000], t_test[:1000]

max_epochs = 20

network = SimpleConvNet(input_dim=(1,28,28), 
                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                        hidden_size=100, output_size=10, weight_init_std=0.01)
                        
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=max_epochs, mini_batch_size=100,
                  optimizer='Adam', optimizer_param={'lr': 0.001},
                  evaluate_sample_num_per_epoch=1000)
trainer.train()

# 保存参数
network.save_params("params.pkl")
print("Saved Network Parameters!")

# 绘制图形
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, trainer.train_acc_list, marker='o', label='train', markevery=2)
plt.plot(x, trainer.test_acc_list, marker='s', label='test', markevery=2)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

7.20

第八章深度学习

8.1 加深网络

8.1.1 向更深的网络出发

这个网络的层比之前实现的网络都更深。这里使用的卷积层全都是3 × 3的小型滤波器，特点是随着层的加深，通道数变大（卷积层的通道数从前面的层开始按顺序以16、16、32、32、64、64的方式增加）。此外，如图8-1所示，插入了池化层，以逐渐减小中间数据的空间大小；并且，后面的全连接层中使用了Dropout层。

8.1

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *


class DeepConvNet:
    """识别率为99%以上的高精度的ConvNet

    网络结构如下所示
        conv - relu - conv- relu - pool -
        conv - relu - conv- relu - pool -
        conv - relu - conv- relu - pool -
        affine - relu - dropout - affine - dropout - softmax
    """
    def __init__(self, input_dim=(1, 28, 28),
                 conv_param_1 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1},
                 conv_param_2 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1},
                 conv_param_3 = {'filter_num':32, 'filter_size':3, 'pad':1, 'stride':1},
                 conv_param_4 = {'filter_num':32, 'filter_size':3, 'pad':2, 'stride':1},
                 conv_param_5 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1},
                 conv_param_6 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1},
                 hidden_size=50, output_size=10):
        # 初始化权重===========
        # 各层的神经元平均与前一层的几个神经元有连接（TODO:自动计算）
        pre_node_nums = np.array([1*3*3, 16*3*3, 16*3*3, 32*3*3, 32*3*3, 64*3*3, 64*4*4, hidden_size])
        wight_init_scales = np.sqrt(2.0 / pre_node_nums)  # 使用ReLU的情况下推荐的初始值
        
        self.params = {}
        pre_channel_num = input_dim[0]
        for idx, conv_param in enumerate([conv_param_1, conv_param_2, conv_param_3, conv_param_4, conv_param_5, conv_param_6]):
            self.params['W' + str(idx+1)] = wight_init_scales[idx] * np.random.randn(conv_param['filter_num'], pre_channel_num, conv_param['filter_size'], conv_param['filter_size'])
            self.params['b' + str(idx+1)] = np.zeros(conv_param['filter_num'])
            pre_channel_num = conv_param['filter_num']
        self.params['W7'] = wight_init_scales[6] * np.random.randn(64*4*4, hidden_size)
        self.params['b7'] = np.zeros(hidden_size)
        self.params['W8'] = wight_init_scales[7] * np.random.randn(hidden_size, output_size)
        self.params['b8'] = np.zeros(output_size)

        # 生成层===========
        self.layers = []
        self.layers.append(Convolution(self.params['W1'], self.params['b1'], 
                           conv_param_1['stride'], conv_param_1['pad']))
        self.layers.append(Relu())
        self.layers.append(Convolution(self.params['W2'], self.params['b2'], 
                           conv_param_2['stride'], conv_param_2['pad']))
        self.layers.append(Relu())
        self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2))
        self.layers.append(Convolution(self.params['W3'], self.params['b3'], 
                           conv_param_3['stride'], conv_param_3['pad']))
        self.layers.append(Relu())
        self.layers.append(Convolution(self.params['W4'], self.params['b4'],
                           conv_param_4['stride'], conv_param_4['pad']))
        self.layers.append(Relu())
        self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2))
        self.layers.append(Convolution(self.params[' W5'], self.params['b5'],
                           conv_param_5['stride'], conv_param_5['pad']))
        self.layers.append(Relu())
        self.layers.append(Convolution(self.params['W6'], self.params['b6'],
                           conv_param_6['stride'], conv_param_6['pad']))
        self.layers.append(Relu())
        self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2))
        self.layers.append(Affine(self.params['W7'], self.params['b7']))
        self.layers.append(Relu())
        self.layers.append(Dropout(0.5))
        self.layers.append(Affine(self.params['W8'], self.params['b8']))
        self.layers.append(Dropout(0.5))
        
        self.last_layer = SoftmaxWithLoss()

    def predict(self, x, train_flg=False):
        for layer in self.layers:
            if isinstance(layer, Dropout):
                x = layer.forward(x, train_flg)
            else:
                x = layer.forward(x)
        return x

    def loss(self, x, t):
        y = self.predict(x, train_flg=True)
        return self.last_layer.forward(y, t)

    def accuracy(self, x, t, batch_size=100):
        if t.ndim != 1 : t = np.argmax(t, axis=1)

        acc = 0.0

        for i in range(int(x.shape[0] / batch_size)):
            tx = x[i*batch_size:(i+1)*batch_size]
            tt = t[i*batch_size:(i+1)*batch_size]
            y = self.predict(tx, train_flg=False)
            y = np.argmax(y, axis=1)
            acc += np.sum(y == tt)

        return acc / x.shape[0]

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        tmp_layers = self.layers.copy()
        tmp_layers.reverse()
        for layer in tmp_layers:
            dout = layer.backward(dout)

        # 设定
        grads = {}
        for i, layer_idx in enumerate((0, 2, 5, 7, 10, 12, 15, 18)):
            grads['W' + str(i+1)] = self.layers[layer_idx].dW
            grads['b' + str(i+1)] = self.layers[layer_idx].db

        return grads

    def save_params(self, file_name="params.pkl"):
        params = {}
        for key, val in self.params.items():
            params[key] = val
        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name="params.pkl"):
        with open(file_name, 'rb') as f:
            params = pickle.load(f)
        for key, val in params.items():
            self.params[key] = val

        for i, layer_idx in enumerate((0, 2, 5, 7, 10, 12, 15, 18)):
            self.layers[layer_idx].W = self.params['W' + str(i+1)]
            self.layers[layer_idx].b = self.params['b' + str(i+1)]

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from deep_convnet import DeepConvNet
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)

network = DeepConvNet()  
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=20, mini_batch_size=100,
                  optimizer='Adam', optimizer_param={'lr':0.001},
                  evaluate_sample_num_per_epoch=1000)
trainer.train()

# 保存参数
network.save_params("deep_convnet_params.pkl")
print("Saved Network Parameters!")

相关阅读:
vue 根据时间时间区间搜索功能
 vue 分页
 ubuntu18 vscode ros 配置
 在ubuntu16上用vscode编译ros历程记录
 word:页眉头部出现一条横线
 word:设置基偶页不同和页眉页脚
 多级标题
 添加论文应用
 添加论文尾注2（交叉引用）
三线表
原文地址：https://www.cnblogs.com/Cotmar/p/16539748.html

深度学习入门——基于Python的理论与实现 读书笔记