深度学习系列 Part (2)

深度学习系列 Part (2)

1. 神经网络原理

神经网络模型，是上一章节提到的典型的监督学习问题，即我们有一组输入以及对应的目标输出，求最优模型。通过最优模型，当我们有新的输入时，可以得到一个近似真实的预测输出。

我们先看一下如何实现这样一个简单的神经网络：

输入 x = [1,2,3],

目标输出 y = [-0.85, 0.72]

中间使用一个包含四个单元的隐藏层。

结构如图：

求所需参数 w10w10 w20w20 b10b10 b20b20，使得给定输入 x 下得到的输出，和目标输出 y^y^ 之间的平均均方误差（Mean Square Errors, MSE) 最小化。

我们首先需要思考，有几个参数？由于是两层神经网络，结构如下图（图片来源http://stackoverflow.com/questions/22054877/backpropagation-training-stuck）其中输入层为 3，中间层为 4，输出层是 2：

因此，其中总共包含 (3x4+4) + (4*2+2) = 26 个参数需要训练。我们可以如图初始化参数。参数可以随机初始化，也可以随便指定：

Python 3重置复制

Python示例代码

1

import numpy as np

2

w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4],

3

[ 0.5, 0.6, 0.7, 0.8],

4

[ 0.9, 1.0, 1.1, 1.2]])

5

w2_0 = np.array([[ 1.3, 1.4],

6

[ 1.5, 1.6],

7

[ 1.7, 1.8],

8

[ 1.9, 2.0]])

9

10

b1_0 = np.array( [-2.0, -6.0, -1.0, -7.0])

11

b2_0 = np.array( [-2.5, -5.0])

运行

我们进行一次正向传播：

Python 3重置复制

Python示例代码

1

x = [1,2,3]

2

y = [-0.85, 0.72]

3

4

o1 = np.dot(x, w1_0 ) + b1_0

5

os1 = np.power(1+np.exp(o1*-1), -1)

6

o2 = np.dot(os1, w2_0) + b2_0

7

os2 = np.tanh(o2)

运行

再进行一次反向传播：

Python 3重置复制

Python示例代码

1

alpha = 0.1

2

grad_os2 = (y - os2) * (1-np.power(os2, 2))

3

grad_os1 = np.dot(w2_0, grad_os2.T).T * (1-os1)*os1

4

grad_w2 = ...

5

grad_b2 = ...

6

...

7

...

8

w2_0 = w2_0 + alpha * grad_w2

9

b2_0 = b2_0 + alpha * grad_b2

10

...

11

...

运行

如此反复多次，直到最终误差收敛。进行反向传播时，需要将所有参数的求导结果都写上去，然后根据求导结果更新参数。我这里就没有写全，因为一层一层推导实在是太过麻烦。更重要的是，当我们需要训练新的神经网络结构时，这些都需要重新推导一次，费时费力。

然而仔细想一想，这个推导的过程也并非无规律可循。即上一级的神经网络梯度输出，会被用作下一级计算梯度的输入，同时下一级计算梯度的输出，会被作为上一级神经网络的输入。于是我们就思考能否将这一过程抽象化，做成一个可以自动求导的框架？OK，以 Tensorflow 为代表的一系列深度学习框架，正是根据这一思路诞生的。

2.深度学习框架

近几年最火的深度学习框架是什么？毫无疑问，Tensorflow 高票当选。

但实际上，这些深度学习框架都具有一些普遍特征。Gokula Krishnan Santhanam认为，大部分深度学习框架都包含以下五个核心组件：

张量（Tensor）

基于张量的各种操作

计算图（Computation Graph）

自动微分（Automatic Differentiation）工具

BLAS、cuBLAS、cuDNN等拓展包

其中，张量 Tensor 可以理解为任意维度的数组——比如一维数组被称作向量（Vector），二维的被称作矩阵（Matrix），这些都属于张量。有了张量，就有对应的基本操作，如取某行某列的值，张量乘以常数等。运用拓展包其实就相当于使用底层计算软件加速运算。

我们今天重点介绍的，就是计算图模型，以及自动微分两部分。首先介绍以 Torch 框架为例，谈谈如何实现自动求导，然后再用最简单的方法，实现这两部分。

2.1. 深度学习框架如何实现自动求导

诸如 Tensorflow 这样的深度学习框架的入门，网上有大量的几行代码、几分钟入门这样的资料，可以快速实现手写数字识别等简单任务。但如果想深入了解 Tensorflow 的背后原理，可能就不是这么容易的事情了。这里我们简单的谈一谈这一部分。

我们知道，当我们拿到数据、训练神经网络时，网络中的所有参数都是变量。训练模型的过程，就是如何得到一组最佳变量，使预测最准确的过程。这个过程实际上就是，输入数据经过正向传播，变成预测，然后预测与实际情况的误差反向传播误差回来，更新变量。如此反复多次，得到最优的参数。这里就会遇到一个问题，神经网络这么多层，如何保证正向、反向传播都可以正确运行？

值得思考的是，这两种传播方式，都具有管道传播的特征。正向传播一层一层算就可以了，上一层网络的结果作为下一层的输入。而反向传播过程可以利用链式求导法则，从后往前，不断将误差分摊到每一个参数的头上。

图片来源：Colah博客

进过抽象化后，我们发现，深度学习框架中的每一个模块都需要两个函数，一个连接正向，一个连接反向。这里的正向和反向，如同武侠小说中的任督二脉。而训练模型的过程，数据通过正向传播生成预测结果，进而将误差反向传回更新参数，就如同让真气通过任督二脉在体内游走，随着训练误差逐渐缩小收敛，深度神经网络也将打通任督二脉。

接下来，我们将首先审视一下 Torch 框架的源码如何实现这两部分内容，其次我们通过 Python 直接编写一个最简单的深度学习框架。

举 Torch 的 nn 项目的例子是因为Torch 的代码文件结构比较简单，Tensorflow 的规律和Torch比较近似，但文件结构相对更加复杂，有兴趣的可以仔细读读相关文章。

Torch nn 模块Github 源码这个目录下的几乎所有 .lua 文件，都有这两个函数：

# lua

function xxx:updateOutput(input)

input.THNN.xxx_updateOutput(

input:cdata(),

self.output:cdata()

)

return self.output

end

function xxx:updateGradInput(input, gradOutput)

input.THNN.xxx_updateGradInput(

input:cdata(),

gradOutput:cdata(),

self.gradInput:cdata(),

self.output:cdata()

)

return self.gradInput

end

这里其实是相当于留了两个方法的定义，没有写具体功能。具体功能的代码，在 ./lib/THNN/generic 目录中用 C 实现实现，具体以 Sigmoid 函数举例。

我们知道 Sigmoid 函数的形式是：代码实现起来是这样：

# lua

void THNN_(Sigmoid_updateOutput)( THNNState

*state, THTensor

*input, THTensor

*output)

{

THTensor_(resizeAs)(output, input);

TH_TENSOR_APPLY2(real, output, real, input,

*output_data = 1./(1.+ exp(- *input_data));

);

}

Sigmoid 函数求导变成：所以这里在实现的时候就是：

// c

void THNN_(Sigmoid_updateGradInput)(

THNNState *state,

THTensor *input,

THTensor *gradOutput,

THTensor *gradInput,

THTensor *output)

{

THNN_CHECK_NELEMENT(input, gradOutput);

THTensor_(resizeAs)(gradInput, output);

TH_TENSOR_APPLY3(real, gradInput, real, gradOutput, real, output,

real z = * output_data;

*gradInput_data = *gradOutput_data * (1. - z) * z;

);

}

大家应该注意到了一点， updateOutput 函数, output_data 在等号左边， input_data 在等号右边。而 updateGradInput 函数， gradInput_data 在等号左边， gradOutput_data 在等号右边。这里，output = f(input) 对应的是正向传播 input = f(output) 对应的是反向传播。

2.2 用 Python 直接编写一个最简单的深度学习框架

这部分内容属于“造轮子”，并且借用了优达学城的一个小型项目 MiniFlow。

数据结构部分

首先，我们实现一个父类 Node，然后基于这个父类，依次实现 Input Linear Sigmoid 等模块。这里运用了简单的 Python Class 继承。这些模块中，需要将 forward 和 backward 两个方法针对每个模块分别重写。

代码如下：

Python 3重置复制

Python示例代码

1

class Node(object):

2

"""

3

Base class for nodes in the network.

4

5

Arguments:

6

7

`inbound_nodes`: A list of nodes with edges into this node.

8

"""

9

def __init__(self, inbound_nodes=[]):

10

"""

11

Node's constructor (runs when the object is instantiated). Sets

12

properties that all nodes need.

13

"""

14

# A list of nodes with edges into this node.

15

self.inbound_nodes = inbound_nodes

16

# The eventual value of this node. Set by running

17

# the forward() method.

18

self.value = None

19

# A list of nodes that this node outputs to.

20

self.outbound_nodes = []

21

# New property! Keys are the inputs to this node and

22

# their values are the partials of this node with

23

# respect to that input.

24

self.gradients = {}

25

26

# Sets this node as an outbound node for all of

27

# this node's inputs.

28

for node in inbound_nodes:

29

node.outbound_nodes.append(self)

30

31

def forward(self):

32

"""

33

Every node that uses this class as a base class will

34

need to define its own `forward` method.

35

"""

36

raise NotImplementedError

37

38

def backward(self):

39

"""

40

Every node that uses this class as a base class will

41

need to define its own `backward` method.

42

"""

43

raise NotImplementedError

44

45

46

class Input(Node):

47

"""

48

A generic input into the network.

49

"""

50

def __init__(self):

51

Node.__init__(self)

52

53

def forward(self):

54

pass

55

56

def backward(self):

57

self.gradients = {self: 0}

58

for n in self.outbound_nodes:

59

self.gradients[self] += n.gradients[self]

60

61

class Linear(Node):

62

"""

63

Represents a node that performs a linear transform.

64

"""

65

def __init__(self, X, W, b):

66

Node.__init__(self, [X, W, b])

67

68

def forward(self):

69

"""

70

Performs the math behind a linear transform.

71

"""

72

X = self.inbound_nodes[0].value

73

W = self.inbound_nodes[1].value

74

b = self.inbound_nodes[2].value

75

self.value = np.dot(X, W) + b

76

77

def backward(self):

78

"""

79

Calculates the gradient based on the output values.

80

"""

81

self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}

82

for n in self.outbound_nodes:

83

grad_cost = n.gradients[self]

84

self.gradients[self.inbound_nodes[0]] += np.dot(grad_cost, self.inbound_nodes[1].value.T)

85

self.gradients[self.inbound_nodes[1]] += np.dot(self.inbound_nodes[0].value.T, grad_cost)

86

self.gradients[self.inbound_nodes[2]] += np.sum(grad_cost, axis=0, keepdims=False)

87

88

89

class Sigmoid(Node):

90

"""

91

Represents a node that performs the sigmoid activation function.

92

"""

93

def __init__(self, node):

94

Node.__init__(self, [node])

95

96

def _sigmoid(self, x):

97

"""

98

This method is separate from `forward` because it

99

will be used with `backward` as well.

100

101

`x`: A numpy array-like object.

102

"""

103

return 1. / (1. + np.exp(-x))

104

105

def forward(self):

106

"""

107

Perform the sigmoid function and set the value.

108

"""

109

input_value = self.inbound_nodes[0].value

110

self.value = self._sigmoid(input_value)

111

112

def backward(self):

113

"""

114

Calculates the gradient using the derivative of

115

the sigmoid function.

116

"""

117

self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}

118

for n in self.outbound_nodes:

119

grad_cost = n.gradients[self]

120

sigmoid = self.value

121

self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_cost

122

123

class Tanh(Node):

124

def __init__(self, node):

125

"""

126

The tanh cost function.

127

Should be used as the last node for a network.

128

"""

129

Node.__init__(self, [node])

130

131

def forward(self):

132

"""

133

Calculates the tanh.

134

"""

135

input_value = self.inbound_nodes[0].value

136

self.value = np.tanh(input_value)

137

138

def backward(self):

139

"""

140

Calculates the gradient of the cost.

141

"""

142

self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}

143

for n in self.outbound_nodes:

144

grad_cost = n.gradients[self]

145

tanh = self.value

146

self.gradients[self.inbound_nodes[0]] += (1 + tanh) * (1 - tanh) * grad_cost.T

147

148

149

150

class MSE(Node):

151

def __init__(self, y, a):

152

"""

153

The mean squared error cost function.

154

Should be used as the last node for a network.

155

"""

156

Node.__init__(self, [y, a])

157

158

def forward(self):

159

"""

160

Calculates the mean squared error.

161

"""

162

y = self.inbound_nodes[0].value.reshape(-1, 1)

163

a = self.inbound_nodes[1].value.reshape(-1, 1)

164

165

self.m = self.inbound_nodes[0].value.shape[0]

166

self.diff = y - a

167

self.value = np.mean(self.diff**2)

168

169

def backward(self):

170

"""

171

Calculates the gradient of the cost.

172

"""

173

self.gradients[self.inbound_nodes[0]] = (2 / self.m) * self.diff

174

self.gradients[self.inbound_nodes[1]] = (-2 / self.m) * self.diff

运行

调度算法与优化部分

优化部分则会在以后的系列中单独详细说明。这里主要将简单讲一下图计算的算法调度。就是实际上Tensorflow的各个模块会生成一个有向无环图，如下图（来源http://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/）:

在计算过程中，几个模块存在着相互依赖关系，比如要计算模块1，就必须完成模块3和模块4，而要完成模块3，就需要在之前顺次完成模块5、2；因此这里可以使用 Kahn 算法作为调度算法（下面的 topological_sort 函数），从计算图中，推导出类似 5->2->3->4->1 的计算顺序。

Python 3重置复制

Python示例代码

1

def topological_sort(feed_dict):

2

"""

3

Sort the nodes in topological order using Kahn's Algorithm.

4

5

`feed_dict`: A dictionary where the key is a `Input` Node and the value is the respective value feed to that Node.

6

7

Returns a list of sorted nodes.

8

"""

9

input_nodes = [n for n in feed_dict.keys()]

10

G = {}

11

nodes = [n for n in input_nodes]

12

while len(nodes) > 0:

13

n = nodes.pop(0)

14

if n not in G:

15

G[n] = {'in': set(), 'out': set()}

16

for m in n.outbound_nodes:

17

if m not in G:

18

G[m] = {'in': set(), 'out': set()}

19

G[n]['out'].add(m)

20

G[m]['in'].add(n)

21

nodes.append(m)

22

23

L = []

24

S = set(input_nodes)

25

while len(S) > 0:

26

n = S.pop()

27

if isinstance(n, Input):

28

n.value = feed_dict[n]

29

30

L.append(n)

31

for m in n.outbound_nodes:

32

G[n]['out'].remove(m)

33

G[m]['in'].remove(n)

34

if len(G[m]['in']) == 0:

35

S.add(m)

36

return L

37

38

39

def forward_and_backward(graph):

40

"""

41

Performs a forward pass and a backward pass through a list of sorted Nodes.

42

43

Arguments:

44

45

`graph`: The result of calling `topological_sort`.

46

"""

47

for n in graph:

48

n.forward()

49

50

for n in graph[::-1]:

51

n.backward()

52

53

54

def sgd_update(trainables, learning_rate=1e-2):

55

"""

56

Updates the value of each trainable with SGD.

57

58

Arguments:

59

60

`trainables`: A list of `Input` Nodes representing weights/biases.

61

`learning_rate`: The learning rate.

62

"""

63

for t in trainables:

64

t.value = t.value - learning_rate * t.gradients[t]

运行

使用模型

Python 3重置复制

Python示例代码

1

import numpy as np

2

from sklearn.utils import resample

3

np.random.seed(0)

4

5

w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4],

6

[ 0.5, 0.6, 0.7, 0.8],

7

[ 0.9, 1.0, 1.1, 1.2]])

8

w2_0 = np.array([[ 1.3, 1.4],

9

[ 1.5, 1.6],

10

[ 1.7, 1.8],

11

[ 1.9, 2.0]])

12

b1_0 = np.array( [-2.0, -6.0, -1.0, -7.0])

13

b2_0 = np.array( [-2.5, -5.0])

14

15

X_ = np.array([[1.0, 2.0, 3.0]])

16

y_ = np.array([[-0.85, 0.75]])

17

n_features = X_.shape[1]

18

19

W1_ = w1_0

20

b1_ = b1_0

21

W2_ = w2_0

22

b2_ = b2_0

23

24

X, y = Input(), Input()

25

W1, b1 = Input(), Input()

26

W2, b2 = Input(), Input()

27

28

l1 = Linear(X, W1, b1)

29

s1 = Sigmoid(l1)

30

l2 = Linear(s1, W2, b2)

31

t1 = Tanh(l2)

32

cost = MSE(y, t1)

33

34

feed_dict = {

35

X: X_, y: y_,

36

W1: W1_, b1: b1_,

37

W2: W2_, b2: b2_

38

}

39

40

epochs = 10

41

m = X_.shape[0]

42

batch_size = 1

43

steps_per_epoch = m // batch_size

44

45

graph = topological_sort(feed_dict)

46

trainables = [W1, b1, W2, b2]

47

48

l_Mat_W1 = [w1_0]

49

l_Mat_W2 = [w2_0]

50

l_Mat_out = []

51

52

l_val = []

53

for i in range(epochs):

54

loss = 0

55

for j in range(steps_per_epoch):

56

X_batch, y_batch = resample(X_, y_, n_samples=batch_size)

57

X.value = X_batch

58

y.value = y_batch

59

forward_and_backward(graph)

60

sgd_update(trainables, 0.1)

61

loss += graph[-1].value

62

63

mat_W1 = []

64

mat_W2 = []

65

for i in graph:

66

try:

67

if (i.value.shape[0] == 3) and (i.value.shape[1] == 4):

68

mat_W1 = i.value

69

if (i.value.shape[0] == 4) and (i.value.shape[1] == 2):

70

mat_W2 = i.value

71

except:

72

pass

73

74

l_Mat_W1.append(mat_W1)

75

l_Mat_W2.append(mat_W2)

76

l_Mat_out.append(graph[9].value)

运行

来观察一下。当然还有更高级的可视化方法：可视化的神经网络

Python 3重置复制

Python示例代码

1

import matplotlib.pyplot as plt

2

%matplotlib inline

3

4

fig = plt.figure( figsize=(14,10))

5

ax0 = fig.add_subplot(131)

6

#aax0 = fig.add_axes([0, 0, 0.3, 0.1])

7

c0 = ax0.imshow(np.array(l_Mat_out).reshape([-1,2]).T, interpolation='nearest',aspect='auto', cmap="Reds", vmax=1, vmin=-1)

8

ax0.set_title("Output")

9

10

cbar = fig.colorbar(c0, ticks=[-1, 0, 1])

11

12

13

14

ax1 = fig.add_subplot(132)

15

c1 = ax1.imshow(np.array(l_Mat_W1).reshape(len(l_Mat_W1), 12).T, interpolation='nearest',aspect='auto', cmap="Reds")

16

ax1.set_title("w1")

17

cbar = fig.colorbar(c1, ticks=[np.min(np.array(l_Mat_W1)), np.max(np.array(l_Mat_W1))])

18

19

ax2 = fig.add_subplot(133)

20

c2 = ax2.imshow(np.array(l_Mat_W2).reshape(len(l_Mat_W2), 8).T, interpolation='nearest',aspect='auto', cmap="Reds")

21

ax2.set_title("w2")

22

cbar = fig.colorbar(c2, ticks=[np.min(np.array(l_Mat_W2)), np.max(np.array(l_Mat_W2))])

23

24

ax0.set_yticks([0,1])

25

ax0.set_yticklabels(["out0", "out1"])

26

27

ax1.set_xlabel("epochs")

28

#for i in range(len(l_Mat_W1)):

运行

我们注意到，随着训练轮数 Epoch 不断增多， Output 值从最初的 [0.72, -0.88] 不断接近 y = [-0.85, 0.72], 其背后的原因，是模型参数不断的从初始化的值变化、更新，如图中的 w1 w2 两个矩阵。

好了，最简单的轮子已经造好了。我们的轮子，实现了 Input Linear Sigmoid Tanh 以及 MSE 这几个模块。接下来的内容，我们将基于现在最火的轮子 Tensorflow，详细介绍一下更多的模块。

最后，本篇只是造了个最基本的轮子，我们集智的知乎专栏上，有一个系列文章，正在介绍如何在Matlab上手写深度学习框架，欢迎大家围观。
相关阅读:
MyEclipse编码集设置
 Tomcat内存溢出问题解决
 避免头文件多次编译
 C++指针学习（1）
C++头文件和实现（用复数类举例）
从helloworld开始
 标准库string类型
 浅谈Lua的Coroutine协程的多"线程"并发模型
 关于闭包函数的概念和原理
 笔记
原文地址：https://www.cnblogs.com/yangshunde/p/7740055.html