一、梯度
- 导数是对某个自变量求导,得到一个标量。
- 偏微分是在多元函数中对某一个自变量求偏导(将其他自变量看成常数)。
- 梯度指对所有自变量分别求偏导,然后组合成一个向量,所以梯度是向量,有方向和大小。
上左图中,箭头的长度表示陡峭度,越陡峭的地方箭头越长,箭头指向的方向是y变大的方向,如果要使用梯度下降,则需要取负方向。
右图中,蓝色代表低点,红色代表高点,中间的箭头方向从蓝色指向红色,而且中间最陡峭的地方,箭头最长。
二、梯度下降
上图中分别使用梯度下降优化θ1和θ2的值,α表示学习率,即每次按梯度下降方向更新参数的一个乘数因子,可以控制参数调整的速度。
三、凸函数convex function、非凸函数non-convex function
上图为凸函数,即任意两个点 ( f(z1) + f(z2) ) / 2 > f( (z1 + z2) / 2 ) 。
上图为非凸函数,其中包含若干个局部最低和局部最高,还有一个全局最高和全局最低点。
四*、ResNet为什么效果好(直观)
ResNet利用残差块,相当于将Loss函数变得相对平滑,这样更利于深度网络找到最优解global minima(或相对比较好的局部最优解Local minima)。
五、鞍点saddle point
六、激活函数和梯度
sigmoid激活函数:
$$f(x)=sigma(x)=frac{1}{1+e^{-x}}$$
求导:
$$egin{aligned} frac{d}{d x} sigma(x) &=frac{d}{d x}left(frac{1}{1+e^{-x}} ight) \ &=frac{e^{-x}}{left(1+e^{-x} ight)^{2}} \ &=frac{left(1+e^{-x} ight)-1}{left(1+e^{-x} ight)^{2}} \ &=frac{1+e^{-x}}{left(1+e^{-x} ight)^{2}}-left(frac{1}{1+e^{-x}} ight)^{2} \ &=sigma(x)-sigma(x)^{2} \ sigma^{prime} &=sigma(1-sigma) end{aligned}$$
tanh激活函数:
$$egin{aligned} f(x) &= anh (x)=frac{left(e^{x}-e^{-x} ight)}{left(e^{x}+e^{-x} ight)} \ &=2 operatorname{sigmoid}(2 x)-1 end{aligned}$$
求导:
$$egin{array}{l}{frac{d}{d x} anh (x)=frac{left(e^{x}+e^{-x} ight)left(e^{x}+e^{-x} ight)-left(e^{x}-e^{-x} ight)left(e^{x}-e^{-x} ight)}{left(e^{x}+e^{-x} ight)^{2}}} \ {=1-frac{left(e^{x}-e^{-x} ight)^{2}}{left(e^{x}+e^{-x} ight)^{2}}=1- anh ^{2}(x)}end{array}$$
import torch z = torch.linspace(-1, 1, 10) a = torch.tanh(z) print(a) # tensor([-0.7616, -0.6514, -0.5047, -0.3215, -0.1107, 0.1107, 0.3215, 0.5047, # 0.6514, 0.7616])
Relu激活函数(Rectified Linear Unit,整流线性单元)
$$f(x)=left{egin{array}{ll}{0} & { ext { for } x<0} \ {x} & { ext { for } x geq 0}end{array} ight.$$
求导:
$$f^{prime}(x)=left{egin{array}{ll}{0} & { ext { for } x<0} \ {1} & { ext { for } x geq 0}end{array} ight.$$
import torch z = torch.linspace(-1,1,10) a = torch.relu(z) print(a) # tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1111, 0.3333, 0.5556, 0.7778, # 1.0000])
七、Loss函数及梯度
均方差误差(MSE:Mean squared Error):
$$ ext { loss }=sumleft[y-f_{ heta}(x) ight]^{2}$$
求导:
$$frac{ abla ext{loss}}{ abla heta}=2 sumleft[y-f_{ heta}(x) ight] * frac{ abla f_{ heta}(x)}{ abla heta}$$
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F x = torch.ones(1) w = torch.tensor([2.], requires_grad=True) # w = torch.full([1], 2.) # 如果在定义w的时候没有说明需要计算梯度,则使用下面语句指定 # w.requires_grad_() mse = F.mse_loss(torch.ones(1), x * w) # 梯度计算方法1,假设里面又N个w需要求导,则参数列表[w1,w2,w3,...,wn,b],返回的是[dw1,dw2,dw3,...,dwn,db] # dw = torch.autograd.grad(mse, [w]) # print(dw) # 梯度计算方法2,计算出w的梯度会直接赋给w1.grad,w2.grad,...,wn.grad,b.grad mse.backward() print(w.grad)
Softmax误差:
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F # L层计算结果z z = torch.tensor([3., 2., 1.], requires_grad=True) print(z) # z经过softmax激活函数 y_hat = F.softmax(z, dim=0) print(y_hat) # 假设标签是y y = torch.tensor([1., 0., 0.]) # 损失值loss为交叉熵函数结果 loss = -torch.sum(y * torch.log(y_hat)) print(loss) # 方法一 loss.backward() print(z.grad) # 方法二 # dz = torch.autograd.grad(loss, [z], retain_graph=True) # print(dz) # tensor([-0.3348, 0.2447, 0.0900])
import torch import torch.nn.functional as F z = torch.linspace(-100, 100, 10) # 新版api a = torch.softmax(z, dim=0) # 或 a = F.softmax(z,dim=0) print(a) # 输出tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 2.4891e-39, # 1.1144e-29, 4.9890e-20, 2.2336e-10, 1.0000e+00])
八、初步使用梯度下降优化参数
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch # 该函数有4个局部最优解,但值是一样的,都是0 def func(x, y): return (x ** 2 + y - 11) ** 2 + (x + y ** 2 - 7) ** 2 # 当初始化改变时,通过梯度下降找到的局部最优解也时不同的 # 初始化x=4,y=4时,局部最优x=3.0,y=2.0 # 初始化x=0,y=4时,局部最优x=-2.8051,y=3.1313 x = torch.tensor(0., requires_grad=True) y = torch.tensor(4., requires_grad=True)
# 这里使用的优化方法是Adam optimizer = torch.optim.Adam([x, y], lr=1e-3) for step in range(20000): optimizer.zero_grad() pred = func(x, y) pred.backward() optimizer.step() if step % 2000 == 0: print("step:", x, y)
上述代码中,我们优化的是函数z = func(),优化对象x和y。
如果用到深度学习中,我们优化的目标函数为loss函数,优化对象为w和b(如果使用BN等,则还要优化γβ等)。
九、使用cross_entropy实现多分类
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F x = torch.randn(5, 784) w = torch.randn(10, 784, requires_grad=True) # L层计算结果z z = x @ w.t() z.requires_grad_() print(z.size()) # 注意,这里没有使用one-hot编码 y = torch.tensor([1, 4, 6, 2, 4]) # Adam优化器 optimizer = torch.optim.Adam([w], lr=1e-3) for step in range(100000): # 每次更新w,重新计算z z = x @ w.t() optimizer.zero_grad() # 使用cross_entropy,z为输入[B,C],B为batch_size,C为classes。y输入标签,但不是one-hot loss = F.cross_entropy(z, y) # 反向传播 loss.backward(retain_graph=True) # 优化一次 optimizer.step() if step % 2000 == 0: print("step:", loss) # 看看结果是否和标签一致 print(torch.softmax(z, dim=1))
可以用一下cross_entropy的分解步骤代替:
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F x = torch.randn(5, 784) w = torch.randn(10, 784, requires_grad=True) # L层计算结果z z = x @ w.t() z.requires_grad_() print(z.size()) # 注意,这里没有使用one-hot编码 y = torch.tensor([1, 4, 6, 2, 4]) # Adam优化器 optimizer = torch.optim.Adam([w], lr=1e-3) for step in range(100000): # 每次更新w,重新计算z z = x @ w.t() optimizer.zero_grad() # 先计算softmax的log,即-ylog(y_hat)中的log(y_hat) pred = torch.log_softmax(z, dim=1) # 然后再计算整个-ylog(y_hat),并在各class上求和,然后batch平均 loss = F.nll_loss(pred, y) loss.backward(retain_graph=True) # 优化一次 optimizer.step() if step % 2000 == 0: print("step:", loss) # 看看结果是否和标签一致 print(torch.softmax(z, dim=1))
十、Mnist数据分类
low-level api版
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F from torchvision import datasets, transforms from torch.utils.data import DataLoader import time batch_size = 200 learning_rate = 0.001 epochs = 10 train_loader = DataLoader( datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=batch_size, shuffle=True) test_loader = DataLoader( datasets.MNIST('../data', train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=batch_size, shuffle=True) w1, b1 = torch.randn(200, 784, requires_grad=True), torch.zeros(200, requires_grad=True) w2, b2 = torch.randn(200, 200, requires_grad=True), torch.zeros(200, requires_grad=True) w3, b3 = torch.randn(10, 200, requires_grad=True), torch.zeros(10, requires_grad=True) # 使用何凯明的参数初始化方法,初始化很重要 torch.nn.init.kaiming_normal_(w1) torch.nn.init.kaiming_normal_(w2) torch.nn.init.kaiming_normal_(w3) # 网络结构 def forward(x): x = x @ w1.t() + b1 x = F.relu(x) x = x @ w2.t() + b2 x = F.relu(x) x = x @ w3.t() + b3 # 因为后面会用softmax激活函数,这里的relu可有可无,但一定不要是sigmoid和tanh x = F.relu(x) return x # 选择一个优化器,指定需要优化的参数,以及学习率 optimizer = torch.optim.SGD([w1, b1, w2, b2, w3, b3], lr=learning_rate) s_time = 0 for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader): # data转换维度为[200,784],target的维度为[200] data = data.view(-1, 28 * 28) # 跑一次网络,得到z,维度为[200,10],200是batch_size,10是类别 z = forward(data) # 使用cross_entropy计算损失值,参数为z和label(target) loss = F.cross_entropy(z, target) # 每次迭代前将梯度置0 optimizer.zero_grad() # 反向传播,计算梯度 loss.backward() # 相当于执行w = w - dw,也就是更新权值 optimizer.step() if batch_idx % 100 == 0: e_time = time.time() if s_time != 0: print("time:", e_time - s_time) s_time = time.time() print(loss)
high-level api版:
# -*- coding:utf-8 -*- __author__ = 'Leo.Z' import torch import torch.nn.functional as F from torch.nn import Module, Sequential, Linear, LeakyReLU from torchvision import datasets, transforms from torch.utils.data import DataLoader import time batch_size = 200 learning_rate = 0.001 epochs = 10 train_loader = DataLoader( datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=batch_size, shuffle=True) test_loader = DataLoader( datasets.MNIST('../data', train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=batch_size, shuffle=True) # 网络结构 class MLP(Module): def __init__(self): super(MLP, self).__init__() self.model = Sequential( Linear(784, 200), LeakyReLU(inplace=True), Linear(200, 200), LeakyReLU(inplace=True), Linear(200, 10), LeakyReLU(inplace=True) ) def forward(self, x): x = self.model(x) return x # 定义GPU设备 device = torch.device('cuda:0') # model放到GPU net = MLP().to(device) # 选择一个优化器,指定需要优化的参数,以及学习率 optimizer = torch.optim.SGD(net.parameters(), lr=learning_rate) for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader): # data转换维度为[200,784],target的维度为[200] data = data.view(-1, 28 * 28) # 将data和target放到GPU data, target = data.to(device), target.to(device) # data为输入,net()直接执行forward # 跑一次网络,得到z,维度为[200,10],200是batch_size,10是类别 # 由于net在GPU,data也在GPU,计算出的z就在GPU # 调用net(data)的时候相当于调用Module类的__call__方法 z = net(data) # 将loss放到GPU loss = F.cross_entropy(z, target).to(device) # 每次迭代前将梯度置0 optimizer.zero_grad() # 反向传播,计算梯度 loss.backward() # 相当于执行w = w - dw,也就是更新权值 optimizer.step() if batch_idx % 100 == 0: print(loss) # -*- coding:utf-8 -*-
十一、使用GPU
在上述代码中:
# 搬移一个model到GPU,net是对应更新的,也就是相当于剪切到GPU net = MLP() net2 = net.to(device) print(net is net2) # 输出True # 搬移一个数据到GPU,相当于复制一份到GPU,两者是完全不同的 data2, target2 = data.to(device), target.to(device) print(data is data2) # 输出False