今天这篇博文针对Assignment3的全连接网络作业,对前面学习的内容进行一些总结
在前面的作业中我们建立神经网络的操作比较简单,也不具有模块化的特征,在A3作业中,引导我们对前面的比如linear layer,Relu layer,Loss layer以及dropout layer(这个前面课程内容未涉及 但是在cs231n中有出现),以及梯度下降不同方法(SGD,SGD+Momentum,RMSprop,Adam)等等进行模块化的实现
Linear与Relu单层实现
class Linear(object): @staticmethod def forward(x, w, b): """ Computes the forward pass for an linear (fully-connected) layer. The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N examples, where each example x[i] has shape (d_1, ..., d_k). We will reshape each input into a vector of dimension D = d_1 * ... * d_k, and then transform it to an output vector of dimension M. Inputs: - x: A tensor containing input data, of shape (N, d_1, ..., d_k) - w: A tensor of weights, of shape (D, M) - b: A tensor of biases, of shape (M,) Returns a tuple of: - out: output, of shape (N, M) - cache: (x, w, b) """ out = None out = x.view(x.shape[0],-1).mm(w)+b cache = (x, w, b) return out, cache @staticmethod def backward(dout, cache): """ Computes the backward pass for an linear layer. Inputs: - dout: Upstream derivative, of shape (N, M) - cache: Tuple of: - x: Input data, of shape (N, d_1, ... d_k) - w: Weights, of shape (D, M) - b: Biases, of shape (M,) Returns a tuple of: - dx: Gradient with respect to x, of shape (N, d1, ..., d_k) - dw: Gradient with respect to w, of shape (D, M) - db: Gradient with respect to b, of shape (M,) """ x, w, b = cache dx, dw, db = None, None, None db = dout.sum(dim = 0) dx = dout.mm(w.t()).view(x.shape) dw = x.view(x.shape[0],-1).t().mm(dout) return dx, dw, db class ReLU(object): @staticmethod def forward(x): """ Computes the forward pass for a layer of rectified linear units (ReLUs). Input: - x: Input; a tensor of any shape Returns a tuple of: - out: Output, a tensor of the same shape as x - cache: x """ out = None out = x.clone() out[out<0] = 0 cache = x return out, cache @staticmethod def backward(dout, cache): """ Computes the backward pass for a layer of rectified linear units (ReLUs). Input: - dout: Upstream derivatives, of any shape - cache: Input x, of same shape as dout Returns: - dx: Gradient with respect to x """ dx, x = None, cache dx = dout.clone() dx[x<0] = 0 return dx class Linear_ReLU(object): @staticmethod def forward(x, w, b): """ Convenience layer that performs an linear transform followed by a ReLU. Inputs: - x: Input to the linear layer - w, b: Weights for the linear layer Returns a tuple of: - out: Output from the ReLU - cache: Object to give to the backward pass """ a, fc_cache = Linear.forward(x, w, b) out, relu_cache = ReLU.forward(a) cache = (fc_cache, relu_cache) return out, cache @staticmethod def backward(dout, cache): """ Backward pass for the linear-relu convenience layer """ fc_cache, relu_cache = cache da = ReLU.backward(dout, relu_cache) dx, dw, db = Linear.backward(da, fc_cache) return dx, dw, db
从上面的代码我们可以看到,针对linear与relu层,我们可以将前向传播与反向传播分开实现,具体过程在上一篇我的博文中有讨论:https://www.cnblogs.com/dyccyber/p/17764347.html
不同的是我们要对x进行一个reshape,将其转换为N*D的矩阵,才能与矩阵进行点积
在分别实现了linear与relu之后,因为神经网络的架构往往是在linear之后立马加入一个relu层,所以我们可以再建立一个linear-relu class,将这两个层的前向与反向传播合并
LossLayer实现
def svm_loss(x, y): """ Computes the loss and gradient using for multiclass SVM classification. Inputs: - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class for the ith input. - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and 0 <= y[i] < C Returns a tuple of: - loss: Scalar giving the loss - dx: Gradient of the loss with respect to x """ N = x.shape[0] correct_class_scores = x[torch.arange(N), y] margins = (x - correct_class_scores[:, None] + 1.0).clamp(min=0.) margins[torch.arange(N), y] = 0. loss = margins.sum() / N num_pos = (margins > 0).sum(dim=1) dx = torch.zeros_like(x) dx[margins > 0] = 1. dx[torch.arange(N), y] -= num_pos.to(dx.dtype) dx /= N return loss, dx def softmax_loss(x, y): """ Computes the loss and gradient for softmax classification. Inputs: - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class for the ith input. - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and 0 <= y[i] < C Returns a tuple of: - loss: Scalar giving the loss - dx: Gradient of the loss with respect to x """ shifted_logits = x - x.max(dim=1, keepdim=True).values Z = shifted_logits.exp().sum(dim=1, keepdim=True) log_probs = shifted_logits - Z.log() probs = log_probs.exp() N = x.shape[0] loss = (-1.0/ N) * log_probs[torch.arange(N), y].sum() dx = probs.clone() dx[torch.arange(N), y] -= 1 dx /= N return loss, dx
上面损失函数层我们在之前已经实现过,具体实现需要用到一些矩阵微分的知识,具体可以参考这两篇博文:
http://giantpandacv.com/academic/算法科普/深度学习基础/SVM Loss以及梯度推导/
https://blog.csdn.net/qq_27261889/article/details/82915598
多层神经网络
关于多层神经网络,首先是类的初始化定义,我们可以看神经网络的结构{linear - relu - [dropout]} x (L - 1) - linear - softmax,有L-1个linear层与relu层与dropout层的组合,最后再以linear-softmax的结构结束输出结果,初始化我们要遍历每个隐藏层,初始化权重矩阵与偏置项,最后再去初始化最后一个linear层,要注意矩阵的维度
class FullyConnectedNet(object): """ A fully-connected neural network with an arbitrary number of hidden layers, ReLU nonlinearities, and a softmax loss function. For a network with L layers, the architecture will be: {linear - relu - [dropout]} x (L - 1) - linear - softmax where dropout is optional, and the {...} block is repeated L - 1 times. Similar to the TwoLayerNet above, learnable parameters are stored in the self.params dictionary and will be learned using the Solver class. """ def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, dropout=0.0, reg=0.0, weight_scale=1e-2, seed=None, dtype=torch.float, device='cpu'): """ Initialize a new FullyConnectedNet. Inputs: - hidden_dims: A list of integers giving the size of each hidden layer. - input_dim: An integer giving the size of the input. - num_classes: An integer giving the number of classes to classify. - dropout: Scalar between 0 and 1 giving the drop probability for networks with dropout. If dropout=0 then the network should not use dropout. - reg: Scalar giving L2 regularization strength. - weight_scale: Scalar giving the standard deviation for random initialization of the weights. - seed: If not None, then pass this random seed to the dropout layers. This will make the dropout layers deteriminstic so we can gradient check the model. - dtype: A torch data type object; all computations will be performed using this datatype. float is faster but less accurate, so you should use double for numeric gradient checking. - device: device to use for computation. 'cpu' or 'cuda' """ self.use_dropout = dropout != 0 self.reg = reg self.num_layers = 1 + len(hidden_dims) self.dtype = dtype self.params = {} ############################################################################ # TODO: Initialize the parameters of the network, storing all values in # # the self.params dictionary. Store weights and biases for the first layer # # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # # initialized from a normal distribution centered at 0 with standard # # deviation equal to weight_scale. Biases should be initialized to zero. # ############################################################################ # Replace "pass" statement with your code last_dim = input_dim for n ,hidden_dim in enumerate(hidden_dims): i = n+1 self.params['W{}'.format(i)] = torch.zeros(last_dim, hidden_dim, dtype=dtype,device = device) self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, hidden_dim, dtype=dtype,device= device) self.params['b{}'.format(i)] = torch.zeros(hidden_dim, dtype=dtype,device= device) last_dim = hidden_dim i+=1 self.params['W{}'.format(i)] = torch.zeros(last_dim, num_classes, dtype=dtype,device = device) self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, num_classes, dtype=dtype,device= device) self.params['b{}'.format(i)] = torch.zeros(num_classes, dtype=dtype,device= device) # When using dropout we need to pass a dropout_param dictionary to each # dropout layer so that the layer knows the dropout probability and the mode # (train / test). You can pass the same dropout_param to each dropout layer. self.dropout_param = {} if self.use_dropout: self.dropout_param = {'mode': 'train', 'p': dropout} if seed is not None: self.dropout_param['seed'] = seed
其次,我们可以定义save与load函数,对模型参数等等进行存储与加载:
def save(self, path): checkpoint = { 'reg': self.reg, 'dtype': self.dtype, 'params': self.params, 'num_layers': self.num_layers, 'use_dropout': self.use_dropout, 'dropout_param': self.dropout_param, } torch.save(checkpoint, path) print("Saved in {}".format(path)) def load(self, path, dtype, device): checkpoint = torch.load(path, map_location='cpu') self.params = checkpoint['params'] self.dtype = dtype self.reg = checkpoint['reg'] self.num_layers = checkpoint['num_layers'] self.use_dropout = checkpoint['use_dropout'] self.dropout_param = checkpoint['dropout_param'] for p in self.params: self.params[p] = self.params[p].type(dtype).to(device) print("load checkpoint file: {}".format(path))
最后是前向传播与反向传播的实现,这里直接使用前面基础的linear与relu的前向与反向传播即可,注意一下神经网络的结构,不要把顺序搞错即可
def loss(self, X, y=None): """ Compute loss and gradient for the fully-connected net. Input / output: Same as TwoLayerNet above. """ X = X.to(self.dtype) mode = 'test' if y is None else 'train' # Set train/test mode for batchnorm params and dropout param since they # behave differently during training and testing. if self.use_dropout: self.dropout_param['mode'] = mode scores = None ############################################################################ # TODO: Implement the forward pass for the fully-connected net, computing # # the class scores for X and storing them in the scores variable. # # # # When using dropout, you'll need to pass self.dropout_param to each # # dropout forward pass. # ############################################################################ # Replace "pass" statement with your code cache_dict = {} last_out = X for n in range(self.num_layers-1): i=n+1 last_out, cache_dict['cache_LR{}'.format(i)] = Linear_ReLU.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)]) if self.use_dropout: last_out, cache_dict['cache_Dropout{}'.format(i)] = Dropout.forward(last_out,self.dropout_param) i+=1 last_out, cache_dict['cache_L{}'.format(i)] = Linear.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)]) scores = last_out # If test mode return early if mode == 'test': return scores loss, grads = 0.0, {} ############################################################################ # TODO: Implement the backward pass for the fully-connected net. Store the # # loss in the loss variable and gradients in the grads dictionary. Compute # # data loss using softmax, and make sure that grads[k] holds the gradients # # for self.params[k]. Don't forget to add L2 regularization! # # NOTE: To ensure that your implementation matches ours and you pass the # # automated tests, make sure that your L2 regularization includes a factor # # of 0.5 to simplify the expression for the gradient. # ############################################################################ # Replace "pass" statement with your code loss, dout = softmax_loss(scores, y) loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg last_dout, dw, db = Linear.backward(dout, cache_dict['cache_L{}'.format(i)]) grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg grads['b{}'.format(i)] = db for n in range(self.num_layers-1)[::-1]: i = n +1 if self.use_dropout: last_dout = Dropout.backward(last_dout, cache_dict['cache_Dropout{}'.format(i)]) last_dout, dw, db = Linear_ReLU.backward(last_dout, cache_dict['cache_LR{}'.format(i)]) grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg grads['b{}'.format(i)] = db loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg return loss, grads
不同梯度下降方法
SGD,SGD+Momentum,RMSprop,Adam(Momentum+RMSprop+bias)的实现
具体原理介绍可参考之前的一篇博文:https://www.cnblogs.com/dyccyber/p/17759697.html
这里特别提及一下在Adam中我们加入了偏置项,是为了防止在初期进行梯度下降的过程中,下降的过快
def sgd(w, dw, config=None): """ Performs vanilla stochastic gradient descent. config format: - learning_rate: Scalar learning rate. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) w -= config['learning_rate'] * dw return w, config def sgd_momentum(w, dw, config=None): """ Performs stochastic gradient descent with momentum. config format: - learning_rate: Scalar learning rate. - momentum: Scalar between 0 and 1 giving the momentum value. Setting momentum = 0 reduces to sgd. - velocity: A numpy array of the same shape as w and dw used to store a moving average of the gradients. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) config.setdefault('momentum', 0.9) v = config.get('velocity', torch.zeros_like(w)) next_w = None ############################################################################# # TODO: Implement the momentum update formula. Store the updated value in # # the next_w variable. You should also use and update the velocity v. # ############################################################################# # Replace "pass" statement with your code v = config['momentum']*v - config['learning_rate'] * dw next_w = w + v ############################################################################# # END OF YOUR CODE # ############################################################################# config['velocity'] = v return next_w, config def rmsprop(w, dw, config=None): """ Uses the RMSProp update rule, which uses a moving average of squared gradient values to set adaptive per-parameter learning rates. config format: - learning_rate: Scalar learning rate. - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared gradient cache. - epsilon: Small scalar used for smoothing to avoid dividing by zero. - cache: Moving average of second moments of gradients. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) config.setdefault('decay_rate', 0.99) config.setdefault('epsilon', 1e-8) config.setdefault('cache', torch.zeros_like(w)) next_w = None ########################################################################### # TODO: Implement the RMSprop update formula, storing the next value of w # # in the next_w variable. Don't forget to update cache value stored in # # config['cache']. # ########################################################################### # Replace "pass" statement with your code config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw**2 w += -config['learning_rate'] * dw / (torch.sqrt(config['cache']) + config['epsilon']) next_w = w ########################################################################### # END OF YOUR CODE # ########################################################################### return next_w, config def adam(w, dw, config=None): """ Uses the Adam update rule, which incorporates moving averages of both the gradient and its square and a bias correction term. config format: - learning_rate: Scalar learning rate. - beta1: Decay rate for moving average of first moment of gradient. - beta2: Decay rate for moving average of second moment of gradient. - epsilon: Small scalar used for smoothing to avoid dividing by zero. - m: Moving average of gradient. - v: Moving average of squared gradient. - t: Iteration number. """ if config is None: config = {} config.setdefault('learning_rate', 1e-3) config.setdefault('beta1', 0.9) config.setdefault('beta2', 0.999) config.setdefault('epsilon', 1e-8) config.setdefault('m', torch.zeros_like(w)) config.setdefault('v', torch.zeros_like(w)) config.setdefault('t', 0) next_w = None ############################################################################# # TODO: Implement the Adam update formula, storing the next value of w in # # the next_w variable. Don't forget to update the m, v, and t variables # # stored in config. # # # # NOTE: In order to match the reference output, please modify t _before_ # # using it in any calculations. # ############################################################################# # Replace "pass" statement with your code config['t'] += 1 config['m'] = config['beta1']*config['m'] + (1-config['beta1'])*dw mt = config['m'] / (1-config['beta1']**config['t']) config['v'] = config['beta2']*config['v'] + (1-config['beta2'])*(dw*dw) vc = config['v'] / (1-(config['beta2']**config['t'])) w = w - (config['learning_rate'] * mt)/ (torch.sqrt(vc) + config['epsilon']) next_w = w ############################################################################# # END OF YOUR CODE # ############################################################################# return next_w, config
Dropout层
注意在前面多层全连接网络的实现中,dropout只有在我们进行train的时候才使用,在test的时候是不使用的
dropout层是一个非常高效与简单的正则化方法,具体来说,在训练时,dropout 是通过仅以一定概率 p 保持神经元活跃来实现的,如果我们设置的随机数小于p就将其设置为零,如下图所示:

用另一种视角去看,dropout实际上是一种对全神经网络进行抽样的方法,可以减少不同神经元之间复杂的关系
具体论文原文见:https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
代码实现:
class Dropout(object): @staticmethod def forward(x, dropout_param): """ Performs the forward pass for (inverted) dropout. Inputs: - x: Input data: tensor of any shape - dropout_param: A dictionary with the following keys: - p: Dropout parameter. We *drop* each neuron output with probability p. - mode: 'test' or 'train'. If the mode is train, then perform dropout; if the mode is test, then just return the input. - seed: Seed for the random number generator. Passing seed makes this function deterministic, which is needed for gradient checking but not in real networks. Outputs: - out: Tensor of the same shape as x. - cache: tuple (dropout_param, mask). In training mode, mask is the dropout mask that was used to multiply the input; in test mode, mask is None. NOTE: Please implement **inverted** dropout, not the vanilla version of dropout. See http://cs231n.github.io/neural-networks-2/#reg for more details. NOTE 2: Keep in mind that p is the probability of **dropping** a neuron output; this might be contrary to some sources, where it is referred to as the probability of keeping a neuron output. """ p, mode = dropout_param['p'], dropout_param['mode'] if 'seed' in dropout_param: torch.manual_seed(dropout_param['seed']) mask = None out = None if mode == 'train': ########################################################################### # TODO: Implement training phase forward pass for inverted dropout. # # Store the dropout mask in the mask variable. # ########################################################################### # Replace "pass" statement with your code mask = torch.rand(x.shape) > p out = x.clone() out[mask] = 0 ########################################################################### # END OF YOUR CODE # ########################################################################### elif mode == 'test': ########################################################################### # TODO: Implement the test phase forward pass for inverted dropout. # ########################################################################### # Replace "pass" statement with your code out = x cache = (dropout_param, mask) return out, cache @staticmethod def backward(dout, cache): """ Perform the backward pass for (inverted) dropout. Inputs: - dout: Upstream derivatives, of any shape - cache: (dropout_param, mask) from Dropout.forward. """ dropout_param, mask = cache mode = dropout_param['mode'] dx = None if mode == 'train': ########################################################################### # TODO: Implement training phase backward pass for inverted dropout # ########################################################################### # Replace "pass" statement with your code dx = dout dx[mask] = 0 elif mode == 'test': dx = dout return dx