umicv cv-summary1-全连接神经网络模块化实现

技术分享 2年前 (2023-10-22) 0 999+

全连接神经网络模块化实现

今天这篇博文针对Assignment3的全连接网络作业，对前面学习的内容进行一些总结

在前面的作业中我们建立神经网络的操作比较简单，也不具有模块化的特征，在A3作业中，引导我们对前面的比如linear layer,Relu layer,Loss layer以及dropout layer(这个前面课程内容未涉及但是在cs231n中有出现),以及梯度下降不同方法(SGD,SGD+Momentum,RMSprop,Adam)等等进行模块化的实现

Linear与Relu单层实现

class Linear(object):    @staticmethod   def forward(x, w, b):     """     Computes the forward pass for an linear (fully-connected) layer.     The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N     examples, where each example x[i] has shape (d_1, ..., d_k). We will     reshape each input into a vector of dimension D = d_1 * ... * d_k, and     then transform it to an output vector of dimension M.     Inputs:     - x: A tensor containing input data, of shape (N, d_1, ..., d_k)     - w: A tensor of weights, of shape (D, M)     - b: A tensor of biases, of shape (M,)     Returns a tuple of:     - out: output, of shape (N, M)     - cache: (x, w, b)     """     out = None     out = x.view(x.shape[0],-1).mm(w)+b     cache = (x, w, b)     return out, cache    @staticmethod   def backward(dout, cache):     """     Computes the backward pass for an linear layer.     Inputs:     - dout: Upstream derivative, of shape (N, M)     - cache: Tuple of:       - x: Input data, of shape (N, d_1, ... d_k)       - w: Weights, of shape (D, M)       - b: Biases, of shape (M,)     Returns a tuple of:     - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)     - dw: Gradient with respect to w, of shape (D, M)     - db: Gradient with respect to b, of shape (M,)     """     x, w, b = cache     dx, dw, db = None, None, None     db = dout.sum(dim = 0)     dx = dout.mm(w.t()).view(x.shape)     dw = x.view(x.shape[0],-1).t().mm(dout)     return dx, dw, db   class ReLU(object):    @staticmethod   def forward(x):     """     Computes the forward pass for a layer of rectified linear units (ReLUs).     Input:     - x: Input; a tensor of any shape     Returns a tuple of:     - out: Output, a tensor of the same shape as x     - cache: x     """     out = None     out = x.clone()     out[out<0] = 0     cache = x     return out, cache    @staticmethod   def backward(dout, cache):     """     Computes the backward pass for a layer of rectified linear units (ReLUs).     Input:     - dout: Upstream derivatives, of any shape     - cache: Input x, of same shape as dout     Returns:     - dx: Gradient with respect to x     """     dx, x = None, cache     dx = dout.clone()     dx[x<0] = 0     return dx   class Linear_ReLU(object):    @staticmethod   def forward(x, w, b):     """     Convenience layer that performs an linear transform followed by a ReLU.      Inputs:     - x: Input to the linear layer     - w, b: Weights for the linear layer     Returns a tuple of:     - out: Output from the ReLU     - cache: Object to give to the backward pass     """     a, fc_cache = Linear.forward(x, w, b)     out, relu_cache = ReLU.forward(a)     cache = (fc_cache, relu_cache)     return out, cache    @staticmethod   def backward(dout, cache):     """     Backward pass for the linear-relu convenience layer     """     fc_cache, relu_cache = cache     da = ReLU.backward(dout, relu_cache)     dx, dw, db = Linear.backward(da, fc_cache)     return dx, dw, db

从上面的代码我们可以看到，针对linear与relu层，我们可以将前向传播与反向传播分开实现，具体过程在上一篇我的博文中有讨论：https://www.cnblogs.com/dyccyber/p/17764347.html
不同的是我们要对x进行一个reshape，将其转换为N*D的矩阵，才能与矩阵进行点积
在分别实现了linear与relu之后，因为神经网络的架构往往是在linear之后立马加入一个relu层，所以我们可以再建立一个linear-relu class，将这两个层的前向与反向传播合并

LossLayer实现

def svm_loss(x, y):   """   Computes the loss and gradient using for multiclass SVM classification.   Inputs:   - x: Input data, of shape (N, C) where x[i, j] is the score for the jth     class for the ith input.   - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and     0 <= y[i] < C   Returns a tuple of:   - loss: Scalar giving the loss   - dx: Gradient of the loss with respect to x   """   N = x.shape[0]   correct_class_scores = x[torch.arange(N), y]   margins = (x - correct_class_scores[:, None] + 1.0).clamp(min=0.)   margins[torch.arange(N), y] = 0.   loss = margins.sum() / N   num_pos = (margins > 0).sum(dim=1)   dx = torch.zeros_like(x)   dx[margins > 0] = 1.   dx[torch.arange(N), y] -= num_pos.to(dx.dtype)   dx /= N   return loss, dx   def softmax_loss(x, y):   """   Computes the loss and gradient for softmax classification.   Inputs:   - x: Input data, of shape (N, C) where x[i, j] is the score for the jth     class for the ith input.   - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and     0 <= y[i] < C   Returns a tuple of:   - loss: Scalar giving the loss   - dx: Gradient of the loss with respect to x   """   shifted_logits = x - x.max(dim=1, keepdim=True).values   Z = shifted_logits.exp().sum(dim=1, keepdim=True)   log_probs = shifted_logits - Z.log()   probs = log_probs.exp()   N = x.shape[0]   loss = (-1.0/ N) * log_probs[torch.arange(N), y].sum()   dx = probs.clone()   dx[torch.arange(N), y] -= 1   dx /= N   return loss, dx

上面损失函数层我们在之前已经实现过，具体实现需要用到一些矩阵微分的知识，具体可以参考这两篇博文：
http://giantpandacv.com/academic/算法科普/深度学习基础/SVM Loss以及梯度推导/
https://blog.csdn.net/qq_27261889/article/details/82915598

多层神经网络

关于多层神经网络，首先是类的初始化定义，我们可以看神经网络的结构{linear - relu - [dropout]} x (L - 1) - linear - softmax，有L-1个linear层与relu层与dropout层的组合，最后再以linear-softmax的结构结束输出结果，初始化我们要遍历每个隐藏层，初始化权重矩阵与偏置项，最后再去初始化最后一个linear层，要注意矩阵的维度

class FullyConnectedNet(object):   """   A fully-connected neural network with an arbitrary number of hidden layers,   ReLU nonlinearities, and a softmax loss function.   For a network with L layers, the architecture will be:    {linear - relu - [dropout]} x (L - 1) - linear - softmax    where dropout is optional, and the {...} block is repeated L - 1 times.    Similar to the TwoLayerNet above, learnable parameters are stored in the   self.params dictionary and will be learned using the Solver class.   """    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,                dropout=0.0, reg=0.0, weight_scale=1e-2, seed=None,                dtype=torch.float, device='cpu'):     """     Initialize a new FullyConnectedNet.      Inputs:     - hidden_dims: A list of integers giving the size of each hidden layer.     - input_dim: An integer giving the size of the input.     - num_classes: An integer giving the number of classes to classify.     - dropout: Scalar between 0 and 1 giving the drop probability for networks       with dropout. If dropout=0 then the network should not use dropout.     - reg: Scalar giving L2 regularization strength.     - weight_scale: Scalar giving the standard deviation for random       initialization of the weights.     - seed: If not None, then pass this random seed to the dropout layers. This       will make the dropout layers deteriminstic so we can gradient check the       model.     - dtype: A torch data type object; all computations will be performed using       this datatype. float is faster but less accurate, so you should use       double for numeric gradient checking.     - device: device to use for computation. 'cpu' or 'cuda'     """     self.use_dropout = dropout != 0     self.reg = reg     self.num_layers = 1 + len(hidden_dims)     self.dtype = dtype     self.params = {}      ############################################################################     # TODO: Initialize the parameters of the network, storing all values in    #     # the self.params dictionary. Store weights and biases for the first layer #     # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #     # initialized from a normal distribution centered at 0 with standard       #     # deviation equal to weight_scale. Biases should be initialized to zero.   #     ############################################################################     # Replace "pass" statement with your code     last_dim = input_dim     for n ,hidden_dim in enumerate(hidden_dims):       i = n+1       self.params['W{}'.format(i)] = torch.zeros(last_dim, hidden_dim, dtype=dtype,device = device)       self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, hidden_dim, dtype=dtype,device= device)       self.params['b{}'.format(i)] = torch.zeros(hidden_dim, dtype=dtype,device= device)       last_dim = hidden_dim     i+=1     self.params['W{}'.format(i)] = torch.zeros(last_dim, num_classes, dtype=dtype,device = device)     self.params['W{}'.format(i)] += weight_scale*torch.randn(last_dim, num_classes, dtype=dtype,device= device)     self.params['b{}'.format(i)] = torch.zeros(num_classes, dtype=dtype,device= device)          # When using dropout we need to pass a dropout_param dictionary to each     # dropout layer so that the layer knows the dropout probability and the mode     # (train / test). You can pass the same dropout_param to each dropout layer.     self.dropout_param = {}     if self.use_dropout:       self.dropout_param = {'mode': 'train', 'p': dropout}       if seed is not None:         self.dropout_param['seed'] = seed

其次，我们可以定义save与load函数，对模型参数等等进行存储与加载：

def save(self, path):     checkpoint = {       'reg': self.reg,       'dtype': self.dtype,       'params': self.params,       'num_layers': self.num_layers,       'use_dropout': self.use_dropout,       'dropout_param': self.dropout_param,     }            torch.save(checkpoint, path)     print("Saved in {}".format(path))     def load(self, path, dtype, device):     checkpoint = torch.load(path, map_location='cpu')     self.params = checkpoint['params']     self.dtype = dtype     self.reg = checkpoint['reg']     self.num_layers = checkpoint['num_layers']     self.use_dropout = checkpoint['use_dropout']     self.dropout_param = checkpoint['dropout_param']      for p in self.params:       self.params[p] = self.params[p].type(dtype).to(device)      print("load checkpoint file: {}".format(path))

最后是前向传播与反向传播的实现，这里直接使用前面基础的linear与relu的前向与反向传播即可，注意一下神经网络的结构，不要把顺序搞错即可

def loss(self, X, y=None):     """     Compute loss and gradient for the fully-connected net.     Input / output: Same as TwoLayerNet above.     """     X = X.to(self.dtype)     mode = 'test' if y is None else 'train'      # Set train/test mode for batchnorm params and dropout param since they     # behave differently during training and testing.     if self.use_dropout:       self.dropout_param['mode'] = mode     scores = None     ############################################################################     # TODO: Implement the forward pass for the fully-connected net, computing  #     # the class scores for X and storing them in the scores variable.          #     #                                                                          #     # When using dropout, you'll need to pass self.dropout_param to each       #     # dropout forward pass.                                                    #     ############################################################################     # Replace "pass" statement with your code     cache_dict = {}     last_out = X     for n  in range(self.num_layers-1):       i=n+1       last_out, cache_dict['cache_LR{}'.format(i)] = Linear_ReLU.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])       if self.use_dropout:         last_out, cache_dict['cache_Dropout{}'.format(i)] =  Dropout.forward(last_out,self.dropout_param)     i+=1     last_out, cache_dict['cache_L{}'.format(i)] = Linear.forward(last_out,self.params['W{}'.format(i)],self.params['b{}'.format(i)])     scores = last_out      # If test mode return early     if mode == 'test':       return scores      loss, grads = 0.0, {}     ############################################################################     # TODO: Implement the backward pass for the fully-connected net. Store the #     # loss in the loss variable and gradients in the grads dictionary. Compute #     # data loss using softmax, and make sure that grads[k] holds the gradients #     # for self.params[k]. Don't forget to add L2 regularization!               #     # NOTE: To ensure that your implementation matches ours and you pass the   #     # automated tests, make sure that your L2 regularization includes a factor #     # of 0.5 to simplify the expression for the gradient.                      #     ############################################################################     # Replace "pass" statement with your code     loss, dout = softmax_loss(scores, y)     loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg     last_dout, dw, db  = Linear.backward(dout, cache_dict['cache_L{}'.format(i)])     grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg     grads['b{}'.format(i)] = db     for n  in range(self.num_layers-1)[::-1]:       i = n +1       if self.use_dropout:         last_dout =  Dropout.backward(last_dout, cache_dict['cache_Dropout{}'.format(i)])       last_dout, dw, db  = Linear_ReLU.backward(last_dout, cache_dict['cache_LR{}'.format(i)])       grads['W{}'.format(i)] = dw + 2*self.params['W{}'.format(i)]*self.reg       grads['b{}'.format(i)] = db       loss += (self.params['W{}'.format(i)]*self.params['W{}'.format(i)]).sum()*self.reg     return loss, grads

不同梯度下降方法

SGD,SGD+Momentum,RMSprop,Adam(Momentum+RMSprop+bias)的实现
具体原理介绍可参考之前的一篇博文：https://www.cnblogs.com/dyccyber/p/17759697.html
这里特别提及一下在Adam中我们加入了偏置项，是为了防止在初期进行梯度下降的过程中，下降的过快

def sgd(w, dw, config=None):     """     Performs vanilla stochastic gradient descent.     config format:     - learning_rate: Scalar learning rate.     """     if config is None: config = {}     config.setdefault('learning_rate', 1e-2)      w -= config['learning_rate'] * dw     return w, config  def sgd_momentum(w, dw, config=None):   """   Performs stochastic gradient descent with momentum.   config format:   - learning_rate: Scalar learning rate.   - momentum: Scalar between 0 and 1 giving the momentum value.     Setting momentum = 0 reduces to sgd.   - velocity: A numpy array of the same shape as w and dw used to store a     moving average of the gradients.   """   if config is None: config = {}   config.setdefault('learning_rate', 1e-2)   config.setdefault('momentum', 0.9)   v = config.get('velocity', torch.zeros_like(w))    next_w = None   #############################################################################   # TODO: Implement the momentum update formula. Store the updated value in   #   # the next_w variable. You should also use and update the velocity v.       #   #############################################################################   # Replace "pass" statement with your code   v = config['momentum']*v - config['learning_rate'] * dw   next_w = w + v   #############################################################################   #                              END OF YOUR CODE                             #   #############################################################################   config['velocity'] = v    return next_w, config  def rmsprop(w, dw, config=None):   """   Uses the RMSProp update rule, which uses a moving average of squared   gradient values to set adaptive per-parameter learning rates.   config format:   - learning_rate: Scalar learning rate.   - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared     gradient cache.   - epsilon: Small scalar used for smoothing to avoid dividing by zero.   - cache: Moving average of second moments of gradients.   """   if config is None: config = {}   config.setdefault('learning_rate', 1e-2)   config.setdefault('decay_rate', 0.99)   config.setdefault('epsilon', 1e-8)   config.setdefault('cache', torch.zeros_like(w))    next_w = None   ###########################################################################   # TODO: Implement the RMSprop update formula, storing the next value of w #   # in the next_w variable. Don't forget to update cache value stored in    #   # config['cache'].                                                        #   ###########################################################################   # Replace "pass" statement with your code   config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw**2   w  +=  -config['learning_rate'] * dw / (torch.sqrt(config['cache']) + config['epsilon'])   next_w = w   ###########################################################################   #                             END OF YOUR CODE                            #   ###########################################################################    return next_w, config  def adam(w, dw, config=None):   """   Uses the Adam update rule, which incorporates moving averages of both the   gradient and its square and a bias correction term.   config format:   - learning_rate: Scalar learning rate.   - beta1: Decay rate for moving average of first moment of gradient.   - beta2: Decay rate for moving average of second moment of gradient.   - epsilon: Small scalar used for smoothing to avoid dividing by zero.   - m: Moving average of gradient.   - v: Moving average of squared gradient.   - t: Iteration number.   """   if config is None: config = {}   config.setdefault('learning_rate', 1e-3)   config.setdefault('beta1', 0.9)   config.setdefault('beta2', 0.999)   config.setdefault('epsilon', 1e-8)   config.setdefault('m', torch.zeros_like(w))   config.setdefault('v', torch.zeros_like(w))   config.setdefault('t', 0)    next_w = None   #############################################################################   # TODO: Implement the Adam update formula, storing the next value of w in   #   # the next_w variable. Don't forget to update the m, v, and t variables     #   # stored in config.                                                         #   #                                                                           #   # NOTE: In order to match the reference output, please modify t _before_    #   # using it in any calculations.                                             #   #############################################################################   # Replace "pass" statement with your code   config['t'] += 1   config['m'] = config['beta1']*config['m'] + (1-config['beta1'])*dw   mt = config['m'] / (1-config['beta1']**config['t'])   config['v'] = config['beta2']*config['v'] + (1-config['beta2'])*(dw*dw)   vc = config['v'] / (1-(config['beta2']**config['t']))   w = w - (config['learning_rate'] * mt)/ (torch.sqrt(vc) + config['epsilon'])   next_w = w   #############################################################################   #                              END OF YOUR CODE                             #   #############################################################################    return next_w, config

Dropout层

注意在前面多层全连接网络的实现中，dropout只有在我们进行train的时候才使用，在test的时候是不使用的
dropout层是一个非常高效与简单的正则化方法，具体来说，在训练时，dropout 是通过仅以一定概率 p 保持神经元活跃来实现的，如果我们设置的随机数小于p就将其设置为零，如下图所示：

用另一种视角去看，dropout实际上是一种对全神经网络进行抽样的方法，可以减少不同神经元之间复杂的关系
具体论文原文见：https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
代码实现：

https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

class Dropout(object):    @staticmethod   def forward(x, dropout_param):     """     Performs the forward pass for (inverted) dropout.     Inputs:     - x: Input data: tensor of any shape     - dropout_param: A dictionary with the following keys:       - p: Dropout parameter. We *drop* each neuron output with probability p.       - mode: 'test' or 'train'. If the mode is train, then perform dropout;       if the mode is test, then just return the input.       - seed: Seed for the random number generator. Passing seed makes this       function deterministic, which is needed for gradient checking but not       in real networks.     Outputs:     - out: Tensor of the same shape as x.     - cache: tuple (dropout_param, mask). In training mode, mask is the dropout       mask that was used to multiply the input; in test mode, mask is None.     NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.     See http://cs231n.github.io/neural-networks-2/#reg for more details.     NOTE 2: Keep in mind that p is the probability of **dropping** a neuron     output; this might be contrary to some sources, where it is referred to     as the probability of keeping a neuron output.     """     p, mode = dropout_param['p'], dropout_param['mode']     if 'seed' in dropout_param:       torch.manual_seed(dropout_param['seed'])      mask = None     out = None      if mode == 'train':       ###########################################################################       # TODO: Implement training phase forward pass for inverted dropout.       #       # Store the dropout mask in the mask variable.                            #       ###########################################################################       # Replace "pass" statement with your code       mask = torch.rand(x.shape) > p       out = x.clone()       out[mask] = 0       ###########################################################################       #                             END OF YOUR CODE                            #       ###########################################################################     elif mode == 'test':       ###########################################################################       # TODO: Implement the test phase forward pass for inverted dropout.       #       ###########################################################################       # Replace "pass" statement with your code       out = x     cache = (dropout_param, mask)      return out, cache    @staticmethod   def backward(dout, cache):     """     Perform the backward pass for (inverted) dropout.     Inputs:     - dout: Upstream derivatives, of any shape     - cache: (dropout_param, mask) from Dropout.forward.     """     dropout_param, mask = cache     mode = dropout_param['mode']      dx = None     if mode == 'train':       ###########################################################################       # TODO: Implement training phase backward pass for inverted dropout       #       ###########################################################################       # Replace "pass" statement with your code       dx = dout       dx[mask] = 0     elif mode == 'test':       dx = dout     return dx

发表评论