Tensorflow2 深度学习十必知

技术分享 4年前 (2022-07-01) 0 999+

关注

博主根据自身多年的深度学习算法研发经验，整理分享以下十条必知。

含参考资料链接，部分附上相关代码实现。

独乐乐不如众乐乐，希望对各位看客有所帮助。

待回头有时间再展开细节说一说深度学习里的那些道道。

有什么技术需求需要有偿解决的也可以邮件或者QQ联系博主。

邮箱QQ同ID：gaozhihan@vip.qq.com

当然除了这十条，肯定还有其他“必知”，

欢迎评论分享更多，这里只是暂时拟定的十条，别较真哈。

主要学习其中的思路，切记，以下思路在个别场景并不适用。

1.数据回流

[1907.05550] Faster Neural Network Training with Data Echoing

def data_echoing(factor):      return lambda image, label: tf.data.Dataset.from_tensors((image, label)).repeat(factor)

作用:

数据集加载后，在数据增广前后重复当前批次进模型的次数，减少数据的加载耗时。

等价于让模型看n次当前的数据，或者看n个增广后的数据样本。

2.AMP 自动精度混合

在bert4keras中使用混合精度和XLA加速训练 - 科学空间|Scientific Spaces

    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

作用:

降低显存占用，加速训练，将部分网络计算转为等价的低精度计算，以此降低计算量。

3.优化器节省显存

3.1 [1804.04235]Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

mesh/optimize.py at master · tensorflow/mesh · GitHub

3.2 [1901.11150] Memory-Efficient Adaptive Optimization

google-research/sm3 at master · google-research/google-research (github.com)

作用:

节省显存，加速训练，

主要是对二阶动量进行特例化解构，减少显存存储。

4.权重标准化(归一化)

[2102.06171] High-Performance Large-Scale Image Recognition Without Normalization

deepmind-research/nfnets at master · deepmind/deepmind-research · GitHub

class WSConv2D(tf.keras.layers.Conv2D):     def __init__(self, *args, **kwargs):         super(WSConv2D, self).__init__(             kernel_initializer=tf.keras.initializers.VarianceScaling(                 scale=1.0, mode='fan_in', distribution='untruncated_normal',             ),             use_bias=False,             kernel_regularizer=tf.keras.regularizers.l2(1e-4), *args, **kwargs         )         self.gain = self.add_weight(             name='gain',             shape=(self.filters,),             initializer="ones",             trainable=True,             dtype=self.dtype         )      def standardize_weight(self, eps):         mean, var = tf.nn.moments(self.kernel, axes=[0, 1, 2], keepdims=True)         fan_in = np.prod(self.kernel.shape[:-1])         # Manually fused normalization, eq. to (w - mean) * gain / sqrt(N * var)         scale = tf.math.rsqrt(             tf.math.maximum(                 var * fan_in,                 tf.convert_to_tensor(eps, dtype=self.dtype)             )         ) * self.gain         shift = mean * scale         return self.kernel * scale - shift      def call(self, inputs):         eps = 1e-4         weight = self.standardize_weight(eps)         return tf.nn.conv2d(             inputs, weight, strides=self.strides,             padding=self.padding.upper(), dilations=self.dilation_rate         ) if self.bias is None else tf.nn.bias_add(             tf.nn.conv2d(                 inputs, weight, strides=self.strides,                 padding=self.padding.upper(), dilations=self.dilation_rate             ), self.bias)

作用:

通过对kernel进行标准化或归一化，相当于对kernel做一个先验约束，以此加速模型训练收敛。

5.自适应梯度裁剪

deepmind-research/agc_optax.py at master · deepmind/deepmind-research · GitHub

def unitwise_norm(x):     if len(tf.squeeze(x).shape) <= 1:  # Scalars and vectors         axis = None         keepdims = False     elif len(x.shape) in [2, 3]:  # Linear layers of shape IO         axis = 0         keepdims = True     elif len(x.shape) == 4:  # Conv kernels of shape HWIO         axis = [0, 1, 2, ]         keepdims = True     else:         raise ValueError(f'Got a parameter with shape not in [1, 2, 3, 4]! {x}')     square_sum = tf.reduce_sum(tf.square(x), axis, keepdims=keepdims)     return tf.sqrt(square_sum)   def gradient_clipping(grad, var):     clipping = 0.01     max_norm = tf.maximum(unitwise_norm(var), 1e-3) * clipping     grad_norm = unitwise_norm(grad)     trigger = (grad_norm > max_norm)     clipped_grad = (max_norm / tf.maximum(grad_norm, 1e-6))     return grad * tf.where(trigger, clipped_grad, tf.ones_like(clipped_grad))

作用:

防止梯度爆炸，稳定训练。通过梯度和参数的关系，对梯度进行裁剪，约束学习率。

6.recompute_grad

[1604.06174] Training Deep Nets with Sublinear Memory Cost

google-research/recompute_grad.py at master · google-research/google-research (github.com)

bojone/keras_recompute: saving memory by recomputing for keras (github.com)

作用:

通过梯度重计算，节省显存。

7.归一化

[2003.05569] Extended Batch Normalization (arxiv.org)

from keras.layers.normalization.batch_normalization import BatchNormalizationBase  class ExtendedBatchNormalization(BatchNormalizationBase):     def __init__(self,                  axis=-1,                  momentum=0.99,                  epsilon=1e-3,                  center=True,                  scale=True,                  beta_initializer='zeros',                  gamma_initializer='ones',                  moving_mean_initializer='zeros',                  moving_variance_initializer='ones',                  beta_regularizer=None,                  gamma_regularizer=None,                  beta_constraint=None,                  gamma_constraint=None,                  renorm=False,                  renorm_clipping=None,                  renorm_momentum=0.99,                  trainable=True,                  name=None,                  **kwargs):         # Currently we only support aggregating over the global batch size.         super(ExtendedBatchNormalization, self).__init__(             axis=axis,             momentum=momentum,             epsilon=epsilon,             center=center,             scale=scale,             beta_initializer=beta_initializer,             gamma_initializer=gamma_initializer,             moving_mean_initializer=moving_mean_initializer,             moving_variance_initializer=moving_variance_initializer,             beta_regularizer=beta_regularizer,             gamma_regularizer=gamma_regularizer,             beta_constraint=beta_constraint,             gamma_constraint=gamma_constraint,             renorm=renorm,             renorm_clipping=renorm_clipping,             renorm_momentum=renorm_momentum,             fused=False,             trainable=trainable,             virtual_batch_size=None,             name=name,             **kwargs)      def _calculate_mean_and_var(self, x, axes, keep_dims):         with tf.keras.backend.name_scope('moments'):             y = tf.cast(x, tf.float32) if x.dtype == tf.float16 else x             replica_ctx = tf.distribute.get_replica_context()             if replica_ctx:                 local_sum = tf.math.reduce_sum(y, axis=axes, keepdims=True)                 local_squared_sum = tf.math.reduce_sum(tf.math.square(y), axis=axes,                                                        keepdims=True)                 batch_size = tf.cast(tf.shape(y)[0], tf.float32)                 y_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM, local_sum)                 y_squared_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,                                                        local_squared_sum)                 global_batch_size = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,                                                            batch_size)                 axes_vals = [(tf.shape(y))[i] for i in range(1, len(axes))]                 multiplier = tf.cast(tf.reduce_prod(axes_vals), tf.float32)                 multiplier = multiplier * global_batch_size                 mean = y_sum / multiplier                 y_squared_mean = y_squared_sum / multiplier                 # var = E(x^2) - E(x)^2                 variance = y_squared_mean - tf.math.square(mean)             else:                 # Compute true mean while keeping the dims for proper broadcasting.                 mean = tf.math.reduce_mean(y, axes, keepdims=True, name='mean')                 variance = tf.math.reduce_mean(                     tf.math.squared_difference(y, tf.stop_gradient(mean)),                     axes,                     keepdims=True,                     name='variance')             if not keep_dims:                 mean = tf.squeeze(mean, axes)                 variance = tf.squeeze(variance, axes)             variance = tf.math.reduce_mean(variance)             if x.dtype == tf.float16:                 return (tf.cast(mean, tf.float16),                         tf.cast(variance, tf.float16))             else:                 return mean, variance

作用:

一个简易改进版的Batch Normalization，思路简单有效。

8.学习率策略

[1506.01186] Cyclical Learning Rates for Training Neural Networks (arxiv.org)

作用:

一个推荐的学习率策略方案，特定情况下可以取得更好的泛化。

9.重参数化

[1908.03930] ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks

https://zhuanlan.zhihu.com/p/361090497

作用：

通过同时训练多份参数，合并权重的思路来提升模型泛化性。

10.长尾学习