• Tensorflow2 深度学习十必知


    博主根据自身多年的深度学习算法研发经验,整理分享以下十条必知。

    含参考资料链接,部分附上相关代码实现。

    独乐乐不如众乐乐,希望对各位看客有所帮助。

    待回头有时间再展开细节说一说深度学习里的那些道道。 

    有什么技术需求需要有偿解决的也可以邮件或者QQ联系博主。

    邮箱QQ同ID:gaozhihan@vip.qq.com

    当然除了这十条,肯定还有其他“必知”,

    欢迎评论分享更多,这里只是暂时拟定的十条,别较真哈。

    主要学习其中的思路,切记,以下思路在个别场景并不适用 。

    1.数据回流

    [1907.05550] Faster Neural Network Training with Data Echoing

    def data_echoing(factor): 
        return lambda image, label: tf.data.Dataset.from_tensors((image, label)).repeat(factor)

    作用:

    数据集加载后,在数据增广前后重复当前批次进模型的次数,减少数据的加载耗时。

    等价于让模型看n次当前的数据,或者看n个增广后的数据样本。

    2.AMP 自动精度混合

    在bert4keras中使用混合精度和XLA加速训练 - 科学空间|Scientific Spaces

        tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

    作用:

    降低显存占用,加速训练,将部分网络计算转为等价的低精度计算,以此降低计算量。

    3.优化器节省显存

    3.1  [1804.04235]Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    mesh/optimize.py at master · tensorflow/mesh · GitHub

    3.2 [1901.11150] Memory-Efficient Adaptive Optimization

    google-research/sm3 at master · google-research/google-research (github.com)

    作用:

    节省显存,加速训练,

    主要是对二阶动量进行特例化解构,减少显存存储。

    4.权重标准化(归一化)

    [2102.06171] High-Performance Large-Scale Image Recognition Without Normalization

    deepmind-research/nfnets at master · deepmind/deepmind-research · GitHub

    class WSConv2D(tf.keras.layers.Conv2D):
        def __init__(self, *args, **kwargs):
            super(WSConv2D, self).__init__(
                kernel_initializer=tf.keras.initializers.VarianceScaling(
                    scale=1.0, mode='fan_in', distribution='untruncated_normal',
                ),
                use_bias=False,
                kernel_regularizer=tf.keras.regularizers.l2(1e-4), *args, **kwargs
            )
            self.gain = self.add_weight(
                name='gain',
                shape=(self.filters,),
                initializer="ones",
                trainable=True,
                dtype=self.dtype
            )
    
        def standardize_weight(self, eps):
            mean, var = tf.nn.moments(self.kernel, axes=[0, 1, 2], keepdims=True)
            fan_in = np.prod(self.kernel.shape[:-1])
            # Manually fused normalization, eq. to (w - mean) * gain / sqrt(N * var)
            scale = tf.math.rsqrt(
                tf.math.maximum(
                    var * fan_in,
                    tf.convert_to_tensor(eps, dtype=self.dtype)
                )
            ) * self.gain
            shift = mean * scale
            return self.kernel * scale - shift
    
        def call(self, inputs):
            eps = 1e-4
            weight = self.standardize_weight(eps)
            return tf.nn.conv2d(
                inputs, weight, strides=self.strides,
                padding=self.padding.upper(), dilations=self.dilation_rate
            ) if self.bias is None else tf.nn.bias_add(
                tf.nn.conv2d(
                    inputs, weight, strides=self.strides,
                    padding=self.padding.upper(), dilations=self.dilation_rate
                ), self.bias)

    作用:

    通过对kernel进行标准化或归一化,相当于对kernel做一个先验约束,以此加速模型训练收敛。

    5.自适应梯度裁剪

    deepmind-research/agc_optax.py at master · deepmind/deepmind-research · GitHub

    def unitwise_norm(x):
        if len(tf.squeeze(x).shape) <= 1:  # Scalars and vectors
            axis = None
            keepdims = False
        elif len(x.shape) in [2, 3]:  # Linear layers of shape IO
            axis = 0
            keepdims = True
        elif len(x.shape) == 4:  # Conv kernels of shape HWIO
            axis = [0, 1, 2, ]
            keepdims = True
        else:
            raise ValueError(f'Got a parameter with shape not in [1, 2, 3, 4]! {x}')
        square_sum = tf.reduce_sum(tf.square(x), axis, keepdims=keepdims)
        return tf.sqrt(square_sum)
    
    
    def gradient_clipping(grad, var):
        clipping = 0.01
        max_norm = tf.maximum(unitwise_norm(var), 1e-3) * clipping
        grad_norm = unitwise_norm(grad)
        trigger = (grad_norm > max_norm)
        clipped_grad = (max_norm / tf.maximum(grad_norm, 1e-6))
        return grad * tf.where(trigger, clipped_grad, tf.ones_like(clipped_grad))

    作用:

    防止梯度爆炸,稳定训练。通过梯度和参数的关系,对梯度进行裁剪,约束学习率。

    6.recompute_grad

    [1604.06174] Training Deep Nets with Sublinear Memory Cost

    google-research/recompute_grad.py at master · google-research/google-research (github.com)

    bojone/keras_recompute: saving memory by recomputing for keras (github.com)

    作用:

    通过梯度重计算,节省显存。

    7.归一化

    [2003.05569] Extended Batch Normalization (arxiv.org)

    from keras.layers.normalization.batch_normalization import BatchNormalizationBase
    
    class ExtendedBatchNormalization(BatchNormalizationBase):
        def __init__(self,
                     axis=-1,
                     momentum=0.99,
                     epsilon=1e-3,
                     center=True,
                     scale=True,
                     beta_initializer='zeros',
                     gamma_initializer='ones',
                     moving_mean_initializer='zeros',
                     moving_variance_initializer='ones',
                     beta_regularizer=None,
                     gamma_regularizer=None,
                     beta_constraint=None,
                     gamma_constraint=None,
                     renorm=False,
                     renorm_clipping=None,
                     renorm_momentum=0.99,
                     trainable=True,
                     name=None,
                     **kwargs):
            # Currently we only support aggregating over the global batch size.
            super(ExtendedBatchNormalization, self).__init__(
                axis=axis,
                momentum=momentum,
                epsilon=epsilon,
                center=center,
                scale=scale,
                beta_initializer=beta_initializer,
                gamma_initializer=gamma_initializer,
                moving_mean_initializer=moving_mean_initializer,
                moving_variance_initializer=moving_variance_initializer,
                beta_regularizer=beta_regularizer,
                gamma_regularizer=gamma_regularizer,
                beta_constraint=beta_constraint,
                gamma_constraint=gamma_constraint,
                renorm=renorm,
                renorm_clipping=renorm_clipping,
                renorm_momentum=renorm_momentum,
                fused=False,
                trainable=trainable,
                virtual_batch_size=None,
                name=name,
                **kwargs)
    
        def _calculate_mean_and_var(self, x, axes, keep_dims):
            with tf.keras.backend.name_scope('moments'):
                y = tf.cast(x, tf.float32) if x.dtype == tf.float16 else x
                replica_ctx = tf.distribute.get_replica_context()
                if replica_ctx:
                    local_sum = tf.math.reduce_sum(y, axis=axes, keepdims=True)
                    local_squared_sum = tf.math.reduce_sum(tf.math.square(y), axis=axes,
                                                           keepdims=True)
                    batch_size = tf.cast(tf.shape(y)[0], tf.float32)
                    y_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM, local_sum)
                    y_squared_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,
                                                           local_squared_sum)
                    global_batch_size = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,
                                                               batch_size)
                    axes_vals = [(tf.shape(y))[i] for i in range(1, len(axes))]
                    multiplier = tf.cast(tf.reduce_prod(axes_vals), tf.float32)
                    multiplier = multiplier * global_batch_size
                    mean = y_sum / multiplier
                    y_squared_mean = y_squared_sum / multiplier
                    # var = E(x^2) - E(x)^2
                    variance = y_squared_mean - tf.math.square(mean)
                else:
                    # Compute true mean while keeping the dims for proper broadcasting.
                    mean = tf.math.reduce_mean(y, axes, keepdims=True, name='mean')
                    variance = tf.math.reduce_mean(
                        tf.math.squared_difference(y, tf.stop_gradient(mean)),
                        axes,
                        keepdims=True,
                        name='variance')
                if not keep_dims:
                    mean = tf.squeeze(mean, axes)
                    variance = tf.squeeze(variance, axes)
                variance = tf.math.reduce_mean(variance)
                if x.dtype == tf.float16:
                    return (tf.cast(mean, tf.float16),
                            tf.cast(variance, tf.float16))
                else:
                    return mean, variance
    

      

    作用:

    一个简易改进版的Batch Normalization,思路简单有效。

    8.学习率策略

    [1506.01186] Cyclical Learning Rates for Training Neural Networks (arxiv.org)

    作用:

    一个推荐的学习率策略方案,特定情况下可以取得更好的泛化。

    9.重参数化

    [1908.03930] ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks

    https://zhuanlan.zhihu.com/p/361090497

    作用:

    通过同时训练多份参数,合并权重的思路来提升模型泛化性。

    10.长尾学习

    [2110.04596] Deep Long-Tailed Learning: A Survey (arxiv.org)

    Jorwnpay/A-Long-Tailed-Survey: 本项目是 Deep Long-Tailed Learning: A Survey 文章的中译版 (github.com)

    作用:

    解决长尾问题,可以加速收敛,提升模型泛化,稳定训练。

  • 相关阅读:
    english note(6.3 to 6.8)
    english note(6.2 to 5.30)
    Lambda表达式
    Python Software Foundation
    eval(input())
    北航操作系统实验2019:Lab4-1代码实现参考
    北航操作系统实验2019:Lab4-1流程梳理
    面向对象设计与构造2019 第二单元总结博客作业
    面向对象设计与构造2019 第一单元总结博客作业
    Java代码度量分析工具:Designite简介
  • 原文地址:https://www.cnblogs.com/cpuimage/p/16427268.html
Copyright © 2020-2023  润新知