• [tts] WaveNet初步理解


    WaveNet初步理解

    paper 是DeepMind使用CNN来做语音生成的工作,这个模型直接使用声音的原始波形进行训练的。目前github上开源了一个tensorflow-wavenet项目
    文章主要内容有几点:
    1.文章中,通过该模型进行语音生成任务,结果很接近真人发出的声音
    2.Wavenet还可以抓取不同说话者的特征,有高保真度
    3.使用音乐文件来训练该模型,可以生成新的高保真度的音乐片段
    4.还可以加入判别模型,使之完成语音识别任务

    模型结构

    ![wavenet](file:///C:/Users/makui/Desktop/wavenet.png)
    为了提高receptive field, 文章使用dilated convolution技术(跳着卷积)。虽然音频是连续的,但是将其进行量化,降维到256dims,转换成了分类问题。文章基于的条件分布概率模型,是通过causal convolutional网络结构来实现的。

    input

    原始音频文件是16bit的整数值序列存储,为了便于运算,文章使用了ulaw压缩编码,将音频16bit归一化后降低到了256dims。

    G711编码的声音清晰度好,语音自然度高,但是压缩效率低,输出为8bit,主要分为ulaw和alaw。alaw也叫G711a,输入时13位(s16的高13位),在欧洲和中国使用广泛;ulaw也叫G711u,输入14位,主要北美和日本使用。编码算法多以查表为主,基础值+平均偏移。

    convolution

    Convolutions

    在wavenet中采用causal convolutions,保证

    中不包含

    中的信息。
    对于1D数据信息,输出label为输入信息的偏移。

    input: x: 0-255 int
    output: label: x向负方向shift 1. 使用前t个采样点来预测第t+1个采样点,空缺位补0.
    eg: x=[1,2,3,4,5] y=[2,3,4,5,0]

    由于声波采样非常密集,为了提高receptive field,paper使用了dilated convolutions,可以在保持在原本卷积层数量的参数情况下,指数级的增长receptive field。
    conv1
    conv2

    Gated Activation Units

    图片描述
    在声音信号建模中,非线性模型效果更好。

    residual block and skip connection

    WaveNet中使用residual block and skip connection技术是用来加速模型收敛速度的,并且使得梯度能够传到更深的模型。
    Residual and skip connection技术是KaimingHe在2015年ResNet paper 提出的。
    ResNet产生的背景是当网络越深时的学习越难,主要原因是梯度弥散或爆炸(vanishing/exploding gradients)。有一种解决方案是采用标准化,标准化的初始化数据或中间层的标准化(batch normalization)。当网络的深度继续增加,训练精确度达到饱和后,会出现急剧衰退,一个深度合适的模型增加更多的网络层会导致更高的训练错误,这并不是由过拟合导致的。ResNet通过引入一个深度残差学习网络来解决衰退问题。
    图片描述

    构造方式是增加恒等映射(identity mapping,即f(x)=x),明确让网络层去拟合残差映射,将期望的潜在映射指为H(x),非线性层匹配其他映射F(x)=H(x)-x。通过多个非线性网络层去逼近恒等映射是有困难的,重新公式化后H(x)=F(x)+x,如果恒等映射时最优的,则非线性网络的权重向0逼近,以此来逼近恒等映射。
    ResNet paper中实验表明,恒等映射足以解决网络衰退问题,并且恒等映射是经济的(introduce neither extra parameter nor computation complexity)。

    公式化输入输出:
    Ws只有在匹配维度时才使用,例如pooling操作前后。

    ResNet

    Code

    fake code

    def loss(input_batch):
    	'''Creates a WaveNet network and returns the autoencoding loss.
    	with tf.name_scope(name):
    		# ulaw编码,return 0-255
    		input_batch = mu_law_encode(input_batch,
    									self.quantization_channels)
    		encoded = one_hot(input_batch)
    		#wavenet model output
    		raw_output = self._create_network(encoded)
    
    		with tf.name_scope('loss'):
    			# 向左偏移一位,即减去第一位,保证每次是预测下一个输出。
    			# encoded=[0,0,0,0,1,2,3,4,5], encoded.shape=(1,9,1)
    			# shifted.shape=(1,8,1),[0,0,0,1,2,3,4,5]
    			shifted = tf.slice(encoded, [0, 1, 0],
    							   [-1, tf.shape(encoded)[1] - 1, -1])
    			# 补零,shifted.shape=(1,9,1),[0,0,0,1,2,3,4,5,0]
    			shifted = tf.pad(shifted, [[0, 0], [0, 1], [0, 0]])
    
    			#将模型预测转换shape为prediction
    			prediction = tf.reshape(raw_output,
    									[-1, self.quantization_channels])
    			#loss函数
    			loss = tf.nn.softmax_cross_entropy_with_logits(
    				prediction,
    				tf.reshape(shifted, [-1, self.quantization_channels]))
    			reduced_loss = tf.reduce_mean(loss)
    
    	return reduced_loss
    
    def create_network(input_batch):
    	outputs = []
    	current_layer = input_batch
    	current_layer = create_causal_layer(current_layer)
    	# Add all defined dilation layers.	#18 layers
    	for layer_index, dilation in enumerate(self.dilations):
    		output, current_layer = create_dilation_layer(
    			current_layer, layer_index, dilation)
    		outputs.append(output)
    	# Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to the output.
    	total = sum(outputs)
    	transformed1 = tf.nn.relu(total)
    	conv1 = tf.nn.conv1d(transformed1, w1, stride=1, padding="SAME")
    	transformed2 = tf.nn.relu(conv1)
    	conv2 = tf.nn.conv1d(transformed2, w2, stride=1, padding="SAME")
    
    	return conv2
    
    
    def create_dilation_layer(input_batch, layer_index, dilation):
    
    	'''Creates a single causal dilated convolution layer.
    	The layer contains a gated filter that connects to dense output
    	and to a skip connection:
    
    		   |-> [gate]   -|        |-> 1x1 conv -> skip output
    		   |             |-> (*) -|
    	input -|-> [filter] -|        |-> 1x1 conv -|
    		   |                                    |-> (+) -> dense output
    		   |------------------------------------|
    
    	Where `[gate]` and `[filter]` are causal convolutions with a
    	non-linear activation at the output.
    	'''
    
    	conv_filter = causal_conv(input_batch, weights_filter, dilation)
    	conv_gate = causal_conv(input_batch, weights_gate, dilation)
    	out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)
    	transformed = tf.nn.conv1d(
    		out, weights_dense, stride=1, padding="SAME", name="dense")
    	skip_contribution = tf.nn.conv1d(
    		out, weights_skip, stride=1, padding="SAME", name="skip")
    
    	#返回skip output和(残差+input)
    	return skip_contribution, input_batch + transformed
    
    
    def create_causal_layer():
    	detail source code
    	实现方式巧妙,直接见源码分析
    

    source code

    causal_conv

    def time_to_batch(value, dilation, name=None):
        with tf.name_scope('time_to_batch'):
            #value=[[[0],[0],[0],[0],[1],[2],[3],[4],[5]]]
            #value.shape=(1,9,1) dilation=4
            shape = tf.shape(value)
            #pad_elements=4-1-(9+4-1)%4=3
            pad_elements = dilation - 1 - (shape[1] + dilation - 1) % dilation
            #padded=[[[0],[0],[0],[0],[1],[2],[3],[4],[5],[0],[0],[0]]]
            padded = tf.pad(value, [[0, 0], [0, pad_elements], [0, 0]])
            #reshape=[[[0,0,0],[0,1,2],[3,4,5],[0,0,0]]]
            #reshape.shape=(3,4,1)
            reshaped = tf.reshape(padded, [-1, dilation, shape[2]])
            #transposed=[[[0,0,3,0],[0,1,4,0],[0,2,5,0]]]
            #transposed.shape=(4,3,1)
            transposed = tf.transpose(reshaped, perm=[1, 0, 2])
            #return shape=(4,3,1)
            return tf.reshape(transposed, [shape[0] * dilation, -1, shape[2]])
    
    
    def batch_to_time(value, dilation, name=None):
        with tf.name_scope('batch_to_time'):
            shape = tf.shape(value)
            prepared = tf.reshape(value, [dilation, -1, shape[2]])
            transposed = tf.transpose(prepared, perm=[1, 0, 2])
            #最后返回的是前面time_to_batch的最初输入数值的shape
            #return shape为(1,9,1)
            return tf.reshape(transposed,
                              [tf.div(shape[0], dilation), -1, shape[2]])
    ##if filter_width=2,dilation=4, value=[[[1],[2],[3],[4],[5]]] value.shape=(1,5,1)
    def causal_conv(value, filter_, dilation, name='causal_conv'):
        with tf.name_scope(name):
            # Pad beforehand to preserve causality.
            filter_width = tf.shape(filter_)[0]
            #padding=[[0, 0], [4, 0], [0, 0]]
            padding = [[0, 0], [(filter_width - 1) * dilation, 0], [0, 0]]
            #第二个维度前边增加4个0,shape:(1,5,1)->(1,9,1)
    		#padded=[[[0],[0],[0],[0],[1],[2],[3],[4],[5]]]
            padded = tf.pad(value, padding)
            if dilation > 1:
                #return shape=(4,3,1)
                transformed = time_to_batch(padded, dilation)
                
                conv = tf.nn.conv1d(transformed, filter_, stride=1, padding='SAME')
    
                restored = batch_to_time(conv, dilation)
            else:
                restored = tf.nn.conv1d(padded, filter_, stride=1, padding='SAME')
            # Remove excess elements at the end.
            result = tf.slice(restored,
                              [0, 0, 0],
                              [-1, tf.shape(value)[1], -1])
            #result.shape = padded.shape
            return result
    

    dilation_layer

    def _create_dilation_layer(self, input_batch, layer_index, dilation):
    	'''Creates a single causal dilated convolution layer.
    
    	The layer contains a gated filter that connects to dense output
    	and to a skip connection:
    
    		   |-> [gate]   -|        |-> 1x1 conv -> skip output
    		   |             |-> (*) -|
    	input -|-> [filter] -|        |-> 1x1 conv -|
    		   |                                    |-> (+) -> dense output
    		   |------------------------------------|
    
    	Where `[gate]` and `[filter]` are causal convolutions with a
    	non-linear activation at the output.
    	'''
    	variables = self.variables['dilated_stack'][layer_index]
    
    	weights_filter = variables['filter']
    	weights_gate = variables['gate']
    
    	conv_filter = causal_conv(input_batch, weights_filter, dilation)
    	conv_gate = causal_conv(input_batch, weights_gate, dilation)
    
    	if self.use_biases:
    		filter_bias = variables['filter_bias']
    		gate_bias = variables['gate_bias']
    		conv_filter = tf.add(conv_filter, filter_bias)
    		conv_gate = tf.add(conv_gate, gate_bias)
    
    	#gate和filter共同输出
    	out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)
    
    	# The 1x1 conv to produce the residual output
    	weights_dense = variables['dense']
    	transformed = tf.nn.conv1d(
    		out, weights_dense, stride=1, padding="SAME", name="dense")
    
    	# The 1x1 conv to produce the skip output
    	weights_skip = variables['skip']
    	#skip output
    	skip_contribution = tf.nn.conv1d(
    		out, weights_skip, stride=1, padding="SAME", name="skip")
    
    	if self.use_biases:
    		dense_bias = variables['dense_bias']
    		skip_bias = variables['skip_bias']
    		transformed = transformed + dense_bias
    		skip_contribution = skip_contribution + skip_bias
    
    	#返回skip output和(残差+input)
    	return skip_contribution, input_batch + transformed
    

    network

    def _create_network(self, input_batch):
    	'''Construct the WaveNet network.'''
    	outputs = []
    	current_layer = input_batch
    
    	current_layer = self._create_causal_layer(current_layer)
    
    	# Add all defined dilation layers.	#18 layers
    	with tf.name_scope('dilated_stack'):
    		for layer_index, dilation in enumerate(self.dilations):
    			with tf.name_scope('layer{}'.format(layer_index)):
    				output, current_layer = self._create_dilation_layer(
    					current_layer, layer_index, dilation)
    				outputs.append(output)
    
    	#postprocess层
    	with tf.name_scope('postprocessing'):
    		# Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to
    		# postprocess the output.
    		# conv weight
    		w1 = self.variables['postprocessing']['postprocess1']
    		w2 = self.variables['postprocessing']['postprocess2']
    		if self.use_biases:
    			b1 = self.variables['postprocessing']['postprocess1_bias']
    			b2 = self.variables['postprocessing']['postprocess2_bias']
    
    		# We skip connections from the outputs of each layer, adding them
    		# all up here.
    		#将每一层的skip connection输出累加
    		total = sum(outputs)
    		transformed1 = tf.nn.relu(total)
    		conv1 = tf.nn.conv1d(transformed1, w1, stride=1, padding="SAME")
    		if self.use_biases:
    			conv1 = tf.add(conv1, b1)
    		transformed2 = tf.nn.relu(conv1)
    		conv2 = tf.nn.conv1d(transformed2, w2, stride=1, padding="SAME")
    		if self.use_biases:
    			conv2 = tf.add(conv2, b2)
    
    	return conv2
    

    Reference

    WAVENET
    github open source code
    【Emotibot Tech】WaveNet语音合成与深度生成模型解析
    WaveNet 分析和实现
    谷歌WaveNet如何通过深度学习方法来生成声音?
    谷歌WaveNet 源码详解
    技术 | DeepMind语音生成模型WaveNet的TensorFlow实现
    Pixel CNN, Wavenet, GCNN笔记
    一些ResNet的参考:
    https://zhuanlan.zhihu.com/p/27082562
    https://tracholar.github.io/wiki/machine-learning/residual-network.html
    https://zhuanlan.zhihu.com/p/22071346
    https://zhuanlan.zhihu.com/p/22447440
    http://blog.csdn.net/mao_feng/article/details/52734438
    http://blog.csdn.net/bea_tree/article/details/51817142
    http://www.voidcn.com/article/p-hnytmiyq-pt.html
    http://pengshuang.space/2017/08/05/Resnet-学习笔记/
    G711编码:
    https://en.wikipedia.org/wiki/Μ-law_algorithm
    http://www.voidcn.com/article/p-fgyvjsfz-bmq.html
    http://www.21ic.com/evm/audio/201705/721797.htm

  • 相关阅读:
    C语言程序设计II—第六周教学
    第一次结对编程情况反馈
    C语言程序设计II—第五周教学
    C语言程序设计II—第四周教学
    放缩
    切线垂直
    指数为对数时取对数
    整体运算
    数列求通项+离散数列单调性判断
    整体运算+求零点
  • 原文地址:https://www.cnblogs.com/zhanxiage1994/p/7872101.html
Copyright © 2020-2023  润新知