• 【文本分类-03】charCNN


    目录

    1. 大纲概述
    2. 数据集合
    3. 数据处理
    4. 预训练word2vec模型

    一、大纲概述

    文本分类这个系列将会有8篇左右文章,从github直接下载代码,从百度云下载训练数据,在pycharm上导入即可使用,包括基于word2vec预训练的文本分类,与及基于近几年的预训练模型(ELMo,BERT等)的文本分类。总共有以下系列:

    word2vec预训练词向量

    textCNN 模型

    charCNN 模型

    Bi-LSTM 模型

    Bi-LSTM + Attention 模型

    Transformer 模型

    ELMo 预训练模型

    BERT 预训练模型

    charCNN 模型结构

    在charCNN论文Character-level Convolutional Networks for Text Classification中提出了6层卷积层 + 3层全连接层的结构,具体结构如下图:

    针对不同大小的数据集提出了两种结构参数:

    1)卷积层

    2)全连接层

    二、数据集合

    数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),但是在训练word2vec词向量模型(无监督学习)时可以将无标签的数据一起用上。

    训练数据地址:链接:https://pan.baidu.com/s/1-XEwx1ai8kkGsMagIFKX_g 提取码:rtz8

    三、主要代码 

    3.1 配置训练参数:parameter_config.py

        1 	# Author:yifan
        2 	# 1、参数配置
        3 	class TrainingConfig(object):
        4 	    epoches = 6
        5 	    evaluateEvery = 100
        6 	    checkpointEvery = 100
        7 	    learningRate = 0.001
        8 	
        9 	class ModelConfig(object):
       10 	    # 该列表中子列表的三个元素分别:卷积核的数量,卷积核的高度,池化的尺寸
       11 	    convLayers = [[256, 7, 4],
       12 	                  [256, 7, 4],
       13 	                  [256, 3, 4]]
       14 	    fcLayers = [512]
       15 	    dropoutKeepProb = 0.5
       16 	    epsilon = 1e-3  # BN层中防止分母为0而加入的极小值
       17 	    decay = 0.999  # BN层中用来计算滑动平均的值
       18 	
       19 	class Config(object):
       20 	 # 我们使用论文中提出的69个字符来表征输入数据
       21 	    alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'"/\|_@#$%^&*~`+-=<>()[]{}"
       22 	#  alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
       23 	    sequenceLength = 1014  # 字符表示的序列长度
       24 	    batchSize = 128
       25 	    rate = 0.8  # 训练集的比例
       26 	    dataSource = "../data/preProcess/labeledCharTrain.csv"
       27 	    training = TrainingConfig()
       28 	    model = ModelConfig()
       29 	config = Config()

    3.2 获取训练数据:get_train_data.py

    1) 加载数据,将所有的句子分割成字符表示

    2) 构建字符-索引映射表,并保存成json的数据格式,方便在inference阶段加载使用

    3)将字符转换成one-hot的嵌入形式,作为模型中embedding层的初始化值。

    4) 将数据集分割成训练集和验证集

        1 	# Author:yifan
        2 	import json
        3 	import pandas as pd
        4 	import  numpy as np
        5 	import parameter_config
        6 	# 2、 训练数据生成
        7 	#   1) 加载数据,将所有的句子分割成字符表示
        8 	#   2) 构建字符-索引映射表,并保存成json的数据格式,方便在inference阶段加载使用
        9 	#   3)将字符转换成one-hot的嵌入形式,作为模型中embedding层的初始化值。
       10 	#   4) 将数据集分割成训练集和验证集
       11 	# 数据预处理的类,生成训练集和测试集
       12 	class Dataset(object):
       13 	    def __init__(self, config):   #config.的部分都是从parameter.config.py中带出
       14 	        self._dataSource = config.dataSource             #路径
       15 	        self._sequenceLength = config.sequenceLength    # 字符表示的序列长度
       16 	        self._rate = config.rate                        # 训练集的比例
       17 	        self._alphabet = config.alphabet
       18 	        self.trainReviews = []
       19 	        self.trainLabels = []
       20 	        self.evalReviews = []
       21 	        self.evalLabels = []
       22 	        self.charEmbedding = None
       23 	        self._charToIndex = {}
       24 	        self._indexToChar = {}
       25 	
       26 	    def _readData(self, filePath):
       27 	        """
       28 	        从csv文件中读取数据集
       29 	        """
       30 	        df = pd.read_csv(filePath)
       31 	        labels = df["sentiment"].tolist()
       32 	        review = df["review"].tolist()
       33 	        reviews = [[char for char in line if char != " "] for line in review]
       34 	        return reviews, labels
       35 	
       36 	    def _reviewProcess(self, review, sequenceLength, charToIndex):
       37 	        """
       38 	        将数据集中的每条评论用index表示
       39 	        wordToIndex中“pad”对应的index为0
       40 	        """
       41 	        reviewVec = np.zeros((sequenceLength))
       42 	        sequenceLen = sequenceLength
       43 	        # 判断当前的序列是否小于定义的固定序列长度
       44 	        if len(review) < sequenceLength:
       45 	            sequenceLen = len(review)
       46 	        for i in range(sequenceLen):
       47 	            if review[i] in charToIndex:
       48 	                reviewVec[i] = charToIndex[review[i]]
       49 	            else:
       50 	                reviewVec[i] = charToIndex["UNK"]
       51 	        return reviewVec
       52 	
       53 	    def _genTrainEvalData(self, x, y, rate):
       54 	        """
       55 	        生成训练集和验证集,最后生成的一行表示一个句子,包含单词数为sequenceLength = 1014。每个单词用index表示
       56 	        """
       57 	        reviews = []
       58 	        labels = []
       59 	        # 遍历所有的文本,将文本中的词转换成index表示
       60 	        for i in range(len(x)):
       61 	            reviewVec = self._reviewProcess(x[i], self._sequenceLength, self._charToIndex)
       62 	            reviews.append(reviewVec)
       63 	            labels.append([y[i]])
       64 	        trainIndex = int(len(x) * rate)
       65 	        trainReviews = np.asarray(reviews[:trainIndex], dtype="int64")
       66 	        trainLabels = np.array(labels[:trainIndex], dtype="float32")
       67 	        evalReviews = np.asarray(reviews[trainIndex:], dtype="int64")
       68 	        evalLabels = np.array(labels[trainIndex:], dtype="float32")
       69 	        return trainReviews, trainLabels, evalReviews, evalLabels
       70 	
       71 	    def _getCharEmbedding(self, chars):
       72 	        """
       73 	        按照one的形式将字符映射成向量
       74 	        字母pad表示【0,0,0...】,UNK是【1,0,0...】,a表示【0,1,0...】等等
       75 	        """
       76 	        alphabet = ["UNK"] + [char for char in self._alphabet]
       77 	        vocab = ["pad"] + alphabet
       78 	        charEmbedding = []
       79 	        charEmbedding.append(np.zeros(len(alphabet), dtype="float32"))
       80 	
       81 	        for i, alpha in enumerate(alphabet):
       82 	            onehot = np.zeros(len(alphabet), dtype="float32")
       83 	            # 生成每个字符对应的向量
       84 	            onehot[i] = 1
       85 	            # 生成字符嵌入的向量矩阵
       86 	            charEmbedding.append(onehot)
       87 	        return vocab, np.array(charEmbedding)
       88 	
       89 	    def _genVocabulary(self, reviews):
       90 	        """
       91 	        生成字符向量和字符-索引映射字典
       92 	        """
       93 	        chars = [char for char in self._alphabet]
       94 	        vocab, charEmbedding = self._getCharEmbedding(chars)
       95 	        self.charEmbedding = charEmbedding
       96 	
       97 	        self._charToIndex = dict(zip(vocab, list(range(len(vocab)))))
       98 	        self._indexToChar = dict(zip(list(range(len(vocab))), vocab))
       99 	
      100 	        # 将词汇-索引映射表保存为json数据,之后做inference时直接加载来处理数据
      101 	        with open("../data/charJson/charToIndex.json", "w", encoding="utf-8") as f:
      102 	            json.dump(self._charToIndex, f)
      103 	        with open("../data/charJson/indexToChar.json", "w", encoding="utf-8") as f:
      104 	            json.dump(self._indexToChar, f)
      105 	
      106 	    def dataGen(self):
      107 	        """
      108 	        初始化训练集和验证集
      109 	        """
      110 	        # 初始化数据集
      111 	        # reviews: [['"', 'w', 'i', 't', 'h', 'a', 'l', 'l', 't', 'h', 'i', 's', 's', 't', 'u', 'f', 'f
      112 	        #labels:[1, ...
      113 	        reviews, labels = self._readData(self._dataSource)
      114 	        # 初始化词汇-索引映射表和词向量矩阵
      115 	        self._genVocabulary(reviews)
      116 	        # 初始化训练集和测试集  训练集20000,测试集5000
      117 	        trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviews, labels, self._rate)
      118 	        self.trainReviews = trainReviews
      119 	        self.trainLabels = trainLabels
      120 	        self.evalReviews = evalReviews
      121 	        self.evalLabels = evalLabels
      122 	        # print(trainReviews)
      123 	        # print("++++")
      124 	        # print(trainLabels)
      125 	        # print(len(trainReviews[0]))
      126 	        # print(len(trainReviews[2]))
      127 	        # print(len(evalLabels))
      128 	#test
      129 	# config =parameter_config.Config()
      130 	# data = Dataset(config)
      131 	# data.dataGen()

    3.3 模型构建:mode_structure.py

        1 	# Author:yifan
        2 	import tensorflow as tf
        3 	import math
        4 	import parameter_config
        5 	
        6 	# 构建模型  3 textCNN 模型
        7 	# 定义char-CNN分类器
        8 	class CharCNN(object):
        9 	    """
       10 	    char-CNN用于文本分类   
       11 	    在charCNN 模型中我们引入了BN层,但是效果并不明显,甚至存在一些收敛问题,待之后去探讨。
       12 	    """
       13 	    def __init__(self, config, charEmbedding):
       14 	        # placeholders for input, output and dropuot
       15 	        self.inputX = tf.placeholder(tf.int32, [None, config.sequenceLength], name="inputX")
       16 	        self.inputY = tf.placeholder(tf.float32, [None, 1], name="inputY")
       17 	        self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb")
       18 	        self.isTraining = tf.placeholder(tf.bool, name="isTraining")
       19 	        self.epsilon = config.model.epsilon
       20 	        self.decay = config.model.decay
       21 	
       22 	        # 字符嵌入
       23 	        with tf.name_scope("embedding"):
       24 	            # 利用one-hot的字符向量作为初始化词嵌入矩阵
       25 	            self.W = tf.Variable(tf.cast(charEmbedding, dtype=tf.float32, name="charEmbedding"), name="W")
       26 	            # 获得字符嵌入
       27 	            self.embededChars = tf.nn.embedding_lookup(self.W, self.inputX)
       28 	            # 添加一个通道维度
       29 	            self.embededCharsExpand = tf.expand_dims(self.embededChars, -1)
       30 	
       31 	        for i, cl in enumerate(config.model.convLayers):
       32 	            print("开始第" + str(i + 1) + "卷积层的处理")
       33 	            # 利用命名空间name_scope来实现变量名复用
       34 	            with tf.name_scope("convLayer-%s" % (i + 1)):
       35 	                # 获取字符的向量长度
       36 	                filterWidth = self.embededCharsExpand.get_shape()[2].value
       37 	                # filterShape = [height, width, in_channels, out_channels]
       38 	                filterShape = [cl[1], filterWidth, 1, cl[0]]
       39 	                stdv = 1 / math.sqrt(cl[0] * cl[1])
       40 	
       41 	                # 初始化w和b的值
       42 	                wConv = tf.Variable(tf.random_uniform(filterShape, minval=-stdv, maxval=stdv),
       43 	                                    dtype='float32', name='w')
       44 	                bConv = tf.Variable(tf.random_uniform(shape=[cl[0]], minval=-stdv, maxval=stdv), name='b')
       45 	
       46 	                #                 w_conv = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="w")
       47 	                #                 b_conv = tf.Variable(tf.constant(0.1, shape=[cl[0]]), name="b")
       48 	                # 构建卷积层,可以直接将卷积核的初始化方法传入(w_conv)
       49 	                conv = tf.nn.conv2d(self.embededCharsExpand, wConv, strides=[1, 1, 1, 1], padding="VALID", name="conv")
       50 	                # 加上偏差
       51 	                hConv = tf.nn.bias_add(conv, bConv)
       52 	                # 可以直接加上relu函数,因为tf.nn.conv2d事实上是做了一个卷积运算,然后在这个运算结果上加上偏差,再导入到relu函数中
       53 	                hConv = tf.nn.relu(hConv)
       54 	
       55 	                #                 with tf.name_scope("batchNormalization"):
       56 	                #                     hConvBN = self._batchNorm(hConv)
       57 	
       58 	                if cl[-1] is not None:
       59 	                    ksizeShape = [1, cl[2], 1, 1]
       60 	                    hPool = tf.nn.max_pool(hConv, ksize=ksizeShape, strides=ksizeShape, padding="VALID", name="pool")
       61 	                else:
       62 	                    hPool = hConv
       63 	
       64 	                print(hPool.shape)
       65 	
       66 	                # 对维度进行转换,转换成卷积层的输入维度
       67 	                self.embededCharsExpand = tf.transpose(hPool, [0, 1, 3, 2], name="transpose")
       68 	        print(self.embededCharsExpand)
       69 	        with tf.name_scope("reshape"):
       70 	            fcDim = self.embededCharsExpand.get_shape()[1].value * self.embededCharsExpand.get_shape()[2].value
       71 	            self.inputReshape = tf.reshape(self.embededCharsExpand, [-1, fcDim])
       72 	
       73 	        weights = [fcDim] + config.model.fcLayers
       74 	
       75 	        for i, fl in enumerate(config.model.fcLayers):   #fcLayers = [512]
       76 	            with tf.name_scope("fcLayer-%s" % (i + 1)):
       77 	                print("开始第" + str(i + 1) + "全连接层的处理")
       78 	                stdv = 1 / math.sqrt(weights[i])
       79 	                # 定义全连接层的初始化方法,均匀分布初始化w和b的值
       80 	                wFc = tf.Variable(tf.random_uniform([weights[i], fl], minval=-stdv, maxval=stdv), dtype="float32",
       81 	                                  name="w")
       82 	                bFc = tf.Variable(tf.random_uniform(shape=[fl], minval=-stdv, maxval=stdv), dtype="float32", name="b")
       83 	
       84 	                #                 w_fc = tf.Variable(tf.truncated_normal([weights[i], fl], stddev=0.05), name="W")
       85 	                #                 b_fc = tf.Variable(tf.constant(0.1, shape=[fl]), name="b")
       86 	
       87 	                self.fcInput = tf.nn.relu(tf.matmul(self.inputReshape, wFc) + bFc)
       88 	                with tf.name_scope("dropOut"):
       89 	                    self.fcInputDrop = tf.nn.dropout(self.fcInput, self.dropoutKeepProb)
       90 	            self.inputReshape = self.fcInputDrop
       91 	
       92 	        with tf.name_scope("outputLayer"):
       93 	            stdv = 1 / math.sqrt(weights[-1])
       94 	            # 定义隐层到输出层的权重系数和偏差的初始化方法
       95 	            #             w_out = tf.Variable(tf.truncated_normal([fc_layers[-1], num_classes], stddev=0.1), name="W")
       96 	            #             b_out = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
       97 	
       98 	            wOut = tf.Variable(tf.random_uniform([config.model.fcLayers[-1], 1], minval=-stdv, maxval=stdv),
       99 	                               dtype="float32", name="w")
      100 	            bOut = tf.Variable(tf.random_uniform(shape=[1], minval=-stdv, maxval=stdv), name="b")
      101 	            # tf.nn.xw_plus_b就是x和w的乘积加上b
      102 	            self.predictions = tf.nn.xw_plus_b(self.inputReshape, wOut, bOut, name="predictions")
      103 	            # 进行二元分类
      104 	            self.binaryPreds = tf.cast(tf.greater_equal(self.predictions, 0.0), tf.float32, name="binaryPreds")
      105 	
      106 	        with tf.name_scope("loss"):
      107 	            # 定义损失函数,对预测值进行softmax,再求交叉熵。
      108 	            losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.predictions, labels=self.inputY)
      109 	            self.loss = tf.reduce_mean(losses)
      110 	
      111 	    def _batchNorm(self, x):
      112 	        # BN层代码实现
      113 	        gamma = tf.Variable(tf.ones([x.get_shape()[3].value]))
      114 	        beta = tf.Variable(tf.zeros([x.get_shape()[3].value]))
      115 	        self.popMean = tf.Variable(tf.zeros([x.get_shape()[3].value]), trainable=False, name="popMean")
      116 	        self.popVariance = tf.Variable(tf.ones([x.get_shape()[3].value]), trainable=False, name="popVariance")
      117 	
      118 	        def batchNormTraining():
      119 	            # 一定要使用正确的维度确保计算的是每个特征图上的平均值和方差而不是整个网络节点上的统计分布值
      120 	            batchMean, batchVariance = tf.nn.moments(x, [0, 1, 2], keep_dims=False)
      121 	            decay = 0.99
      122 	            trainMean = tf.assign(self.popMean, self.popMean * self.decay + batchMean * (1 - self.decay))
      123 	            trainVariance = tf.assign(self.popVariance,
      124 	                                      self.popVariance * self.decay + batchVariance * (1 - self.decay))
      125 	
      126 	            with tf.control_dependencies([trainMean, trainVariance]):
      127 	                return tf.nn.batch_normalization(x, batchMean, batchVariance, beta, gamma, self.epsilon)
      128 	
      129 	        def batchNormInference():
      130 	            return tf.nn.batch_normalization(x, self.popMean, self.popVariance, beta, gamma, self.epsilon)
      131 	        batchNormalizedOutput = tf.cond(self.isTraining, batchNormTraining, batchNormInference)
      132 	        return tf.nn.relu(batchNormalizedOutput)

    3.4 模型训练:mode_trainning.py

        1 	# Author:yifan
        2 	import os
        3 	import datetime
        4 	import warnings
        5 	import numpy as np
        6 	import tensorflow as tf
        7 	from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score
        8 	warnings.filterwarnings("ignore")
        9 	import parameter_config
       10 	import get_train_data
       11 	import mode_structure
       12 	
       13 	#获取前些模块的数据
       14 	config =parameter_config.Config()
       15 	data = get_train_data.Dataset(config)
       16 	data.dataGen()
       17 	
       18 	#4生成batch数据集
       19 	def nextBatch(x, y, batchSize):
       20 	    # 生成batch数据集,用生成器的方式输出
       21 	    perm = np.arange(len(x))
       22 	    np.random.shuffle(perm)
       23 	    x = x[perm]
       24 	    y = y[perm]
       25 	    # print("++++++++++++++")
       26 	    # print(x)
       27 	    numBatches = len(x) // batchSize
       28 	
       29 	    for i in range(numBatches):
       30 	        start = i * batchSize
       31 	        end = start + batchSize
       32 	        batchX = np.array(x[start: end], dtype="int64")
       33 	        batchY = np.array(y[start: end], dtype="float32")
       34 	        yield batchX, batchY
       35 	
       36 	# 5 定义计算metrics的函数
       37 	"""
       38 	定义各类性能指标
       39 	"""
       40 	def mean(item):
       41 	    return sum(item) / len(item)
       42 	
       43 	def genMetrics(trueY, predY, binaryPredY):
       44 	    """
       45 	    生成acc和auc值
       46 	    """
       47 	    auc = roc_auc_score(trueY, predY)
       48 	    accuracy = accuracy_score(trueY, binaryPredY)
       49 	    precision = precision_score(trueY, binaryPredY, average='macro')
       50 	    recall = recall_score(trueY, binaryPredY, average='macro')
       51 	    return round(accuracy, 4), round(auc, 4), round(precision, 4), round(recall, 4)
       52 	
       53 	# 6 训练模型
       54 	    # 生成训练集和验证集
       55 	trainReviews = data.trainReviews
       56 	trainLabels = data.trainLabels
       57 	evalReviews = data.evalReviews
       58 	evalLabels = data.evalLabels
       59 	charEmbedding = data.charEmbedding
       60 	
       61 	    # 定义计算图
       62 	with tf.Graph().as_default():
       63 	    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
       64 	    session_conf.gpu_options.allow_growth = True
       65 	    session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9  # 配置gpu占用率
       66 	    sess = tf.Session(config=session_conf)
       67 	
       68 	    # 定义会话
       69 	    with sess.as_default():
       70 	        cnn = mode_structure.CharCNN(config, charEmbedding)
       71 	        globalStep = tf.Variable(0, name="globalStep", trainable=False)
       72 	        # 定义优化函数,传入学习速率参数
       73 	        optimizer = tf.train.RMSPropOptimizer(config.training.learningRate)
       74 	        # 计算梯度,得到梯度和变量
       75 	        gradsAndVars = optimizer.compute_gradients(cnn.loss)
       76 	        # 将梯度应用到变量下,生成训练器
       77 	        trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep)
       78 	
       79 	        # 用summary绘制tensorBoard
       80 	        gradSummaries = []
       81 	        for g, v in gradsAndVars:
       82 	            if g is not None:
       83 	                tf.summary.histogram("{}/grad/hist".format(v.name), g)
       84 	                tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
       85 	        outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys"))
       86 	        print("Writing to {}
    ".format(outDir))
       87 	        lossSummary = tf.summary.scalar("trainLoss", cnn.loss)
       88 	
       89 	        summaryOp = tf.summary.merge_all()
       90 	
       91 	        trainSummaryDir = os.path.join(outDir, "train")
       92 	        trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph)
       93 	        evalSummaryDir = os.path.join(outDir, "eval")
       94 	        evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph)
       95 	
       96 	        # 初始化所有变量
       97 	        saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
       98 	
       99 	        # 保存模型的一种方式,保存为pb文件
      100 	        builder = tf.saved_model.builder.SavedModelBuilder("../model/charCNN/savedModel")
      101 	        sess.run(tf.global_variables_initializer())
      102 	
      103 	        def trainStep(batchX, batchY):
      104 	            """
      105 	            训练函数
      106 	            """
      107 	            feed_dict = {
      108 	                cnn.inputX: batchX,
      109 	                cnn.inputY: batchY,
      110 	                cnn.dropoutKeepProb: config.model.dropoutKeepProb,
      111 	                cnn.isTraining: True
      112 	            }
      113 	            _, summary, step, loss, predictions, binaryPreds = sess.run(
      114 	                [trainOp, summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
      115 	                feed_dict)
      116 	            timeStr = datetime.datetime.now().isoformat()
      117 	            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
      118 	            print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(timeStr, step, loss,
      119 	                                                                                               acc, auc, precision,
      120 	                                                                                               recall))
      121 	            trainSummaryWriter.add_summary(summary, step)
      122 	
      123 	        def devStep(batchX, batchY):
      124 	            """
      125 	            验证函数
      126 	            """
      127 	            feed_dict = {
      128 	                cnn.inputX: batchX,
      129 	                cnn.inputY: batchY,
      130 	                cnn.dropoutKeepProb: 1.0,
      131 	                cnn.isTraining: False
      132 	            }
      133 	            summary, step, loss, predictions, binaryPreds = sess.run(
      134 	                [summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
      135 	                feed_dict)
      136 	
      137 	            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
      138 	
      139 	            evalSummaryWriter.add_summary(summary, step)
      140 	
      141 	            return loss, acc, auc, precision, recall
      142 	
      143 	        for i in range(config.training.epoches):
      144 	            # 训练模型
      145 	            print("start training model")
      146 	            for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize):
      147 	                trainStep(batchTrain[0], batchTrain[1])
      148 	
      149 	                currentStep = tf.train.global_step(sess, globalStep)
      150 	                if currentStep % config.training.evaluateEvery == 0:
      151 	                    print("
    Evaluation:")
      152 	
      153 	                    losses = []
      154 	                    accs = []
      155 	                    aucs = []
      156 	                    precisions = []
      157 	                    recalls = []
      158 	
      159 	                    for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize):
      160 	                        loss, acc, auc, precision, recall = devStep(batchEval[0], batchEval[1])
      161 	                        losses.append(loss)
      162 	                        accs.append(acc)
      163 	                        aucs.append(auc)
      164 	                        precisions.append(precision)
      165 	                        recalls.append(recall)
      166 	
      167 	                    time_str = datetime.datetime.now().isoformat()
      168 	                    print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(time_str,
      169 	                                                                                                       currentStep,
      170 	                                                                                                       mean(losses),
      171 	                                                                                                       mean(accs),
      172 	                                                                                                       mean(aucs),
      173 	                                                                                                       mean(
      174 	                                                                                                           precisions),
      175 	                                                                                                       mean(
      176 	                                                                                                           recalls)))
      177 	
      178 	                if currentStep % config.training.checkpointEvery == 0:
      179 	                    # 保存模型的另一种方法,保存checkpoint文件
      180 	                    path = saver.save(sess, "../model/charCNN/model/my-model", global_step=currentStep)
      181 	                    print("Saved model checkpoint to {}
    ".format(path))
      182 	
      183 	        inputs = {"inputX": tf.saved_model.utils.build_tensor_info(cnn.inputX),
      184 	                  "keepProb": tf.saved_model.utils.build_tensor_info(cnn.dropoutKeepProb)}
      185 	
      186 	        outputs = {"binaryPreds": tf.saved_model.utils.build_tensor_info(cnn.binaryPreds)}
      187 	
      188 	        prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs,
      189 	                                                                                      outputs=outputs,
      190 	                                                                                      method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
      191 	        legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op")
      192 	        builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING],
      193 	                                             signature_def_map={"predict": prediction_signature},
      194 	                                             legacy_init_op=legacy_init_op)
      195 	
      196 	        builder.save()

    3.5 预测:predict.py

        1 	# Author:yifan
        2 	import tensorflow as tf
        3 	import parameter_config
        4 	import get_train_data
        5 	config =parameter_config.Config()
        6 	data = get_train_data.Dataset(config)
        7 	
        8 	#7预测代码
        9 	x = "this movie is full of references like mad max ii the wild one and many others the ladybug´s face it´s a clear reference or tribute to peter lorre this movie is a masterpiece we´ll talk much more about in the future"
       10 	# x = "This film is not good"   #最终反馈为1
       11 	# x = "This film is   bad"   #最终反馈为0
       12 	# x = "This film is   good"   #最终反馈为1
       13 	
       14 	# 根据前面get_train_data获取,变成可以用来训练的向量。
       15 	y = list(x)
       16 	data._genVocabulary(y)
       17 	print(x)
       18 	reviewVec = data._reviewProcess(y, config.sequenceLength, data._charToIndex)
       19 	print(reviewVec)
       20 	
       21 	graph = tf.Graph()
       22 	with graph.as_default():
       23 	    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
       24 	    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, gpu_options=gpu_options)
       25 	    sess = tf.Session(config=session_conf)
       26 	    with sess.as_default():
       27 	        # 恢复模型
       28 	        checkpoint_file = tf.train.latest_checkpoint("../model/charCNN/model/")
       29 	        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
       30 	        saver.restore(sess, checkpoint_file)
       31 	
       32 	        # 获得需要喂给模型的参数,输出的结果依赖的输入值
       33 	        inputX          = graph.get_operation_by_name("inputX").outputs[0]
       34 	        dropoutKeepProb = graph.get_operation_by_name("dropoutKeepProb").outputs[0]
       35 	
       36 	        # 获得输出的结果
       37 	        predictions = graph.get_tensor_by_name("outputLayer/binaryPreds:0")
       38 	        pred = sess.run(predictions, feed_dict={inputX: [reviewVec], dropoutKeepProb: 1.0,})[0]
       39 	
       40 	# pred = [idx2label[item] for item in pred]
       41 	print(pred)

    结果

     相关代码可见:https://github.com/yifanhunter/NLP_textClassifier

    主要参考:

    【1】 https://home.cnblogs.com/u/jiangxinyang/

  • 相关阅读:
    安恒X计划12月月赛
    IDA 7.0在Mojava更新后打不开的问题
    ev3_basic——HITCON CTF 2018
    护网杯划水
    python开发中容易犯的错误整合
    使用gunicorn部署Flask项目
    记两个国外CTF的弱pwn
    MongoDB和pymongo自用手册
    深入理解python之二——python列表和元组
    深入理解python之一——python3对象的一些特性
  • 原文地址:https://www.cnblogs.com/yifanrensheng/p/13363407.html
Copyright © 2020-2023  润新知