• NLP(十八) 一维卷积网络IMDB情感分析


    原文链接:http://www.one2know.cn/nlp18/

    • 准备
      Keras的IMDB数据集,包含一个词集和对应的情感标签
    import pandas as pd
    from keras.preprocessing import sequence
    from keras.models import Sequential
    from keras.layers import Dense,Dropout,Activation
    from keras.layers import Embedding
    from keras.layers import Conv1D,GlobalAveragePooling1D
    from keras.datasets import imdb
    from sklearn.metrics import accuracy_score,classification_report
    
    # 参数 最大特征数6000 单个句子最大长度400
    max_features = 6000
    max_length = 400
    (x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
    print(len(x_train),'train observations')
    print(len(x_test),'test observations')
    
    wind = imdb.get_word_index() # 给单词编号,用数字代替单词
    revind = dict((k,v) for k,v in enumerate(wind))
    # 单词编号:情感词性编号 字典 => 情感词性编号:一堆该词性的单词编号列表
    print(x_train[0])
    print(y_train[0])
    
    def decode(sent_list): # 逆映射字典解码 数字=>单词
        new_words = []
        for i in sent_list:
            new_words.append(revind[i])
        comb_words = " ".join(new_words)
        return comb_words
    print(decode(x_train[0]))
    

    输出:

    25000 train observations
    25000 test observations
    [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 。。。]
    1
    tsukino 'royale rumbustious canet thrace bellow headbanger 。。。
    
    • 如何实现
      1.预处理,数据整合到一个固定的维度
      2.一维CNN模型的构建和验证
      3.模型评估
    • 代码
    import pandas as pd
    from keras.preprocessing import sequence
    from keras.models import Sequential
    from keras.layers import Dense,Dropout,Activation
    from keras.layers import Embedding
    from keras.layers import Conv1D,GlobalAveragePooling1D
    from keras.datasets import imdb
    from sklearn.metrics import accuracy_score,classification_report
    
    # 参数 最大特征数6000 单个句子最大长度400
    max_features = 6000
    max_length = 400
    (x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
    # print(x_train) # 一堆句子,每个句子有有一堆单词编码
    # print(y_train) # 一堆0或1
    # print(len(x_train),'train observations')
    # print(len(x_test),'test observations')
    
    wind = imdb.get_word_index() # 给单词编号,用数字代替单词
    revind = dict((k, v) for k, v in enumerate(wind))
    # 单词编号:情感词性编号 字典 => 情感词性编号:一堆该词性的单词编号列表
    # print(x_train[0])
    # print(y_train[0])
    
    def decode(sent_list): # 逆映射字典解码 数字=>单词
        new_words = []
        for i in sent_list:
            new_words.append(revind[i])
        comb_words = " ".join(new_words)
        return comb_words
    # print(decode(x_train[0]))
    
    # 将句子填充到最大长度400 使数据长度保持一致
    x_train = sequence.pad_sequences(x_train,maxlen=max_length)
    x_test = sequence.pad_sequences(x_test,maxlen=max_length)
    print('x_train.shape:',x_train.shape)
    print('x_test.shape:',x_test.shape)
    
    ## Keras框架 深度学习 一维CNN模型
    # 参数
    batch_size = 32
    embedding_dims = 60
    num_kernels = 260
    kernel_size = 3
    hidden_dims = 300
    epochs = 3
    # 建立模型
    model = Sequential()
    model.add(Embedding(max_features,embedding_dims,input_length=max_length))
    model.add(Dropout(0.2))
    model.add(Conv1D(num_kernels,kernel_size,padding='valid',activation='relu',strides=1))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(hidden_dims))
    model.add(Dropout(0.5))
    model.add(Activation('relu'))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    print(model.summary())
    
    model.fit(x_train,y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2)
    
    # 模型预测
    y_train_predclass = model.predict_classes(x_train,batch_size=batch_size)
    y_test_preclass = model.predict_classes(x_test,batch_size=batch_size)
    y_train_predclass.shape = y_train.shape
    y_test_preclass.shape = y_test.shape
    
    print('
    
    CNN 1D - Train accuracy:',round(accuracy_score(y_train,y_train_predclass),3))
    print('
    CNN 1D of Training data
    ',classification_report(y_train,y_train_predclass))
    print('
    CNN 1D - Train Confusion Matrix
    
    ',pd.crosstab(y_train,y_train_predclass,
                        rownames=['Actuall'],colnames=['Predicted']))
    print('
    CNN 1D - Test accuracy:',round(accuracy_score(y_test,y_test_preclass),3))
    print('
    CNN 1D of Test data
    ',classification_report(y_test,y_test_preclass))
    print('
    CNN 1D - Test Confusion Matrix
    
    ',pd.crosstab(y_test,y_test_preclass,
                        rownames=['Actuall'],colnames=['Predicted']))
    

    输出:

    Using TensorFlow backend.
    x_train.shape: (25000, 400)
    x_test.shape: (25000, 400)
    WARNING:tensorflow:From 
    D:Python37Libsite-packages	ensorflowpythonframeworkop_def_library.py:263: 
    colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a 
    future version.
    Instructions for updating:
    Colocations handled automatically by placer.
    WARNING:tensorflow:From 
    D:Anaconda3libsite-packageskerasackend	ensorflow_backend.py:3445: calling dropout 
    (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a 
    future version.
    Instructions for updating:
    Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    
    =================================================================
    embedding_1 (Embedding)      (None, 400, 60)           360000    
    _________________________________________________________________
    dropout_1 (Dropout)          (None, 400, 60)           0         
    _________________________________________________________________
    conv1d_1 (Conv1D)            (None, 398, 260)          47060     
    _________________________________________________________________
    global_average_pooling1d_1 ( (None, 260)               0         
    _________________________________________________________________
    dense_1 (Dense)              (None, 300)               78300     
    _________________________________________________________________
    dropout_2 (Dropout)          (None, 300)               0         
    _________________________________________________________________
    activation_1 (Activation)    (None, 300)               0         
    _________________________________________________________________
    dense_2 (Dense)              (None, 1)                 301       
    _________________________________________________________________
    activation_2 (Activation)    (None, 1)                 0         
    
    =================================================================
    Total params: 485,661
    Trainable params: 485,661
    Non-trainable params: 0
    _________________________________________________________________
    None
    WARNING:tensorflow:From 
    D:Python37Libsite-packages	ensorflowpythonopsmath_ops.py:3066: to_int32 (from 
    tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.cast instead.
    Train on 20000 samples, validate on 5000 samples
    Epoch 1/3
    2019-07-07 15:27:37.848057: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU 
    supports instructions that this TensorFlow binary was not compiled to use: AVX2
    
       32/20000 [..............................] - ETA: 7:03 - loss: 0.6929 - acc: 0.5000
       64/20000 [..............................] - ETA: 4:13 - loss: 0.6927 - acc: 0.5156
       96/20000 [..............................] - ETA: 3:19 - loss: 0.6933 - acc: 0.5000
      128/20000 [..............................] - ETA: 2:50 - loss: 0.6935 - acc: 0.4844
      160/20000 [..............................] - ETA: 2:32 - loss: 0.6931 - acc: 0.4813
      此处省略一堆epoch的一堆操作
      
    CNN 1D - Train accuracy: 0.949
    
    CNN 1D of Training data
                   precision    recall  f1-score   support
    
               0       0.94      0.96      0.95     12500
               1       0.95      0.94      0.95     12500
    
        accuracy                           0.95     25000
       macro avg       0.95      0.95      0.95     25000
    weighted avg       0.95      0.95      0.95     25000
    
    CNN 1D - Train Confusion Matrix
    
     Predicted      0      1
    Actuall                
    0          11938    562
    1            715  11785
    
    CNN 1D - Test accuracy: 0.876
    
    CNN 1D of Test data
                   precision    recall  f1-score   support
    
               0       0.86      0.89      0.88     12500
               1       0.89      0.86      0.87     12500
    
        accuracy                           0.88     25000
       macro avg       0.88      0.88      0.88     25000
    weighted avg       0.88      0.88      0.88     25000
    
    CNN 1D - Test Confusion Matrix
    
     Predicted      0      1
    Actuall                
    0          11144   1356
    1           1744  10756
    
    
  • 相关阅读:
    hibernate 主键利用uuid生成
    Jquery ui widget开发
    完美解决IE6不支持position:fixed的bug
    rose pipe–一次对http技术的伟大革新实现(54chen乱弹版)
    关于JBoss的Log4j的输出问题
    《rose portal & pipe技术介绍》之《变革:结构&范围》
    click() 方法和mousedown
    BigPipe具体实现细节
    获取汉字首字母,拼音,可实现拼音字母搜索npm jspinyin
    时间戳显示格式为几天前、几分钟前、几秒前vue过滤器
  • 原文地址:https://www.cnblogs.com/peng8098/p/nlp_18.html
Copyright © 2020-2023  润新知