• tensorflow学习|基于TF-IDF的垃圾短信预测


    tensorflow学习|基于TF-IDF的垃圾预测

    版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/shillyshally/article/details/82988661

    在学习过程中看到了使用TF-IDF方法进行垃圾预测,于是作为学习和记录,尝试跟随教科书的思路实现了一下。

    先定义几个参数

    sess = tf.Session()
    batch_size = 200 # 训练时的批尺寸
    max_features = 1000 # 信tf-idf表示的维数

    一、读取已有的公开信数据集

    # 读取信数据集(如果已经下载过了,就读取本地数据不再下载)
    save_file_name = 'temp_spam_data.csv'
    if os.path.isfile(save_file_name):
        text_data = []
        with open(save_file_name, 'r') as temp_output_file:
            reader = csv.reader(temp_output_file)
            for row in reader:
                text_data.append(row)
    else:
        zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
        r = requests.get(zip_url)
        z = ZipFile(io.BytesIO(r.content))
        file = z.read('SMSSpamCollection')
        text_data = file.decode()
        text_data = text_data.encode('ascii', errors='ignore')
        text_data = text_data.decode().split('
    ')
        text_data = [x.split('	') for x in text_data if len(x) >= 1]
        with open(save_file_name, 'w', newline='') as temp_output_file:
            writer = csv.writer(temp_output_file)
            writer.writerows(text_data)
    

    二、对数据集进行处理并转化为TF-IDF向量表示

    # 分割标签和数据集
    texts = [x[1] for x in text_data]
    targets = [x[0] for x in text_data]
    # 去除大小写产生的差异、去除标点、去除数字、去除多余的空格
    texts = [x.lower() for x in texts]
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
    texts = [' '.join(x.split()) for x in texts]
    
    
    def tokenizer(text):
        return nltk.word_tokenize(text)
    
    
    # 转化为tfidf
    tfidf_texts = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features).fit_transform(texts)

    三、分割数据集为训练集和测试集以方便对模型进行验证

    # 随机分割数据集 产生80%训练集以及20%测试集
    train_indics = np.random.choice(tfidf_texts.shape[0], round(0.8*tfidf_texts.shape[0]), replace=False)
    test_indics = np.array(list(set(range(tfidf_texts.shape[0]))-set(train_indics)))
    x_train = tfidf_texts[train_indics]
    x_test = tfidf_texts[test_indics]
    y_target = [1 if y == 'spam' else 0 for y in targets]
    y_train = np.array([y for iy, y in enumerate(y_target) if iy in train_indics])
    y_test = np.array([y for iy, y in enumerate(y_target) if iy in test_indics])

    四、建立模型并声明损失函数

    �0�2 �0�2 矩阵A和b的初始值从高斯分布中产生并通过梯度下降对进行最优化

    A = tf.Variable(tf.random_normal(shape=[max_features, 1]))
    b = tf.Variable(tf.random_normal(shape=[1, 1]))
    
    x_data = tf.placeholder(shape=[None, max_features], dtype=tf.float32)
    y_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
    
    model_output = tf.add(tf.matmul(x_data, A), b)
    
    optimizer = tf.train.GradientDescentOptimizer(0.0025)
    loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_data))
    train_step = optimizer.minimize(loss)
    
    prediction = tf.round(tf.sigmoid(model_output))
    predictions_correct = tf.cast(tf.equal(prediction, y_data), dtype=tf.float32)
    acc = tf.reduce_mean(predictions_correct)
    
    init = tf.global_variables_initializer()
    sess.run(init)

    五、训练并输出损失

    for i in range(10000):
        rand_indics = np.random.choice(x_train.shape[0], batch_size)
        rand_x = x_train[rand_indics].todense()
        rand_y = np.transpose([y_train[rand_indics]])
        sess.run(train_step, feed_dict={x_data: rand_x, y_data: rand_y})
        # 每1000个迭代输出一次损失值
        if (i + 1) % 1000 == 0:
            print(i+1, ' acc:', sess.run(acc, feed_dict={x_data: x_test.todense(), y_data: np.transpose([y_test])}))

    六、部分训练过程以及测试集上的准确率

    �0�2 �0�2 可以看到模型在4000个迭代时已收敛

    �0�2 �0�2�0�2

    �0�2 �0�2 如果要将该模型用于实际应用,将模型参数矩阵以及模型结构保存即可。

    阅读更多
     
    想对作者说点什么?
     

    自然语言处理系列之TF-IDF算法

  • 相关阅读:
    RedHat Linux下利用sersync进行实时同步数据
    curl网站开发指南
    常用命令
    Linux 查看CPU信息、机器型号等硬件信息
    -bash: crontab: command not found(转)
    端口映射工具--socat
    左右半透明的无缝滚动
    js学习笔记33----DOM操作
    Framework 7 之 给Picker Modal 添加半透明背景
    网页嵌入自定义字体方法
  • 原文地址:https://www.cnblogs.com/waj2018/p/10626033.html
Copyright © 2020-2023  润新知