tensorflow之最近邻算法实现

最近邻算法，最直接的理解就是，输入数据的特征与已有数据的特征一一进行比对，最靠近哪一个就将输入数据划分为那一个所属的类，当然，以此来统计k个最靠近特征中所属类别最多的类，那就变成了k近邻算法。本博客同样对sklearn的乳腺癌数据进行最近邻算法分类，基本的内容同上一篇博客内容一样，就是最近邻计算的是距离，优化的是最小距离问题，这里采用L1距离(曼哈顿距离)或者L2距离(欧氏距离)，计算特征之间的绝对距离：

# 计算L1距离(曼哈顿)
distance = tf.reduce_sum(tf.abs(tf.add(xtr, tf.negative(xte))), reduction_indices=1)
# L2距离(欧式距离)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.add(xtr, tf.negative(xte))), reduction_indices=1))

优化问题就是获得最小距离的标签：

pred = tf.arg_min(distance, 0)

最后衡量最近邻算法的性能的时候就通过统计正确分类和错误分类的个数来计算准确率，完整的代码如下：

from __future__ import print_function
import tensorflow as tf
import sklearn.datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets as skd
from sklearn.model_selection import train_test_split


# 加载乳腺癌数据集，该数据及596个样本，每个样本有30维，共有两类
cancer = skd.load_breast_cancer()

# 将数据集的数据和标签分离
X_data = cancer.data
Y_data = cancer.target
print("X_data.shape = ", X_data.shape)
print("Y_data.shape = ", Y_data.shape)

# 将数据和标签分成训练集和测试集
x_train,x_test,y_train,y_test = train_test_split(X_data,Y_data,test_size=0.2,random_state=1)
print("y_test=", y_test)
print("x_train.shape = ", x_train.shape)
print("x_test.shape = ", x_test.shape)
print("y_train.shape = ", y_train.shape)
print("y_test.shape = ", y_test.shape)

# tf的图模型输入
xtr = tf.placeholder("float", [None, 30])
xte = tf.placeholder("float", [30])

# 计算L1距离(曼哈顿)
# distance = tf.reduce_sum(tf.abs(tf.add(xtr, tf.negative(xte))), reduction_indices=1)
# L2距离(欧式距离)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.add(xtr, tf.negative(xte))), reduction_indices=1))
# Prediction: Get min distance index (Nearest neighbor)
pred = tf.arg_min(distance, 0)

accuracy = 0.
error_count = 0

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for i in range(x_test.shape[0]):
        # 获取最近邻类
        nn_index = sess.run(pred, feed_dict={xtr: x_train, xte: x_test[i, :]})
        print("Test", i, "Prediction:", y_train[nn_index], "True Class:", y_test[i])
        if y_train[nn_index] == y_test[i]:
            accuracy += 1./len(x_test)
        else:
            error_count = error_count + 1
    print("完成!")
    print("准确分类：", x_test.shape[0] - error_count)
    print("错误分类：", error_count)
    print("准确率:", accuracy)

最近邻算法的表现如下：

这里有几点影响：

1、数据集，一般，训练集越大，相对来说准确率相对就高一些；

2、使用欧氏距离度量的时候会比用曼哈顿距离要好一些。

朱雀桥边野草花，乌衣巷口夕阳斜。

旧时王谢堂前燕，飞入寻常百姓家。

-- 刘禹锡《乌衣巷》

上善若水，为而不争。

相关阅读:
内存-程序运行的空间
 数据在内存中是这样存储的（二进制形式存储）
从编写源代码到程序在内存中运行的全过程解析
 QT开发工具
 Linux中Too many open files 问题分析和解决
 TCP端口状态说明ESTABLISHED、TIME_WAIT
HttpClient当HTTP连接的时候出现大量CLOSE_WAIT连接
 缓存穿透、击穿、雪崩
 Http长连接和Keep-Alive以及Tcp的Keepalive
防止表单重复提交
原文地址：https://www.cnblogs.com/Bearoom/p/11721777.html