• 机器学习--Classifier comparison


    最近在学习机器学习,学习和积累和一些关于机器学习的算法,今天介绍一种机器学习里面各种分类算法的比较

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    
    """
    =====================
    Classifier comparison
    =====================
    
    A comparison of a several classifiers in scikit-learn on synthetic datasets.
    The point of this example is to illustrate the nature of decision boundaries
    of different classifiers.
    与其他的机器学习的分类的算法在合成数据方面相比较,本示例为了说明不同算法边界的性质。
    This should be taken with a grain of salt, as the intuition conveyed by
    these examples does not necessarily carry over to real datasets.
    
    Particularly in high-dimensional spaces, data can more easily be separated
    linearly and the simplicity of classifiers such as naive Bayes and linear SVMs
    might lead to better generalization than is achieved by other classifiers.
    
    The plots show training points in solid colors and testing points
    semi-transparent. The lower right shows the classification accuracy on the test
    set.
    """
    print(__doc__)
    
    
    # Code source: Gaël Varoquaux
    #              Andreas Müller
    # Modified for documentation by Jaques Grobler
    # License: BSD 3 clause
    
    import numpy as np
    import matplotlib.pyplot as plt
    from matplotlib.colors import ListedColormap
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.datasets import make_moons, make_circles, make_classification
    from sklearn.neural_network import MLPClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import RBF
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    
    h = .02  # step size in the mesh
    
    names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
             "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
             "Naive Bayes", "QDA"]
    
    classifiers = [
        KNeighborsClassifier(3),
        SVC(kernel="linear", C=0.025),
        SVC(gamma=2, C=1),
        GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
        DecisionTreeClassifier(max_depth=5),
        RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
        MLPClassifier(alpha=1),
        AdaBoostClassifier(),
        GaussianNB(),
        QuadraticDiscriminantAnalysis()]
    
    X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                               random_state=1, n_clusters_per_class=1)
    #print X
    #print len(y)
    rng = np.random.RandomState(2)
    #print X.shape
    X+= 2 * rng.uniform(size=X.shape)
    #print X
    linearly_separable = (X, y)
    
    datasets = [make_moons(noise=0.3, random_state=0),
                make_circles(noise=0.2, factor=0.5, random_state=1),
                linearly_separable
                ]
    
    figure = plt.figure(figsize=(27, 9))
    i = 1
    # iterate over datasets
    for ds_cnt, ds in enumerate(datasets):
        '''
        上面的循环 ds_cnt是从0-datasets的长度变换
        ds 代表datasets的每个值,在这里相当于每个数据生成方法的返回值
        '''
        # preprocess dataset, split into training and test part
        '''
        将 ds 的返回值赋值给X,y
        '''
        X, y = ds
        '''
        标准化,均值去除和按方差比例缩放数据集的标准化:
        当个体特征太过或明显不遵从高斯正态分布时,标准化表现的效果较差。
        实际操作中,经常忽略特征数据的分布形状,移除每个特征均值,划分离散特征的标准差,从而等级化,进而实现数据中心化。
        通过删除平均值和缩放到单位方差来标准化特征
    
        '''
        X = StandardScaler().fit_transform(X)
        '''
        定义了四个变量
    
        '''
        '''
        利用数据分割函数将数据分为训练数据集和测试数据集
        以及训练数据集和测试数据集对应的整数标签
        '''
        X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=.4, random_state=42)
    
        '''
    
        '''
        x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
        y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
        #print X[:, 0]
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))
    
        # just plot the dataset first
        cm = plt.cm.RdBu
        '''
    
        红色和蓝色
        '''
        cm_bright = ListedColormap(['#FF0000', '#0000FF'])
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        if ds_cnt == 0:
            ax.set_title("Input data")
        # Plot the training points
        '''
        scatter函数绘制散列图:
        '''
        '''
        深红色和深蓝色是划分出来的训练数据
        '''
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
        # and testing points
        '''
        浅红色和浅蓝色是划分出来的测试数据
        这样就形成了四种颜色的数据
    
        '''
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
        
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        i += 1
        # iterate over classifiers
        for name, clf in zip(names, classifiers):
            ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
    
            # Plot the decision boundary. For that, we will assign a color to each
            # point in the mesh [x_min, x_max]x[y_min, y_max].
            if hasattr(clf, "decision_function"):
                Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
            else:
                Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    
            # Put the result into a color plot
            Z = Z.reshape(xx.shape)
            ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
    
            # Plot also the training points
            ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
            # and testing points
            ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                       alpha=0.6)
    
            ax.set_xlim(xx.min(), xx.max())
            ax.set_ylim(yy.min(), yy.max())
            ax.set_xticks(())
            ax.set_yticks(())
            if ds_cnt == 0:
                ax.set_title(name)
            ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                    size=15, horizontalalignment='right')
            i += 1
    
    plt.tight_layout()
    plt.show()

  • 相关阅读:
    mikadonic负载均衡RH6(band)RH7(team)
    Linux日志系统两个分类:rsyslog和journal
    Linux7 重置root密码
    mikadonicLINUX文件系统大小调整
    文件系统fdisk、gdisk、parted
    Linux文件权限设置
    使用python3从零开始写安全脚本(1)
    浅谈软件开发企业绩效管理中的问题与对策(四、绩效方案实例)
    非常道中小软件公司项目管理(一 项目管理终极目标)
    浅谈软件开发企业绩效管理中的问题与对策(二、管理理论篇)
  • 原文地址:https://www.cnblogs.com/ncut/p/6004236.html
Copyright © 2020-2023  润新知