• python数据分析与应用


    python数据分析与应用笔记

    使用sklearn构建模型

    1.使用sklearn转换器处理数据

    import numpy as np
    from sklearn.datasets import load_breast_cancer
    from sklearn.preprocessing import MinMaxScaler  #该函数时对数据做标准化处理
    from sklearn.decomposition import PCA  #该函数时对数据进行降维处理
    from sklearn.model_selection import train_test_split  #该函数是对数据做训练集和测试集的划分
    cancer = load_breast_cancer()   #将数据集赋值给cancer变量
    cancer_data = cancer['data']   #提取数据集中的数据
    cancer_target = cancer['target']   #提取数据集中的标签
    cancer_names = cancer['feature_names']  #查看特征数目
    cancer_desc = cancer['DESCR']
    
    #划分训练集和测试集,其中20%的作为测试集
    cancer_train_data,cancer_test_data,cancer_train_target,cancer_test_target = train_test_split(cancer_data,cancer_target,test_size = 0.2,random_state = 42)
    scaler = MinMaxScaler().fit(cancer_train_data)  #生成规则
    # 将规则应用于训练集和测试集
    cancer_trainScaler = scaler.transform(cancer_train_data)
    cancer_testScaler = scaler.transform(cancer_test_data)
    #构建pca降维模型
    pca_model = PCA(n_components = 10).fit(cancer_trainScaler)
    #将降维模型应用于标准化之后的训练数据和测试数据
    cancer_trainPca = pca_model.transform(cancer_trainScaler)
    cancer_testPca = pca_model.transform(cancer_testScaler)
    
    print('降维前训练数据的形状:',cancer_trainScaler.shape)
    print('降维后训练数据的形状:',cancer_trainPca.shape)
    print('降维前测试数据的形状:',cancer_testScaler.shape)
    print('降维后测试数据的形状:',cancer_testPca.shape)
    
    降维前训练数据的形状: (455, 30)
    降维后训练数据的形状: (455, 10)
    降维前测试数据的形状: (114, 30)
    降维后测试数据的形状: (114, 10)
    
    • 任务:使用sklearn实现数据处理和降维操作
    from sklearn.datasets import load_boston
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA  
    boston = load_boston()
    boston_data = boston['data']
    boston_target = boston['target']
    boston_names = boston['feature_names']
    boston_train_data,boston_test_data,boston_train_target,boston_test_target = train_test_split(boston_data,boston_target,test_size = 0.2,random_state = 42)
    stdScale = StandardScaler().fit(boston_train_data)
    boston_trainScaler = stdScale.transform(boston_train_data)
    boston_testScaler =  stdScale.transform(boston_test_data)
    
    pca_model = PCA(n_components = 5).fit(boston_trainScaler)
    boston_trainPca = pca_model.transform(boston_trainScaler)
    boston_testPca = pca_model.transform(boston_testScaler)
    

    2.构建并评价聚类模型

    常用的聚类算法如表所示:

    sklearn常用的聚类算法模块cluster提供的聚类算法及其适用范围如图:
    cluster提供的聚类算法及其适用范围

    import pandas as pd
    from sklearn.manifold import TSNE  #TSNE函数可实现多维数据的可视化展现
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_iris
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.cluster import KMeans 
    iris = load_iris()
    iris_data = iris['data']
    iris_target = iris['target']
    iris_names = iris['feature_names']
    scale = MinMaxScaler().fit(iris_data)  #构建规则
    iris_dataScale = scale.transform(iris_data)  #将规则应用于数据
    kmeans = KMeans(n_clusters = 3,random_state = 123).fit(iris_dataScale) #构建并训练聚类模型
    result = kmeans.predict([[1.5,1.5,1.5,1.5]])  #用模型进行预测
    
    tsne = TSNE(n_components = 2,init = 'random',random_state=177).fit(iris_data)  #使用TSNE对数据进行降维,降成两维
    df = pd.DataFrame(tsne.embedding_)  #将原始数据转化为DataFrame
    df['labels']=kmeans.labels_  #将聚类结果存储进df数据集
    
    df1 = df[df['labels']==0]
    df2 = df[df['labels']==1]
    df3 = df[df['labels']==2]
    
    fig = plt.figure(figsize=(9,6))
    plt.plot(df1[0],df1[1],'bo',df2[0],df2[1],'r*',df3[0],df3[1],'gD')
    #plt.axis([-60,60,-80,80])
    plt.savefig('聚类结果.png')
    plt.show()
    # print(df)
    # print(df1)
    # print(kmeans.labels_)
    print(iris_names)
    

    png

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
    

    评价聚类模型

    • 标准是:组内相似性越大,组间差别越大,其聚类效果越好
      sklearn 的metrics模块提供的聚类模型评价指标有:


    使用FMI评级法去评价K-Means聚类模型

    from sklearn.metrics import fowlkes_mallows_score
    for i in range(2,7):
        kmeans = KMeans(n_clusters = i,random_state = 123).fit(iris_data)   
        score = fowlkes_mallows_score(iris_target,kmeans.labels_)
        print('iris数据聚%d类FMI评价分值为:%f'%(i,score))
    
    iris数据聚2类FMI评价分值为:0.750473
    iris数据聚3类FMI评价分值为:0.820808
    iris数据聚4类FMI评价分值为:0.753970
    iris数据聚5类FMI评价分值为:0.725483
    iris数据聚6类FMI评价分值为:0.600691
    

    使用轮廓系数评价法

    from sklearn.metrics import silhouette_score
    import matplotlib.pyplot as plt
    silhouettteScore = []
    for i in range(2,15):
        kmeans = KMeans(n_clusters = i,random_state = 123).fit(iris_data) 
        score = silhouette_score(iris_data,kmeans.labels_)
        silhouettteScore.append(score)
    plt.figure(figsize=(10,6))
    plt.plot(range(2,15),silhouettteScore,linewidth = 1.5,linestyle = '-')
    plt.show()
    

    png

    使用Calinski-Harabasz指数评价K-Means聚类模型

    from sklearn.metrics import calinski_harabaz_score
    for i in range(2,7):
        kmeans = KMeans(n_clusters = i,random_state = 123).fit(iris_data) 
        score = calinski_harabaz_score(iris_data,kmeans.labels_)
        print('iris数据聚%d类calinski_harabaz指数为:%f'%(i,score))
    
    iris数据聚2类calinski_harabaz指数为:513.303843
    iris数据聚3类calinski_harabaz指数为:560.399924
    iris数据聚4类calinski_harabaz指数为:529.120719
    iris数据聚5类calinski_harabaz指数为:494.094382
    iris数据聚6类calinski_harabaz指数为:474.753604
  • 相关阅读:
    HDU
    HDU
    (4)数据--相似性与相异性
    (3)数据--操作
    (2)数据--基本概念
    五、按生命周期划分数据(二)
    五、常用数据类型(一)
    四、坏耦合的原因与解耦(三)
    四、强化耦合(二)
    四、初识耦合(一)
  • 原文地址:https://www.cnblogs.com/LouieZhang/p/9164302.html
Copyright © 2020-2023  润新知