• Sklearn使用良心完整入门教程


    The complete .ipynb file can be download through my share in onedrive:https://1drv.ms/u/s!Al86h1dThXMNxDtq_wkOF1PNARrl?e=WvRNaI


    All the materials come from the Machine Learning class in Polyu,HK.

    I promise that I just use and share for learning and non-profit

    from sklearn.datasets import load_iris
    iris=load_iris()
    X=iris.data
    y=iris.target
    
    #use logisticRegression in sklearn
    from sklearn.linear_model import LogisticRegression
    logreg=LogisticRegression()
    logreg.fit(X,y)
    #the prediction for the training data
    y_pred=logreg.predict(X)
    
    /home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    /home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
      "this warning.", FutureWarning)
    
    #analysis the result
    from sklearn import metrics
    #The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance computations.
    print(metrics.accuracy_score(y,y_pred))
    
    0.96
    
    #use knn in sklearn
    from sklearn.neighbors import KNeighborsClassifier
    knn=KNeighborsClassifier(n_neighbors=5)#set the K
    knn.fit(X,y)#as we know,knn needn't the process of training
    y_pred=knn.predict(X)
    print(metrics.accuracy_score(y,y_pred))
    
    0.9666666666666667
    

    test_size means the percentage of data is used for test

    the parameter “random_state” is used here to keep track of a consistent random output number each time to simplify and ease our evaluation.

    random_statedecide the root for the random algorithm.

    We will get the same way for spliting if we use the same root every time

    #split the data into training data and test data
    #the train_test_split method helps us to do this work
    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=4)
    
    #we use LogisticRegression again
    logreg=LogisticRegression()
    logreg.fit(X_train,y_train)
    y_pred=logreg.predict(X_test)
    print(metrics.accuracy_score(y_test,y_pred))
    
    0.9333333333333333
    
    
    /home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    /home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
      "this warning.", FutureWarning)
    
    #we test the accuracy of knn and find the k which makes the biggest accuracy
    k_range=list(range(1,26))#[1,25]
    scores=[]
    for k in k_range:
        knn=KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train,y_train)
        y_pred=knn.predict(X_test)
        scores.append(metrics.accuracy_score(y_test,y_pred))
    
    #we draw a graph to show the result
    import matplotlib.pyplot as plt
    #a magic function,which allows polts to appear whitin the notebook
    %matplotlib inline
    plt.plot(k_range,scores)
    plt.xlabel('Value of K for KNN')
    plt.ylabel('Testing Accuracy')
    
    Text(0, 0.5, 'Testing Accuracy')
    

    output_10_1


    the following experiment requires a file named"l3_data.csv"

    You can download from my onedrive:

    https://1drv.ms/u/s!Al86h1dThXMNugsNGgtFBFYmZpYt?e=QDU4c4

    #use the data in a csv file named"l3_data.csv"
    #use pandas now
    import pandas as pd
    
    /home/jiading/.conda/envs/nn/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
      return f(*args, **kwds)
    

    index_col : int, str, sequence of int / str, or False, default None
    Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

    Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

    1.index_col 默认值(index_col = None)——重新设置一列成为index值
    2.index_col=False——重新设置一列成为index值
    3.index_col=0——第一列为index值
    在这里插入图片描述
    index_col=0,将第一列变为index。
    reference:

    1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    2. https://blog.csdn.net/qq_33217634/article/details/85305731
    data=pd.read_csv('./l3_data.csv',index_col=0)# indicate the location where the file is being stored
    #show first 5 rows in the file
    data.head()
    
    TV Radio Newspaper Sales
    1 230.1 37.8 69.2 22.1
    2 44.5 39.3 45.1 10.4
    3 17.2 45.9 69.3 9.3
    4 151.5 41.3 58.5 18.5
    5 180.8 10.8 58.4 12.9
    #show last 5 rows in the file
    data.tail()
    
    TV Radio Newspaper Sales
    196 38.2 3.7 13.8 7.6
    197 94.2 4.9 8.1 9.7
    198 177.0 9.3 6.4 12.8
    199 283.6 42.0 66.2 25.5
    200 232.1 8.6 8.7 13.4

    From “shape”, we know that there are 200 rows (observations) and 4 columns (3 features and 1
    response). The three features are “TV”, “Radio” and “Newspaper”. The response is “Sales”.

    The dataset is showing the advertising dollars spent on different media (TV, Radio and
    Newspaper), and the corresponding Sales amount of a product in a given market. All figures are
    in thousands unit.

    It is hard to tell the relationships between the response and the three features. Plot some graphs to
    visualize the relationship could be helpful.

    Seaborn其实是在matplotlib的基础上进行了更高级的API封装,从而使得作图更加容易,在大多数情况下使用seaborn就能做出很具有吸引力的图,而使用matplotlib就能制作具有更多特色的图。应该把Seaborn视为matplotlib的补充,而不是替代物。

    Seaborn 要求原始数据的输入类型为 pandas 的 Dataframe 或 Numpy 数组,画图函数一般为如下形式
    sns.图名(x='X轴 列名', y='Y轴 列名', data=原始数据df对象)

    sns.图名(x='X轴 列名', y='Y轴 列名', hue='分组绘图参数', data=原始数据df对象)

    sns.图名(x=np.array, y=np.array[, ...])
    hue 的意思是 variable in data to map plot aspects to different colors。

    reference:

    1. https://www.cnblogs.com/kylinlin/p/5236601.html
    2. https://www.jianshu.com/p/4b925654f506
    import seaborn as sns
    
    %matplotlib inline
    
    /home/jiading/.conda/envs/nn/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
      return f(*args, **kwds)
    

    Plot pairwise relationships in a dataset.

    By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

    size changes the size of the chart

    aspect : gives the width (in inches) of each facet.

    kind='reg' means use the method of regression

    the pairplot know which one is x and which is y by the name of attribute we specified in "x_var" and "y_var"

    source:https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot

    #visualize the relationship between the features and the response using scatterplots(散点图)
    sns.pairplot(data,x_vars=['TV','Radio','Newspaper'],y_vars='Sales',size=7,aspect=0.7,kind='reg')
    
    <seaborn.axisgrid.PairGrid at 0x7eff2ed36710>
    

    output_20_1

    From the three graphs, it seems that there is a strong relationship between the TV ads and Sales.
    For Newspaper, it seems it does not affect the Sales too much. Later we will try to prove that
    observation

    Remember that the Scikit-Learn needs the dataset to have two parts, one feature dataset that is in
    a matrix form and the other is response in vector format. That means we have to preprocess the
    dataset in the correct format before we can apply it to perform the prediction task.

    feature_cols=['TV','Radio','Newspaper']
    # use the list to select a subset of the original DataFrame
    X=data[feature_cols]#we can take out data by this!
    #show the data in X
    X.head()
    
    TV Radio Newspaper
    1 230.1 37.8 69.2
    2 44.5 39.3 45.1
    3 17.2 45.9 69.3
    4 151.5 41.3 58.5
    5 180.8 10.8 58.4
    #pay attention to the type of X
    print(type(X))
    
    <class 'pandas.core.frame.DataFrame'>
    
    #create the y in another way:visit the DataFrame through Member properties in the object of the class
    y=data.Sales
    y.head()
    
    1    22.1
    2    10.4
    3     9.3
    4    18.5
    5    12.9
    Name: Sales, dtype: float64
    
    #pay attention to the type of y
    print(type(y))
    
    <class 'pandas.core.series.Series'>
    
    #does two ways to build y get the same type?
    y=data['Sales']
    print(type(y))
    #the answer is yes
    
    <class 'pandas.core.series.Series'>
    
    #now we spilt the training data and testing data
    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
    

    Let's check the size and the type of training data and testing data.

    Since we don't use the 'test_size' property,the spilting is done by the default way

    print("Size:")
    print(X_train.shape)
    print(y_train.shape)
    print(X_test.shape)
    print(y_test.shape)
    print("Type:")
    print(type(X_train))
    
    Size:
    (150, 3)
    (150,)
    (50, 3)
    (50,)
    Type:
    <class 'pandas.core.frame.DataFrame'>
    
    #we use LinearRegression again:
    from sklearn.linear_model import LinearRegression
    linreg=LinearRegression()
    #the sklearn model can handle the data with type like pandas.core.frame.DataFrame
    linreg.fit(X_train,y_train)
    
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
    

    we can see the intercept(the b in the formula) and the coefficients(the w in the formula):

    print(linreg.intercept_)
    print(linreg.coef_)
    
    2.8769666223179318
    [0.04656457 0.17915812 0.00345046]
    

    we can use the zip function in python to pair the title and coefficient for this title :

    list(zip(feature_cols,linreg.coef_))
    
    [('TV', 0.04656456787415029),
     ('Radio', 0.17915812245088839),
     ('Newspaper', 0.003450464711180378)]
    
    y_pred=linreg.predict(X_test)
    

    We can notice that thanks to the characteristics of python,we don't need to define the vairables first.So that we can use the same "y-pred" from the very beginning.23333

    Evaluation metrics for classification problems, such as accuracy, are not useful for regression
    problems. Instead, we need evaluation metrics designed for comparing continuous values.

    There are three ways to do this:

    #first,use the MAE(Mean Absolute Error)
    #the rule for MAE is just add up the error in every dimonsion then divide the number of dimonsions
    print(metrics.mean_absolute_error(y_test,y_pred))
    
    1.0668917082595206
    
    #second,MSE(Mean Squared Error)
    #different from MAE,MSE square the error in every dimonsion first,then add up and divide the number of dimonsions
    print(metrics.mean_squared_error(y_test,y_pred))
    
    1.9730456202283368
    
    #thrid,RMSE(Root Mean Squared Error)
    #we need the sqrt function in numpy
    #as you can see,RMSE just sqrt the result of MSE
    import numpy as np
    print(np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
    
    1.404651423032895
    

    There methods have their advantages and disadvantages:

    • MAE is the easiest to understand, because it is just the average error.
    • MSE is more popular than MAE, because MSE "punishes" larger errors.
    • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

    In this experiment,we use the result of RMSE.

    Then, we can remove the “Newspaper” to re-run the “Logistic Regression” model again.

    feature_cols=['TV','Radio']
    X=data[feature_cols]
    y=data.Sales
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
    linreg.fit(X_train,y_train)
    y_pred=linreg.predict(X_test)
    print(np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
    
    1.3879034699382886
    

    The RMSE decreased when we removed Newspaper from the model. (Error is something we
    want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is
    useful for predicting Sales, and should be removed from the model.

    It means that this feature doesn't suit the same distribution as the three other attributes and the y_label,which means that maybe this attribute is not related to the result or that the relationship between this attribute and y_label is different from others and we should use another method to predict,that is to say,use combined method to predict-of course the second possibilty is more complexity to implement.


  • 相关阅读:
    JAVA for(i = 0; i<a.length; i++) 解析
    3.2.2多维数组 3.3 排序
    3.2数组
    字符串和数组
    2.7.3与程序转移有关的跳转语句
    2.7.2 循环语句
    读书共享 Primer Plus C-part 4
    Linux 批量修改文件名
    关于/usr/local/lib/libz.a(zutil.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC解决办法
    做一个有深度的程序猿
  • 原文地址:https://www.cnblogs.com/jiading/p/11750139.html
Copyright © 2020-2023  润新知