• 一步一步使用sklearn


    http://kukuruku.co/hub/python/introduction-to-machine-learning-with-python-andscikit-learn

    Hello, %username%!

    My name is Alex. I deal with machine learning and web graphs analysis (mostly in theory). I also work on the development of Big Data products for one of the mobile operators in Russia. It’s the first time I write a post, so please, don’t judge me too harshly.

    Nowadays, a lot of people want to develop efficient algorithms and take part in machine learning competitions. So they come to me and ask: “Where to start?”. Some time ago, I led the development of Big Data tools for the analysis of media and social networks in one of the institutions of the Government of the Russian Federation. I still have some documentation my team used, and I’d like to share it with you. It is assumed that the reader has a good knowledge of mathematics and machine learning (my team mostly consisted of MIPT (the Moscow Institute of Physics and Technology) and the School of Data Analysis graduates).

    Actually, it has been the introduction to Data Science. This science has become quite popular recently. Competitions in machine learning are increasingly held (for example, Kaggle, TudedIT), and their budget is often quite considerable.

    The most common tools for a Data Scientist today are R and Python. Each tool has its pros and cons, but Python wins recently in all respects (this is just imho, I use both R and Python though). This happened after there had appeared a very well documented Scikit-Learn library that contains a great number of machine learning algorithms.

    Please note that we will focus on Machine Learning algorithms in the article. It is usually better to perform the primary data analysis by means of the Pandaspackage that is quite simple to deal with on your own. So, let’s focus on implementation. For definiteness, we assume that there is a feature-object matrix at the input, and it is stored in a *.csv file.

    Data Loading

    First of all, the data should be loaded into memory, so that we could work with it. The Scikit-Learn library uses NumPy arrays in its implementation, so we will use NumPy to load *.csv files. Let’s download one of the datasets from the UCI Machine Learning Repository.

    import numpy as np
    import urllib
    # url with dataset
    url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
    # download the file
    raw_data = urllib.urlopen(url)
    # load the CSV file as a numpy matrix
    dataset = np.loadtxt(raw_data, delimiter=",")
    # separate the data from the target attributes
    X = dataset[:,0:7]
    y = dataset[:,8]

    We will work with this dataset in all examples, namely, with the X feature-object matrix and values of the y target variable.

    Data Normalization

    All of us know well that the majority of gradient methods (on which almost all machine learning algorithms are based) are highly sensitive to data scaling. Therefore, before running an algorithm, we should perform either normalization, or the so-called standardization. Normalization involves replacing nominal features, so that each of them would be in the range from 0 to 1. As for standardization, it involves data pre-processing, after which each feature has an average 0 and 1 dispersion. The Scikit-Learn library provides ready-made functions for this:

    from sklearn import preprocessing
    # normalize the data attributes
    normalized_X = preprocessing.normalize(X)
    # standardize the data attributes
    standardized_X = preprocessing.scale(X)

    Feature Selection

    It’s no secret that the most important thing in solving a task is the ability to properly choose or even create features. It’s called Feature Selection and Feature Engineering. While Future Engineering is quite a creative process and relies more on intuition and expert knowledge, there are plenty of ready-made algorithms for Feature Selection. Tree algorithms allow to compute the informativeness of features.

    from sklearn import metrics
    from sklearn.ensemble import ExtraTreesClassifier
    model = ExtraTreesClassifier()
    model.fit(X, y)
    # display the relative importance of each attribute
    print(model.feature_importances_)

    All other methods are based on the effective search of subsets of features in order to find the best subset, on which the developed model gives the best quality. One of these search algorithms is the Recursive Feature Elimination Algorithm that is also available in the Scikit-Learn library.

    from sklearn.feature_selection import RFE
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    # create the RFE model and select 3 attributes
    rfe = RFE(model, 3)
    rfe = rfe.fit(X, y)
    # summarize the selection of the attributes
    print(rfe.support_)
    print(rfe.ranking_)

    Algorithm Development

    As I have said, Scikit-Learn has implemented all the basic algorithms of machine learning. Let’s take a look at some of them.

    Logistic Regression

    Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.

    from sklearn import metrics
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    model.fit(X, y)
    print(model)
    # make predictions
    expected = y
    predicted = model.predict(X)
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    Naive Bayes

    Is also one of the most well-known machine learning algorithms, the main task of which is to restore the density of data distribution of the training sample. This method often provides good quality in multiclass classification problems.

    from sklearn import metrics
    from sklearn.naive_bayes import GaussianNB
    model = GaussianNB()
    model.fit(X, y)
    print(model)
    # make predictions
    expected = y
    predicted = model.predict(X)
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    k-Nearest Neighbours

    The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.

    from sklearn import metrics
    from sklearn.neighbors import KNeighborsClassifier
    # fit a k-nearest neighbor model to the data
    model = KNeighborsClassifier()
    model.fit(X, y)
    print(model)
    # make predictions
    expected = y
    predicted = model.predict(X)
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    Decision Trees

    Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.

    from sklearn import metrics
    from sklearn.tree import DecisionTreeClassifier
    # fit a CART model to the data
    model = DecisionTreeClassifier()
    model.fit(X, y)
    print(model)
    # make predictions
    expected = y
    predicted = model.predict(X)
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    Support Vector Machines

    SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.

    from sklearn import metrics
    from sklearn.svm import SVC
    # fit a SVM model to the data
    model = SVC()
    model.fit(X, y)
    print(model)
    # make predictions
    expected = y
    predicted = model.predict(X)
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))

    In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, and also implemented techniques to create compositions of algorithms, including Bagging and Boosting.

    How to Optimize Algorithm Parameters

    One of the most difficult stages in creating really efficient algorithms is choosing correct parameters. It’s usually easier with experience, but one way or another, we have to do the search. Fortunately, Scikit-Learn provides many implemented functions for this purpose.

    As an example, let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:

    import numpy as np
    from sklearn.linear_model import Ridge
    from sklearn.grid_search import GridSearchCV
    # prepare a range of alpha values to test
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
    # create and fit a ridge regression model, testing each alpha
    model = Ridge()
    grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
    grid.fit(X, y)
    print(grid)
    # summarize the results of the grid search
    print(grid.best_score_)
    print(grid.best_estimator_.alpha)

    Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.

    import numpy as np
    from scipy.stats import uniform as sp_rand
    from sklearn.linear_model import Ridge
    from sklearn.grid_search import RandomizedSearchCV
    # prepare a uniform distribution to sample for the alpha parameter
    param_grid = {'alpha': sp_rand()}
    # create and fit a ridge regression model, testing random alpha values
    model = Ridge()
    rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
    rsearch.fit(X, y)
    print(rsearch)
    # summarize the results of the random parameter search
    print(rsearch.best_score_)
    print(rsearch.best_estimator_.alpha)

    We have reviewed the entire process of working with the Scikit-Learn library, except for outputting results back to a file. Offering you to do this as an exercise, as Python’s (and Scikit-Learn library’s) advantage, in comparison to R, is its excellent documentation.

    In the next articles, we will consider other problems in detail. In particular, we will touch on such an important thing as Feature Engineering.

    I really hope that this material will help novice Data Scientists to get down to solving machine learning problems in practice as soon as possible.

    In conclusion, I’d like to wish a success and patience to those who are just beginning to take part in machine learning competitions!

  • 相关阅读:
    mac 终端命令kill掉某个指定端口
    python web开发之flask框架学习(1) 创建flask项目
    ios json转model的简单现实
    SnapKit swift实现高度自适应的新浪微博布局
    IOS swift实现密码的显示与隐藏切换
    IOS UIWebView与js的简单交互swift3版
    android 手写万能adapter适配器
    简单几步实现 IOS UITextField输入长度的控制
    IOS Swift UITableViewcontroller实现点击空白处隐藏键盘
    xcode 版本控制推送代码到远程git仓库的步骤
  • 原文地址:https://www.cnblogs.com/wxquare/p/5292537.html
Copyright © 2020-2023  润新知