1.加载数据(Data Loading)
假设输入是特征矩阵或者csv文件,首先数据被载入内存。
scikit-learn的实现使用了NumPy中的arrays,所以,使用NumPy来载入csv文件。
以下是从UCI机器学习数据仓库中下载的数据。
#data loading import numpy as np import urllib #url with dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" #download the file raw_data = urllib.urlopen(url) #load the CSV file as a numpy matrix dataset = np.loadtxt(raw_data, delimiter = ",") #seperate the data from the target attributes X = dataset[:, 0:7] y = dataset[:, 8]
2.数据归一化(Data Normalization)
大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的,在开始跑算法之前,我们应该进行归一化或者标准化的过程,这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法。
#data normalization from sklearn import preprocessing #normalize the data attributes normalized_X = preprocessing.normalize(X) #standardize the data attributes standardized_X = preprocessing.scale(X)
3.特征选择(Feature Selection)
在解决一个实际问题的过程中,选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。
特征选择时一个很需要创造力的过程,更多的依赖于直觉和专业知识,并且有很多现成的算法来进行特征的选择。
下面的树算法(Tree algorithms)计算特征的信息量:
#feature selection from sklearn import metrics from sklearn.ensemble import ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X, y) #display the relative importance of each attribute print(model.feature_importances_)
结果:
>>> runfile('F:/HDN20160329/python/spyder/example2_sklearn_procedure/sklearn_procedure.py', wdir='F:/HDN20160329/python/spyder/example2_sklearn_procedure') [ 0.12315529 0.25870914 0.11863867 0.08749797 0.08296516 0.1840623 0.14497146]
4.算法的使用
-
逻辑回归
大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。
#logistic regression from sklearn import metrics from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X, y) print(model) #make predictions expected = y predicted = model.predict(X) #summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
结果:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) precision recall f1-score support 0.0 0.79 0.89 0.84 500 1.0 0.74 0.55 0.63 268 avg / total 0.77 0.77 0.77 768 [[447 53] [120 148]]
- 朴素贝叶斯
该方法的任务是还原训练样本数据的分布密度,其在多类别分类中有很好的效果。
#GaussianNB from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X, y) print(model) #make predicitions expected = y predicted = model.predict(X) #summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
结果:
GaussianNB() precision recall f1-score support 0.0 0.80 0.86 0.83 500 1.0 0.69 0.60 0.64 268 avg / total 0.76 0.77 0.76 768 [[429 71] [108 160]]
- k近邻
k近邻算法常常被用作是分类算法一部分,比如可以用它来评估特征,在特征选择上我们可以用到它。
#KNN from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() model.fit(X, y) print(model) expected = y predicted = model.predict(X) print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
结果:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform') precision recall f1-score support 0.0 0.82 0.90 0.86 500 1.0 0.77 0.63 0.69 268 avg / total 0.80 0.80 0.80 768 [[448 52] [ 98 170]]
- 决策树
分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题,这种方法非常适用于多分类情况。
#decision tree from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X, y) print(model) expected = y predicted = model.predict(X) print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
结果:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') precision recall f1-score support 0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268 avg / total 1.00 1.00 1.00 768 [[500 0] [ 0 268]]
- SVM
SVM是非常流行的机器学习算法,主要用于分类问题,如同逻辑回归问题,它可以使用一对多的方法进行多类别的分类。
#SVM from sklearn.svm import SVC model = SVC() model.fit(X, y) print(model) expected = y predicted = model.predict(X) print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
结果:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) precision recall f1-score support 0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268 avg / total 1.00 1.00 1.00 768 [[500 0] [ 0 268]]
5.如何优化算法参数
一项更加困难的任务是构建一个有效的方法用于选择正确的参数,我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。
下面的例子是一个进行正则参数选择的程序:
#paramater selection from sklearn.linear_model import Ridge from sklearn.grid_search import GridSearchCV #prepare a range of alpha values to test alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0]) #create and fit a ridge regression model, testing each alpha model = Ridge() grid = GridSearchCV(estimator = model, param_grid = dict(alpha = alphas)) grid.fit(X, y) print(grid) #summarize the results of the grid search print(grid.best_score_) print(grid.best_estimator_.alpha)
结果:
GridSearchCV(cv=None, error_score='raise', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), fit_params={}, iid=True, n_jobs=1, param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03, 1.00000e-04, 0.00000e+00])}, pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0) 0.282118955686 1.0
有时随机从给定区间中选择参数是很有效的方法,然后根据这些参数来评估算法的效果进而选择最佳的那个。
from scipy.stats import uniform as sp_rand from sklearn.linear_model import Ridge from sklearn.grid_search import RandomizedSearchCV #prepare a uniform distribution to sample for the alpha parameter param_grid = {'alpha': sp_rand()} model = Ridge() rsearch = RandomizedSearchCV(estimator = model, param_distributions = param_grid, n_iter = 100) rsearch.fit(X, y) print(rsearch) print(rsearch.best_score_) print(rsearch.best_estimator_.alpha)
结果:
RandomizedSearchCV(cv=None, error_score='raise', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), fit_params={}, iid=True, n_iter=100, n_jobs=1, param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000008739C18>}, pre_dispatch='2*n_jobs', random_state=None, refit=True, scoring=None, verbose=0) 0.282118896925 0.997818886895