• The sklearn preprocessing


    Recently, I was writing module of feature engineering, i found two excellently packages -- tsfresh and sklearn.

    tsfresh has been specialized for data of time series, tsfresh mainly include two modules, feature extract, and feature select:

    1 from tsfresh import feature_selection, feature_extraction

    To limit the number of irrelevant features tsfresh deploys the fresh algorithms. The whole process consists of three steps.

    Firstly.  the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.

    In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those tests are contained in submodule tsfresh.feature_selection.significance_tests. the result of a significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.

    Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.

    In summary, the tsfresh is a scalable and efficiency tool of feature engineering.

    although the function of tsfresh was powerful, i choose sklearn.

    I download data which is the heart disease data set. the data set target is binary and has a 13 dimension feature, I was just used MinMaxScaler to transform age,trestbps,chol three columns, the model had a choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two models.

    from sklearn.preprocessing import MinMaxScaler,StandardScaler
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from numpy import set_printoptions, inf
    set_printoptions(threshold=inf)
    import pandas as pd
    data = pd.read_csv("../data_set/heart.csv")
    X = data[data.columns[:data.shape[1] - 1]].values
    y = data[data.columns[-1]].values
    
    data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])
    X[:, [0, 3, 4, 7]] = data
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    from autosklearn.classification import AutoSklearnClassifier
    model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)
    model_auto.fit(x_train, y_train)
    
    from sklearn.metrics import accuracy_score
    y_pred = model_auto.predict(x_test)
    accuracy_score(y_test, y_pred)   >>> 0.8021978021978022
    
    
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(n_estimators=500)
    y_pred_rf = model.predict(x_test)
    accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352

    My personal web site which provides automl service, I upload this data set to my service, it gets a better score than my code: http://simple-code.cn/

  • 相关阅读:
    Android实战技巧之四十九:Usb通信之USB Host
    Netty in Action (七) 第三章节 Netty组件和设计
    打造敏捷的自组织团队
    Cocos2d-x移植到Android平台编译的两个文件Android.mk和Application.mk
    Codeforces Round #306 (Div. 2) (ABCE题解)
    事情非常重要却总不想開始怎么办
    <html>
    Android中Activity与Task相关的属性解析
    大话设计模式—策略模式
    android hander怎样避免内存泄露
  • 原文地址:https://www.cnblogs.com/xu-xiaofeng/p/10934296.html
Copyright © 2020-2023  润新知