• Model persistence of sklearn


    Model persistence

    https://scikit-learn.org/stable/modules/model_persistence.html

         模型训练完毕后,如何保存起来,以便日后使用呢?这就是模型持久化。

    After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following sections give you some hints on how to persist a scikit-learn model.

    Python specific serialization

    sklearn是基于python语言, python本身支持对象的序列化功能。

    第一个方法就是使用pickle,将模型进行序列化 和 反序列化。

    It is possible to save a model in scikit-learn by using Python’s built-in persistence model, namely pickle:

    >>> from sklearn import svm
    >>> from sklearn import datasets
    >>> clf = svm.SVC()
    >>> X, y= datasets.load_iris(return_X_y=True)
    >>> clf.fit(X, y)
    SVC()
    
    >>> import pickle
    >>> s = pickle.dumps(clf)
    >>> clf2 = pickle.loads(s)
    >>> clf2.predict(X[0:1])
    array([0])
    >>> y[0]
    0

    第二个方法使用 joblib库的 dump 和load接口, 此工具比pickle更加高效, 但是它只支持保存为文件, 不能转换为二进制串。

    In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

    >>> from joblib import dump, load
    >>> dump(clf, 'filename.joblib') 

    Later you can load back the pickled model (possibly in another Python process) with:

    >>> clf = load('filename.joblib') 

    Security & maintainability limitations

    python的序列化方法,有缺点:

    (1)安全性问题: 对于一个不可信的pickle文件,如果含有恶意代码, 加载到内存产生安全隐患。

    (2)可维护性问题: 如果训练环境的sklearn 和 加载环境的sklearn 不是相同的版本, 有可能不支持。

    为了解决这个可维护性问题, 一般将训练环境的做成docker容器, 在部署过程中,会将模型在此docker容器中加载,并运用。

    pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this,

    • Never unpickle untrusted data as it could lead to malicious code being executed upon loading.

    • While models saved using one version of scikit-learn might load in other versions, this is entirely unsupported and inadvisable. It should also be kept in mind that operations performed on such data could give different and unexpected results.

    In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

    • The training data, e.g. a reference to an immutable snapshot

    • The python source code used to generate the model

    • The versions of scikit-learn and its dependencies

    • The cross validation score obtained on the training data

    This should make it possible to check that the cross-validation score is in the same range as before.

    Since a model internal representation may be different on two different architectures, dumping a model on one architecture and loading it on another architecture is not a supported behaviour, even if it might work on some cases. To overcome the issue of portability, pickle models are often deployed in production using containers, like docker.

    If you want to know more about these issues and explore other possible serialization methods, please refer to this talk by Alex Gaynor.

    Interoperable formats

        对于上面问题,实际上是面向模型在不同架构和平台上的互操作性, 工业界提出两种解决方案:

    (1) Open Neural Network Exchange 一个模型的二进制序列化规范和工具。

    (2) Predictive Model Markup Language (PMML)  一个机遇XML格式的模型表示规范。 具有可读性, 便于在不同框架或者平台上测试相同模型配置的性能。

    For reproducibility and quality control needs, when different architectures and environments should be taken into account, exporting the model in Open Neural Network Exchange format or Predictive Model Markup Language (PMML) format might be a better approach than using pickle alone. These are helpful where you may want to use your model for prediction in a different environment from where the model was trained.

    ONNX is a binary serialization of the model. It has been developed to improve the usability of the interoperable representation of data models. It aims to facilitate the conversion of the data models between different machine learning frameworks, and to improve their portability on different computing architectures. More details are available from the ONNX tutorial. To convert scikit-learn model to ONNX a specific tool sklearn-onnx has been developed.

    PMML is an implementation of the XML document standard defined to represent data models together with the data used to generate them. Being human and machine readable, PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does not help in production when performance is critical. To convert scikit-learn model to PMML you can use for example sklearn2pmml distributed under the Affero GPLv3 license.

    ONNX

    https://onnx.ai/supported-tools.html#buildModel

    The ONNX community provides tools to assist with creating and deploying your next deep learning model. Use the information below to select the tool that is right for your project.

    Frameworks & Converters

    Use the frameworks you already know and love.

     

    PMML

    http://dmg.org/pmml/v4-4-1/GeneralStructure.html

    PMML
    Version
    Model Type Vendor Application Dataset PMML File
    4.1 Clustering KNIME KNIME 2.8 Audit View
    4.1 Clustering KNIME KNIME 2.8 Iris View
    4.1 Neural Network KNIME KNIME 2.8 Audit View
    4.1 NeuralNetwork KNIME KNIME 2.8 Iris View
    4.1 Regression KNIME KNIME 2.8 Audit View
    4.0 Regression KNIME KNIME 2.6.2 Elnino View
    4.0 Regression KNIME KNIME 2.6.2 Elnino View
    4.1 Regression KNIME KNIME 2.8 Iris View
    4.1 Tree KNIME KNIME 2.8 Audit View
    4.1 Tree KNIME KNIME 2.8 Iris View
    4.1 Support Vector Machine KNIME KNIME 2.8 Audit View
    4.1 Support Vector Machine KNIME KNIME 2.8 Iris View
    4.1 Model Ensemble - Clustering KNIME KNIME 2.8 Audit View
    4.1 Model Ensemble - Neural Network KNIME KNIME 2.8 Audit View
    4.1 Model Ensemble - Neural Network KNIME KNIME 2.8 Iris View
    4.1 Model Ensemble - Regression KNIME KNIME 2.8 Audit View
    4.1 Model Ensemble - Regression KNIME KNIME 2.8 Iris View
    4.1 Model Ensemble - Tree KNIME KNIME 2.8 Audit View
    4.1 Model Ensemble - Tree KNIME KNIME 2.8 Iris View
    4.1 Model Ensemble - SVM KNIME KNIME 2.8 Audit View
    4.1 Model Ensemble - SVM KNIME KNIME 2.8 Iris View
    3.2 Clustering R/Rattle PMML Package 1.2.29 Audit View
    3.2 Clustering R/Rattle PMML Package 1.2.29 Iris View
    3.2 Clustering R/Rattle PMML Package 1.2.29 Iris View
    3.2 Tree R/Rattle PMML Package 1.2.29 Audit View
    3.2 Tree R/Rattle PMML Package 1.2.29 Iris View
    3.2 Regression R/Rattle PMML Package 1.2.29 Audit View
    3.2 Regression R/Rattle PMML Package 1.2.29 Iris View
    3.2 Regression R/Rattle PMML Package 1.2.29 Iris View
    4.0 Support Vector Machine R/Rattle PMML Package 1.2.30 Audit View
    4.0 Random Forest R/Rattle PMML Package 1.2.30 Audit View
    4.0 Random Forest R/Rattle PMML Package 1.2.30 Iris View
    4.0 General Regression R/Rattle PMML Package 1.2.30 Iris View
    4.0 Association Rules R/Rattle PMML Package 1.2.30 Shopping View
    4.1 Transformations R/Rattle PMML Package 1.3 Audit View
    4.1 Transformations R/Rattle PMML Package 1.3 Iris View
    4.2 Clustering Apache Spark Apache Spark MLlib 1.4 Iris View
    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    ComboBox.DoubleClick事件
    mktime 夏令时
    STL String的使用[转]
    加在电源后至进入操作系统前的计算机的行为
    C语言数据类型大小分析(基于VC2005编译器)
    linux线程同步之条件变量
    windows 下架设svn服务器(转载+修改) (非利用Google项目托管)
    浅尝《Windows核心编程》之内核对象
    C——数组与指针
    如何用U盘做系统启动盘WINPE 并且 利用WINPE安装Ghost
  • 原文地址:https://www.cnblogs.com/lightsong/p/14344784.html
Copyright © 2020-2023  润新知