• 机器学习框架ML.NET学习笔记【9】自动学习


    一、概述

    本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过AutoML的方法再重新实现,通过对比两种实现方式来学习AutoML的应用。

    首先数据集来自于竞赛网站kaggle.com的UCI Wine Quality Dataset数据集,访问地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data

     该数据集,输入为一些葡萄酒的化学检测数据,比如酒精度等,输出为品酒师的打分,具体字段描述如下:

    Data fields
    Input variables (based on physicochemical tests): 
    1 - fixed acidity 
    2 - volatile acidity 
    3 - citric acid 
    4 - residual sugar 
    5 - chlorides 
    6 - free sulfur dioxide 
    7 - total sulfur dioxide 
    8 - density 
    9 - pH 
    10 - sulphates 
    11 - alcohol
    
    Output variable (based on sensory data): 
    12 - quality (score between 0 and 10)
    
    Other:
    13 - id (unique ID for each sample, needed for submission)
    

       

    二、代码

    namespace Regression_WineQuality
    {
        public class WineData
        {
            [LoadColumn(0)]
            public float FixedAcidity;
    
            [LoadColumn(1)]
            public float VolatileAcidity;
    
            [LoadColumn(2)]
            public float CitricACID;
    
            [LoadColumn(3)]
            public float ResidualSugar;
    
            [LoadColumn(4)]
            public float Chlorides;
    
            [LoadColumn(5)]
            public float FreeSulfurDioxide;
    
            [LoadColumn(6)]
            public float TotalSulfurDioxide;
    
            [LoadColumn(7)]
            public float Density;
    
            [LoadColumn(8)]
            public float PH;
    
            [LoadColumn(9)]
            public float Sulphates;
    
            [LoadColumn(10)]
            public float Alcohol;
          
            [LoadColumn(11)]
            [ColumnName("Label")]
            public float Quality;
           
            [LoadColumn(12)]
            public float Id;
        }
    
        public class WinePrediction
        {
            [ColumnName("Score")]
            public float PredictionQuality;
        }
    
        class Program
        {
            static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip");
    
            static void Main(string[] args)
            { 
                Train();
                Prediction();
    
                Console.WriteLine("Hit any key to finish the app");
                Console.ReadKey();
            }
    
            public static void Train()
            {
                MLContext mlContext = new MLContext(seed: 1);
    
                // 准备数据
                string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-full.csv");
                var fulldata = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true);
    
                var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.2);
                var trainData = trainTestData.TrainSet;
                var testData = trainTestData.TestSet;
    
                // 创建学习管道并通过训练数据调整模型  
                var dataProcessPipeline = mlContext.Transforms.DropColumns("Id")
                    .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.FreeSulfurDioxide)))
                    .Append(mlContext.Transforms.NormalizeMeanVariance(nameof(WineData.TotalSulfurDioxide)))
                    .Append(mlContext.Transforms.Concatenate("Features", new string[] { nameof(WineData.FixedAcidity),
                                                                                        nameof(WineData.VolatileAcidity),
                                                                                        nameof(WineData.CitricACID),
                                                                                        nameof(WineData.ResidualSugar),
                                                                                        nameof(WineData.Chlorides),
                                                                                        nameof(WineData.FreeSulfurDioxide),
                                                                                        nameof(WineData.TotalSulfurDioxide),
                                                                                        nameof(WineData.Density),
                                                                                        nameof(WineData.PH),
                                                                                        nameof(WineData.Sulphates),
                                                                                        nameof(WineData.Alcohol)}));
    
                var trainer = mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName: "Label", featureColumnName: "Features");
                var trainingPipeline = dataProcessPipeline.Append(trainer);
                var trainedModel = trainingPipeline.Fit(trainData);
    
                // 评估
                var predictions = trainedModel.Transform(testData);
                var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score");
                PrintRegressionMetrics(trainer.ToString(), metrics);
    
                // 保存模型
                Console.WriteLine("====== Save model to local file =========");
                mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath);
            }
    
            static void Prediction()
            {
                MLContext mlContext = new MLContext(seed: 1);
    
                ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema);
                var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel);
    
                WineData wineData = new WineData
                {
                    FixedAcidity = 7.6f,
                    VolatileAcidity = 0.33f,
                    CitricACID = 0.36f,
                    ResidualSugar = 2.1f,
                    Chlorides = 0.034f,
                    FreeSulfurDioxide = 26f,
                    TotalSulfurDioxide = 172f,
                    Density = 0.9944f,
                    PH = 3.42f,
                    Sulphates = 0.48f,
                    Alcohol = 10.5f
                };
    
                var wineQuality = predictor.Predict(wineData);
                Console.WriteLine($"Wine Data  Quality is:{wineQuality.PredictionQuality} ");           
            }        
        }
    }
    View Code

     关于泊松回归的算法,我们在进行人脸颜值判断的那篇文章已经介绍过了,这个程序没有涉及任何新的知识点,就不重复解释了,主要目的是和下面的AutoML代码对比用的。 

    三、自动学习

    我们发现机器学习的大致流程基本都差不多,如:准备数据-明确特征-选择算法-训练等,有时我们存在这样一个问题:该选择什么算法?算法的参数该如何配置?等等。而自动学习就解决了这个问题,框架会多次重复数据选择、算法选择、参数调优、评估结果这一过程,通过这个过程找出评估效果最好的模型。

    全部代码如下:

    namespace Regression_WineQuality
    {
        public class WineData
        {
            [LoadColumn(0)]
            public float FixedAcidity;
    
            [LoadColumn(1)]
            public float VolatileAcidity;
    
            [LoadColumn(2)]
            public float CitricACID;
    
            [LoadColumn(3)]
            public float ResidualSugar;
    
            [LoadColumn(4)]
            public float Chlorides;
    
            [LoadColumn(5)]
            public float FreeSulfurDioxide;
    
            [LoadColumn(6)]
            public float TotalSulfurDioxide;
    
            [LoadColumn(7)]
            public float Density;
    
            [LoadColumn(8)]
            public float PH;
    
            [LoadColumn(9)]
            public float Sulphates;
    
            [LoadColumn(10)]
            public float Alcohol;
          
            [LoadColumn(11)]
            [ColumnName("Label")]
            public float Quality;
    
            [LoadColumn(12)]       
            public float ID; 
        }
    
        public class WinePrediction
        {
            [ColumnName("Score")]
            public float PredictionQuality;
        }
     
    
        class Program
        {
            static readonly string ModelFilePath = Path.Combine(Environment.CurrentDirectory, "MLModel", "model.zip");
            static readonly string TrainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-train.csv");
            static readonly string TestDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "winequality-data-test.csv");
    
            static void Main(string[] args)
            {           
                TrainAndSave();
                LoadAndPrediction();
    
                Console.WriteLine("Hit any key to finish the app");
                Console.ReadKey();
            }
    
            public static void TrainAndSave()
            {
                MLContext mlContext = new MLContext(seed: 1);
    
                // 准备数据 
                var trainData = mlContext.Data.LoadFromTextFile<WineData>(path: TrainDataPath, separatorChar: ',', hasHeader: true);
                var testData = mlContext.Data.LoadFromTextFile<WineData>(path: TestDataPath, separatorChar: ',', hasHeader: true);
             
                var progressHandler = new RegressionExperimentProgressHandler();
                uint ExperimentTime = 200;
    
                ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto()
                   .CreateRegressionExperiment(ExperimentTime)
                   .Execute(trainData, "Label", progressHandler: progressHandler);           
    
                Debugger.PrintTopModels(experimentResult);
    
                RunDetail<RegressionMetrics> best = experimentResult.BestRun;
                ITransformer trainedModel = best.Model;
    
                // 评估 BestRun
                var predictions = trainedModel.Transform(testData);
                var metrics = mlContext.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score");
                Debugger.PrintRegressionMetrics(best.TrainerName, metrics);
    
                // 保存模型
                Console.WriteLine("====== Save model to local file =========");
                mlContext.Model.Save(trainedModel, trainData.Schema, ModelFilePath);           
            }
           
    
            static void LoadAndPrediction()
            {
                MLContext mlContext = new MLContext(seed: 1);
    
                ITransformer loadedModel = mlContext.Model.Load(ModelFilePath, out var modelInputSchema);
                var predictor = mlContext.Model.CreatePredictionEngine<WineData, WinePrediction>(loadedModel);
    
                WineData wineData = new WineData
                {
                    FixedAcidity = 7.6f,
                    VolatileAcidity = 0.33f,
                    CitricACID = 0.36f,
                    ResidualSugar = 2.1f,
                    Chlorides = 0.034f,
                    FreeSulfurDioxide = 26f,
                    TotalSulfurDioxide = 172f,
                    Density = 0.9944f,
                    PH = 3.42f,
                    Sulphates = 0.48f,
                    Alcohol = 10.5f
                };
    
                var wineQuality = predictor.Predict(wineData);
                Console.WriteLine($"Wine Data  Quality is:{wineQuality.PredictionQuality} ");           
            }
        }
    }
    View Code

      

    四、代码分析

    1、自动学习过程

                var progressHandler = new RegressionExperimentProgressHandler();
                uint ExperimentTime = 200;
    
                ExperimentResult<RegressionMetrics> experimentResult = mlContext.Auto()
                   .CreateRegressionExperiment(ExperimentTime)
                   .Execute(trainData, "Label", progressHandler: progressHandler);           
    
                Debugger.PrintTopModels(experimentResult); //打印所有模型数据

      ExperimentTime 是允许的试验时间,progressHandler是一个报告程序,当每完成一种学习,系统就会调用一次报告事件。

        public class RegressionExperimentProgressHandler : IProgress<RunDetail<RegressionMetrics>>
        {
            private int _iterationIndex;
    
            public void Report(RunDetail<RegressionMetrics> iterationResult)
            {
                _iterationIndex++;
                Console.WriteLine($"Report index:{_iterationIndex},TrainerName:{iterationResult.TrainerName},RuntimeInSeconds:{iterationResult.RuntimeInSeconds}");            
            }
        }

     调试结果如下:

    Report index:1,TrainerName:SdcaRegression,RuntimeInSeconds:12.5244426
    Report index:2,TrainerName:LightGbmRegression,RuntimeInSeconds:11.2034988
    Report index:3,TrainerName:FastTreeRegression,RuntimeInSeconds:14.810409
    Report index:4,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:14.7338553
    Report index:5,TrainerName:FastForestRegression,RuntimeInSeconds:15.6224459
    Report index:6,TrainerName:LbfgsPoissonRegression,RuntimeInSeconds:11.1668197
    Report index:7,TrainerName:OnlineGradientDescentRegression,RuntimeInSeconds:10.5353
    Report index:8,TrainerName:OlsRegression,RuntimeInSeconds:10.8905459
    Report index:9,TrainerName:LightGbmRegression,RuntimeInSeconds:10.5703296
    Report index:10,TrainerName:FastTreeRegression,RuntimeInSeconds:19.4470509
    Report index:11,TrainerName:FastTreeTweedieRegression,RuntimeInSeconds:63.638882
    Report index:12,TrainerName:LightGbmRegression,RuntimeInSeconds:10.7710518
    

    学习结束后我们通过Debugger.PrintTopModels打印出所有模型数据: 

       public class Debugger
        {
            private const int Width = 114;
            public  static void PrintTopModels(ExperimentResult<RegressionMetrics> experimentResult)
            {            
                var topRuns = experimentResult.RunDetails
                    .Where(r => r.ValidationMetrics != null && !double.IsNaN(r.ValidationMetrics.RSquared))
                    .OrderByDescending(r => r.ValidationMetrics.RSquared);
    
                Console.WriteLine("Top models ranked by R-Squared --");
                PrintRegressionMetricsHeader();
                for (var i = 0; i < topRuns.Count(); i++)
                {
                    var run = topRuns.ElementAt(i);
                    PrintIterationMetrics(i + 1, run.TrainerName, run.ValidationMetrics, run.RuntimeInSeconds);
                }
            }       
    
            public static void PrintRegressionMetricsHeader()
            {
                CreateRow($"{"",-4} {"Trainer",-35} {"RSquared",8} {"Absolute-loss",13} {"Squared-loss",12} {"RMS-loss",8} {"Duration",9}", Width);
            }
    
            public static void PrintIterationMetrics(int iteration, string trainerName, RegressionMetrics metrics, double? runtimeInSeconds)
            {
                CreateRow($"{iteration,-4} {trainerName,-35} {metrics?.RSquared ?? double.NaN,8:F4} {metrics?.MeanAbsoluteError ?? double.NaN,13:F2} {metrics?.MeanSquaredError ?? double.NaN,12:F2} {metrics?.RootMeanSquaredError ?? double.NaN,8:F2} {runtimeInSeconds.Value,9:F1}", Width);
            }
    
            public static void CreateRow(string message, int width)
            {
                Console.WriteLine("|" + message.PadRight(width - 2) + "|");
            }
    }

     其中CreateRow代码功能用于排版。调试结果如下:

    Top models ranked by R-Squared --
    |     Trainer                             RSquared Absolute-loss Squared-loss RMS-loss  Duration                 |
    |1    FastTreeTweedieRegression             0.4731          0.46         0.41     0.64      63.6                 |
    |2    FastTreeTweedieRegression             0.4431          0.49         0.43     0.65      14.7                 |
    |3    FastTreeRegression                    0.4386          0.54         0.49     0.70      19.4                 |
    |4    LightGbmRegression                    0.4177          0.52         0.45     0.67      10.8                 |
    |5    FastTreeRegression                    0.4102          0.51         0.45     0.67      14.8                 |
    |6    LightGbmRegression                    0.3944          0.52         0.46     0.68      11.2                 |
    |7    LightGbmRegression                    0.3501          0.60         0.57     0.75      10.6                 |
    |8    FastForestRegression                  0.3381          0.60         0.58     0.76      15.6                 |
    |9    OlsRegression                         0.2829          0.56         0.53     0.73      10.9                 |
    |10   LbfgsPoissonRegression                0.2760          0.62         0.63     0.80      11.2                 |
    |11   SdcaRegression                        0.2746          0.58         0.56     0.75      12.5                 |
    |12   OnlineGradientDescentRegression       0.0593          0.69         0.81     0.90      10.5                 |
    

    根据结果可以看到,一些算法被重复试验,但在使用同一个算法时其配置参数并不一样,如阙值、深度等。

    2、获取最优模型

                RunDetail<RegressionMetrics> best = experimentResult.BestRun;
                ITransformer trainedModel = best.Model;

     获取最佳模型后,其评估和保存的过程和之前代码一致。用测试数据评估结果:

    *************************************************
    *       Metrics for FastTreeTweedieRegression regression model
    *------------------------------------------------
    *       LossFn:        0.67
    *       R2 Score:      0.34
    *       Absolute loss: .63
    *       Squared loss:  .67
    *       RMS loss:      .82
    *************************************************
    

    看结果识别率约70%左右,这种结果是没有办法用于生产的,问题应该是我们没有找到决定葡萄酒品质的关键特征。

    五、小结

    到这篇文章为止,《ML.NET学习笔记系列》就结束了。学习过程中涉及的原始代码主要来源于:https://github.com/dotnet/machinelearning-samples 。

    该工程中还有一些其他算法应用的例子,包括:聚类、矩阵分解、异常检测,其大体流程基本都差不多,有了我们这个系列的学习基础有兴趣的朋友可以自己研究一下。

      

    六、资源获取 

    源码下载地址:https://github.com/seabluescn/Study_ML.NET

    回归工程名称:Regression_WineQuality

    AutoML工程名称:Regression_WineQuality_AutoML

    点击查看机器学习框架ML.NET学习笔记系列文章目录

  • 相关阅读:
    一起学Windows phone 7开发(四. DeepZoom)
    设计模式Observer(观察者模式)
    今天挺开心
    设计模式Singleton(单例模式)
    PointFromScreen和PointFromScreen的用法和区别
    设计模式Adapter(适配器模式)
    设计模式Abstract Factory(抽象工厂)
    C++多线程信号量,互斥
    linux bash 几个命令
    大小端存储
  • 原文地址:https://www.cnblogs.com/seabluescn/p/10995991.html
Copyright © 2020-2023  润新知