• timescaledb 集成 madlib


    github 上有人提出了一个问题(2017 很早了),然后搜索timescaledb 的docs 文档,发现有
    一片介绍的文章,所以尝试运行下
    备注: 环境使用虚拟机安装(没有使用docker madlib 的原因,实际上可以尝试基于timescaledb 的镜像改造)

    安装madlib

    这个可以参考madlib 官方文档,或者https://www.cnblogs.com/rongfengliang/p/10298159.html
    因为timescaledb 是一个pg 的标准扩展,可以复用以前的安装方式

    • 添加pg 10 repo
     
    yum install https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm
     
    • 安装pg python 基本包
    yum -y install postgresql10-plpython supervisor
     
    • 安装madlib 依赖包
      注意python 版本,我使用python 2.7 安装时候失败了,修改为了python34
     
    yum update -y && yum install -y 
                        git 
                        gcc 
                        wget 
                        postgresql10-devel 
                        openssl 
                        m4 
                        vim 
                        flex 
                        bison 
                        graphviz 
                        java 
                        epel-release 
                        python34-devel
     
     
    • 安装pip 包
      默认一般是包含的
     
    yum install -y python34-pip
     
     
    • pg_conf 配置(环境变量)
    PATH="$PATH:/usr/pgsql-10/bin"
     
     
    • 安装python 依赖(通过pip)
     pip3 install awscli pygresql paramiko --upgrade
     
     
    • 安装apache-madlib
    下载rpm 
    wget   https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
    安装
    yum install -y apache-madlib-1.15.1-bin-Linux.rpm
     
     

    timescaledb 扩展安装

    • 下载timescaledb pg 扩展
    wget https://timescalereleases.blob.core.windows.net/rpm/timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
     
    • 安装
    yum install -y timescaledb-1.1.1-postgresql-10-0.x86_64.rpm
     
    • 初始化数据库
    /usr/pgsql-10/bin/postgresql-10-setup initdb
     
     
    • 配置pg 扩展
      /var/lib/pgsql/10/data/postgresql.conf 文件
     
    + shared_preload_libraries = 'timescaledb'
     
     
    • 修改pg_hba.conf 添加访问支持
      之后修改之后,需要重启服务,systemctl restart postgresql-10
     
    /var/lib/pgsql/10/data/pg_hba.conf
    修改如下:
    # "local" is for Unix domain socket connections only
    local all all trust
    # IPv4 local connections:
    host all all 127.0.0.1/32 trust
    # IPv6 local connections:
    host all all ::1/128 trust
     
     
    • 启动数据库
    systemctl enable postgresql-10
    systemctl start postgresql-10
     
     

    注册madlib 扩展

    • 下载&&安装madlib 扩展
    下载rpm 
    wget https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm
    安装
    yum install -y apache-madlib-1.15.1-bin-Linux.rpm
     
     
    • 注册
    /usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install
     
     

    效果

    madpack.py: INFO : Detected PostgreSQL version 10.6.
    madpack.py: INFO : *** Installing MADlib ***
    madpack.py: INFO : MADlib tools version = 1.15.1 (/usr/local/madlib/Versions/1.15.1/bin/../madpack/madpack.py)
    madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=postgres, schema=madlib)
    madpack.py: INFO : Testing PL/Python environment...
    madpack.py: INFO : > Creating language PL/Python...
    madpack.py: INFO : > PL/Python environment OK (version: 2.7.5)
    madpack.py: INFO : > Preparing objects for the following modules:
    madpack.py: INFO : > - array_ops
    madpack.py: INFO : > - bayes
    madpack.py: INFO : > - crf
    madpack.py: INFO : > - elastic_net
    madpack.py: INFO : > - linalg
    madpack.py: INFO : > - pmml
    madpack.py: INFO : > - prob
    madpack.py: INFO : > - sketch
    madpack.py: INFO : > - svec
    madpack.py: INFO : > - svm
    madpack.py: INFO : > - tsa
    madpack.py: INFO : > - stemmer
    madpack.py: INFO : > - conjugate_gradient
    madpack.py: INFO : > - knn
    madpack.py: INFO : > - lda
    madpac
     
     
    • 检查
    /usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost:5432/postgres install-check
     
     

    效果

    TEST CASE RESULT|Module: bayes|bayes.ic.sql_in|PASS|Time: 121 milliseconds
    TEST CASE RESULT|Module: crf|crf_train_small.ic.sql_in|PASS|Time: 112 milliseconds
    TEST CASE RESULT|Module: crf|crf_test_small.ic.sql_in|PASS|Time: 133 milliseconds
    TEST CASE RESULT|Module: elastic_net|elastic_net.ic.sql_in|PASS|Time: 133 milliseconds
    TEST CASE RESULT|Module: linalg|linalg.ic.sql_in|PASS|Time: 44 milliseconds
    TEST CASE RESULT|Module: linalg|svd.ic.sql_in|PASS|Time: 168 milliseconds
    TEST CASE RESULT|Module: linalg|matrix_ops.ic.sql_in|PASS|Time: 238 milliseconds
    TEST CASE RESULT|Module: prob|prob.ic.sql_in|PASS|Time: 21 milliseconds
    TEST CASE RES
     
     

    简单测试

    内容来自官方文档
    https://docs.timescale.com/v0.11/tutorials/tutorial-forecasting,https://docs.timescale.com/v1.1/tutorials/tutorial-hello-nyc
    同时需要安装的组件也比较多,gis,timescale。。

    • 预备安装
      gis 扩展安装,后边需要,实际上这个是可选的
    wget https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-centos10-10-2.noarch.rpm // 这个yum 源
    上边实际上已经执行了
    yum  install -y postgis25_10
     
     
    • 加载扩展
      都在postgres 数据库
     
    psql -U postgres -d  postgres -h localhost 
    c postgres
    CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
    CREATE EXTENSION postgis;
     
     
    • 加载数据部分数据(因为gis 数据有依赖关系)
      数据有点多,可能会比较慢
     
    wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz
    tar -xvzf nyc_data.tar.gz
    psql -U postgres -d  postgres -h localhost < nyc_data.sql
    psql -U postgres -d postgres -h localhost -c "COPY rides FROM nyc_data_rides.csv CSV"
     
     
    • 初始化gis 扩展 && forecast.sql
      数据有点多,可能会比较慢
     
    ALTER TABLE rides ADD COLUMN pickup_geom geometry(POINT,2163);
    ALTER TABLE rides ADD COLUMN dropoff_geom geometry(POINT,2163);
    生成geo 数据:
    UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);
    UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);
    wget http://assets.iobeam.com/sql/forecast.sql
    // 此过程会创建rides_count rides_length rides_price  等hypertables 
    psql -U postgres -d postgres -h localhost -f forecast.sql
     
     

    查询操作

    • 查询rides_price
      如下:
     
    SELECT * FROM rides_price;
          one_hour | trip_price
    ---------------------+------------------
     2016-01-01 00:00:00 | 58.34
     2016-01-01 01:00:00 | 58.34
     2016-01-01 02:00:00 | 58.34
     2016-01-01 03:00:00 | 58.34
     2016-01-01 04:00:00 | 58.34
     2016-01-01 05:00:00 | 59.59
     2016-01-01 06:00:00 | 58.34
     2016-01-01 07:00:00 | 60.3833333333333
     2016-01-01 08:00:00 | 61.2575
     2016-01-01 09:00:00 | 58.435
     2016-01-01 10:00:00 | 63.952
     2016-01-01 11:00:00 | 59.9576923076923
     2016-01-01 12:00:00 | 60.4
     
     
    • 创建训练以及测试的数据集
      这个主要是测试从肯尼迪国际机场到时代广场的车程价格
     
    -- Make the training dataset
    SELECT * INTO rides_price_train FROM rides_price
    WHERE one_hour <= '2016-01-21 23:59:59';
    -- Make the testing dataset
    SELECT * INTO rides_price_test FROM rides_price
    WHERE one_hour >= '2016-01-22 00:00:00';
     
     

    效果

    SELECT * INTO rides_price_train FROM rides_price
    postgres-# WHERE one_hour <= '2016-01-21 23:59:59';
    SELECT 504
    postgres=# SELECT * INTO rides_price_test FROM rides_price
    postgres-# WHERE one_hour >= '2016-01-22 00:00:00';
    SELECT 240
     
     
    • 使用madlib 函数生成训练数据
    DROP TABLE IF EXISTS rides_price_output;
    DROP TABLE IF EXISTS rides_price_output_residual;
    DROP TABLE IF EXISTS rides_price_output_summary;
    DROP TABLE IF EXISTS rides_price_forecast_output;
    SELECT madlib.arima_train('rides_price_train', -- input table
          'rides_price_output', -- output table
          'one_hour', -- timestamp column
          'trip_price', -- time-series column
          NULL, -- grouping columns
          TRUE, -- include_mean
          ARRAY[2,1,3] -- non-seasonal orders
          );
    SELECT madlib.arima_forecast('rides_price_output', -- model table
                            'rides_price_forecast_output', -- output table
                            240 -- steps_ahead (10 days)
                            );
     
     
    • 查看生成的结果
     
        SELECT * FROM rides_price_forecast_output;
     

    数据

     SELECT * FROM rides_price_forecast_output;
     steps_ahead | forecast_value
    -------------+----------------
               1 | 62.317574661
               2 | 62.7126520761
               3 | 62.892038632
               4 | 62.755044625
               5 | 62.6064068106
               6 | 62.6197088752
               7 | 62.7032172965
               8 | 62.729257786
               9 | 62.6956015739
              10 | 62.6685762986
              11 | 62.6755999496
              12 | 62.6926055721
              13 | 62.695637064
              14 | 62.6878845777
              15 | 62.683
     
     

    结论: 从肯尼迪国际机场到时代广场的车程价格在日常基础上保持不变

    • 模型评估的
    ALTER TABLE rides_price_test ADD COLUMN id SERIAL PRIMARY KEY;
    ALTER TABLE rides_price_test ADD COLUMN forecast DOUBLE PRECISION;
    UPDATE rides_price_test
    SET forecast = rides_price_forecast_output.forecast_value
    FROM rides_price_forecast_output
    WHERE rides_price_test.id = rides_price_forecast_output.steps_ahead;
    SELECT madlib.mean_abs_perc_error('rides_price_test', 'rides_price_mean_abs_perc_error', 'trip_price', 'forecast');
    SELECT * FROM rides_price_mean_abs_perc_error;
     
     

    结果

    SELECT * FROM rides_price_mean_abs_perc_error;
     mean_abs_perc_error
    ---------------------
      0.0423789161947618
     
     

    结论: 从机场到曼哈顿的旅行价格保持在62美元,并且与测试数据集相比表现良好

    说明

    有一些关于机器学习以及统计的专业术语不是很懂,但是从基本的运行上,说明,madlib 与timescaledb 的集成还是
    很方便的,这样我们既能有时序数据库的方便,同时还包含了基于sql 的机器学习能力,当然我们同时也可以集成graphql
    engine 进行方便的graphql api ,总的来说基于标准pg 扩展的模型,部署还挺方便的

    参考资料

    https://docs.timescale.com/v0.11/tutorials/tutorial-forecasting
    https://docs.timescale.com/v0.11/getting-started
    https://docs.timescale.com/v1.1/tutorials/tutorial-hello-nyc#tutorial-postgis

  • 相关阅读:
    分布式session管理解决方案
    RabbitMQ知识汇总
    RabbitMQ之集群模式总结
    Flexbox参数详解
    CSS Lint
    javascript中的defer属性和async属性
    简介BFC
    GIT 牛刀小试 (第二发)
    GIT 牛刀小试 (第一发)
    如何让浏览器支持HTML5标签
  • 原文地址:https://www.cnblogs.com/rongfengliang/p/10300733.html
Copyright © 2020-2023  润新知