• mongoDB BI 分析利器


    背景

    mongoDB是近几年迅速崛起的一种文档型数据库,广泛应用于对事务无要求,但是要求较好的开发灵活性,扩展弹性的领域,。

    随着企业对数据挖掘需求的增加,用户可能会对存储在mongo中的数据有挖掘需求,但是mongoDB的语法较为单一,不能满足挖掘的需求。

    PostgreSQL是起源于伯克利大小的一个开源数据库,已经有20多年的历史,以稳定性,功能强大著称,号称"开源界的Oracle"。

    在国内外各个行业都有非常多的用户,如平安银行,邮储银行,中移动,去哪儿,高德,菜鸟,美国宇航局,俄罗斯杜马等等。

    PostgreSQL 9.6 新增了基于CPU的并行计算。 20TB以内的OLTP+OLAP的混合场景,PostgreSQL 会是很好的选择。

    PostgreSQL的FDW特性,可以允许它连接任何数据源,将外部数据源当成本地源使用。

    MongoDB Connector for BI就是PostgreSQL 的FDW衍生的产品。 为mongoDB用户提供丰富的SQL接口。
    screenshot

    除了可以连接mongoDB,PostgreSQL FDW还能连接几乎所有数据源,图中没有完全列出。
    screenshot
    FDW请参考
    http://wiki.postgresql.org/wiki/Fdw

    本文将从mongodb用户视角,讲解一下mongodb bi connector的用法。

    MongoDB Connector for BI的部署

    因为国内下载mongodb-bi的包非常慢,我这里没有验证整个过程,以互联网上一篇文档或蓝本,细化一下整个过程。

    OS环境

    [root@mongobihost raj]# lsb_release -a
    Distributor ID: RedHatEnterpriseServer
    Description: Red Hat Enterprise Linux Server release 6.5 (Santiago)
    Release: 6.5
    [root@mongobihost raj]# cat /etc/redhat-release 
    Red Hat Enterprise Linux Server release 6.5 (Santiago)
    

    python版本

    [root@mongobihost raj]# which python
    /usr/bin/python
    [root@mongobihost raj]# python -V
    Python 2.6.6
    

    下载 mongodb-bi-1.1.3-1-centos6-rpms.tar.bz2, 解压
    包含了PostgreSQL, FDW接口以及mongodb schema转换成SQL的工具等。

    root@mongobihost bin]# cd /tmp/
    [root@mongobihost tmp]# ls -ltr
    mongodb-bi-schematools-1.1.3-1.el6.x86_64.rpm
    mongodb-bi-libs-1.1.3-1.el6.x86_64.rpm
    mongodb-bi-1.1.3-1.el6.x86_64.rpm
    mongodb-bi-server-1.1.3-1.el6.x86_64.rpm  -- PostgreSQL server
    mongodb-bi-contrib-1.1.3-1.el6.x86_64.rpm  -- PostgreSQL contrib
    mongodb-bi-devel-1.1.3-1.el6.x86_64.rpm  -- PostgreSQL include
    mongodb-bi-multicorn-1.1.3-1.el6.x86_64.rpm  -- PostgreSQL python FDW 开发接口
    mongodb-bi-pymongo-1.1.3-1.x86_64.rpm
    mongodb-bi-fdw-1.1.3-1.noarch.rpm    --  PostgreSQL mongofdw based on mulitcorn
    mongodb-bi-1.1.3-1-centos6-rpms.tar.bz2
    

    安装这些 rpm

    [root@mongobihost tmp]# rpm -ivh *.rpm --nodeps
    Preparing...                ########################################### [100%]
     package mongodb-bi-libs-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-devel-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-server-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-contrib-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-schematools-1.1.3-1.el6.x86_64 is already installed
     package mongodb-bi-pymongo-1.1.3-1.x86_64 is already installed
     package mongodb-bi-fdw-1.1.3-1.noarch is already installed
    

    安装 mongodb-bi-multicorn

    [root@mongobihost tmp]# rpm -ivh mongodb-bi-multicorn-1.1.3-1.el6.x86_64 --nodeps
    error: open of mongodb-bi-multicorn-1.1.3-1.el6.x86_64 failed: No such file or directory
    
    [root@mongobihost tmp]# rpm -ivh mongodb-bi-multicorn-1.1.3-1.el6.x86_64.rpm --nodeps
    Preparing...                ########################################### [100%]
       1:mongodb-bi-multicorn   ########################################### [100%]
    

    安装完后,检查python 的collections模块是否正常

    NOTE: python Version should be greater than 2.6 - Hence, upgrade it and then install RPMs.
    
    One way to check is :  to start a Python2.6 shell, and confirm that the "collections" module includes the "OrderedDict()" methods. For example:
    
    python
    Python 2.6.6 (r266:84292, Sep  4 2013, 07:46:00) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import collections
    >>> od = collections.OrderedDict()
    >>> od
    
    OrderedDict()
    
    Ctrl+D to exit..
    

    检查 本地 Mongo

     mongo ${HOST}:${PORT}/admin -u mongoadmin -p $password
    MongoDB shell version: 3.2.4
    connecting to: mongobihost:27017/admin
    Server has startup warnings: 
    
    2016-04-01T16:49:54.454-0700 I CONTROL  [initandlisten] 
    MongoDB Enterprise set01:PRIMARY> show dbs
    admin              0.000GB
    rajdb              1.210GB
    abcdeconfig        0.015GB
    abcdb              0.166GB
    jiradb             0.026GB
    local              1.199GB
    exit;
    

    创建 mongodb bi 用户
    对应的操作是在PostgreSQL 中使用 create server和CREATE USER MAPPING FOR定义foreign server与user mapping的操作。( 指向提供的 mongodb url )
    参考 https://docs.mongodb.com/bi-connector/reference/mongobiuser/#bin.mongobiuser

    [root@mongobihost bin]# mongobiuser create biuser mongodb://biuser:test@mongobihost.myhost.com:27017/admin
    
    or  
    
    [root@mongobihost bin]# mongobiuser create biuser mongodb://mongobihost.myblog.com:27017/admin
    Enter password:  
    
    2016-06-17T12:12:15.403-0700 creating user biuser  
    2016-06-17T12:12:15.408-0700 creating database buses  
    

    检查PostgreSQL是否启动
    mongo bi connector修改了PostgreSQL中的一些默认选项,例如端口已修改为27032,当然你也可以自己去修改这个端口。
    以下是PostgreSQL在unix socket上的监听,监听端口27032,如果你需要监听在IP端口上,需要修改postgresql.conf重启数据库.

    [root@mongobihost bin]# netstat -an|grep PG
    Active Internet connections (servers and established)
    Proto RefCnt Flags       Type       State         I-Node Path
    unix  2      [ ACC ]     STREAM     LISTENING     1262987 /tmp/.s.PGSQL.27032
    

    查看PostgreSQL 配置文件的位置
    其实用 rpm -ql mongodb-bi-server 更好

    [root@mongobihost tmp]# find / -name postgresql.conf 
    /var/lib/pgsql/9.4/data/postgresql.conf  
    

    修改监听,在所有的接口上。这样你的BI软件才能通过网络连到PostgreSQL

    vi /var/lib/pgsql/9.4/data/postgresql.conf
    listen_addresses = '0.0.0.0'
    

    配置PostgreSQL pg_hba.conf,允许所有来源IP访问这个PostgreSQL

    [root@mongobihost bin]# vi /var/lib/pgsql/9.4/data/pg_hba.conf
    #** Add below content :
    # IPv4 local connections:
    host    all             all             0.0.0.0/0               md5
    

    重启postgresql

    pg_ctl restart -m fast -D /var/lib/pgsql/9.4/data
    

    使用mongodrdl将需要参与BI分析的collection导出成为创建PostgreSQL 外部表的DDL

    mongodrdl -d rajdb -o rajdb.drdl -h mongobihost:27017 -u mongoadmin -p $password --authenticationDatabase admin
    
    Note: 27017 is mongo port
    2016-06-17T14:20:15.546-0700 Table "employee", column "sfg.sfgsf" has no types: ignoring column.
    2016-06-17T14:20:15.546-0700 Table "employee", column "fgfs.gsdfgf" has no types: ignoring column.
    2016-06-17T14:20:15.546-0700 Table "employee", column "fgsf.sgfgs" has no types: ignoring column.
    2016-06-17T14:20:15.546-0700 Table "employee", column "sgss.srgs" has no types: ignoring column.
    2016-06-17T14:20:16.123-0700 Table "emp_Pack_flat", column "rtgs.comments" has no types: ignoring column.
    2016-06-17T14:20:16.972-0700 Table "customer_transaction", column "FValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.973-0700 Table "customer_transaction_Notes", column "Notes.enumValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.973-0700 Table "customer_transaction_SiteValues", column "F1z_v.fields.SiteAbbr.enumValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.973-0700 Table "customer_transaction_URL", column "URL.enumValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.974-0700 Table "customer_transaction_active", column "F1z_v.fields.active.enumValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.974-0700 Table "customer_transaction_active", column "colCur.enumValues" is an array that has no types: ignoring column.
    2016-06-17T14:20:16.974-0700 Table "customer_transaction_active", column "colDiff.enumValues" is an array that has no types: ignoring column.
    

    使用mongobischema 将DDL导入PostgreSQL

    # To import data into BI schema
    
    [root@mongobihost bin]# mongobischema import biuser ./rajdb.drdl
    Enter password:
    
    2016-06-17T14:55:02.541-0700 creating table employee
    2016-06-17T14:55:02.572-0700 creating table emp_Pac_fla
    2016-06-17T14:55:02.579-0700 creating table customer_transaction
    2016-06-17T14:55:02.588-0700 creating table customer_transaction_Notes
    2016-06-17T14:55:02.597-0700 creating table customer_transaction_SiteVa
    2016-06-17T14:55:02.606-0700 creating table customer_transaction_URL
    2016-06-17T14:55:02.614-0700 creating table customer_transaction_active
    
    # to look at the tables in the BI schema, run below stmt.
    

    检查已导入的外部表

    [root@mongobihost]# mongobischema list biuser
    Enter password:
    
    
    employee
    
    customer_transaction
    
    customer_transaction_Notes
    
    customer_transaction_SiteVa
    
    customer_transaction_URL 
    customer_transaction_active
    

    如何重启PostgreSQL,也可直接使用pg_ctl。

    If you need to restart the BI Connector, then
    sudo service postgresql-9.4 stop
    sudo service postgresql-9.4 start
    
    or 
    
    pg_ctl restart -m fast -D /var/lib/pgsql/9.4/data
    

    列出bi用户,也可以直接用PostgreSQL中的SQL或视图查看

    # mongobiuser list
    

    检查连接PostgreSQL是否正常

    to check if things are okay on postgre Sql..
    
     psql -h localhost -p 27032 -U biuser
    Password for user biuser: 
    psql (9.4.5 MongoDB BI Connector 1.1.3)
    SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
    Type "help" for help.
    
    biuser=> d
                                                    List of relations
     Schema |                                     Name                                      |     Type      | Owner  
    --------+-------------------------------------------------------------------------------+---------------+--------
     public | customer_transaction                                                            | view          | biuser
     public | customer_transaction_Notes                                                        | foreign table | biuser
     public | customer_transaction_SiteVa                                               | view          | biuser
    
    biuser=> select * from "customer_transaction" limit 1;
    

    现在你可以用BI软件连接PostgreSQL来分析存储在mongoDB的数据了 。

    小结

    • 其实创建fdw没有这么麻烦,几条SQL语句就搞定了。 mongodb只是出于mongodb用户可能不熟悉PostgreSQL的考虑,把操作都封装成了命令,便于mongodb的用户使用。
    • 如果用户未来又新增了需要分析的collection,使用导出和导入的步骤即可。
    • PostgreSQL作为mongodb bi connector的作用是扩充mongoDB的SQL功能,并不存储数据,数据都是在mongoDB中的。 如果涉及到比较复杂的运算无法下推到mongodb,则会将数据提取到PostgreSQL本地进行运算(过程自动,对用户透明)。但是如果数据量非常庞大(例如每次分析都超过百GB需要提取)会被数据在网络传输的时间拖累。
    • 如果数据量很庞大,建议还是讲mongoDB的数据导出到PostgreSQL或者Greenplum,直接在PostgreSQL或Greenplum中分析。效率会更高。
    • Greenplum 是基于PostgreSQL的一个MPP OLAP产品,在OLAP领域有着非常好的口碑,国内外的用户群也非常多。
    • 用户群覆盖了互联网,金融、物流、政府部门等各大行业,最大的集群规模有超过1000 segment的。
      20TB ~ 1PB 纯OLAP场景 Greenplum 会是更好的选择。
    • 如何将mongoDB的数据导入PostgreSQL或Greenplum , 参考 https://yq.aliyun.com/articles/31632 , 或者直接使用SQL导入. create table local_table(表结构); insert into local_table select * from 外部表;

    参考

    扩展阅读

  • 相关阅读:
    CnForums1.0 Alpha 开始试运行
    asp.net Forums2.0修改密码后无法登陆问题——都是Cache惹的祸
    CnForums1.0 Alpha RC1 发布
    Docker: Nvidia Driver, Nvidia Docker 推荐安装步骤
    Docker: docker pull, wget, curl, git clone 等如何更快?
    DL4J实战之三:经典卷积实例(LeNet5)
    纯净Ubuntu16安装CUDA(9.1)和cuDNN
    DL4J实战之四:经典卷积实例(GPU版本)
    JAVA 中静态块、静态变量加载顺序详解
    设计模式之单例模式
  • 原文地址:https://www.cnblogs.com/aliyunblogs/p/5703361.html
Copyright © 2020-2023  润新知