• Apache Atlas Basic Usage


    当我们谈论数据治理/元数据管理的时候,我们究竟在讨论什么?

      谈到数据治理,自然离不开元数据。元数据(Metadata),用一句话定义就是:描述数据的数据。元数据打通了数据源、数据仓库、数据应用,记录了数据从产生到消费的全过程。因此,数据治理的核心就是元数据管理。

      数据的真正价值在于数据驱动决策,通过数据指导运营。通过数据驱动的方法判断趋势,帮住我们发现问题,继而推动创新或产生新的解决方案。随着企业数据爆发式增长,数据体量越来越难以估量,我们很难说清楚我们到底拥有哪些数据,这些数据从哪里来,到哪里去,发生了什么变化,应该如何使用它们。因此元数据管理(数据治理)成为企业级数据湖不可或缺的重要组成部分。

      可惜很长一段时间内,市面都没有成熟的数据治理解决方案。直到2015年,Hortonworks终于坐不住了,约了一众小伙伴公司倡议:咱们开始整个数据治理方案吧。然后,包含数据分类、集中策略引擎、数据血缘、安全和生命周期管理功能的Atlas应运而生。

      Atlas 是一个可伸缩和可扩展的核心基础治理服务集合 ,使企业能够有效地和高效地满足 Hadoop 中的合规性要求,并允许与整个企业数据生态系统的集成。

      Apache Atlas为组织提供开放式元数据管理和治理功能,用以构建其数据资产目录,对这些资产进行分类和管理,并为数据科学家,数据分析师和数据治理团队提供围绕这些数据资产的协作功能。

    基本架构信息
    相关概念
    Type
    元数据类型定义,这里可以是表,列,视图,物化视图等,还可以细分hive表(hive_table),hbase表(hbase_table)等,甚至可以是一个数据操作行为,比如定时同步从一张表同步到另外一张表这个也可以描述为一个元数据类型,atlas自带了很多类型,但是可以通过调用api自定义类型

    Classification
    分类,通俗点就是给元数据打标签,分类是可以传递的,比如user_view这个视图是基于user这个表生成的,那么如果user打上了HR这个标签,user_view也会自动打上HR的标签,这样的好处就是便于数据的追踪

    GLOSSARY
    词汇表,GLOSSARY包含两个概念,Category(类别)和Term(术语),Category表示一组Term的集合,术语为元数据提供了别名,以便用户更好的理解数据,举个例子,有个pig的表,里面有个猪肾的字段,但很多人更习惯叫做猪腰子,那么就可以给猪肾这个字段加一个Term,不仅更好理解,也更容易搜索到

    Entity
    实体,表示具体的元数据,Atlas管理的对象就是各种Type的Entity

    Lineage
    数据血缘,表示数据之间的传递关系,通过Lineage我们可以清晰的知道数据的从何而来又流向何处,中间经过了哪些操作

    基本用法

    This Apache Atlas is built from the 2.1.0-release source tarball and patched to be run in a Docker container.

    Atlas is built with embedded HBase + Solr and it is pre-initialized, so you can use it right after image download without additional steps.

    If you want to use external Atlas backends, set them up according to the documentation.
    汉化版参考文档查看:Apache Atlas v1.1 版本

    1. Pull the latest release image:
    docker pull sburn/apache-atlas
    
    1. Start Apache Atlas in a container exposing Web-UI port 21000:
    docker run -d -p 21000:21000 --name atlas_v2.1.0 sburn/apache-atlas /opt/apache-atlas-2.1.0/bin/atlas_start.py
    

    Please, take into account that the first startup of Atlas may take up to few mins depending on host machine performance before web-interface become available at http://localhost:21000/

    Web-UI default credentials: admin / admin

    Usage options

    Usage options

    Gracefully stop Atlas:

    docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/atlas_stop.py
    

    Check Atlas startup script output:

    docker logs atlas
    

    Check interactively Atlas application.log (useful at the first run and for debugging during workload):

    docker exec -ti atlas tail -f /opt/apache-atlas-2.1.0/logs/application.log
    

    Run the example (this will add sample types and instances along with traits):

    docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/quick_start.py
    

    Start Atlas overriding settings by environment variables
    (to support large number of metadata objects for example):

    docker run --detach 
        -e "ATLAS_SERVER_OPTS=-server -XX:SoftRefLRUPolicyMSPerMB=0 
        -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC 
        -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution 
        -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof 
        -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation 
        -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails 
        -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps" 
        -p 21000:21000 
        --name atlas 
        sburn/apache-atlas 
        /opt/apache-atlas-2.1.0/bin/atlas_start.py
    

    Start Atlas exposing logs directory on the host to view them directly:

    docker run --detach 
        -v ${PWD}/atlas-logs:/opt/apache-atlas-2.1.0/logs 
        -p 21000:21000 
        --name atlas 
        sburn/apache-atlas 
        /opt/apache-atlas-2.1.0/bin/atlas_start.py
    

    Start Atlas exposing conf directory on the host to place and edit configuration files directly:

    docker run --detach 
        -v ${PWD}/pre-conf:/opt/apache-atlas-2.1.0/conf 
        -p 21000:21000 
        --name atlas 
        sburn/apache-atlas 
        /opt/apache-atlas-2.1.0/bin/atlas_start.py
    

    Start Atlas with data directory mounted on the host to provide its persistency:

    docker run --detach 
        -v ${PWD}/data:/opt/apache-atlas-2.1.0/data 
        -p 21000:21000 
        --name atlas 
        sburn/apache-atlas 
        /opt/apache-atlas-2.1.0/bin/atlas_start.py
    

    Tinkerpop Gremlin support

    Image contains build-in extras for those who want to play with Janusgraph, and Atlas artifacts using Apache Tinkerpop Gremlin Console (gremlin CLI).

    1. You need Atlas container up and running as shown above.

    2. Install gremlin-server and gremlin-console into the container by running included automation script:

    docker exec -ti atlas /opt/gremlin/install-gremlin.sh
    
    1. Start gremlin-server in the same container:
    docker exec -d atlas /opt/gremlin/start-gremlin-server.sh
    
    1. Finally, run gremlin-console interactively:
    docker exec -ti atlas /opt/gremlin/run-gremlin-console.sh
    

    Gremlin-console usage example:

             \,,,/
             (o o)
    -----oOOo-(3)-oOOo-----
    
    gremlin>:remote connect tinkerpop.server conf/remote.yaml session
    ==>Configured localhost/127.0.0.1:8182-[d1b2d9de-da1f-471f-be14-34d8ea769ae8]
    gremlin> :remote console
    ==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[d1b2d9de-da1f-471f-be14-34d8ea769ae8] - type ':remote console' to return to local mode
    gremlin> g = graph.traversal()
    ==>graphtraversalsource[standardjanusgraph[hbase:[localhost]], standard]
    gremlin> g.V().has('__typeName','hdfs_path').count()
    

    Environment Variables

    The following environment variables are available for configuration:

    Name Default Description
    JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
    ATLAS_OPTS any additional java opts you want to set. This will apply to both client and server operations
    ATLAS_CLIENT_OPTS any additional java opts that you want to set for client only
    ATLAS_CLIENT_HEAP java heap size we want to set for the client. Default is 1024MB
    ATLAS_SERVER_OPTS any additional opts you want to set for atlas service.
    ATLAS_SERVER_HEAP java heap size we want to set for the atlas server. Default is 1024MB
    ATLAS_HOME_DIR What is is considered as atlas home dir. Default is the base location of the installed software
    ATLAS_LOG_DIR Where log files are stored. Defatult is logs directory under the base install location
    ATLAS_PID_DIR Where pid files are stored. Defatult is logs directory under the base install location
    ATLAS_EXPANDED_WEBAPP_DIR Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.

    Bug Tracker

    Bugs are tracked on GitHub Issues.
    In case of trouble, please check there to see if your issue has already been reported.
    If you spotted it first, help us smash it by providing detailed and welcomed feedback.

    Maintainer

  • 相关阅读:
    iOS Core Animation 简明系列教程
    常用的iOS第三方资源
    超全!整理常用的iOS第三方资源
    iOS使用Workspace来管理多项目 ( 转 )
    转 与App Store审核的斗智斗勇
    Python -- Web -- WSGI
    Python -- Web
    Python -- 序列化
    Python -- 文档测试
    Python -- 单元测试
  • 原文地址:https://www.cnblogs.com/xx2017/p/15162884.html
Copyright © 2020-2023  润新知