• Amazon Athena学习笔记


    Amazon Athena概览

    快速了解Athena 是什么?关键字:

    1. 交互式查询服务
    2. ad-hoc查询
    3. 支持标准SQL
    4. 指定S3中的数据形成表(类似hive)
    5. 快速响应(seconds级别)
    6. serverless
    7. 支持JDBC连接和Java API连接

    Amazon Athena is an interactive query service that lets you use standard SQL to analyze data directly in Amazon S3. You can point Athena at your data in Amazon S3 and run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage. You pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.

     

    If you connect to Athena using the JDBC driver, use version 1.1.0 of the driver or later with the Amazon Athena API. Earlier version drivers do not support the API. For more information and to download the driver, see Accessing Amazon Athena with JDBC.

    For code samples using the AWS SDK for Java, see Examples and Code Samples

     

    Athena数据库名,表名,字段名规范

    1. 数据库名字,表名字,列名字必须是小写

    2. 特殊字符"_"支持,其他的则不支持

    3. 如果名字以"_"开头,则需要使用``来修饰

     

    创建Athena表加载数据

    1.数据在s3,创建athena表通过location参数指定加载s3上的数据

    NOTE:这个好像必须创建外部表才行,后续验证

    CREATE EXTERNAL TABLE IF NOT EXISTS default.self_learning_old(rowkey STRING,windspd INT,directh INT,directv INT,func STRING,value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3://com.kong.bp.cn.test/test_folder/'

    2.基于已有的表,创建分区表demo

    CREATE table self_learning
    WITH (format='PARQUET',
    parquet_compression='SNAPPY',
    partitioned_by=array['year'],
    external_location = 's3://com.kong.bp.cn.test/test_folder/self_learning_old/')
    AS
    SELECT
          windspd,
          directh,
          directv,
          func,
          value,
         cast(substr(split(rowkey,':')[2],1,4) AS bigint) as year
    FROM default.self_learning_old

     

    Athena查询json数据

    关于Athena加载json数据参考文档中的:Querying JSON

    JSON样例数据:

    {
    "name": "Bob Smith",
    "org": "engineering",
    "projects": [{
    "name": "project1",
    "completed": false
    }, {
    "name": "project2",
    "completed": true
    }]
    }

    1.使用json_extract函数解析数据:

    WITH dataset AS (
    SELECT '{"name": "Susan Smith",
    "org": "engineering",
    "projects": [{"name":"project1", "completed":false},
    {"name":"project2", "completed":true}]}'
    AS blob
    )
    SELECT
    json_extract(blob, '$.name') AS name,
    json_extract(blob, '$.projects') AS projects
    FROM dataset

    返回结果:

    2.使用json_extract_scalar函数

    json_extract_scalar类似json_extract函数,但是json_extract_scalar只返回scalar values (Boolean, number, or string)。

    NOTE:此函数不适用于arrays, maps, or structs,这里的"scalar"我理解为对应的数据类型

    比如使用json_extract_scalar解析出对应的数据:

    WITH dataset AS (
    SELECT '{"name": "Susan Smith",
    "org": "engineering",
    "projects": [{"name":"project1", "completed":false},{"name":"project2",
    "completed":true}]}'
    AS blob
    )
    SELECT
    json_extract_scalar(blob, '$.name') AS name,
    json_extract_scalar(blob, '$.projects') AS projects
    FROM dataset

    查询的结果:

    +---------------------------+
    | name       | projects   |
    +---------------------------+
    | Susan Smith |             |
    +---------------------------+

    因为json中的projects是一个数组类型,所以这里使用json_extract_scalar无法识别

    3.使用json_array_get函数

    对于这种数组类型,可以使用json_array_get函数,比如:

    WITH dataset AS (
    SELECT '{"name": "Bob Smith",
    "org": "engineering",
    "projects": [{"name":"project1", "completed":false},{"name":"project2",
    "completed":true}]}'
    AS blob
    )
    SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
    FROM dataset

    先使用json_extract函数获得projects项数据,得到的是一个数组类型,再使用json_array_get函数按下标(index)来获取。返回的结果:

    +---------------------------------------+
    | item                                 |
    +---------------------------------------+
    | {"name":"project1","completed":false} |
    +---------------------------------------+

     

  • 相关阅读:
    [c language] getopt
    编程经典问题
    一些常用的正则表达式
    [Head First Python]6. summary
    Java多线程
    JVM运行原理
    Struts2---自定义拦截器
    SpringMVC框架初步
    测试基本问题
    自动化测试
  • 原文地址:https://www.cnblogs.com/zz-ksw/p/13339718.html
Copyright © 2020-2023  润新知