• 简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译


    1、MovieLens User Ratings

    First, create a table with tab-delimited text file format:

    首先,创建一个通过tab分隔的表:

    CREATE TABLE u_data (

      userid INT,

      movieid INT,

      rating INT,

      unixtime STRING)

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

    STORED AS TEXTFILE;

    Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

    然后,下载数据文件从下面方法:

    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

    or:

    curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

    Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

    Unzip the data files:

    解压缩这个文件:

    unzip ml-100k.zip

    And load u.data into the table that was just created:

    并且加载数据到刚刚创建的u_data表中:

    LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

    Count the number of rows in table u_data:

    统计表u_data的行数:

    SELECT COUNT(*) FROM u_data;

    Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    Now we can do some complex data analysis on the table u_data:

    现在我们可以做一些复杂的数据分析针对表u_data:

    Create weekday_mapper.py:

    import sys

    import datetime

    for line in sys.stdin:

      line = line.strip()

      userid, movieid, rating, unixtime = line.split(' ')

      weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

      print ' '.join([userid, movieid, rating, str(weekday)])

    Use the mapper script:

    使用这个脚本:

    CREATE TABLE u_data_new (

      userid INT,

      movieid INT,

      rating INT,

      weekday INT)

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

    add FILE weekday_mapper.py;

    INSERT OVERWRITE TABLE u_data_new

    SELECT

      TRANSFORM (userid, movieid, rating, unixtime)

      USING 'python weekday_mapper.py'

      AS (userid, movieid, rating, weekday)

    FROM u_data;

    解释:这里通过python脚本清洗表u_data中数据,使用python脚本通过

    TRANSFORM (userid, movieid, rating, unixtime)   --输入字段

      USING 'python weekday_mapper.py'              --脚本处理

      AS (userid, movieid, rating, weekday)         --输出字段

    SELECT weekday, COUNT(*)

    FROM u_data_new

    GROUP BY weekday;

    Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

    2、Apache Weblog Data

    The format of Apache weblog is customizable, while most webmasters use the default.
    For default Apache weblog, we can create a table with the following command.

    More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.

    CREATE TABLE apachelog (

    host STRING,

    identity STRING,

    user STRING,

    time STRING,

    request STRING,

    status STRING,

    size STRING,

    referer STRING,

    agent STRING)

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

    WITH SERDEPROPERTIES (

    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\[^\]*\]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?"

    )

    STORED AS TEXTFILE;

  • 相关阅读:
    net.core.somaxconn的作用
    Python Mysql unsupported format character 'Y' (0x59) at index xx
    Oracle Compile Object
    Python 读取股市行情信息
    Microsoft Dynamics CRM 高级查找不能搜索实体的解决方案(浏览器插件)
    生成宝塔面板的命令
    项目路由设置了 history路由方式后,页面刷新404
    vue3 vite 在嵌套多层路由后部署服务器刷新空白
    js计算耗时,可多次调用获取耗时
    Element中Tree树结构组件中实现Ctrl和Shift多选
  • 原文地址:https://www.cnblogs.com/wcwen1990/p/7082922.html
Copyright © 2020-2023  润新知