• 实践数据湖iceberg 第二课 iceberg基于hadoop的底层数据格式


     

    前言

    iceberg底层是如何管理数据,具体每次数据变更,在底层数据结构上发生哪些变化?
    本文主要解答这个问题。
    观察的方法是:
    1.建立catalog, hdfs截图
    2.创建表,hdfs截图
    3.insert一条数据,hdfs截图,观察元数据和数据的变更
    4.再insert一条数据,hdfs截图,观察元数据和数据的变更
    5.结论
    6.下一步,观察hive catalog


    提示:以下是本篇文章正文内容,下面案例可供参考

    1、hadoop catalog创建

    创建脚本,warehouse的路径,它会自动创建
    hdfs路径里面 ns是命名空间,但namenode的使用ip:port代替
    在flink-sql client 中执行脚本

    sql-client.sh embedded -j /opt/software/iceberg-flink-runtime-0.11.1.jar shell`

    1.1 执行建立 catalog脚本

    CREATE CATALOG hadoop_catalog2 WITH (
      'type'='iceberg',
      'catalog-type'='hadoop',
      'warehouse'='hdfs://ns/user/hive/warehouse/iceberg_hadoop_catalog2',
      'property-version'='1'
    );

    1.2 查看hdfs 目录

    发现:建好warehouse到defalut的目录 /user/hive/warehouse/iceberg_hadoop_catalog2/default
    在这里插入图片描述

    请添加图片描述

    2.建表

    2.1 建表

    建表语句:

    CREATE TABLE `hadoop_catalog2`.`default`.`sample` (
        id BIGINT COMMENT 'unique id',
        data STRING
    );

    执行:

    Flink SQL> CREATE TABLE `hadoop_catalog2`.`default`.`sample` (
    >     id BIGINT COMMENT 'unique id',
    >     data STRING
    > );
    [INFO] Table has been created.

    2.2 查看建表后的目录结构

    用命令行查看

    [root@hadoop101 software]# hadoop fs -ls -R /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/
    drwxr-xr-x   - root supergroup          0 2022-01-13 14:29 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata
    -rw-r--r--   2 root supergroup        826 2022-01-13 14:29 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v1.metadata.json
    -rw-r--r--   2 root supergroup          1 2022-01-13 14:29 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/version-hint.text

    浏览器查看:
    在这里插入图片描述
    在这里插入图片描述

    看看这个两个文件的内容:

    [root@hadoop101 software]# hadoop fs -cat /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v1.metadata.json
    {
      "format-version" : 1,
      "table-uuid" : "956e7d6f-7184-4147-990e-90c923c43c2f",
      "location" : "hdfs://ns/user/hive/warehouse/iceberg_hadoop_catalog2/default/sample",
      "last-updated-ms" : 1642055374103,
      "last-column-id" : 2,
      "schema" : {
        "type" : "struct",
        "fields" : [ {
          "id" : 1,
          "name" : "id",
          "required" : false,
          "type" : "long"
        }, {
          "id" : 2,
          "name" : "data",
          "required" : false,
          "type" : "string"
        } ]
      },
      "partition-spec" : [ ],
      "default-spec-id" : 0,
      "partition-specs" : [ {
        "spec-id" : 0,
        "fields" : [ ]
      } ],
      "default-sort-order-id" : 0,
      "sort-orders" : [ {
        "order-id" : 0,
        "fields" : [ ]
      } ],
      "properties" : { },
      "current-snapshot-id" : -1,
      "snapshots" : [ ],
      "snapshot-log" : [ ],
      "metadata-log" : [ ]
    }[root@hadoop101 software]# hadoop fs -cat /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/version-hint.text
    1[root@hadoop101 software]#

    1.写入数据

    1.1 写入1条数据

    执行写入sql

    Flink SQL> INSERT INTO `hadoop_catalog2`.`default`.`sample` VALUES (1, 'a');
    [INFO] Submitting SQL update statement to the cluster...
    [INFO] Table update statement has been successfully submitted to the cluster:
    Job ID: 794a9c88a06bd2e889968ee11d213a93

    1.2 写入数据本质上触发一个flink jiob

    看日志,flink中生成一个job,到管理页面看看,发现是生成了
    在这里插入图片描述

    1.3 查看hdfs目录

    查看hdfs的目录,发现生成了v2.metadata.json 和data目录的数据文件

    [root@hadoop101 software]# hadoop fs -ls -R /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample
    drwxr-xr-x   - root supergroup          0 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/data
    -rw-r--r--   2 root supergroup        637 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/data/00000-0-2f98c753-78a8-434f-ba52-60aa9b9a95bf-00001.parquet
    drwxr-xr-x   - root supergroup          0 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata
    -rw-r--r--   2 root supergroup       5591 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/47517408-7822-4345-940b-a4ee956ab29f-m0.avro
    -rw-r--r--   2 root supergroup       3522 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/snap-7677920539998136227-1-47517408-7822-4345-940b-a4ee956ab29f.avro
    -rw-r--r--   2 root supergroup        826 2022-01-13 14:29 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v1.metadata.json
    -rw-r--r--   2 root supergroup       1825 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v2.metadata.json
    -rw-r--r--   2 root supergroup          1 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/version-hint.text

    1.4 再写一条数据,并观察

    Flink SQL> INSERT INTO `hadoop_catalog2`.`default`.`sample` VALUES (2, 'b');
    [INFO] Submitting SQL update statement to the cluster...
    [INFO] Table update statement has been successfully submitted to the cluster:
    Job ID: 4ce2ab5fc0f12ccc196a25da8c307028

    查看hdfs文件

    [root@hadoop101 software]# hadoop fs -ls -R /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample
    drwxr-xr-x   - root supergroup          0 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/data
    -rw-r--r--   2 root supergroup        637 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/data/00000-0-2f98c753-78a8-434f-ba52-60aa9b9a95bf-00001.parquet
    -rw-r--r--   2 root supergroup        636 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/data/00000-0-d753dd95-bfae-4bd7-be87-d6f6cc53ae60-00001.parquet
    drwxr-xr-x   - root supergroup          0 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata
    -rw-r--r--   2 root supergroup       5591 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/47517408-7822-4345-940b-a4ee956ab29f-m0.avro
    -rw-r--r--   2 root supergroup       5591 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/7277ffe1-6e1e-4959-9bbd-16e583835dee-m0.avro
    -rw-r--r--   2 root supergroup       3588 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/snap-4489783265506344084-1-7277ffe1-6e1e-4959-9bbd-16e583835dee.avro
    -rw-r--r--   2 root supergroup       3522 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/snap-7677920539998136227-1-47517408-7822-4345-940b-a4ee956ab29f.avro
    -rw-r--r--   2 root supergroup        826 2022-01-13 14:29 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v1.metadata.json
    -rw-r--r--   2 root supergroup       1825 2022-01-13 14:55 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v2.metadata.json
    -rw-r--r--   2 root supergroup       2858 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/v3.metadata.json
    -rw-r--r--   2 root supergroup          1 2022-01-13 15:03 /user/hive/warehouse/iceberg_hadoop_catalog2/default/sample/metadata/version-hint.text

    1.5 结论

    对追加写入数据管理方式:生成新的medata和snapshot快照文件

    现在好奇,对update语句,对文件的组织有什么影响?
    接下来把id=1的数据进行修改

    发现:不支持update语句

    总结

  • 相关阅读:
    Jrain'Lのvueblog
    前端知识整理 の IMWeb
    js编程小练习1
    mac版本cornerstone的无限期破解方法(转)
    教你解锁被锁住的苹果mac电脑的文件跟文件夹,同时也可删除被锁的文件跟文件夹(转)
    Mac下配置svn服务器
    ios 查看模拟器路径以及应用的文件夹
    python怎么解压压缩的字符串数据
    python全局变量被覆盖的问题
    PyInstaller:把你的Python转为Exe
  • 原文地址:https://www.cnblogs.com/huanghanyu/p/16152294.html
Copyright © 2020-2023  润新知