• kinesis firehose 转换kinesis流数据为parquet 格式存储到S3上


    一、beanstalk部署
    1、创建两个存储桶
    sengled-bucket-reliabilityops-backuptemp
    sengled-bucket-reliabilityops-kinesis

    2、配置glue
    1)创建数据库:只填写名称即可,不需要填写位置信息
    2)配置爬网程序:主要填写名称、数据库、s3源数据存放桶路径(需要根据谁定义表结构,就输入谁的存放路径)
    3)运行爬网程序自动生成“数据目录”的表

    3、配置glue任务(主要是脚本编写)
    import sys
    from awsglue.transforms import *
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.utils import getResolvedOptions
    from pyspark.sql.types import TimestampType,DateType
    from awsglue.job import Job
    import boto3
    import datetime
    import time
    import logging

    logging.basicConfig()
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])

    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)

    start_hour_before = 1
    end_hour_before = 1

    source_bucket_name = "backuptemp"
    target_bucket_name = "kinesis"
    target_prefix_name = "alertlog"
    delimiter = "/"
    default_region = "us-west-2"
    crawler_name = "clean_alertlog"

    sns_arn = "arn:aws:sns:us-east-1:67789:logprocess"
    client = boto3.client('s3')

    def delete_object(bucket_name,key_name):
        try:
                response = client.delete_object(Bucket=bucket_name,Key=key_name)
        except Exception as e:
                print str(e)
                    #email_alert("error when delete_object %s/%s" % (bucket_name, key_name))
                    
    def email_alert(message):
        sns_client = boto3.client('sns',region_name = default_region)
        response = sns_client.publish(
            TopicArn=sns_arn,
            Message=message,
            Subject="glue %s process error" % args['JOB_NAME']
        )
        logger.error(response)

    def aggragate_files(date,key_name):
        logger.info("start aggragate %s, time is %s." % (key_name, time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))))
        if key_name == "":
            return

        try:
            dataframe = spark.read.json("s3://%s/%s" % (source_bucket_name,key_name))
            print("dataframe.....................",dataframe)
            dataframe.write.parquet("s3://%s/%s/dt=%s" % (target_bucket_name, target_prefix_name, date), mode="append")
            logger.info("finish aggragate %s, time is %s." % (key_name, time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))))
        except Exception as e:
            #email
            #email_alert("error when aggragate %s/%s/%s: %s." % (key_name, date, hour, str(e)))
            print str(e)
        else:
                delete_object(source_bucket_name,key_name)
                    
    def main():
        s3 = boto3.resource('s3')
        process_slot = datetime.datetime.now() - datetime.timedelta(days=start_hour_before)
        bucket = s3.Bucket(source_bucket_name)
        dt = process_slot.strftime("%Y-%m-%d")
        for obj in bucket.objects.all():
            aggragate_files(dt,obj.key)
            
        glue_client = boto3.client('glue', region_name = default_region)
        response = glue_client.start_crawler(Name=crawler_name)

    main()

    ####commit job
    job.commit()

    4、创建glue触发器

    5、在beanstalk上部署java版本的kinesis流数据备份服务

    二、kinesis firehost 部署
    1、创建两个s3存储桶
    sengled-bucket-reliabilityops-gluetable
    sengled-bucket-reliabilityops-gluedatabase

    2、配置glue
    1)创建数据库:只填写名称即可,不需要填写位置信息
    2)配置爬网程序:主要填写名称、数据库、s3源数据存放桶路径(sengled-bucket-reliabilityops-gluetable)勾选Create a single schema for each S3 path
    3)运行爬网程序自动生成“数据目录”的表

    3、配置firehost
    填写名称(sengled-firehose-reliabilityOps-falcon)
    选择源数据:kinesis
    lambda:由于源数据就是json格式,所以不需要转换成json格式,这里设置:Disabled
    格式转换:启用,默认Apache Parquet格式
    glue选择区域、数据库名称、表名称、glue表的版本
    目标S3:sengled-bucket-reliabilityops-gluedatabase
    备份S3:sengled-bucket-reliabilityops-gluetable (用于每小时运行爬网程序更新glue表结构)

    4、新建一个glue 任务用于定期删除备份数据,备份数据在这里只做定时运行爬网程序使用。

  • 相关阅读:
    最大公约数
    九宫格
    Hanoi双塔问题(简单的枚举)
    最高分
    盒子
    CodeForces Round #303 Div. 2
    关于“被密码保护”的文章
    【学习】组合数的递推公式
    [FZYZOJ 1821] 一道果题
    [FZYZOJ 1889] 厨房救济
  • 原文地址:https://www.cnblogs.com/husbandmen/p/10151329.html
Copyright © 2020-2023  润新知