• AWS Data Analytics Fundamentals 官方课程笔记


    Variety

      structured data applications include Amazon RDS, Amazon Aurora, MySQL, MariaDB, PostgreSQL, Microsoft SQL Server, and Oracle

      semistructured data stores include CSV, XML, JSON, Amazon DynamoDB, Amazon Neptune, and Amazon ElastiCache.

      OLTP - 写操作比较多, OLAP - 读操作比较多

      AWS的 OLTP和 OLAP row-based indexing DB 有 Amazon RDS(可选Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and SQL Server 等), OLAP column-based indexing DB 有 Amazon Redshift.

      Semi-structured and unstructured data

      NoSQL DB - Amazon DynamoDB (Key-value and document store DB)

      Graph DB - Amazon Neptune

    Veracity

    • Understanding data integrity
    • Understanding database consistency
    • Introduction to the ETL process

    data integrity

    Curation is the action or process of selecting, organizing, and looking after the items in a collection.
    Data integrity is the maintenance and assurance of the accuracy and consistency of data over its entire lifecycle.
    Data veracity is the degree to which data is accurate, precise, and trusted.

    Data cleansing 是ETL 的一部分,用来保证独取数据时候检查数据是否损坏,如果坏的就直接discard. 除了data cleaning, 下一个概念是 怎么enforce data, 首先就是 数据库的 schema (local schema 帮助analyzer 写出good query, Information schema help databases provide data quickly)

     database consistency

      

    除了 data cleansing to ensure integrity 和 database schema to enforce integrity, Another key factor to veracity is the ability to ensure compliance with the consistency and availability of data within a database. There are a few different methods for this. We are going to discuss two: ACID and BASE 

    对事务的ACID要求是很多关系型DB 遵循的数据一致性标准,  比如Amazon RDS 就遵循ACID,对NoSQL 来说因为这样的一致性太耗时,一般遵循 BASE标准, 比如Amazon DynamoDB 要求快速相应。 BASE标准下,数据如果在一个节点上改变了,不要求马上同步到其他节点. 请注意 : In November 2018, Amazon introduced Amazon DynamoDB transactions. This feature implements ACID compliance across one or more tables within a single AWS account and region

     into to ETL process

    AWS 的ETL 服务,有两种 Amazon EMR, 和 Amazon Glue, 这两种是针对 batch data的,如果是streaming data 用 Kinesis. EMR 和 Glue 功能相似,EMR 更加可定制化, 当然就需要更强的技能,Glue则比较傻瓜式. 此外 Glue 自带了一个 metastore叫 AWS Glue Data Catalog,是 HIVE metastore的替代品。

    Value

    Data analytics分两类: information analytics, 和  operational analytics.

    Information analytics is the process of analyzing information to find the value contained within it. This term is often synonymous with data analytics 有5种类型的分析 descriptive, diagnostic, predictive, prescriptive, and cognitive.

    另一种 operational analytics 是 Information analytics 的子形式。 This form of analytics is used specifically to retrieve, analyze, and report on data for IT operations

    5种类型的分析:

    Within AWS, the Amazon Elasticsearch Service is commonly used to implement operational analytics

    Predictive analytics 的一个例子:

     

     Cognitive analytics 的例子有金融领域的自动投资测量,医疗领域的智能治疗建议等

    AWS各种service 的快慢

    流处理的3种choice

     

    topic 2 Introduction to visualizing data

    report 有 static reports, interactive reports, 和 dashboards

    QuickSight就是做visualization的

    With Amazon QuickSight, you can upload CSV and Excel files; connect to software as a service (SaaS) applications, such as Salesforce; access on-premises databases such as SQL Server, MySQL, and PostgreSQL; and seamlessly utilize your AWS data sources, such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3

     

     

  • 相关阅读:
    spring data jpa 分页查询(小结)
    java 工作日判断
    Map的merge方法(java8)
    java8 常用JVM 参数修改
    ubuntu docker client 安装
    scala io 读写文件
    Grafana 系统资源监测
    Java8Stream Collectors收集器
    EhCacheUtils 缓存 ehche (将文件临时保存在磁盘)
    springboot2.X 整合scala
  • 原文地址:https://www.cnblogs.com/mashuai-191/p/13727292.html
Copyright © 2020-2023  润新知