• 0001.大数据课程概述与大数据背景知识



    02-01-什么是大数据

    大数据的应用举例:
    1. 电商的推荐系统
    存储:大量的订单如何存储
    运算:大量的订单如何计算
    2. 天气的预报
    存储:大量的天气的数据如何存储
    运算:大量的天气的数据如何计算
    核心问题:
    1. 存储:分布式的文件系统:HDFS(Hadoop Distributed File System)
    2. 运算:不是算法,分布式的计算:MapReduce、Spark(RDD:弹性分布式数据集)


    02-02-数据仓库和大数据

    数据仓库就是一个数据库(Oracle、MySQL、MS),一般只做select

    搭建数据仓库的过程.png

    搭建数据仓库Data Warehouse可以使用传统的Oracle、MySQL来搭建,也可以使用hadoop、spark来搭建。


    02-03-OLTP和OLAP

    1、OLTP:Online Transaction Processing 联机事务处理,指:(insert、update、delete)--> 事务,传统的关系型数据库解决的问题
    2、OLAP:Online Analytic Processing 联机分析处理,一般:只做查询select(分析)


    02-04-分布式文件系统的基本思想

    • GFS: Google File System ---- HDFS: Hadoop Distributed File System

      1. 分布式文件系统
      2. 大数据的存储问题
      3. HDFS中,记录数据保存的位置信息(元信息)-----> 采用倒排索引(Reverted Index)
        • 什么是索引?index
          (1) create index 创建索引
          (2) 就是一个目录
          (3) 通过索引可以找到对应的数据
          (4)问题:索引一定可以提高查询的速度吗?
        • 什么是倒排索引?
      4. 演示Demo:以伪分布环境为例
    • MapReduce:分布计算模型,问题来源是:PageRank(网页排名)

    • BigTable:大表 ------ NoSQL数据库:HBase

    分布式文件系统的基本思想.png


    02-05-什么是机架感知

    机架感知的基本思想.png


    02-06-什么是倒排索引

    什么是索引.png

    什么是倒排索引.png


    02-07-HDFS的体系架构和Demo演示


    02-08-什么是PageRank

    Google的向量矩阵.png


    02-09-MR编程模型

    MapReduce的编程模型.png


    02-10-Demo-单词计数WordCount

    [ root@ demo11~]# start-yarn. sh 
    starting yarn daemons 
    starting resourcemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-resourcemanager-demol1. out localhost: starting nodemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-nodemanager-demo11. out
    
    [root@demo11~]# jps
    16164 ResourceManager
    16596 Jps
    15976 SecondaryNameNode
    15772 DataNode
    15661 NameNode
    16271 NodeManager
    
    [root@demo11~]# hdfs dfs -1s /input 
    Found 3 items
    -rw-r--r--	1 root supergroup		 204	2018-08-14	11:18	/input/a.xml
    -rw-r--r--	1 root supergroup		  60	2018-08-13	23:48	/input/data.txt
    -rw-r--r--	1 root supergroup	30826876	2018-08-17	10:19	/input/sales
    
    [root@demo11 ~]# hdfs dfs-cat /input/data.txt 
    I love Beijing
    I love China
    Beijing is the capital of China
    
    [root@demo11 ]# cd training/hadoop-2.7.3/share/hadoop/mapreduce/
    [root@demo11 mapreduce]# pwd
    /root/training/hadoop-2.7.3/share/hadoop/mapreduce
    [root@demo11 mapreduce]# 1s hadoop-mapreduce-examples-2.7.3.jar hadoop-mapreduce-examples-2.7.3.jar
    [rootedemol1 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar
    
    An example program must be given as the first argument.
    Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    bbp:A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
    dbcount: An example job that count the pageview counts from a database.
    distbbp:A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
    grep:A map/reduce program that counts the matches of a regex in the input.
    join:A job that effects a join over sorted, equally partitioned datasets multifilewc:A job that counts words from several files.
    pentomino:A map/reduce tile laying program to find solutions to pentomino problems.
    pi:A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
    randomtextwriter:A map/reduce program that writes 10GB of random textual data per node.
    randomwriter:A map/reduce program that writes 10GB of random data per node.
    secondarysort: An example defining a secondary sort to the reduce.
    sort:A map/reduce program that sorts the data written by the random writer.
    sudoku:A sudoku solver.
    teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount:A map/reduce program that counts the words in the input files.
    wordmean:A map/reduce program that counts the average length of the words in the input files.
    wordmedian:A map/reduce program that counts the median length of the words in the input files.
    wordstandarddeviation:A map/reduce program that counts the standard deviation of the length of the words in the input files
    
    [root@demo11 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /input/data.txt /output/day0829/wc1
    
    

    可以通过http://192.168.157.11:8088/cluster监控任务的执行(Yarn的web console)

    Yarn的web console

    ![](0001.大数据课程概述与大数据背景知识.assets/web console.png)

    [root@demo11 mapreduce]# hdfs dfs -1s /output/day0829/wc1
    Found 2 items
    -rw-r--r--1 root supergroup  0 2018-08-29 20:57 /output/day0829/wc1/_SUCCESS
    -rw-r--r--1 root supergroun 55 2018-08-29 20:57 /output/dav0829/wcl/part-r-00000
    
    [root@demo11 mapreduce]# hdfs dfs -cat /output/day0829/wcl/part-r-00000
    Beijing	2
    China	2
    I		2
    capital	1
    is		1
    love	2
    of		1
    the		1
    

    02-11-BigTable大表

    Oracle表结构和HBase的表结构.png

  • 相关阅读:
    kafka报错:Invalid message size: 0
    转载:elastic5.x部署常见问题总结
    hadoop集群zookeeper迁移
    生产环境轻量级dns服务器dnsmasq搭建文档
    (3)安装elastic6.1.3及插件kibana,x-pack,essql,head,bigdesk,cerebro,ik
    (2)安装elastic6.1.3及插件kibana,x-pack,essql,head,bigdesk,cerebro,ik
    (1)安装elastic6.1.3及插件kibana,x-pack,essql,head,bigdesk,cerebro,ik
    换种思路解决日志占用磁盘空间问题
    更改hadoop集群yarn的webui中的开始时间和结束时间为本地时间
    两种虚拟机扩容方式扩容后在线生效的方法
  • 原文地址:https://www.cnblogs.com/RoyalGuardsTomCat/p/13825013.html
Copyright © 2020-2023  润新知