• 自定义InputFormat代码实现


                自定义InputFormat代码实现

                                         作者:尹正杰

    版权声明:原创作品,谢绝转载!否则将追究法律责任。

    一.MapReduce并行度决定机制

      在说MapTask并行度决定之前,我们要先明确以下几个概念:
        1>.MapTask的并行度决定Map节点的任务处理并发度,进而影响到整个Job的处理速度;
        2>.数据块(Block)是HDFS集群物理上把数据分成块进行存储的;
        3>.数据切片(split)只是在逻辑上对输入进行分片,并不会在磁盘上将其切分成片进行存储;
    
      数据切片与MapTask并行度决定机制:
        1>.一个Job的Map阶段并行度由客户端提交在Job时的切片数决定;
        2>.每一个数据切片(split)分配一个MapTask并行实例处理;
        3>.默认情况下,切片大小等于块大小(Block size);
        4>.切片不考虑数据集整体,而是逐个针对每一个文件单独切片;
    
      根据上述信息,我这里提出几个自问自答的问题,由于个人水平有限,如果有对以下问题进行补充的小伙伴欢迎留言。
        Q1:有一个300MB的access.log文件已经存储在HDFS集群中,如果集群的块大小(Black size)默认是128MB,那么这300MB文件到底是如何存储的呢?
          答:按照默认的块大小128MB对300MB的文件进行物理切割,那么该文件会被分配成3个Block,分别为128MB,128MB,44MB。
    
        Q2:继上个问题,300MB文件被切割成128MB,128MB,44MB,前两个文件占据慢慢的一个Block这个很容易理解,但这个44MB的物理在HDFS集群占据多大的空间呢?
          答:当然还是占据44MB的存储空间啦。HDFS集群默认以块(Block)进行数据存储的,会给大家误以为HDFS存储44MB的空间就是以默认的128MB空间来存储44MB,其实不然,我们依旧可以在保存这44MB的块中追加数据,因为该块还有充足的空间可以使用哟。就好像一杯700ml的水你只倒了300ml,还有400ml空间你可以继续装水。
    
        Q3:假设切片大小设置100MB和使用默认的128MB对300MB的access.log进行数据切分时,并行度是多少呢?
          答:很显然,无论是你以100MB切割300MB文件,还是以默认的128MB块大小的方式切割300MB文件,最终文件只会被切割成3个切片,也就是只能启动3个MapTask并行度。
    
        Q4:继上个问题,以100MB切分300MB的access.log,它有什么优缺点呢?
          答:以100MB切分的方案会产生3个MapTask,每个MapTask分配的数据相对均匀,因此在整个集群网络IO不繁忙的情况下理论上说处理速度会很快。但是在集群网络IO比较繁忙的情况下可能效率并不高。究其原因是因为HDFS默认以128MB进行块存储的,当以100MB进行切割时,第一个快会有28MB的数据需要网络IO传输,而第二个块也需要28MB的数据需要网络IO传输,第三个块是44MB,如果就在第三个块所在节点运行任务的话会涉及到56MB的网络IO传输。如果存储第三个块的节点比较繁忙时,可能该节点的44MB也需要进行网络传输,即可能会耗费100MB的网络IO的传输。
    
        Q5:继上个问题,以默认的128MB切分300MB的access.log,它又有什么优缺点呢?
          答:缺点一目了然,就是3给块切割的数据不均匀,前两个块默认都是128MB自然就在其所在节点运行,而第三个块只有44MB的数据,和可能出现第三个块所在的节点MapTask任务提前结束,选哟花费大量时间等待前2个块所在节点的运行结果。但优点相比以100MB切割方案要更节省网络IO的传输。

        Q6:如何理解"切片是不考虑数据集整体,而是逐个针对每一个文件单独切片"这句话呢?
          答:我们要计算数据就必须得提供数据源,数据源可以是单个文件也可以是个目录,如果我们提供的是一个目录,那么该目录可能会有多个文件,比如输入目录有3个文件,文本大小分别为300MB,59MB,32MB。无论是以100MB还是128MB切分,都会存在5个切片,因为后两个文件(59MB和32MB)虽然很小,但是也会被单独分一个切片。并不会将后面两个切片合并成一个切片的,因为它们是两个不同的文件。

    二.官方提供的InputFormat概述

      InputFormat完成了Input到Mapper的数据传递,它主要负责将数据源变成K,V类型的过程。
    
      InputFormat我们只需要关注两个阶段,一个是由文件变成切片的过程,另一个是由切片到K,V类型的过程。
    
      FileInputFormat常见的接口实现类包括:TextInputFormat,KeyValueTextInputFormat,NLineInputFormat,CombineTextInputFormat,FixedLengthInputFormat,SequenceFileInputFormat和自定义InputFormat等。

    1>.TextInputFormat切片机制

      TextInputFormat是FileInputFormat继承类。按行读取每条数据。
    
      切片方法:
        默认使用它的父类FileInputFormat的切片方法。  
        FileInputFormat的切片有以下规则:
          1>.块大小(Block size)为切片大小;
          2>.按照文件单独切片;
          3>.依据1.1被的切片大小来判断是否需要切片;
    
      K,V方法:   
        使用LineRecordReader方法将切片数据变成KV值类型。
        键是存储该行在整个文件中的起始字节偏移量,LongWritable类型。值是这行的内容,不包括任何终止符(换行符和回车符),Text类型。
    
      举例如下:
        假设以下是一个切片的开头部分内容:
          I have a pen
          I have an apple
          Ah apple pen
          I have a pen
          I have a pineapple
          Ah pineapple pen
          ......
    
        如上所示,我们使用TextInputFormat来表示前6条数据:
          (0,I have a pen)
          (13,I have an apple)
          (29,Ah apple pen)
          (42,I have a pen)
          (55,I have a pineapple)
          (74,Ah pineapple pen)
          ......

    2>.KeyValueTextInputFormat切片机制

        KeyValueTextInputFormat也继承自FileInputFormat类。按行读取每条数据。
     
    
      切片方法:
        默认使用它的父类FileInputFormat的切片方法。  
        FileInputFormat的切片有以下规则:
          1>.块大小(Block size)为切片大小;
          2>.按照文件单独切片;
          3>.依据1.1被的切片大小来判断是否需要切片;
    
      K,V方法:   
        使用KeyVauleLineRecondReader方法将切片数据变成KV值类型,可以通过在驱动类中设置conf.set(KeyVauleLineRecondReader.KEY_VALUE_SEPERATOR,"	");来指定分隔符。
        每一行为一条记录,被分隔符分割为Key,Value。,默认以tab("	")作为分隔符。此时的键是每行排在制表符之前的Text序列。
    
      举例如下:
        假设以下是一个切片的开头部分内容:
          line1	I have a pen
          line2	I have an apple
          line3	Ah apple pen
          1ine4	I have a pen
          line5	I have a pineapple
          line6	Ah pineapple pen
          ......
    
        如上所示,我们使用TextInputFormat来表示前6条数据:
          (line1,I have a pen)
          (line2,I have an apple)
          (line3,Ah apple pen)
          (line4,I have a pen)
          (line5,I have a pineapple)
          (line6,Ah pineapple pen)
          ......

    3>.NLineInputFormat切片机制

      NLineInputFormat也继承自FileInputFormat类。
     
      切片方法:
        每个map进程处理的InputSplit不再按Block去划分,而是按NLineInputFormat指定的自定义行数N来划分。即输入文件的总行数/N=切片数,如果不整除,切片数=商+1。
        
      K,V方法:   
        这里的键和值与TextInputFormat生成的一样。 即使用LineRecordReader方法将切片数据变成KV值类型。
    
      举例如下:
        假设以下是一个切片的开头部分内容:
          I have a pen
          I have an apple
          Ah apple pen
          I have a pen
          I have a pineapple
          Ah pineapple pen
          ......
    
        如上所示,如果N是3,则每个输入分配包含3行,以每3行开启一个MapTask
          第一个MapTask内容为:
            (0,I have a pen)
            (13,I have an apple)
            (29,Ah apple pen)
          第二个MapTask内容为:
            (42,I have a pen)
            (55,I have a pineapple)
            (74,Ah pineapple pen)
          第M个MapTask内容为:
                    ......

    4>.CombineTextInputFormat切片机制

      CombineTextInputFormat也继承自FileInputFormat类。
     
      切片方法:
            通过之前的关于TextInputFormat的介绍,TextInputFormat切片机制默认使用它的父类FileInputFormat的切片方法,是对任务按文件规划切片,不管文件多小,都会是一个单独的切片,都会交给一个MapTask,这样如果有大量小文件,就会产生大量的MapTask,处理效率极其低下。
            CombineTextInputFormat用于小文件过多的场景,它可以将多个小文件从逻辑上规划到一个切片中,这样,多个小文件就可以交给一个MapTask处理。
            我们可以通过"CombineTextInputFormat.setMaxInputSplitSize(job, 10485760);"来设置虚拟存储切片最大值。10485760=1024*1024*10,即10MB.
            切片过程:
                1>.判断虚拟存储的文件大小是否大于setMaxInputSplitSize值,大于等于则单独形成一个切片;
                2>.如果不大于则跟下一个虚拟存储文件进行合并,共同形成一个头切片。
        
      K,V方法:   
        它产生KV类型方式是使用了CombineFileRecordReader,它读取到的数据方式和LineRecordReader基本一样,但并没有使用默认的LineRecordReader方法,因为他处理的源数据是跨文件的。
    
        温馨提示:
            虚拟存储切片最大值设置最好根据实际的小文件大小情况来设置具体的值。
            我们知道HDFS集群并不擅长处理小文件,因此我建议大家在生产环境中应该尽量避免创建小文件,生产环境中如果真的有很多小文件建议大家打har包进行处理。
            生产环境中解决小文件的根本办法就是不要生成小文件。
    
     举例如下:
        假设虚拟存储切片最大值为10485760,即10M。以下是有一批小文件,其大小分别为8.7MB,28MB,12MB,2.12MB,18MB。
                    文件                大小                        虚拟存储过程            
                access.log             8.7MB        8.7MB<10MB,因此仅划分一块。
                ftp.log                28MB        21MB>10MB,但小于3*10MB,因此划分3块,一块10MB,第二块9MB,第三块9MB。
                sumba.log            12MB        12MB>10MB,但小于2*10MB,因此划分2块,一块6MB,另一块6MB。
                error.log            2.12MB      2.12MB<10MB,因此仅划分为1块。
                nfs.log                18MB        18MB>10MB,但是小于2*10MB,因此划分2块,一块9MB,另一块9MB。
    
        如上所示,access.log,ftp.log,sumba.log,error.log,nfs.log被切割成多个小文件,接下来就按照切片过程将这些小文件合并成对应的切片,如下所示,共计5个切片。
                第一个切片:
                    8.7MB
                    10MB
    
                第二个切片:
                    9MB
                    9MB
    
                第三个切片:
                    6MB
                    6MB
    
                第四个切片:
                    2.12MB
                    9MB
    
                第五个切片:
                    9MB

    5>.FixedLengthInputFormat切片机制

      FixedLengthInputFormat也继承自FileInputFormat类。
     
      切片方法:
            默认使用它的父类FileInputFormat的切片方法。  
        FileInputFormat的切片有以下规则:
          1>.块大小(Block size)为切片大小;
          2>.按照文件单独切片;
          3>.依据1.1被的切片大小来判断是否需要切片;
        
      K,V方法:   
        它读取数据时一次性读取一定长度的文件,使用FixedLengthRecordReader。

    6>.SequenceFileInputFormat切片机制

      SequenceFileInputFormat也继承自FileInputFormat类。
     
      切片方法:
            默认使用它的父类FileInputFormat的切片方法。  
        FileInputFormat的切片有以下规则:
          1>.块大小(Block size)为切片大小;
          2>.按照文件单独切片;
          3>.依据1.1被的切片大小来判断是否需要切片;
        
      SequenceFileInputFormat的使用场景:
            一般情况下,SequenceFileInputFormat和SequenceFileOutputFormat结合使用。
            假设在2个MapReduce任务中,B任务依赖于A任务的输出结果,A任务执行任务后需要将数据落地,此时我们可以使用普通文本落地,但是落地到本地文件后会占用更多的空间,在B任务进行出入时还得再次处理数据,这样效率很低下。
            此时,如果A任务使用SequenceFileOutputFormat输出数据,B任务使用SequenceFileInputFormat输入数据。它们就可以实现完美的对接,而且在这个对接过程中连类型都可以保存下来,也就是说不需要像文本那样输入时再次处理数据。

    7>.自定义InputFormat

      在企业开发中,Hadoop框架自带的InputFormat类型不能满足所有应用场景,需要自定义InputFormat来解决实际问题。
    
      自定义InputFormat步骤如下:
        1>.自定义一个类继承自FileInputFormat;
        2>.改写RecordReader,实现一次读取一个完整封装为KV类型;
        3>.在输出时使用SequenceFileOutPutFormat输出合并文件。

    三.自定义InputFormat代码实现

    1>.需求分析

      无论是HDFS还是MapReduce,在处理小文件时效率非常低,但又难免面临大量小文件的场景。此时,就需要相应的解决方案。我们可以自定义InputFormat实现小文件合并。

      将多个小文件合并成一个SequenceFile文件(SequenceFile文件是Hadoop用来存储二进制形式的Key-Value对的文件格式),SequenceFile里面存储着多个小文件,Key为:文件路径+文件名称,Value为文件内容。

      下面是我给定的5个小文件,现在需要咱们将其合并为一个文件。
    This tutorial assumes you are starting fresh and have no existing Kafka or ZooKeeper data. Since Kafka console scripts are different for Unix-based and Windows platforms, on Windows platforms use binwindows instead of bin/, and change the script extension to .bat.
    A streaming platform has three key capabilities:
    
    Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
    Store streams of records in a fault-tolerant durable way.
    Process streams of records as they occur.
    Kafka is generally used for two broad classes of applications:
    
    Building real-time streaming data pipelines that reliably get data between systems or applications
    Building real-time streaming applications that transform or react to the streams of data
    To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
    
    First a few concepts:
    
    Kafka is run as a cluster on one or more servers that can span multiple datacenters.
    The Kafka cluster stores streams of records in categories called topics.
    Each record consists of a key, a value, and a timestamp.
    Kafka has four core APIs:
    
    The Producer API allows an application to publish a stream of records to one or more Kafka topics.
    The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
    The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
    
    In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
    kafka.txt文件内容
    The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
    Apache Hive is an open source project run by volunteers at the Apache Software Foundation. Previously it was a subproject of Apache® Hadoop®, but has now graduated to become a top-level project of its own. We encourage you to learn about the project and contribute your expertise.
    hive.txt文件内容
    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
    This is the first stable release of Apache Hadoop 2.10 line. It contains 362 bug fixes, improvements and enhancements since 2.9.0.
    
    Users are encouraged to read the overview of major changes since 2.9.0. For details of 362 bug fixes, improvements, and other enhancements since the previous 2.9.0 release, please check release notes and changelog detail the changes since 2.9.0.
    This is the third stable release of Apache Hadoop 3.1 line. It contains 246 bug fixes, improvements and enhancements since 3.1.2.
    
    Users are encouraged to read the overview of major changes since 3.1.2. For details of the bug fixes, improvements, and other enhancements since the previous 3.1.2 release, please check release notes and changelog
    hadoop.txt文件内容

    2>.自定义RecordReader类 

    package cn.org.yinzhengjie.inputformat;
    
    import org.apache.hadoop.fs.FSDataInputStream;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.BytesWritable;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    
    import java.io.IOException;
    
    /**
     *  咱们自定义的RecordReader只处理一个文件,把这个文件度成一个KV值。即一次性把一个小文件全部读完。
     */
    public class WholeFileRecorder extends RecordReader<Text,BytesWritable> {
    
        //notRead初始值为true,表示还没有读文件
        private boolean notRead = true;
        //定义key类型
        private Text key = new Text();
        //定义value类型
        private BytesWritable value = new BytesWritable();
        //定义一个输入流对象
        private FSDataInputStream inputStream;
        //定义一个FileSplit对象
        private FileSplit fs;
        /**
         *  初始化方法,框架会在初始化时调用一次。
         * @param split :定义要读取的记录范围的拆分
         * @param context:关于这项任务的信息
         * @throws IOException
         * @throws InterruptedException
         */
        public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
            //由于咱们的输入流继承自FileInputFormat,因此我们需要将InputSplit强制类型转换为FileSplit,即转换切片类型到文件切片。
            fs = (FileSplit)split;//FileSplit是InputSplit的一个子类,希望你还没有忘记多态的知识点。
            //通过切片获取文件路径
            Path path = fs.getPath();
            //通过path获取文件系统
            FileSystem fileSystem = path.getFileSystem(context.getConfiguration());
            //开一个输入流,别忘记在close方法中释放输入流资源
            inputStream = fileSystem.open(path);
        }
    
        /**
         * 读取下一个键,值对。
         * @return  :如果读到了,返回true,读完了就返回false。
         * @throws IOException
         * @throws InterruptedException
         */
        public boolean nextKeyValue() throws IOException, InterruptedException {
            if (notRead){
                //具体读文件的过程
                key.set(fs.getPath().toString());   //读key
                byte[] buf = new byte[(int) fs.getLength()];//生成一个跟文件一样长的字节数组
                inputStream.read(buf);//一次性将文件内容读取
                value.set(buf,0,buf.length);//读取value
                notRead = false;    //notRead设置为false表示已经读取
                return true; //在第一次读取时返回true
            }
            return false;
        }
    
        /**
         * 获取当前读到的key
         * @return
         * @throws IOException
         * @throws InterruptedException
         */
        public Text getCurrentKey() throws IOException, InterruptedException {
            return key;
        }
    
        /**
         * 获取当前读到的value
         * @return:当前value
         * @throws IOException
         * @throws InterruptedException
         */
        public BytesWritable getCurrentValue() throws IOException, InterruptedException {
            return value;
        }
    
        /**
         *  记录读取器通过其数据的当前进度。
         * @return:返回一个0.0~1.0之间的小数
         * @throws IOException
         * @throws InterruptedException
         */
        public float getProgress() throws IOException, InterruptedException {
            return notRead ? 0 : 1;
        }
    
        /**
         * 关闭记录读取器。
         * @throws IOException
         */
        public void close() throws IOException {
            //使用hadoop为咱们提供的释放流的工具
            IOUtils.closeStream(inputStream);
        }
    }

    3>.自定义WholeFileInputFormat类

    package cn.org.yinzhengjie.inputformat;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.BytesWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.JobContext;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    
    import java.io.IOException;
    
    /**
        自定义InputFormat,我们为了偷懒不写切片方法,就是用FileInputFormat的默认切片方法。
    
        存储的形式如下:
            Key为:文件路径+文件名称,Value为文件内容。
    
         因此指定key的泛型为Text,指定value为BytesWritable(负责存储一段二进制数值的)。
     */
    public class WholeFileInputFormat extends FileInputFormat<Text,BytesWritable> {
    
    
        /**
         *  咱们自定义的输入文件不可再被切割。因此返回false即可。因为切割该文件后可能会造成数据损坏。
         */
        @Override
        protected boolean isSplitable(JobContext context, Path filename) {
            return false;
        }
    
        @Override
        public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) {
            //返回咱们自定义的RecordReader对象
            return new WholeFileRecorder();
        }
    }

    4>.自定义Driver类

    package cn.org.yinzhengjie.inputformat;
    
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.BytesWritable;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
    
    
    import java.io.IOException;
    
    public class WholeFileDriver {
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
            //获取一个Job实例
            Job job = Job.getInstance(new Configuration());
    
            //设置我们的当前Driver类路径(classpath)
            job.setJarByClass(WholeFileDriver.class);
    
            //设置自定义的Mapper程序的输出类型
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(BytesWritable.class);
    
            //设置自定义的Reducer程序的输出类型
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(BytesWritable.class);
    
            //设置咱们自定义的InputFormat类路径(classpath)
            job.setInputFormatClass(WholeFileInputFormat.class);
    
            //设置输出格式是一个SequenceFile类型文件
            job.setOutputFormatClass(SequenceFileOutputFormat.class);
    
            //设置输入数据
            FileInputFormat.setInputPaths(job,new Path(args[0]));
    
            //设置输出数据
            FileOutputFormat.setOutputPath(job,new Path(args[1]));
    
            //提交我们的Job,返回结果是一个布尔值
            boolean result = job.waitForCompletion(true);
    
            //如果程序运行成功就打印"Task executed successfully!!!"
            if(result){
                System.out.println("Task executed successfully!!!");
            }else {
                System.out.println("Task execution failed...");
            }
    
            //如果程序是正常运行就返回0,否则就返回1
            System.exit(result ? 0 : 1);
        }
    
    }

    5>.查看输出结果

    SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable      7?€*櫘G4�蹀  +   &%file:/E:/yinzhengjie/input/hadoop.txt  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
    This is the first stable release of Apache Hadoop 2.10 line. It contains 362 bug fixes, improvements and enhancements since 2.9.0.
    
    Users are encouraged to read the overview of major changes since 2.9.0. For details of 362 bug fixes, improvements, and other enhancements since the previous 2.9.0 release, please check release notes and changelog detail the changes since 2.9.0.
    This is the third stable release of Apache Hadoop 3.1 line. It contains 246 bug fixes, improvements and enhancements since 3.1.2.
    
    Users are encouraged to read the overview of major changes since 3.1.2. For details of the bug fixes, improvements, and other enhancements since the previous 3.1.2 release, please check release notes and changelog  X   $#file:/E:/yinzhengjie/input/hive.txt  0The Apache Hive 鈩?data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
    Apache Hive is an open source project run by volunteers at the Apache Software Foundation. Previously it was a subproject of Apache庐 Hadoop庐, but has now graduated to become a top-level project of its own. We encourage you to learn about the project and contribute your expertise.7?€*櫘G4�蹀  ?  %$file:/E:/yinzhengjie/input/kafka.txt  €This tutorial assumes you are starting fresh and have no existing Kafka or ZooKeeper data. Since Kafka console scripts are different for Unix-based and Windows platforms, on Windows platforms use binwindows instead of bin/, and change the script extension to .bat.
    A streaming platform has three key capabilities:
    
    Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
    Store streams of records in a fault-tolerant durable way.
    Process streams of records as they occur.
    Kafka is generally used for two broad classes of applications:
    
    Building real-time streaming data pipelines that reliably get data between systems or applications
    Building real-time streaming applications that transform or react to the streams of data
    To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
    
    First a few concepts:
    
    Kafka is run as a cluster on one or more servers that can span multiple datacenters.
    The Kafka cluster stores streams of records in categories called topics.
    Each record consists of a key, a value, and a timestamp.
    Kafka has four core APIs:
    
    The Producer API allows an application to publish a stream of records to one or more Kafka topics.
    The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
    The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
    
    In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
    E:yinzhengjieoutputpart-r-00000

  • 相关阅读:
    deep_learning_Function_numpy_random.normal()
    deep_learning_Function_np.newaxis参数理解
    deep_learning_Function_numpy.linspace()
    deep_learning_Function_tf.identity()
    deep_learning_Function_tf.control_dependencies([])
    deep_learning_Function_tf.train.ExponentialMovingAverage()滑动平均
    deep_learning_Function_bath_normalization()
    deep_learning_PCA主成分分析
    deep_learning_Function_numpy_argmax()函数
    deep_learning_Function_matpotlib_scatter()函数
  • 原文地址:https://www.cnblogs.com/yinzhengjie2020/p/12521292.html
Copyright © 2020-2023  润新知