• Flume之核心架构深入解析


     

    我们一起来了解Source、Channel和Sink的全链路过程。

    一、Flume架构分析

    这个图中核心的组件是:

    Source,ChannelProcessor,Channel,Sink。他们的关系结构如下:

    Source  {
        ChannelProcessor  {
                 Channel  ch1
                 Channel  ch2
                 …
        }
    } 
    Sink  {
       Channel  ch; 
    } 
    SinkGroup {
       Channel ch;
       Sink s1;
       Sink s2;
       …
    }
    

    二、各组件详细介绍

    1、Source组件

    Source是数据源的总称,我们往往设定好源后,数据将源源不断的被抓取或者被推送。

    常见的数据源有:ExecSource,KafkaSource,HttpSource,NetcatSource,JmsSource,AvroSource等等。

    所有的数据源统一实现一个接口类如下:

    @InterfaceAudience.Public
    @InterfaceStability.Stable
    public interface Source extends LifecycleAware, NamedComponent {
    
      /**
       * Specifies which channel processor will handle this source's events.
       *
       * @param channelProcessor
       */
      public void setChannelProcessor(ChannelProcessor channelProcessor);
    
      /**
       * Returns the channel processor that will handle this source's events.
       */
      public ChannelProcessor getChannelProcessor();
    
    }
    

    Source提供了两种机制: PollableSource(轮询拉取)和EventDrivenSource(事件驱动):

    上图展示的Source继承关系类图。

    通过类图我们可以看到NetcatSource,ExecSource和HttpSource属于事件驱动模型。KafkaSource,SequenceGeneratorSource和JmsSource属于轮询拉取模型。

    Source接口继承了LifecycleAware接口,它的的所有逻辑的实现在接口的start和stop方法中进行。

    下图是类关系方法图:

    Source接口定义的是最终的实现过程,比如通过日志抓取日志,这个抓取的过程和实际操作就是在对应的Source实现中,比如:ExecSource。那么这些Source实现由谁来驱动的呢?现在我们将介绍SourceRunner类。看一下类继承结构图:

    我们看一下PollableSourceRunner和EventDrivenSourceRunner的具体实现:

    //PollableSourceRunner:
    public void start() {
      PollableSource source = (PollableSource) getSource();
      ChannelProcessor cp = source.getChannelProcessor();
      cp.initialize();
      source.start();
    
      runner = new PollingRunner();
    
      runner.source = source; //Source实现类就在这里被赋与。
      runner.counterGroup = counterGroup;
      runner.shouldStop = shouldStop;
    
      runnerThread = new Thread(runner);
      runnerThread.setName(getClass().getSimpleName() + "-" + 
          source.getClass().getSimpleName() + "-" + source.getName());
      runnerThread.start();
    
      lifecycleState = LifecycleState.START;
    }
    
    //EventDrivenSourceRunner:
    @Override
    public void start() {
      Source source = getSource();
      ChannelProcessor cp = source.getChannelProcessor();
      cp.initialize();
      source.start();
      lifecycleState = LifecycleState.START;
    }
    

    注:其实所有的Source实现类内部都维护着线程,执行source.start()其实就是启动了相应的线程。

    刚才我们看代码,代码中一直都在展示channelProcessor这个类,同时最上面架构设计图里面也提到了这个类,那它到底是干什么呢,下面我们就对其分解。

    2、Channel组件

    Channel用于连接Source和Sink,Source将日志信息发送到Channel,Sink从Channel消费日志信息;Channel是中转日志信息的一个临时存储,保存有Source组件传递过来的日志信息。

    先看代码如下:

    ChannelSelectorConfiguration selectorConfig = config.getSelectorConfiguration();
    
    ChannelSelector selector = ChannelSelectorFactory.create(sourceChannels, selectorConfig);
    
    ChannelProcessor channelProcessor = new ChannelProcessor(selector);
    Configurables.configure(channelProcessor, config);
    
    source.setChannelProcessor(channelProcessor);
    

    ChannelSelectorFactory.create方法实现如下:

    public static ChannelSelector create(List<Channel> channels,
        ChannelSelectorConfiguration conf) {
      String type = ChannelSelectorType.REPLICATING.toString();
      if (conf != null){
        type = conf.getType();
      }
      ChannelSelector selector = getSelectorForType(type);
      selector.setChannels(channels);
      Configurables.configure(selector, conf);
      return selector;
    }
    

    其中我们看一下ChannelSelectorType这个枚举类,包括了几种类型:

    public enum ChannelSelectorType {
    
      /**
       * Place holder for custom channel selectors not part of this enumeration.
       */
      OTHER(null),
    
      /**
       * 复用通道选择器
       */
      REPLICATING("org.apache.flume.channel.ReplicatingChannelSelector"),
    
      /**
       *  多路通道选择器
       */
      MULTIPLEXING("org.apache.flume.channel.MultiplexingChannelSelector");
    }
    

    ChannelSelector的类结构图如下所示:

    注:RelicatingChannelSelector和MultiplexingChannelSelector是二个通道选择器,第一个是复用型通道选择器,也就是的默认的方式,会把接收到的消息发送给其他每个channel。第二个是多路通道选择器,这个会根据消息header中的参数进行通道选择。

    说完通道选择器,正式来解释Channel是什么,先看一个接口类:

    public interface Channel extends LifecycleAware, NamedComponent {  
      public void put(Event event) throws ChannelException;  
      public Event take() throws ChannelException;  
      public Transaction getTransaction();  
    }
    

    注:put方法是用来发送消息,take方法是获取消息,transaction是用于事务操作。

    类结构图如下:

    3、Sink组件

    Sink负责取出Channel中的消息数据,进行相应的存储文件系统,数据库,或者提交到远程服务器。

    Sink在设置存储数据时,可以向文件系统中,数据库中,hadoop中储数据,在日志数据较少时,可以将数据存储在文件系中,并且设定一定的时间间隔保存数据。在日志数据较多时,可以将相应的日志数据存储到Hadoop中,便于日后进行相应的数据分析。

    Sink接口类内容如下:

    public interface Sink extends LifecycleAware, NamedComponent {  
      public void setChannel(Channel channel);  
      public Channel getChannel();  
      public Status process() throws EventDeliveryException;  
      public static enum Status {  
        READY, BACKOFF  
      }  
    }
    

    Sink是通过如下代码进行的创建:

    Sink sink = sinkFactory.create(comp.getComponentName(),  comp.getType());
    

    DefaultSinkFactory.create方法如下:

    public Sink create(String name, String type) throws FlumeException {
      Preconditions.checkNotNull(name, "name");
      Preconditions.checkNotNull(type, "type");
      logger.info("Creating instance of sink: {}, type: {}", name, type);
      Class<? extends Sink> sinkClass = getClass(type);
      try {
        Sink sink = sinkClass.newInstance();
        sink.setName(name);
        return sink;
      } catch (Exception ex) {
        System.out.println(ex);
        throw new FlumeException("Unable to create sink: " + name
            + ", type: " + type + ", class: " + sinkClass.getName(), ex);
      }
    }
    

    注:Sink是通过SinkFactory工厂来创建,提供了DefaultSinkFactory默认工厂,程序会查找org.apache.flume.conf.sink.SinkType这个枚举类找到相应的Sink处理类,比如:org.apache.flume.sink.LoggerSink,如果没找到对应的处理类,直接通过Class.forName(className)进行直接查找实例化实现类。

    Sink的类结构图如下:

    与ChannelProcessor处理类对应的是SinkProcessor,由SinkProcessorFactory工厂类负责创建,SinkProcessor的类型由一个枚举类提供,看下面代码:

    public enum SinkProcessorType {
      /**
       * Place holder for custom sinks not part of this enumeration.
       */
      OTHER(null),
    
      /**
       * 故障转移 processor
       *
       * @see org.apache.flume.sink.FailoverSinkProcessor
       */
      FAILOVER("org.apache.flume.sink.FailoverSinkProcessor"),
    
      /**
       * 默认processor
       *
       * @see org.apache.flume.sink.DefaultSinkProcessor
       */
      DEFAULT("org.apache.flume.sink.DefaultSinkProcessor"),
    
      /**
       * 负载processor
       *
       * @see org.apache.flume.sink.LoadBalancingSinkProcessor
       */
      LOAD_BALANCE("org.apache.flume.sink.LoadBalancingSinkProcessor");
    
      private final String processorClassName;
    
      private SinkProcessorType(String processorClassName) {
        this.processorClassName = processorClassName;
      }
    
      public String getSinkProcessorClassName() {
        return processorClassName;
      }
    }
    

    SinkProcessor的类结构图如下:

    说明:

    1、FailoverSinkProcessor是故障转移处理器,当sink从通道拿数据信息时出错进行的相关处理,代码如下:

    public Status process() throws EventDeliveryException {
      // 经过了冷却时间,再次发起重试
      Long now = System.currentTimeMillis();
      while(!failedSinks.isEmpty() && failedSinks.peek().getRefresh() < now) {
        //从失败队列中获取sink节点
        FailedSink cur = failedSinks.poll(); 
        Status s;
        try {
          //调用相应sink进行处理,比如将channel的数据读取存放到文件中,
          //这个存放文件的动作就在process中进行。
          s = cur.getSink().process();
          if (s  == Status.READY) {
            //如果处理成功,则放到存活队列中
            liveSinks.put(cur.getPriority(), cur.getSink());
            activeSink = liveSinks.get(liveSinks.lastKey());
            logger.debug("Sink {} was recovered from the fail list",
                    cur.getSink().getName());
          } else {
            // if it's a backoff it needn't be penalized.
            //如果处理失败,则继续放到失败队列中
            failedSinks.add(cur);
          }
          return s;
        } catch (Exception e) {
          cur.incFails();
          failedSinks.add(cur);
        }
      }
    
      Status ret = null;
      while(activeSink != null) {
        try {
          ret = activeSink.process();
          return ret;
        } catch (Exception e) {
          logger.warn("Sink {} failed and has been sent to failover list",
                  activeSink.getName(), e);
          activeSink = moveActiveToDeadAndGetNext();
        }
      }
    

    2、LoadBalancingSinkProcessor是负载Sink处理器

    首先我们和ChannelProcessor一样,我们也要重点说明一下SinkSelector这个选择器。

    先看一下SinkSelector.configure方法的部分代码:

    if (selectorTypeName.equalsIgnoreCase(SELECTOR_NAME_ROUND_ROBIN)) {
      selector = new RoundRobinSinkSelector(shouldBackOff);
    } else if (selectorTypeName.equalsIgnoreCase(SELECTOR_NAME_RANDOM)) {
      selector = new RandomOrderSinkSelector(shouldBackOff);
    } else {
      try {
        @SuppressWarnings("unchecked")
        Class<? extends SinkSelector> klass = (Class<? extends SinkSelector>)
            Class.forName(selectorTypeName);
    
        selector = klass.newInstance();
      } catch (Exception ex) {
        throw new FlumeException("Unable to instantiate sink selector: "
            + selectorTypeName, ex);
      }
    }
    

    结合上面的代码,再看类结构图如下:

    注:RoundRobinSinkSelector是轮询选择器,RandomOrderSinkSelector是随机分配选择器。

    最后我们以KafkaSink为例看一下Sink里面的具体实现:

    public Status process() throws EventDeliveryException {
      Status result = Status.READY;
      Channel channel = getChannel();
      Transaction transaction = null;
      Event event = null;
      String eventTopic = null;
      String eventKey = null;
    
      try {
        long processedEvents = 0;
    
        transaction = channel.getTransaction();
        transaction.begin();
    
        messageList.clear();
        for (; processedEvents < batchSize; processedEvents += 1) {
          event = channel.take();
    
          if (event == null) {
            // no events available in channel
            break;
          }
    
          byte[] eventBody = event.getBody();
          Map<String, String> headers = event.getHeaders();
    
          if ((eventTopic = headers.get(TOPIC_HDR)) == null) {
            eventTopic = topic;
          }
    
          eventKey = headers.get(KEY_HDR);
    
          if (logger.isDebugEnabled()) {
            logger.debug("{Event} " + eventTopic + " : " + eventKey + " : "
              + new String(eventBody, "UTF-8"));
            logger.debug("event #{}", processedEvents);
          }
    
          // create a message and add to buffer
          KeyedMessage<String, byte[]> data = new KeyedMessage<String, byte[]>
            (eventTopic, eventKey, eventBody);
          messageList.add(data);
    
        }
    
        // publish batch and commit.
        if (processedEvents > 0) {
          long startTime = System.nanoTime();
          producer.send(messageList);
          long endTime = System.nanoTime();
          counter.addToKafkaEventSendTimer((endTime-startTime)/(1000*1000));
          counter.addToEventDrainSuccessCount(Long.valueOf(messageList.size()));
        }
    
        transaction.commit();
    
      } catch (Exception ex) {
        String errorMsg = "Failed to publish events";
        logger.error("Failed to publish events", ex);
        result = Status.BACKOFF;
        if (transaction != null) {
          try {
            transaction.rollback();
            counter.incrementRollbackCount();
          } catch (Exception e) {
            logger.error("Transaction rollback failed", e);
            throw Throwables.propagate(e);
          }
        }
        throw new EventDeliveryException(errorMsg, ex);
      } finally {
        if (transaction != null) {
          transaction.close();
        }
      }
    
      return result;
    }
    

    注:方法从channel中不断的获取数据,然后通过Kafka的producer生产者将消息发送到Kafka里面。

  • 相关阅读:
    PowerDesigner生成SQL的冒号设置
    Linux/Windows 一键获取当前目录及子目录下所有文件名脚本
    Target runtime jdk1.8.0_181 is not defined
    windows——任务计划程序
    12篇文章回顾总结
    《逆商》2月12日
    《终身成长》2月11日
    《心流》 什么才是真正的幸福
    《心流》 什么才是真正的幸福 2月7日
    《高效能人士的7个习惯》 2月3日
  • 原文地址:https://www.cnblogs.com/hd-zg/p/5975399.html
Copyright © 2020-2023  润新知