• 大数据之数据收集


    大数据之数据收集

    数据收集是大数据的基础。散落在各处的数据,只有经过了数据收集,才会集中起来,提供了后续处理的可能。从大数据技术发展以来,出现了很多数据收集的技术框架,本文试图在若干流行的数据收集解决方案上加以叙述。

    评估一个技术框架是否适合某个业务场景,通常需要考虑多个方面。

    最基本的,考虑接口是否适配,收集socket数据了还是log数据,输出到哪里;

    l 考虑技术框架的性能,是否满足业务的需求;

    l 还需要考虑灵活性,如果需要做一些过滤或者自定义开发,是否容易;

    l 考虑对性能的影响,数据收集不能影响了业务系统本身的运行,不能资源消耗太大;

    l 考虑运维的难易程度,有的技术方案依赖很多,配置很复杂,就容易出错;

    l 考虑技术框架是否高可靠,不会出现丢数据的情况。

    相信通过上面几个方面的判断,应该可以找到合适的技术框架。

    一、Flume

    Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量流事件数据。

    source接口适配

    Source Type

    Comments

    Avro Source

    Listens on Avro port and receives events from external Avro client streams.

    Thrift Source

    Listens on Thrift port and receives events from external Thrift client streams.

    Exec Source

    Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true).

    JMS Source

    JMS Source reads messages from a JMS destination such as a queue or topic.

    SSL and JMS Source

    MS client implementations typically support to configure SSL/TLS via some Java system properties defined by JSSE (Java Secure Socket Extension).

    Spooling Directory Source

    This source lets you ingest data by placing files to be ingested into a spoolingdirectory on disk.

    Taildir Source

    Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files.

    Kafka Source

    Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.

    NetCat TCP Source

    A netcat-like source that listens on a given port and turns each line of text into an event.

    NetCat UDP Source

    As per the original Netcat (TCP) source, this source that listens on a given port and turns each line of text into an event and sent via the connected channel. Acts like nc -u -k -l [host] [port].

    Sequence Generator Source

    A simple sequence generator that continuously generates events with a counter that starts from 0, increments by 1 and stops at totalEvents.

    Syslog Sources

    Reads syslog data and generate Flume events.

    Syslog TCP Source

    The original, tried-and-true syslog TCP source.

    Multiport Syslog TCP Source

    his is a newer, faster, multi-port capable version of the Syslog TCP source.

    Syslog UDP Source

     

    HTTP Source

    A source which accepts Flume Events by HTTP POST and GET.

    Stress Source

    StressSource is an internal load-generating source implementation which is very useful for stress tests.

    Custom Source

    A custom source is your own implementation of the Source interface.

    Scribe Source

    Scribe is another type of ingest system. To adopt existing Scribe ingest system, Flume should use ScribeSource based on Thrift with compatible transfering protocol.

     

     

     

    sink接口适配

    Sink Type

    Comments

    HDFS Sink

    This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types.

    Hive Sink

    This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions.

    Logger Sink

    Logs event at INFO level. Typically useful for testing/debugging purpose.

    Avro Sink

    This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair.

    Thrift Sink

    This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Thrift events and sent to the configured hostname / port pair.

    IRC Sink

    The IRC sink takes messages from attached channel and relays those to configured IRC destinations.

    File Roll Sink

    Stores events on the local filesystem.

    Null Sink

    Discards all events it receives from the channel.

    HBaseSink

    This sink writes data to HBase. The Hbase configuration is picked up from the first hbase-site.xml encountered in the classpath. A class implementing HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and/or increments.

    HBase2Sink

    HBase2Sink is the equivalent of HBaseSink for HBase version 2.

    AsyncHBaseSink

    This sink writes data to HBase using an asynchronous model.

    MorphlineSolrSink

    This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers, which in turn serve queries to end users or search applications.

    ElasticSearchSink

    This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them.

    Kafka Sink

    This is a Flume Sink implementation that can publish data to a Kafka topic.

    HTTP Sink

    Behaviour of this sink is that it will take events from the channel, and send those events to a remote service using an HTTP POST request. The event content is sent as the POST body.

    Custom Sink

    A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent.

    由上可见,Flume支持的接口比较丰富,最常用的基于文件的日志收集source以及同步到kafkasink

    处理能力,处理能力和机器性能和数据都有关,在考虑的时候,既需要考虑每秒多少条数据,也需要考虑每秒多少兆数据。通常,在以file作为channel的时候,Flume可以支持每秒几十兆的数据处理,以memory作为channel的时候,可以支持每秒几百兆的数据处理。

    灵活性Flume的灵活性还是不错的,在数据处理的各个环节都预留有接口,方便进行个性化开发,再加上Flume本身也是java语言开发的,就更友好一些。

    消耗Flume本身的资源消耗还是比较多的,如果对资源消耗敏感,经过参数调优之后,使用的资源能够降低不少。

    维护性Flume的依赖不多,jar包下载既可用,有人觉得配置起来比较麻烦,不过灵活性带来的就是复杂性,个人感觉还好。

    可靠Flume的可靠性在几个方面都有体现,首先需要选择合适的channel,来保证消息处理的可靠性,其次Flume自身还待遇LB的功能。

    http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

    https://blog.csdn.net/lijinqi1987/article/details/77506034

     

    二、Logstash

    Logstash本身是作为ELK的一员存在的,负责数据摄入,后来慢慢的也接入了更多的数据源和数据端。

    Input接口

    Input Plugin

    Description

    azure_event_hubs

    Receives events from Azure Event Hubs

    beats

    Receives events from the Elastic Beats framework

    cloudwatch

    Pulls events from the Amazon Web Services CloudWatch API

    couchdb_changes

    Streams events from CouchDB’s _changes URI

    dead_letter_queue

    read events from Logstash’s dead letter queue

    elasticsearch

    Reads query results from an Elasticsearch cluster

    exec

    Captures the output of a shell command as an event

    file

    Streams events from files

    ganglia

    Reads Ganglia packets over UDP

    gelf

    Reads GELF-format messages from Graylog2 as events

    generator

    Generates random log events for test purposes

    github

    Reads events from a GitHub webhook

    google_cloud_storage

    Extract events from files in a Google Cloud Storage bucket

    google_pubsub

    Consume events from a Google Cloud PubSub service

    graphite

    Reads metrics from the graphite tool

    heartbeat

    Generates heartbeat events for testing

    http

    Receives events over HTTP or HTTPS

    http_poller

    Decodes the output of an HTTP API into events

    imap

    Reads mail from an IMAP server

    irc

    Reads events from an IRC server

    java_generator

    Generates synthetic log events

    java_stdin

    Reads events from standard input

    jdbc

    Creates events from JDBC data

    jms

    Reads events from a Jms Broker

    jmx

    Retrieves metrics from remote Java applications over JMX

    kafka

    Reads events from a Kafka topic

    kinesis

    Receives events through an AWS Kinesis stream

    log4j

    Reads events over a TCP socket from a Log4j SocketAppender 

    object

    lumberjack

    Receives events using the Lumberjack protocl

    meetup

    Captures the output of command line tools as an event

    pipe

    Streams events from a long-running command pipe

    puppet_facter

    Receives facts from a Puppet server

    rabbitmq

    Pulls events from a RabbitMQ exchange

    redis

    Reads events from a Redis instance

    relp

    Receives RELP events over a TCP socket

    rss

    Captures the output of command line tools as an event

    s3

    Streams events from files in a S3 bucket

    s3_sns_sqs

    Reads logs from AWS S3 buckets using sqs

    salesforce

    Creates events based on a Salesforce SOQL query

    snmp

    Polls network devices using Simple Network Management Protocol (SNMP)

    snmptrap

    Creates events based on SNMP trap messages

    sqlite

    Creates events based on rows in an SQLite database

    sqs

    Pulls events from an Amazon Web Services Simple Queue Service queue

    stdin

    Reads events from standard input

    stomp

    Creates events received with the STOMP protocol

    syslog

    Reads syslog messages as events

    tcp

    Reads events from a TCP socket

    twitter

    Reads events from the Twitter Streaming API

    udp

    Reads events over UDP

    unix

    Reads events over a UNIX socket

    varnishlog

    Reads from the varnish cache shared memory log

    websocket

    Reads events from a websocket

    wmi

    Creates events based on the results of a WMI query

    xmpp

    Receives events over the XMPP/Jabber protocol

    可以看到Logstash除了支持filestdoutkafka等常规的input之外,还支持很多乱七八糟的input,这说明其作为一个ELK的数据摄入是合格的,但是否是我们需要的,则要仔细评估。

    Output接口

    Output Plugin

    Description

    boundary

    Sends annotations to Boundary based on Logstash events

    circonus

    Sends annotations to Circonus based on Logstash events

    cloudwatch

    Aggregates and sends metric data to AWS CloudWatch

    csv

    Writes events to disk in a delimited format

    datadog

    Sends events to DataDogHQ based on Logstash events

    datadog_metrics

    Sends metrics to DataDogHQ based on Logstash events

    elastic_app_search

    Sends events to the Elastic App Search solution

    elasticsearch

    Stores logs in Elasticsearch

    email

    Sends email to a specified address when output is received

    exec

    Runs a command for a matching event

    file

    Writes events to files on disk

    ganglia

    Writes metrics to Ganglia’s gmond

    gelf

    Generates GELF formatted output for Graylog2

    google_bigquery

    Writes events to Google BigQuery

    google_cloud_storage

    Uploads log events to Google Cloud Storage

    google_pubsub

    Uploads log events to Google Cloud Pubsub

    graphite

    Writes metrics to Graphite

    graphtastic

    Sends metric data on Windows

    http

    Sends events to a generic HTTP or HTTPS endpoint

    influxdb

    Writes metrics to InfluxDB

    irc

    Writes events to IRC

    java_sink

    Discards any events received

    java_stdout

    Prints events to the STDOUT of the shell

    juggernaut

    Pushes messages to the Juggernaut websockets server

    kafka

    Writes events to a Kafka topic

    librato

    Sends metrics, annotations, and alerts to Librato based on Logstash events

    loggly

    Ships logs to Loggly

    lumberjack

    Sends events using the lumberjack protocol

    metriccatcher

    Writes metrics to MetricCatcher

    mongodb

    Writes events to MongoDB

    nagios

    Sends passive check results to Nagios

    nagios_nsca

    Sends passive check results to Nagios using the NSCA protocol

    opentsdb

    Writes metrics to OpenTSDB

    pagerduty

    Sends notifications based on preconfigured services and escalation policies

    pipe

    Pipes events to another program’s standard input

    rabbitmq

    Pushes events to a RabbitMQ exchange

    redis

    Sends events to a Redis queue using the RPUSH command

    redmine

    Creates tickets using the Redmine API

    riak

    Writes events to the Riak distributed key/value store

    riemann

    Sends metrics to Riemann

    s3

    Sends Logstash events to the Amazon Simple Storage Service

    sns

    Sends events to Amazon’s Simple Notification Service

    solr_http

    Stores and indexes logs in Solr

    sqs

    Pushes events to an Amazon Web Services Simple Queue Service queue

    statsd

    Sends metrics using the statsd network daemon

    stdout

    Prints events to the standard output

    stomp

    Writes events using the STOMP protocol

    syslog

    Sends events to a syslog server

    tcp

    Writes events over a TCP socket

    timber

    Sends events to the Timber.io logging service

    udp

    Sends events over UDP

    webhdfs

    Sends Logstash events to HDFS using the webhdfs REST API

    websocket

    Publishes messages to a websocket

    xmpp

    Posts events over XMPP

    zabbix

    Sends events to a Zabbix server

    Input类似,Output首先是一大批ES自己的东西,对于Hadoop系统的支持本身比较少,但在一些NoSQL数据库方面的支持,相对多一些,比如RedisMongoDB等。

     

    处理能力,Logstash处理能力在每秒几千条的规模上。

    灵活性Logstash提供了强大的数据过滤和预处理能力。

    消耗Logstash对资源的要求比较高,需要比较多的内存资源。

    运维,Logstash本身以JRuby写成,依赖和配置的复杂度比较高。

    可靠,Logstash是单机运行,极端情况下存在丢数据的可能。

    https://www.elastic.co/guide/en/logstash/current/index.html

    https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/index.html

     

    三、FileBeat

    FileBeat也是ES推出的数据收集的技术框架,相对于Logstash而言,支持数据处理的能力要弱一些,不过这正是其目的——轻量化,资源消耗就很低。

    Input接口

    Input type

    Comments

    Log

    read lines from log files.

    Stdin

    read events from standard in.

    Container

    read containers log files.

    Kafka

    read from topics in a Kafka cluster.

    Redis

    read entries from Redis slowlogs.

    UDP

    read events over UDP.

    Docker

    read logs from Docker containers.

    TCP

    read events over TCP.

    Syslog

    read events over TCP or UDP, this input will parse BSD (rfc3164) event and some variant.

    s3

    retrieve logs from S3 objects that are pointed by messages from specific SQS queues.

    NetFlow

    read NetFlow and IPFIX exported flows and options records over UDP.

    Google Pub/Sub

    read messages from a Google Cloud Pub/Sub topic subscription.

    Azure eventhub

    read messages from an azure eventhub.

    可以看到FileBeat支持的Input种类比较少,但是常规的文件、标准输出、Kafka等都支持,另外对特定产品的支持还是有的。

    Output接口

    Output type

    Comments

    Elasticsearch

    sends the transactions directly to Elasticsearch by using the Elasticsearch HTTP API.

    Logstash

    sends events directly to Logstash by using the lumberjack protocol, which runs over TCP. Logstash allows for additional processing and routing of generated events.

    Kafka

    sends the events to Apache Kafka.

    Redis

    inserts the events into a Redis list or a Redis channel.

    File

    dumps the transactions into a file where each transaction is in a JSON format.

    Console

    writes events in JSON format to stdout.

    由以上组件接口可以看出,FileBeat在支持持久化方面还是比较弱的,由于本身孵化自ES,所以持久化的大部分为ES系组件,但对KafkaRedis的支持,也能满足一定的需求。

     

    处理能力,FileBeat的处理能力相对一般,满足基本的需求可以,类似每秒几千条数据,如果数据量再多,就需要特殊处理。

    灵活性FileBeat提供了一定的灵活性,不过GO语言本身就有门槛,不如FlumeLogstash灵活性和处理能力强。

    消耗,由于以GO编写,所以正常情况下,资源消耗不多,但当遇到event的消息比较大时,在默认配置下容易出现OOM的情况。

    运维,FileBeat本身是用GO写的,所以没有额外的依赖,但配置文件采用类YAML格式,可配的内容还是比较多的。

    可靠,没看到FileBeat本身对数据传输过程本身防丢失采取的策略,所以极端情况下,存在丢数据的可能。

    https://www.elastic.co/guide/en/beats/filebeat/current/index.html

  • 相关阅读:
    jQuery 2.0.3 源码分析 Deferred(最细的实现剖析,带图)
    jQuery 2.0.3 源码分析 Deferrred概念
    jQuery 2.0.3 源码分析core
    jQuery 2.0.3 源码分析core
    JavaScript异步机制
    使用Node.js实现数据推送
    自定义jQuery插件Step by Step
    转 CSS hack:针对IE6,IE7,firefox显示不同效果
    ie6,ie7兼容性总结(转)
    QQ浏览器X5内核问题汇总
  • 原文地址:https://www.cnblogs.com/029zz010buct/p/12621098.html
Copyright © 2020-2023  润新知