02.网站点击流数据分析项目_模块开发_数据采集

3 模块开发——数据采集

3.1 需求

　　数据采集的需求广义上来说分为两大部分。

　　1）是在页面采集用户的访问行为，具体开发工作：

　　　　1、开发页面埋点js，采集用户访问行为

　　　　2、后台接受页面js请求记录日志

　　此部分工作也可以归属为“数据源”，其开发工作通常由web开发团队负责

　　2）是从web服务器上汇聚日志到HDFS，是数据分析系统的数据采集，此部分工作由数据分析平台建设团队负责，

具体的技术实现有很多方式：

　　　　Shell脚本：优点：轻量级，开发简单；缺点：对日志采集过程中的容错处理不便控制

　　　　Java采集程序：优点：可对采集过程实现精细控制；缺点：开发工作量大

　　　　Flume日志采集框架：成熟的开源日志采集系统，且本身就是hadoop生态体系中的一员，与hadoop体系中的

各种框架组件具有天生的亲和力，可扩展性强

3.2 Flume日志采集系统搭建：

　　1、数据源信息：本项目分析的数据用服务器所生成的流量日志：/data/flumedata/access.log

　　2、数据内容样例：

58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 
"http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
字段解析：
1、访客ip地址：   58.215.204.118
2、访客用户信息：  - -
3、请求时间：[18/Sep/2013:06:51:35 +0000]
4、请求方式：GET
5、请求的url：/wp-includes/js/jquery/jquery.js?ver=1.10.2
6、请求所用协议：HTTP/1.1
7、响应码：304
8、返回的数据流量：0
9、访客的来源url：http://blog.fens.me/nodejs-socketio-chat/
10、访客所用浏览器：Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0

　　3、Flume采集实现：配置采集方案：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#a1.sources.r1.type = exec
#a1.sources.r1.command = tail -F /home/hadoop/log/test.log   用tail命令获取数据，下沉到hdfs
#a1.sources.r1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /data/flumedata   采集目录到HDFS
a1.sources.r1.fileHeader = false

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /fensiweblog/events/%y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
#指定下沉文件按30分钟滚动
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 1024
#指定下沉文件按1000000条数滚动
a1.sinks.k1.hdfs.rollCount = 10000
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

　　如果向目录/data/flumedata中放入文件，就会将文件下沉到HDFS中；

　　启动Flume的Agent: bin/flume-ng agent -c conf -f conf/fensi.conf -n a1 -Dflume.root.logger=INFO,console

注意：启动命令中的 -n 参数要给配置文件中配置的agent名称

相关阅读:
【设计模式（17）】行为型模式之中介者模式
 服务器迁移--MySQL数据库迁移
 巧妙解决element-ui下拉框选项过多的问题
 关于后台返回的文件流下载方法
 关于input框只让输入数字的写法
 关于element ui 全局配置某些组件的属性方法
 关于echarts的南丁格尔玫瑰图极值导致展示效果不好的解决方案
 mock数据的使用方法
 配置 git账号和邮箱
 vite 发布了正式版版了用起来
原文地址：https://www.cnblogs.com/yaboya/p/9329361.html