Hadoop综合大作业

1.用Hive对爬虫大作业产生的文本文件（或者英文词频统计下载的英文长篇小说）进行词频统计。

我下载了英文长篇小说《The Souls of Black Folk》，全文69100个单词，下载完更改名字为Black.txt方便操作。

start-all.sh

技术分享图片

Hdfs上创建文件夹

hdfs dfs -mkdir storyinput
hdfs dfs -ls /user/hadoop

技术分享图片

上传文件至hdfs：

下载story.txt保存在~/下载里，查询目录，上传至hdfs

cd ~/下载
ls
hdfs dfs -put ./story.txt storyinput
hdfs dfs -ls /user/hadoop/storyinput

技术分享图片

启动Hive

hive

技术分享图片

创建数据库story，在数据库里建原始文档表

show databases;<br>create database story;<br>use story;<br>create table storydocs(line string);<br>show tables;

技术分享图片

导入文件内容到表storydocs并查看

load data inpath ‘/user/hadoop/storyinput/Black.txt‘ overwrite into table storydocs;<br>select * from storydocs;

技术分享图片

用HQL进行词频统计，结果放在表story_count里

create table story_count as select word,count(1) from (select explode(split(line,‘ ‘)) as word from storydocs) word group by word;

技术分享图片

查看统计结果

select * from story_count;

技术分享图片

2.用Hive对爬虫大作业产生的csv文件进行数据分析，写一篇博客描述你的分析过程和分析结果。

统计豆丁网关于IT计算机题目讨论的主要数据

技术分享图片

将文件保存在邮箱，下载到虚拟机中，复制并改为csv模式

启动hdfs并把wordcount.csv上传到hdfs,并查看前10条数据

cd ~/下载
hdfs dfs -put ./wordcount.csv storyinput
hdfs dfs -ls /user/hadoop/storyinput
hdfs dfs -cat /user/hadoop/storyinput/wordcount.csv | head -10

技术分享图片

启动hive，进入story数据库，建表格big_data，把wordcount.csv里的数据导进该表并查询前10条记录

技术分享图片

相关阅读:
问题——虚拟机连接，查本地DNS，查软件位置，payload生成，检测注册表变化
nmap命令解释
SMB扫描，SMTP扫描
操作系统识别，SNMP扫描
服务扫描——查询banner信息,服务识别
nmap之扫描端口（附加hping3隐藏扫描）
scapy简单用法——四层发现
转载界面组装器模式
设计模式=外观模式
如何进行自动化测试和手工测试

原文地址：https://www.cnblogs.com/206cch/p/9090797.html