大数据入门第五天——离线计算之hadoop（下）hadoop-shell与HDFS的JavaAPI入门

一、Hadoop Shell命令

　　既然有官方文档，那当然先找到官方文档的参考：http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

　　对于3种命令的区别：

　　　　

以下内容参考自stackoverflow
Following are the three commands which appears same but have minute differences
hadoop fs {args}

hadoop dfs {args}
hdfs dfs {args}
hadoop fs <args>
FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, HFTP FS, S3 FS, and others
  　　hadoop dfs <args>
dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.
 　　 hdfs   dfs <args>
same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs

below is the list categorized as HDFS commands.
  **#hdfs commands**
  namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups
So even if you use Hadoop dfs , it will look locate hdfs and delegate that command to hdfs dfs

　　启动hadoop客户端HDFS的命令是：

bin/hadoop fs <args>

　　1.上传文件

　　　　先造一个测试文件：

echo "i love china" > 1.txt

　　　　// 重定向追加到文件，详细请参考linux基础随笔

　　　　使用 -put，详细用法：

put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix ._COPYING_.
Examples:

hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:

Returns 0 on success and -1 on error.

View Code

　　　　示例：

[hadoop@mini2 ~]$ hadoop fs -put 1.txt /

　　　　// 在任意一台机器上都是OK的因为是集群的环境！

　　　　可以在页面查看到：

http://mini1:50070——HDFS
http://mini1:8088——YARN

　　　　　　当然，页面只是辅助（会存在bug），建议使用hdfs dfsadmin -report进行查看！

　　　　　　如果发现问题，首先检查日志！更多参考上节踩坑实录

　　新建文件夹使用mkdir,-p也是递归创建的选项

　2.查看文件内容

　　　　通过cat，这里不再赘述详细用法，参考文首链接处的官方文档！

　　　　示例：

[hadoop@mini2 ~]$ hadoop fs -cat /1.txt

　　3.查看文件目录

　　　　除了上面的可视化查看，还可以通过ls来查看

ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

-C: Display the paths of files and directories only.
-d: Directories are listed as plain files.
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-q: Print ? instead of non-printable characters.
-R: Recursively list subdirectories encountered.
-t: Sort output by modification time (most recent first).
-S: Sort output by file size.
-r: Reverse the sort order.
-u: Use access time rather than modification time for display and sorting.
-e: Display the erasure coding policy of files and directories only.
For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename
For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname
Files within a directory are order by filename by default.

Example:

hadoop fs -ls /user/hadoop/file1
hadoop fs -ls -e /ecdir
Exit Code:

Returns 0 on success and -1 on error.

View Code

　　　　示例：

[hadoop@mini2 ~]$ hadoop fs -ls /

　// 注意文件这里的/是文件系统的根目录！

　　文件的真正位置是在当初我们设置的数据目录处，也就是hdpdata

　　目录很深：（如要使用tree，请先安装：yum install tree）

/home/hadoop/hdpdata/dfs/data/current/BP-1456370730-192.168.137.128-1517051116045/current/finalized/subdir0/subdir0

　　这里命令都是统一的blk+id这样的命名了！

　　注意，我们之前配置过保存2份，所以会保存2份数据！

　　默认情况下，文件大于128M才会切割文件成为几块！

　　切割文件是否可以组合起来重新使用呢？（使用cat file1 >> tmp,cat file2 >> tmp通过重定向追加进行）

　　事实证明，是可以的！也就是hdfs只是做简单的切割，并没有加新的东西！

　　当然了，Hadoop既然做了切割肯定不会让我们自己傻傻的再去这样合并了，也是有对应的shell命令的！

　　4.下载文件

　　　　通过get命令，例如我们上传的是一个150M的xuexi.mp4文件到/xuexi.mp4，真正hdfs存储的时候我们刚刚已经看了，是通过blk+一串类似id的数字

组成的2个块，那我们只需要执行以下命令，框架即可自动合并这两个块把完成文件还回来：

hadoop fs -put /1.txt
hadoop fs -get /1.txt

　　　其他常用命令如下：

-help             
功能：输出这个命令参数手册
-ls                  
功能：显示目录信息
示例： hadoop fs -ls hdfs://hadoop-server01:9000/
备注：这些参数中，所有的hdfs路径都可以简写
-->hadoop fs -ls /   等同于上一条命令的效果
-mkdir              
功能：在hdfs上创建目录
示例：hadoop fs  -mkdir  -p  /aaa/bbb/cc/dd
-moveFromLocal            
功能：从本地剪切粘贴到hdfs
示例：hadoop  fs  - moveFromLocal  /home/hadoop/a.txt  /aaa/bbb/cc/dd
-moveToLocal              
功能：从hdfs剪切粘贴到本地
示例：hadoop  fs  - moveToLocal   /aaa/bbb/cc/dd  /home/hadoop/a.txt 
--appendToFile  
功能：追加一个文件到已经存在的文件末尾
示例：hadoop  fs  -appendToFile  ./hello.txt  hdfs://hadoop-server01:9000/hello.txt
可以简写为：
Hadoop  fs  -appendToFile  ./hello.txt  /hello.txt

-cat  
功能：显示文件内容  
示例：hadoop fs -cat  /hello.txt

-tail                 
功能：显示一个文件的末尾
示例：hadoop  fs  -tail  /weblog/access_log.1
-text                  
功能：以字符形式打印一个文件的内容
示例：hadoop  fs  -text  /weblog/access_log.1
-chgrp 
-chmod
-chown
功能：linux文件系统中的用法一样，对文件所属权限
示例：
hadoop  fs  -chmod  666  /hello.txt
hadoop  fs  -chown  someuser:somegrp   /hello.txt
-copyFromLocal    
功能：从本地文件系统中拷贝文件到hdfs路径去
示例：hadoop  fs  -copyFromLocal  ./jdk.tar.gz  /aaa/
-copyToLocal      
功能：从hdfs拷贝到本地
示例：hadoop fs -copyToLocal /aaa/jdk.tar.gz
-cp              
功能：从hdfs的一个路径拷贝hdfs的另一个路径
示例： hadoop  fs  -cp  /aaa/jdk.tar.gz  /bbb/jdk.tar.gz.2

-mv                     
功能：在hdfs目录中移动文件
示例： hadoop  fs  -mv  /aaa/jdk.tar.gz  /
-get              
功能：等同于copyToLocal，就是从hdfs下载文件到本地
示例：hadoop fs -get  /aaa/jdk.tar.gz
-getmerge             
功能：合并下载多个文件
示例：比如hdfs的目录 /aaa/下有多个文件:log.1, log.2,log.3,...
hadoop fs -getmerge /aaa/log.* ./log.sum
-put                
功能：等同于copyFromLocal
示例：hadoop  fs  -put  /aaa/jdk.tar.gz  /bbb/jdk.tar.gz.2

-rm                
功能：删除文件或文件夹
示例：hadoop fs -rm -r /aaa/bbb/

-rmdir                 
功能：删除空目录
示例：hadoop  fs  -rmdir   /aaa/bbb/ccc
-df               
功能：统计文件系统的可用空间信息
示例：hadoop  fs  -df  -h  /

-du 
功能：统计文件夹的大小信息
示例：
hadoop  fs  -du  -s  -h /aaa/*

-count         
功能：统计一个指定目录下的文件节点数量
示例：hadoop fs -count /aaa/

-setrep                
功能：设置hdfs中文件的副本数量
示例：hadoop fs -setrep 3 /aaa/jdk.tar.gz
<这里设置的副本数只是记录在namenode的元数据中，是否真的会有这么多副本，还得看datanode的数量>

View Code

　　常用命令，参考：http://blog.51cto.com/flycc258/1615120

　　而对于用户的概念hdfs不像Linux一样严格检查，基本上可以说“你说是谁的就是谁的”，而不去关心例如此用户是否真正存在

　　其他HDFS原理等，将会在后续补充！

二、JavaAPI操作HDFS

　　1.引入依赖

<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.4</version>
</dependency>

　　// 如果网络不方便，jar包在hadoop安装包的share目录下也有！

　　2.window下开发说明　　　

　　建议在linux下进行hadoop应用的开发，不会存在兼容性问题。如在window上做客户端应用开发，需要设置以下环境：

　　　　A、在windows的某个目录下解压一个hadoop的安装包

　　　　B、将安装包下的lib和bin目录用对应windows版本平台编译的本地库（native）替换。链接：https://pan.baidu.com/s/1eTkjdrs 密码：f6el

　　　　　　或者在github上下载：https://github.com/srccodes/hadoop-common-2.2.0-bin

　　　　C、在window系统中配置HADOOP_HOME指向你解压的安装包

　　　　D、在windows系统的path变量中加入hadoop的bin目录

　　　　关于出现以下错误时网上有很多解决办法，这里不再赘述，并且也可以使用网友们提供的编译好的版本！

　　3.测试与问题解决

package com.hdfs.demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Before;
import org.junit.Test;

/**
 * HDFS客户端demo
 *
 * @author zcc ON 2018/1/28
 **/
public class HdfsClientDemo {
    FileSystem fs = null;
    @Before
    public void init() throws Exception{
        System.setProperty("hadoop.home.dir", "F:\work\hadoop-2.6.4");
        Configuration conf = new Configuration();
        // 拿到一个操作的客户端实例对象
        fs = FileSystem.get(conf);
    }
    @Test
    public void testUpload() throws Exception{
        // 就对应get的别名
        fs.copyFromLocalFile(new Path("F:/c.log"),new Path("/c.log.copy"));
        // 关闭
        fs.close();
    }
}

　　这样是可以执行成功的；但是很明显，我们的conf什么都没配置，默认的文件系统我们也可以从官网看到是本地文件系统，（当然代码在手上是可以直接通过打断点的形式查看的，我们可以通过断点查看fs的类型，通过比较conf设置前后fs的类型来判断）所以它是上传到本地了。

　　还有一个问题是直接设置文件系统会出现权限不足的情况，因为是以windows的用户进行提交的；所以我们可以有2种解决办法：1是放开HDFS的权限，改为777，这样所有人都可以操作，但是777的权限又有些太放了，就没办法控制权限了；那我们可以让HDFS来信任我们当前的用户也可以

　　这两种方式具体操作可以参考这里：http://www.linuxidc.com/Linux/2014-08/105335p2.htm

或者采用不验证的方式（谨慎使用）：

管理DFS system目录。目前做法是将hadoop服务集群关闭权限认证，修改hadoop安装集群master的hadoop-1.2.0/conf/mapred-site.xml，增加：
 
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

　　这里我们采用在代码中操作的方式来解决！

　　4.下载

@Test
    public void testDownloadFileToLocal() throws IllegalArgumentException, IOException {
        fs.copyToLocalFile(new Path("/jdk-7u65-linux-i586.tar.gz"), new Path("d:/"));
        fs.close();
    }

　　5.目录操作

@Test
    public void testMkdirAndDeleteAndRename() throws IllegalArgumentException, IOException {

        // 创建目录
        fs.mkdirs(new Path("/a1/b1/c1"));

        // 删除文件夹 ，如果是非空文件夹，参数2必须给值true
        fs.delete(new Path("/aaa"), true);

        // 重命名文件或文件夹
        fs.rename(new Path("/a1"), new Path("/a2"));

    }

　　6.文件查看

@Test
    public void testListFiles() throws FileNotFoundException, IllegalArgumentException, IOException {

        // 思考：为什么返回迭代器，而不是List之类的容器
        RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);

        while (listFiles.hasNext()) {
            LocatedFileStatus fileStatus = listFiles.next();
            System.out.println(fileStatus.getPath().getName());
            System.out.println(fileStatus.getBlockSize());
            System.out.println(fileStatus.getPermission());
            System.out.println(fileStatus.getLen());
            BlockLocation[] blockLocations = fileStatus.getBlockLocations();
            for (BlockLocation bl : blockLocations) {
                System.out.println("block-length:" + bl.getLength() + "--" + "block-offset:" + bl.getOffset());
                String[] hosts = bl.getHosts();
                for (String host : hosts) {
                    System.out.println(host);
                }
            }
            System.out.println("--------------为angelababy打印的分割线--------------");
        }
    }

　　更多参考：https://www.cnblogs.com/Eddyer/p/6641778.html

　　补充说明将在HDFS详解中进行补充！

相关阅读:
archlinux 怎么样安装KDE界面
 选择Arch Linux还是Gentoo Linux？
服务器用什么Linux系统较好？
轻型简易的Linux桌面环境推荐
 扩大VMware虚拟机中linux硬盘空间
 archbang 硬盘安装
 持续改进中, Gnome Shell 2.91.3 发布
 群英战吕布 2010年十大Linux PK WIN7
将LFSliveCD安装到硬盘的注意事项
 ArchLinux安装笔记（续）（桌面篇）
原文地址：https://www.cnblogs.com/jiangbei/p/8366238.html