• hadoop之 hdfs FilePattern


    举一个例子:使用mapreduce统计一个月或者两个的日志文件,这里可能有大量的日志文件。如何快速的提取文件路径?
    在HDFS中,可以使用通配符来解决这个问题。与linux shell的通配符相同。

    例如:

    Tables Are
    2016/* 2016/05 2016/04
    2016/0[45] 2016/05 2016/04
    2016/0[4-5] 2016/05 2016/04

    代码:

        public static void globFiles(String pattern){
    
            try {
                FileSystem fileSystem = FileSystem.get(configuration);
    
                FileStatus[] statuses = fileSystem.globStatus(new Path(pattern));
                Path[] listPaths = FileUtil.stat2Paths(statuses);
                for (Path path : listPaths){
                    System.out.println(path);
                }
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    
    

    hdfs 还提供了一个PathFilter 对我们获取的文件路径进行过滤,与java.io.FileFilter类似

      /**
       * Return an array of FileStatus objects whose path names match pathPattern
       * and is accepted by the user-supplied path filter. Results are sorted by
       * their path names.
       * Return null if pathPattern has no glob and the path does not exist.
       * Return an empty array if pathPattern has a glob and no path matches it. 
       * 
       * @param pathPattern
       *          a regular expression specifying the path pattern
       * @param filter
       *          a user-supplied path filter
       * @return an array of FileStatus objects
       * @throws IOException if any I/O error occurs when fetching file status
       */
      public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
          throws IOException {
        return new Globber(this, pathPattern, filter).glob();
      }
    
    

    hdfs自身提供了许多filter,在hadoop权威指南中,提供一种 正则表达式filter的实现

    public class RegexExcludePathFilter implements PathFilter {
    
        private  String regex;
    
        public RegexExcludePathFilter(String regex) {
            this.regex = regex;
        }
    
        @Override
        public boolean accept(Path path) {
            return !path.toString().matches(regex);
        }
    }
    
    

    利用正则表达式优化结果

    fileSystem.listStatus(new Path(uri),new RegexExcludePathFilter("^.*/2016/0$"));
    
    

    结果输出如下:

    hdfs://hadoop:9000/hadoop/2016/04
    hdfs://hadoop:9000/hadoop/2016/05
    
    

    过滤器由Path表示,只能作用于文件名以及路径。

    用放荡不羁的心态过随遇而安的生活
  • 相关阅读:
    php基础:函数定义和时间戳函数
    php基础:函数
    http协议基础
    laravel基础②
    laravel基础①安装、配置虚拟主机
    composer
    git基础⑤远程仓库
    windows+caffe(一)——自己环境
    caffe问题集锦
    使用cygwin出现syntax error near unexpected token'$'do
  • 原文地址:https://www.cnblogs.com/re-myself/p/5527587.html
Copyright © 2020-2023  润新知