• hadoop入门之海量Web日志分析 用Hadoop提取KPI统计指标


    转载自:
    http://blog.fens.me/hadoop-mapreduce-log-kpi/

    今天学习了这一篇博客,写得十分好,照着这篇博客敲了一遍。

    发现几个问题,

    一是这篇博客中采用的hadoop版本过低,如果在hadoop2.x上面跑的话,可能会出现结果文件没有写入任何数据,为了解决这个问题,我试着去参照官网http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html的API进行操作,发现官网里讲得十分详细,只要有一点英文基础的同行都可以看得懂,直白简单。hadoop2.x相比较hadoop1.x而言编写Mapper类,可以直接继承import org.apache.hadoop.mapreduce.Mapper;无需再实现Mapper接口了,其中关于map方法的写法也变了改成如下:

    		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    			// TODO Auto-generated method stub
    			KPI kpi = KPI.filterPVS(value.toString());
    			System.out.println(kpi);
    			if (kpi.isValid()) {
    				word.set(kpi.getIp());
    				context.write(word, one);
    			}
    		}
    

    hadoop1.x的写法如下:

     @Override
            public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
                KPI kpi = KPI.filterPVs(value.toString());
                if (kpi.isValid()) {
                    word.set(kpi.getRequest());
                    output.collect(word, one);
                }
            }
    

    hadoop2.x的写法就必须改变了,相应的Reducer中的reduce方法随之改变。一开始没有发现文中的github网址去百度了一下费了很大劲找到了一个150多M的文件,需要自取:

    链接: https://pan.baidu.com/s/1hz5dTX69Hc_l9Aj-axvfqw 提取码: ssys 复制这段内容后打开百度网盘手机App,操作更方便哦,当然这个日志文件内容与博客的不一致,少了两个属性,请自行对照代码修改。

    二、在hadoop2.x上面运行,在main方法里配置运行参数我这次使用的hadoop2.9.2这个版本的,需要用到winuitil.exe和hadoop.dll这两个工具。已经上传到百度网盘上面,地址如下,链接: https://pan.baidu.com/s/1RTSeGjV2VwWxRAvsUMkkrA 提取码: dkxt ,有三个文件分别是hadoop.2.9.2,eclipse插件,以及winutil,需要把hadoo2.6x里面的文件全部复制到hadoop.2.9.2/bin文件夹下,其中hadoop2.6.x中的haoop.dll需要复制到c:/Windows/System32目录下。关闭所有应用重启计算机,在main方法中设置如下系统属性:

              System.setProperty("HADOOP_HOME", "E:\hadoop\hadoop2.6");
    		System.setProperty("hadoop.home.dir", "E:\hadoop\hadoop-2.9.2");
    		System.setProperty("HADOOP_USER_NAME", "hadoop");

    设置好以后运行会报错:Acess$0之类的错误:遇到这种情况,在项目src下新建NativeIO.java文件,修改如下:

    /**
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    package org.apache.hadoop.io.nativeio;
    
    import java.io.File;
    import java.io.FileDescriptor;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.RandomAccessFile;
    import java.lang.reflect.Field;
    import java.nio.ByteBuffer;
    import java.nio.MappedByteBuffer;
    import java.nio.channels.FileChannel;
    import java.util.Map;
    import java.util.concurrent.ConcurrentHashMap;
    
    import org.apache.hadoop.classification.InterfaceAudience;
    import org.apache.hadoop.classification.InterfaceStability;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.CommonConfigurationKeys;
    import org.apache.hadoop.fs.HardLink;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.SecureIOUtils.AlreadyExistsException;
    import org.apache.hadoop.util.NativeCodeLoader;
    import org.apache.hadoop.util.Shell;
    import org.apache.hadoop.util.PerformanceAdvisory;
    
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import sun.misc.Unsafe;
    
    import com.google.common.annotations.VisibleForTesting;
    
    /**
     * JNI wrappers for various native IO-related calls not available in Java.
     * These functions should generally be used alongside a fallback to another
     * more portable mechanism.
     */
    @InterfaceAudience.Private
    @InterfaceStability.Unstable
    public class NativeIO {
      public static class POSIX {
        // Flags for open() call from bits/fcntl.h - Set by JNI
        public static int O_RDONLY = -1;
        public static int O_WRONLY = -1;
        public static int O_RDWR = -1;
        public static int O_CREAT = -1;
        public static int O_EXCL = -1;
        public static int O_NOCTTY = -1;
        public static int O_TRUNC = -1;
        public static int O_APPEND = -1;
        public static int O_NONBLOCK = -1;
        public static int O_SYNC = -1;
    
        // Flags for posix_fadvise() from bits/fcntl.h - Set by JNI
        /* No further special treatment.  */
        public static int POSIX_FADV_NORMAL = -1;
        /* Expect random page references.  */
        public static int POSIX_FADV_RANDOM = -1;
        /* Expect sequential page references.  */
        public static int POSIX_FADV_SEQUENTIAL = -1;
        /* Will need these pages.  */
        public static int POSIX_FADV_WILLNEED = -1;
        /* Don't need these pages.  */
        public static int POSIX_FADV_DONTNEED = -1;
        /* Data will be accessed once.  */
        public static int POSIX_FADV_NOREUSE = -1;
    
    
        // Updated by JNI when supported by glibc.  Leave defaults in case kernel
        // supports sync_file_range, but glibc does not.
        /* Wait upon writeout of all pages
           in the range before performing the
           write.  */
        public static int SYNC_FILE_RANGE_WAIT_BEFORE = 1;
        /* Initiate writeout of all those
           dirty pages in the range which are
           not presently under writeback.  */
        public static int SYNC_FILE_RANGE_WRITE = 2;
        /* Wait upon writeout of all pages in
           the range after performing the
           write.  */
        public static int SYNC_FILE_RANGE_WAIT_AFTER = 4;
    
        private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class);
    
        // Set to true via JNI if possible
        public static boolean fadvisePossible = false;
    
        private static boolean nativeLoaded = false;
        private static boolean syncFileRangePossible = true;
    
        static final String WORKAROUND_NON_THREADSAFE_CALLS_KEY =
          "hadoop.workaround.non.threadsafe.getpwuid";
        static final boolean WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT = true;
    
        private static long cacheTimeout = -1;
    
        private static CacheManipulator cacheManipulator = new CacheManipulator();
    
        public static CacheManipulator getCacheManipulator() {
          return cacheManipulator;
        }
    
        public static void setCacheManipulator(CacheManipulator cacheManipulator) {
          POSIX.cacheManipulator = cacheManipulator;
        }
    
        /**
         * Used to manipulate the operating system cache.
         */
        @VisibleForTesting
        public static class CacheManipulator {
          public void mlock(String identifier, ByteBuffer buffer,
              long len) throws IOException {
            POSIX.mlock(buffer, len);
          }
    
          public long getMemlockLimit() {
            return NativeIO.getMemlockLimit();
          }
    
          public long getOperatingSystemPageSize() {
            return NativeIO.getOperatingSystemPageSize();
          }
    
          public void posixFadviseIfPossible(String identifier,
            FileDescriptor fd, long offset, long len, int flags)
                throws NativeIOException {
            NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset,
                len, flags);
          }
    
          public boolean verifyCanMlock() {
            return NativeIO.isAvailable();
          }
        }
    
        /**
         * A CacheManipulator used for testing which does not actually call mlock.
         * This allows many tests to be run even when the operating system does not
         * allow mlock, or only allows limited mlocking.
         */
        @VisibleForTesting
        public static class NoMlockCacheManipulator extends CacheManipulator {
          public void mlock(String identifier, ByteBuffer buffer,
              long len) throws IOException {
            LOG.info("mlocking " + identifier);
          }
    
          public long getMemlockLimit() {
            return 1125899906842624L;
          }
    
          public long getOperatingSystemPageSize() {
            return 4096;
          }
    
          public boolean verifyCanMlock() {
            return true;
          }
        }
    
        static {
          if (NativeCodeLoader.isNativeCodeLoaded()) {
            try {
              Configuration conf = new Configuration();
              workaroundNonThreadSafePasswdCalls = conf.getBoolean(
                WORKAROUND_NON_THREADSAFE_CALLS_KEY,
                WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT);
    
              initNative();
              nativeLoaded = true;
    
              cacheTimeout = conf.getLong(
                CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_KEY,
                CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_DEFAULT) *
                1000;
              LOG.debug("Initialized cache for IDs to User/Group mapping with a " +
                " cache timeout of " + cacheTimeout/1000 + " seconds.");
    
            } catch (Throwable t) {
              // This can happen if the user has an older version of libhadoop.so
              // installed - in this case we can continue without native IO
              // after warning
              PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
            }
          }
        }
    
        /**
         * Return true if the JNI-based native IO extensions are available.
         */
        public static boolean isAvailable() {
          return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
        }
    
        private static void assertCodeLoaded() throws IOException {
          if (!isAvailable()) {
            throw new IOException("NativeIO was not loaded");
          }
        }
    
        /** Wrapper around open(2) */
        public static native FileDescriptor open(String path, int flags, int mode) throws IOException;
        /** Wrapper around fstat(2) */
        private static native Stat fstat(FileDescriptor fd) throws IOException;
    
        /** Native chmod implementation. On UNIX, it is a wrapper around chmod(2) */
        private static native void chmodImpl(String path, int mode) throws IOException;
    
        public static void chmod(String path, int mode) throws IOException {
          if (!Shell.WINDOWS) {
            chmodImpl(path, mode);
          } else {
            try {
              chmodImpl(path, mode);
            } catch (NativeIOException nioe) {
              if (nioe.getErrorCode() == 3) {
                throw new NativeIOException("No such file or directory",
                    Errno.ENOENT);
              } else {
                LOG.warn(String.format("NativeIO.chmod error (%d): %s",
                    nioe.getErrorCode(), nioe.getMessage()));
                throw new NativeIOException("Unknown error", Errno.UNKNOWN);
              }
            }
          }
        }
    
        /** Wrapper around posix_fadvise(2) */
        static native void posix_fadvise(
          FileDescriptor fd, long offset, long len, int flags) throws NativeIOException;
    
        /** Wrapper around sync_file_range(2) */
        static native void sync_file_range(
          FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException;
    
        /**
         * Call posix_fadvise on the given file descriptor. See the manpage
         * for this syscall for more information. On systems where this
         * call is not available, does nothing.
         *
         * @throws NativeIOException if there is an error with the syscall
         */
        static void posixFadviseIfPossible(String identifier,
            FileDescriptor fd, long offset, long len, int flags)
            throws NativeIOException {
          if (nativeLoaded && fadvisePossible) {
            try {
              posix_fadvise(fd, offset, len, flags);
            } catch (UnsatisfiedLinkError ule) {
              fadvisePossible = false;
            }
          }
        }
    
        /**
         * Call sync_file_range on the given file descriptor. See the manpage
         * for this syscall for more information. On systems where this
         * call is not available, does nothing.
         *
         * @throws NativeIOException if there is an error with the syscall
         */
        public static void syncFileRangeIfPossible(
            FileDescriptor fd, long offset, long nbytes, int flags)
            throws NativeIOException {
          if (nativeLoaded && syncFileRangePossible) {
            try {
              sync_file_range(fd, offset, nbytes, flags);
            } catch (UnsupportedOperationException uoe) {
              syncFileRangePossible = false;
            } catch (UnsatisfiedLinkError ule) {
              syncFileRangePossible = false;
            }
          }
        }
    
        static native void mlock_native(
            ByteBuffer buffer, long len) throws NativeIOException;
    
        /**
         * Locks the provided direct ByteBuffer into memory, preventing it from
         * swapping out. After a buffer is locked, future accesses will not incur
         * a page fault.
         * 
         * See the mlock(2) man page for more information.
         * 
         * @throws NativeIOException
         */
        static void mlock(ByteBuffer buffer, long len)
            throws IOException {
          assertCodeLoaded();
          if (!buffer.isDirect()) {
            throw new IOException("Cannot mlock a non-direct ByteBuffer");
          }
          mlock_native(buffer, len);
        }
        
        /**
         * Unmaps the block from memory. See munmap(2).
         *
         * There isn't any portable way to unmap a memory region in Java.
         * So we use the sun.nio method here.
         * Note that unmapping a memory region could cause crashes if code
         * continues to reference the unmapped code.  However, if we don't
         * manually unmap the memory, we are dependent on the finalizer to
         * do it, and we have no idea when the finalizer will run.
         *
         * @param buffer    The buffer to unmap.
         */
        public static void munmap(MappedByteBuffer buffer) {
          if (buffer instanceof sun.nio.ch.DirectBuffer) {
            sun.misc.Cleaner cleaner =
                ((sun.nio.ch.DirectBuffer)buffer).cleaner();
            cleaner.clean();
          }
        }
    
        /** Linux only methods used for getOwner() implementation */
        private static native long getUIDforFDOwnerforOwner(FileDescriptor fd) throws IOException;
        private static native String getUserName(long uid) throws IOException;
    
        /**
         * Result type of the fstat call
         */
        public static class Stat {
          private int ownerId, groupId;
          private String owner, group;
          private int mode;
    
          // Mode constants - Set by JNI
          public static int S_IFMT = -1;    /* type of file */
          public static int S_IFIFO  = -1;  /* named pipe (fifo) */
          public static int S_IFCHR  = -1;  /* character special */
          public static int S_IFDIR  = -1;  /* directory */
          public static int S_IFBLK  = -1;  /* block special */
          public static int S_IFREG  = -1;  /* regular */
          public static int S_IFLNK  = -1;  /* symbolic link */
          public static int S_IFSOCK = -1;  /* socket */
          public static int S_ISUID = -1;  /* set user id on execution */
          public static int S_ISGID = -1;  /* set group id on execution */
          public static int S_ISVTX = -1;  /* save swapped text even after use */
          public static int S_IRUSR = -1;  /* read permission, owner */
          public static int S_IWUSR = -1;  /* write permission, owner */
          public static int S_IXUSR = -1;  /* execute/search permission, owner */
    
          Stat(int ownerId, int groupId, int mode) {
            this.ownerId = ownerId;
            this.groupId = groupId;
            this.mode = mode;
          }
          
          Stat(String owner, String group, int mode) {
            if (!Shell.WINDOWS) {
              this.owner = owner;
            } else {
              this.owner = stripDomain(owner);
            }
            if (!Shell.WINDOWS) {
              this.group = group;
            } else {
              this.group = stripDomain(group);
            }
            this.mode = mode;
          }
          
          @Override
          public String toString() {
            return "Stat(owner='" + owner + "', group='" + group + "'" +
              ", mode=" + mode + ")";
          }
    
          public String getOwner() {
            return owner;
          }
          public String getGroup() {
            return group;
          }
          public int getMode() {
            return mode;
          }
        }
    
        /**
         * Returns the file stat for a file descriptor.
         *
         * @param fd file descriptor.
         * @return the file descriptor file stat.
         * @throws IOException thrown if there was an IO error while obtaining the file stat.
         */
        public static Stat getFstat(FileDescriptor fd) throws IOException {
          Stat stat = null;
          if (!Shell.WINDOWS) {
            stat = fstat(fd); 
            stat.owner = getName(IdCache.USER, stat.ownerId);
            stat.group = getName(IdCache.GROUP, stat.groupId);
          } else {
            try {
              stat = fstat(fd);
            } catch (NativeIOException nioe) {
              if (nioe.getErrorCode() == 6) {
                throw new NativeIOException("The handle is invalid.",
                    Errno.EBADF);
              } else {
                LOG.warn(String.format("NativeIO.getFstat error (%d): %s",
                    nioe.getErrorCode(), nioe.getMessage()));
                throw new NativeIOException("Unknown error", Errno.UNKNOWN);
              }
            }
          }
          return stat;
        }
    
        private static String getName(IdCache domain, int id) throws IOException {
          Map<Integer, CachedName> idNameCache = (domain == IdCache.USER)
            ? USER_ID_NAME_CACHE : GROUP_ID_NAME_CACHE;
          String name;
          CachedName cachedName = idNameCache.get(id);
          long now = System.currentTimeMillis();
          if (cachedName != null && (cachedName.timestamp + cacheTimeout) > now) {
            name = cachedName.name;
          } else {
            name = (domain == IdCache.USER) ? getUserName(id) : getGroupName(id);
            if (LOG.isDebugEnabled()) {
              String type = (domain == IdCache.USER) ? "UserName" : "GroupName";
              LOG.debug("Got " + type + " " + name + " for ID " + id +
                " from the native implementation");
            }
            cachedName = new CachedName(name, now);
            idNameCache.put(id, cachedName);
          }
          return name;
        }
    
        static native String getUserName(int uid) throws IOException;
        static native String getGroupName(int uid) throws IOException;
    
        private static class CachedName {
          final long timestamp;
          final String name;
    
          public CachedName(String name, long timestamp) {
            this.name = name;
            this.timestamp = timestamp;
          }
        }
    
        private static final Map<Integer, CachedName> USER_ID_NAME_CACHE =
          new ConcurrentHashMap<Integer, CachedName>();
    
        private static final Map<Integer, CachedName> GROUP_ID_NAME_CACHE =
          new ConcurrentHashMap<Integer, CachedName>();
    
        private enum IdCache { USER, GROUP }
    
        public final static int MMAP_PROT_READ = 0x1; 
        public final static int MMAP_PROT_WRITE = 0x2; 
        public final static int MMAP_PROT_EXEC = 0x4; 
    
        public static native long mmap(FileDescriptor fd, int prot,
            boolean shared, long length) throws IOException;
    
        public static native void munmap(long addr, long length)
            throws IOException;
      }
    
      private static boolean workaroundNonThreadSafePasswdCalls = false;
    
    
      public static class Windows {
        // Flags for CreateFile() call on Windows
        public static final long GENERIC_READ = 0x80000000L;
        public static final long GENERIC_WRITE = 0x40000000L;
    
        public static final long FILE_SHARE_READ = 0x00000001L;
        public static final long FILE_SHARE_WRITE = 0x00000002L;
        public static final long FILE_SHARE_DELETE = 0x00000004L;
    
        public static final long CREATE_NEW = 1;
        public static final long CREATE_ALWAYS = 2;
        public static final long OPEN_EXISTING = 3;
        public static final long OPEN_ALWAYS = 4;
        public static final long TRUNCATE_EXISTING = 5;
    
        public static final long FILE_BEGIN = 0;
        public static final long FILE_CURRENT = 1;
        public static final long FILE_END = 2;
        
        public static final long FILE_ATTRIBUTE_NORMAL = 0x00000080L;
    
        /**
         * Create a directory with permissions set to the specified mode.  By setting
         * permissions at creation time, we avoid issues related to the user lacking
         * WRITE_DAC rights on subsequent chmod calls.  One example where this can
         * occur is writing to an SMB share where the user does not have Full Control
         * rights, and therefore WRITE_DAC is denied.
         *
         * @param path directory to create
         * @param mode permissions of new directory
         * @throws IOException if there is an I/O error
         */
        public static void createDirectoryWithMode(File path, int mode)
            throws IOException {
          createDirectoryWithMode0(path.getAbsolutePath(), mode);
        }
    
        /** Wrapper around CreateDirectory() on Windows */
        private static native void createDirectoryWithMode0(String path, int mode)
            throws NativeIOException;
    
        /** Wrapper around CreateFile() on Windows */
        public static native FileDescriptor createFile(String path,
            long desiredAccess, long shareMode, long creationDisposition)
            throws IOException;
    
        /**
         * Create a file for write with permissions set to the specified mode.  By
         * setting permissions at creation time, we avoid issues related to the user
         * lacking WRITE_DAC rights on subsequent chmod calls.  One example where
         * this can occur is writing to an SMB share where the user does not have
         * Full Control rights, and therefore WRITE_DAC is denied.
         *
         * This method mimics the semantics implemented by the JDK in
         * {@link java.io.FileOutputStream}.  The file is opened for truncate or
         * append, the sharing mode allows other readers and writers, and paths
         * longer than MAX_PATH are supported.  (See io_util_md.c in the JDK.)
         *
         * @param path file to create
         * @param append if true, then open file for append
         * @param mode permissions of new directory
         * @return FileOutputStream of opened file
         * @throws IOException if there is an I/O error
         */
        public static FileOutputStream createFileOutputStreamWithMode(File path,
            boolean append, int mode) throws IOException {
          long desiredAccess = GENERIC_WRITE;
          long shareMode = FILE_SHARE_READ | FILE_SHARE_WRITE;
          long creationDisposition = append ? OPEN_ALWAYS : CREATE_ALWAYS;
          return new FileOutputStream(createFileWithMode0(path.getAbsolutePath(),
              desiredAccess, shareMode, creationDisposition, mode));
        }
    
        /** Wrapper around CreateFile() with security descriptor on Windows */
        private static native FileDescriptor createFileWithMode0(String path,
            long desiredAccess, long shareMode, long creationDisposition, int mode)
            throws NativeIOException;
    
        /** Wrapper around SetFilePointer() on Windows */
        public static native long setFilePointer(FileDescriptor fd,
            long distanceToMove, long moveMethod) throws IOException;
    
        /** Windows only methods used for getOwner() implementation */
        private static native String getOwner(FileDescriptor fd) throws IOException;
    
        /** Supported list of Windows access right flags */
        public static enum AccessRight {
          ACCESS_READ (0x0001),      // FILE_READ_DATA
          ACCESS_WRITE (0x0002),     // FILE_WRITE_DATA
          ACCESS_EXECUTE (0x0020);   // FILE_EXECUTE
    
          private final int accessRight;
          AccessRight(int access) {
            accessRight = access;
          }
    
          public int accessRight() {
            return accessRight;
          }
        };
    
        /** Windows only method used to check if the current process has requested
         *  access rights on the given path. */
        private static native boolean access0(String path, int requestedAccess);
    
        /**
         * Checks whether the current process has desired access rights on
         * the given path.
         * 
         * Longer term this native function can be substituted with JDK7
         * function Files#isReadable, isWritable, isExecutable.
         *
         * @param path input path
         * @param desiredAccess ACCESS_READ, ACCESS_WRITE or ACCESS_EXECUTE
         * @return true if access is allowed
         * @throws IOException I/O exception on error
         */
        public static boolean access(String path, AccessRight desiredAccess)
            throws IOException {
          return true;
        }
    
        /**
         * Extends both the minimum and maximum working set size of the current
         * process.  This method gets the current minimum and maximum working set
         * size, adds the requested amount to each and then sets the minimum and
         * maximum working set size to the new values.  Controlling the working set
         * size of the process also controls the amount of memory it can lock.
         *
         * @param delta amount to increment minimum and maximum working set size
         * @throws IOException for any error
         * @see POSIX#mlock(ByteBuffer, long)
         */
        public static native void extendWorkingSetSize(long delta) throws IOException;
    
        static {
          if (NativeCodeLoader.isNativeCodeLoaded()) {
            try {
              initNative();
              nativeLoaded = true;
            } catch (Throwable t) {
              // This can happen if the user has an older version of libhadoop.so
              // installed - in this case we can continue without native IO
              // after warning
              PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
            }
          }
        }
      }
    
      private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class);
    
      private static boolean nativeLoaded = false;
    
      static {
        if (NativeCodeLoader.isNativeCodeLoaded()) {
          try {
            initNative();
            nativeLoaded = true;
          } catch (Throwable t) {
            // This can happen if the user has an older version of libhadoop.so
            // installed - in this case we can continue without native IO
            // after warning
            PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t);
          }
        }
      }
    
      /**
       * Return true if the JNI-based native IO extensions are available.
       */
      public static boolean isAvailable() {
        return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded;
      }
    
      /** Initialize the JNI method ID and class ID cache */
      private static native void initNative();
    
      /**
       * Get the maximum number of bytes that can be locked into memory at any
       * given point.
       *
       * @return 0 if no bytes can be locked into memory;
       *         Long.MAX_VALUE if there is no limit;
       *         The number of bytes that can be locked into memory otherwise.
       */
      static long getMemlockLimit() {
        return isAvailable() ? getMemlockLimit0() : 0;
      }
    
      private static native long getMemlockLimit0();
      
      /**
       * @return the operating system's page size.
       */
      static long getOperatingSystemPageSize() {
        try {
          Field f = Unsafe.class.getDeclaredField("theUnsafe");
          f.setAccessible(true);
          Unsafe unsafe = (Unsafe)f.get(null);
          return unsafe.pageSize();
        } catch (Throwable e) {
          LOG.warn("Unable to get operating system page size.  Guessing 4096.", e);
          return 4096;
        }
      }
    
      private static class CachedUid {
        final long timestamp;
        final String username;
        public CachedUid(String username, long timestamp) {
          this.timestamp = timestamp;
          this.username = username;
        }
      }
      private static final Map<Long, CachedUid> uidCache =
          new ConcurrentHashMap<Long, CachedUid>();
      private static long cacheTimeout;
      private static boolean initialized = false;
      
      /**
       * The Windows logon name has two part, NetBIOS domain name and
       * user account name, of the format DOMAINUserName. This method
       * will remove the domain part of the full logon name.
       *
       * @param Fthe full principal name containing the domain
       * @return name with domain removed
       */
      private static String stripDomain(String name) {
        int i = name.indexOf('\');
        if (i != -1)
          name = name.substring(i + 1);
        return name;
      }
    
      public static String getOwner(FileDescriptor fd) throws IOException {
        ensureInitialized();
        if (Shell.WINDOWS) {
          String owner = Windows.getOwner(fd);
          owner = stripDomain(owner);
          return owner;
        } else {
          long uid = POSIX.getUIDforFDOwnerforOwner(fd);
          CachedUid cUid = uidCache.get(uid);
          long now = System.currentTimeMillis();
          if (cUid != null && (cUid.timestamp + cacheTimeout) > now) {
            return cUid.username;
          }
          String user = POSIX.getUserName(uid);
          LOG.info("Got UserName " + user + " for UID " + uid
              + " from the native implementation");
          cUid = new CachedUid(user, now);
          uidCache.put(uid, cUid);
          return user;
        }
      }
    
      /**
       * Create a FileDescriptor that shares delete permission on the
       * file opened at a given offset, i.e. other process can delete
       * the file the FileDescriptor is reading. Only Windows implementation
       * uses the native interface.
       */
      public static FileDescriptor getShareDeleteFileDescriptor(
          File f, long seekOffset) throws IOException {
        if (!Shell.WINDOWS) {
          RandomAccessFile rf = new RandomAccessFile(f, "r");
          if (seekOffset > 0) {
            rf.seek(seekOffset);
          }
          return rf.getFD();
        } else {
          // Use Windows native interface to create a FileDescriptor that
          // shares delete permission on the file opened, and set it to the
          // given offset.
          //
          FileDescriptor fd = NativeIO.Windows.createFile(
              f.getAbsolutePath(),
              NativeIO.Windows.GENERIC_READ,
              NativeIO.Windows.FILE_SHARE_READ |
                  NativeIO.Windows.FILE_SHARE_WRITE |
                  NativeIO.Windows.FILE_SHARE_DELETE,
              NativeIO.Windows.OPEN_EXISTING);
          if (seekOffset > 0)
            NativeIO.Windows.setFilePointer(fd, seekOffset, NativeIO.Windows.FILE_BEGIN);
          return fd;
        }
      }
    
      /**
       * Create the specified File for write access, ensuring that it does not exist.
       * @param f the file that we want to create
       * @param permissions we want to have on the file (if security is enabled)
       *
       * @throws AlreadyExistsException if the file already exists
       * @throws IOException if any other error occurred
       */
      public static FileOutputStream getCreateForWriteFileOutputStream(File f, int permissions)
          throws IOException {
        if (!Shell.WINDOWS) {
          // Use the native wrapper around open(2)
          try {
            FileDescriptor fd = NativeIO.POSIX.open(f.getAbsolutePath(),
                NativeIO.POSIX.O_WRONLY | NativeIO.POSIX.O_CREAT
                    | NativeIO.POSIX.O_EXCL, permissions);
            return new FileOutputStream(fd);
          } catch (NativeIOException nioe) {
            if (nioe.getErrno() == Errno.EEXIST) {
              throw new AlreadyExistsException(nioe);
            }
            throw nioe;
          }
        } else {
          // Use the Windows native APIs to create equivalent FileOutputStream
          try {
            FileDescriptor fd = NativeIO.Windows.createFile(f.getCanonicalPath(),
                NativeIO.Windows.GENERIC_WRITE,
                NativeIO.Windows.FILE_SHARE_DELETE
                    | NativeIO.Windows.FILE_SHARE_READ
                    | NativeIO.Windows.FILE_SHARE_WRITE,
                NativeIO.Windows.CREATE_NEW);
            NativeIO.POSIX.chmod(f.getCanonicalPath(), permissions);
            return new FileOutputStream(fd);
          } catch (NativeIOException nioe) {
            if (nioe.getErrorCode() == 80) {
              // ERROR_FILE_EXISTS
              // 80 (0x50)
              // The file exists
              throw new AlreadyExistsException(nioe);
            }
            throw nioe;
          }
        }
      }
    
      private synchronized static void ensureInitialized() {
        if (!initialized) {
          cacheTimeout =
              new Configuration().getLong("hadoop.security.uid.cache.secs",
                  4*60*60) * 1000;
          LOG.info("Initialized cache for UID to User mapping with a cache" +
              " timeout of " + cacheTimeout/1000 + " seconds.");
          initialized = true;
        }
      }
      
      /**
       * A version of renameTo that throws a descriptive exception when it fails.
       *
       * @param src                  The source path
       * @param dst                  The destination path
       * 
       * @throws NativeIOException   On failure.
       */
      public static void renameTo(File src, File dst)
          throws IOException {
        if (!nativeLoaded) {
          if (!src.renameTo(dst)) {
            throw new IOException("renameTo(src=" + src + ", dst=" +
              dst + ") failed.");
          }
        } else {
          renameTo0(src.getAbsolutePath(), dst.getAbsolutePath());
        }
      }
    
      /**
       * Creates a hardlink "dst" that points to "src".
       *
       * This is deprecated since JDK7 NIO can create hardlinks via the
       * {@link java.nio.file.Files} API.
       *
       * @param src source file
       * @param dst hardlink location
       * @throws IOException
       */
      @Deprecated
      public static void link(File src, File dst) throws IOException {
        if (!nativeLoaded) {
          HardLink.createHardLink(src, dst);
        } else {
          link0(src.getAbsolutePath(), dst.getAbsolutePath());
        }
      }
    
      /**
       * A version of renameTo that throws a descriptive exception when it fails.
       *
       * @param src                  The source path
       * @param dst                  The destination path
       * 
       * @throws NativeIOException   On failure.
       */
      private static native void renameTo0(String src, String dst)
          throws NativeIOException;
    
      private static native void link0(String src, String dst)
          throws NativeIOException;
    
      /**
       * Unbuffered file copy from src to dst without tainting OS buffer cache
       *
       * In POSIX platform:
       * It uses FileChannel#transferTo() which internally attempts
       * unbuffered IO on OS with native sendfile64() support and falls back to
       * buffered IO otherwise.
       *
       * It minimizes the number of FileChannel#transferTo call by passing the the
       * src file size directly instead of a smaller size as the 3rd parameter.
       * This saves the number of sendfile64() system call when native sendfile64()
       * is supported. In the two fall back cases where sendfile is not supported,
       * FileChannle#transferTo already has its own batching of size 8 MB and 8 KB,
       * respectively.
       *
       * In Windows Platform:
       * It uses its own native wrapper of CopyFileEx with COPY_FILE_NO_BUFFERING
       * flag, which is supported on Windows Server 2008 and above.
       *
       * Ideally, we should use FileChannel#transferTo() across both POSIX and Windows
       * platform. Unfortunately, the wrapper(Java_sun_nio_ch_FileChannelImpl_transferTo0)
       * used by FileChannel#transferTo for unbuffered IO is not implemented on Windows.
       * Based on OpenJDK 6/7/8 source code, Java_sun_nio_ch_FileChannelImpl_transferTo0
       * on Windows simply returns IOS_UNSUPPORTED.
       *
       * Note: This simple native wrapper does minimal parameter checking before copy and
       * consistency check (e.g., size) after copy.
       * It is recommended to use wrapper function like
       * the Storage#nativeCopyFileUnbuffered() function in hadoop-hdfs with pre/post copy
       * checks.
       *
       * @param src                  The source path
       * @param dst                  The destination path
       * @throws IOException
       */
      public static void copyFileUnbuffered(File src, File dst) throws IOException {
        if (nativeLoaded && Shell.WINDOWS) {
          copyFileUnbuffered0(src.getAbsolutePath(), dst.getAbsolutePath());
        } else {
          FileInputStream fis = new FileInputStream(src);
          FileChannel input = null;
          try {
            input = fis.getChannel();
            try (FileOutputStream fos = new FileOutputStream(dst);
                 FileChannel output = fos.getChannel()) {
              long remaining = input.size();
              long position = 0;
              long transferred = 0;
              while (remaining > 0) {
                transferred = input.transferTo(position, remaining, output);
                remaining -= transferred;
                position += transferred;
              }
            }
          } finally {
            IOUtils.cleanupWithLogger(LOG, input, fis);
          }
        }
      }
    
      private static native void copyFileUnbuffered0(String src, String dst)
          throws NativeIOException;
    }
    

      三、关于这个使用maven构建的项目,我在运行时因为使用公司内网,速度很慢,所以改变策略。创建java项目,然后把hadoop2.9.2里面的share目录下的common、hdfs、httpfs、yarn、mapreduce目录下的jar文件都拷了进来,运行中出了不少bug。

    hadoop-hdfs-2.9.2.jar
    hadoop-hdfs-client-2.9.2.jar
    hadoop-mapreduce-client-app-2.9.2.jar
    hadoop-mapreduce-client-common-2.9.2.jar
    hadoop-mapreduce-client-core-2.9.2.jar
    hadoop-mapreduce-client-hs-2.9.2.jar
    hadoop-mapreduce-client-jobclient-2.9.2-tests.jar
    hadoop-mapreduce-client-shuffle-2.9.2.jar
    hadoop-yarn-api-2.9.2.jar
    hadoop-yarn-applications-distributedshell-2.9.2.jar
    hadoop-yarn-applications-unmanaged-am-launcher-2.9.2.jar
    hadoop-yarn-client-2.9.2.jar
    activation-1.1.jar
    aopalliance-1.0.jar
    apacheds-i18n-2.0.0-M15.jar
    apacheds-kerberos-codec-2.0.0-M15.jar
    api-asn1-api-1.0.0-M20.jar
    api-util-1.0.0-M20.jar
    asm-3.2.jar
    avro-1.7.7.jar
    commons-beanutils-1.7.0.jar
    commons-beanutils-core-1.8.0.jar
    commons-cli-1.2.jar
    commons-codec-1.4.jar
    commons-collections-3.2.2.jar
    commons-compress-1.4.1.jar
    commons-configuration-1.6.jar
    commons-digester-1.8.jar
    commons-io-2.4.jar
    commons-lang-2.6.jar
    commons-lang3-3.4.jar
    commons-logging-1.1.3.jar
    commons-math3-3.1.1.jar
    commons-net-3.1.jar
    curator-client-2.7.1.jar
    curator-framework-2.7.1.jar
    curator-recipes-2.7.1.jar
    ehcache-3.3.1.jar
    fst-2.50.jar
    geronimo-jcache_1.0_spec-1.0-alpha-1.jar
    gson-2.2.4.jar
    guava-11.0.2.jar
    guice-3.0.jar
    guice-servlet-3.0.jar
    HikariCP-java7-2.4.12.jar
    htrace-core4-4.1.0-incubating.jar
    httpclient-4.5.2.jar
    httpcore-4.4.4.jar
    jackson-core-asl-1.9.13.jar
    jackson-jaxrs-1.9.13.jar
    jackson-mapper-asl-1.9.13.jar
    jackson-xc-1.9.13.jar
    java-util-1.9.0.jar
    java-xmlbuilder-0.4.jar
    javax.inject-1.jar
    jaxb-api-2.2.2.jar
    jaxb-impl-2.2.3-1.jar
    jcip-annotations-1.0-1.jar
    jersey-client-1.9.jar
    jersey-core-1.9.jar
    jersey-guice-1.9.jar
    jersey-json-1.9.jar
    jersey-server-1.9.jar
    jets3t-0.9.0.jar
    jettison-1.1.jar
    jetty-6.1.26.jar
    jetty-sslengine-6.1.26.jar
    jetty-util-6.1.26.jar
    jsch-0.1.54.jar
    json-io-2.5.1.jar
    json-smart-1.3.1.jar
    jsp-api-2.1.jar
    jsr305-3.0.0.jar
    leveldbjni-all-1.8.jar
    log4j-1.2.17.jar
    metrics-core-3.0.1.jar
    mssql-jdbc-6.2.1.jre7.jar
    netty-3.6.2.Final.jar
    nimbus-jose-jwt-4.41.1.jar
    paranamer-2.3.jar
    protobuf-java-2.5.0.jar
    servlet-api-2.5.jar
    snappy-java-1.0.5.jar
    stax-api-1.0-2.jar
    stax2-api-3.1.4.jar
    woodstox-core-5.0.3.jar
    xmlenc-0.52.jar
    xz-1.0.jar
    zookeeper-3.4.6.jar
    hadoop-common-2.9.2.jar
    slf4j-api-1.7.25.jar
    slf4j-log4j12-1.7.25.jar
    hadoop-yarn-server-nodemanager-2.9.2.jar
    hadoop-yarn-server-resourcemanager-2.9.2.jar
    hadoop-yarn-server-router-2.9.2.jar
    hadoop-yarn-server-sharedcachemanager-2.9.2.jar
    hadoop-yarn-server-timeline-pluginstorage-2.9.2.jar
    hadoop-yarn-server-web-proxy-2.9.2.jar
    hadoop-yarn-ui-2.9.2.war
    hadoop-annotations-2.9.2.jar
    hadoop-auth-2.9.2.jar
    hadoop-nfs-2.9.2.jar
    hamcrest-core-1.3.jar
    junit-4.11.jar
    hadoop-mapreduce-client-jobclient-2.9.2.jar
    mockito-all-1.8.5.jar
    ojdbc7.jar
    orai18n.jar
    hadoop-yarn-common-2.9.2.jar
    hadoop-yarn-registry-2.9.2.jar
    hadoop-yarn-server-applicationhistoryservice-2.9.2.jar
    hadoop-yarn-server-common-2.9.2.jar

    前言

    Web日志包含着网站最重要的信息,通过日志分析,我们可以知道网站的访问量,哪个网页访问人数最多,哪个网页最有价值等。一般中型的网站(10W的PV以上),每天会产生1G以上Web日志文件。大型或超大型的网站,可能每小时就会产生10G的数据量。

    对于日志的这种规模的数据,用Hadoop进行日志分析,是最适合不过的了。

    目录

    1. Web日志分析概述
    2. 需求分析:KPI指标设计
    3. 算法模型:Hadoop并行算法
    4. 架构设计:日志KPI系统架构
    5. 程序开发1:用Maven构建Hadoop项目
    6. 程序开发2:MapReduce程序实现

    1. Web日志分析概述

    Web日志由Web服务器产生,可能是Nginx, Apache, Tomcat等。从Web日志中,我们可以获取网站每类页面的PV值(PageView,页面访问量)、独立IP数;稍微复杂一些的,可以计算得出用户所检索的关键词排行榜、用户停留时间最高的页面等;更复杂的,构建广告点击模型、分析用户行为特征等等。

    在Web日志中,每条日志通常代表着用户的一次访问行为,例如下面就是一条nginx日志:

    
    222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939
     "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    

    拆解为以下8个变量

    • remote_addr: 记录客户端的ip地址, 222.68.172.190
    • remote_user: 记录客户端用户名称, –
    • time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
    • request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1”
    • status: 记录请求状态,成功是200, 200
    • body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
    • http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
    • http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36”

    注:要更多的信息,则要用其它手段去获取,通过js代码单独发送请求,使用cookies记录用户的访问信息。

    利用这些日志信息,我们可以深入挖掘网站的秘密了。

    少量数据的情况

    少量数据的情况(10Mb,100Mb,10G),在单机处理尚能忍受的时候,我可以直接利用各种Unix/Linux工具,awk、grep、sort、join等都是日志分析的利器,再配合perl, python,正则表达工,基本就可以解决所有的问题。

    例如,我们想从上面提到的nginx日志中得到访问量最高前10个IP,实现很简单:

    
    ~ cat access.log.10 | awk '{a[$1]++} END {for(b in a) print b"	"a[b]}' | sort -k2 -r | head -n 10
    163.177.71.12   972
    101.226.68.137  972
    183.195.232.138 971
    50.116.27.194   97
    14.17.29.86     96
    61.135.216.104  94
    61.135.216.105  91
    61.186.190.41   9
    59.39.192.108   9
    220.181.51.212  9
    

    海量数据的情况

    当数据量每天以10G、100G增长的时候,单机处理能力已经不能满足需求。我们就需要增加系统的复杂性,用计算机集群,存储阵列来解决。在Hadoop出现之前,海量数据存储,和海量日志分析都是非常困难的。只有少数一些公司,掌握着高效的并行计算,分步式计算,分步式存储的核心技术。

    Hadoop的出现,大幅度的降低了海量数据处理的门槛,让小公司甚至是个人都能力,搞定海量数据。并且,Hadoop非常适用于日志分析系统。

    2.需求分析:KPI指标设计

    下面我们将从一个公司案例出发来全面的解释,如何用进行海量Web日志分析,提取KPI数据

    案例介绍
    某电子商务网站,在线团购业务。每日PV数100w,独立IP数5w。用户通常在工作日上午10:00-12:00和下午15:00-18:00访问量最大。日间主要是通过PC端浏览器访问,休息日及夜间通过移动设备访问较多。网站搜索浏量占整个网站的80%,PC用户不足1%的用户会消费,移动用户有5%会消费。

    通过简短的描述,我们可以粗略地看出,这家电商网站的经营状况,并认识到愿意消费的用户从哪里来,有哪些潜在的用户可以挖掘,网站是否存在倒闭风险等。

    KPI指标设计

    • PV(PageView): 页面访问量统计
    • IP: 页面独立IP的访问量统计
    • Time: 用户每小时PV的统计
    • Source: 用户来源域名的统计
    • Browser: 用户的访问设备统计

    注:商业保密限制,无法提供电商网站的日志。
    下面的内容,将以我的个人网站为例提取数据进行分析。

    百度统计,对我个人网站做的统计!http://www.fens.me

    基本统计指标:
    hadoop-kpi-baidu

    用户的访问设备统计指标:
    hadoop-kpi-baidu2

    从商业的角度,个人网站的特征与电商网站不太一样,没有转化率,同时跳出率也比较高。从技术的角度,同样都关注KPI指标设计。

    3.算法模型:Hadoop并行算法

    hadoop-kpi-log

    并行算法的设计:
    注:找到第一节有定义的8个变量

    PV(PageView): 页面访问量统计

    • Map过程{key:$request,value:1}
    • Reduce过程{key:$request,value:求和(sum)}

    IP: 页面独立IP的访问量统计

    • Map: {key:$request,value:$remote_addr}
    • Reduce: {key:$request,value:去重再求和(sum(unique))}

    Time: 用户每小时PV的统计

    • Map: {key:$time_local,value:1}
    • Reduce: {key:$time_local,value:求和(sum)}

    Source: 用户来源域名的统计

    • Map: {key:$http_referer,value:1}
    • Reduce: {key:$http_referer,value:求和(sum)}

    Browser: 用户的访问设备统计

    • Map: {key:$http_user_agent,value:1}
    • Reduce: {key:$http_user_agent,value:求和(sum)}

    4.架构设计:日志KPI系统架构

    hadoop-kpi-architect

    上图中,左边是Application业务系统,右边是Hadoop的HDFS, MapReduce。

    1. 日志是由业务系统产生的,我们可以设置web服务器每天产生一个新的目录,目录下面会产生多个日志文件,每个日志文件64M。
    2. 设置系统定时器CRON,夜间在0点后,向HDFS导入昨天的日志文件。
    3. 完成导入后,设置系统定时器,启动MapReduce程序,提取并计算统计指标。
    4. 完成计算后,设置系统定时器,从HDFS导出统计指标数据到数据库,方便以后的即使查询。

    hadoop-kpi-process

    上面这幅图,我们可以看得更清楚,数据是如何流动的。蓝色背景的部分是在Hadoop中的,接下来我们的任务就是完成MapReduce的程序实现。

    5.程序开发1:用Maven构建Hadoop项目

    请参考文章:用Maven构建Hadoop项目

    win7的开发环境 和 Hadoop的运行环境 ,在上面文章中已经介绍过了。

    我们需要放日志文件,上传的HDFS里/user/hdfs/log_kpi/目录,参考下面的命令操作

    
    ~ hadoop fs -mkdir /user/hdfs/log_kpi
    ~ hadoop fs -copyFromLocal /home/conan/datafiles/access.log.10 /user/hdfs/log_kpi/
    

    我已经把整个MapReduce的实现都放到了github上面:

    https://github.com/bsspirit/maven_hadoop_template/releases/tag/kpi_v1

    6.程序开发2:MapReduce程序实现

    开发流程:

    1. 对日志行的解析
    2. Map函数实现
    3. Reduce函数实现
    4. 启动程序实现

    1). 对日志行的解析
    新建文件:org.conan.myhadoop.mr.kpi.KPI.java

    
    package org.conan.myhadoop.mr.kpi;
    
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.Date;
    import java.util.Locale;
    
    /*
     * KPI Object
     */
    public class KPI {
        private String remote_addr;// 记录客户端的ip地址
        private String remote_user;// 记录客户端用户名称,忽略属性"-"
        private String time_local;// 记录访问时间与时区
        private String request;// 记录请求的url与http协议
        private String status;// 记录请求状态;成功是200
        private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
        private String http_referer;// 用来记录从那个页面链接访问过来的
        private String http_user_agent;// 记录客户浏览器的相关信息
    
        private boolean valid = true;// 判断数据是否合法
        
        @Override
        public String toString() {
            StringBuilder sb = new StringBuilder();
            sb.append("valid:" + this.valid);
            sb.append("
    remote_addr:" + this.remote_addr);
            sb.append("
    remote_user:" + this.remote_user);
            sb.append("
    time_local:" + this.time_local);
            sb.append("
    request:" + this.request);
            sb.append("
    status:" + this.status);
            sb.append("
    body_bytes_sent:" + this.body_bytes_sent);
            sb.append("
    http_referer:" + this.http_referer);
            sb.append("
    http_user_agent:" + this.http_user_agent);
            return sb.toString();
        }
    
        public String getRemote_addr() {
            return remote_addr;
        }
    
        public void setRemote_addr(String remote_addr) {
            this.remote_addr = remote_addr;
        }
    
        public String getRemote_user() {
            return remote_user;
        }
    
        public void setRemote_user(String remote_user) {
            this.remote_user = remote_user;
        }
    
        public String getTime_local() {
            return time_local;
        }
    
        public Date getTime_local_Date() throws ParseException {
            SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US);
            return df.parse(this.time_local);
        }
        
        public String getTime_local_Date_hour() throws ParseException{
            SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
            return df.format(this.getTime_local_Date());
        }
    
        public void setTime_local(String time_local) {
            this.time_local = time_local;
        }
    
        public String getRequest() {
            return request;
        }
    
        public void setRequest(String request) {
            this.request = request;
        }
    
        public String getStatus() {
            return status;
        }
    
        public void setStatus(String status) {
            this.status = status;
        }
    
        public String getBody_bytes_sent() {
            return body_bytes_sent;
        }
    
        public void setBody_bytes_sent(String body_bytes_sent) {
            this.body_bytes_sent = body_bytes_sent;
        }
    
        public String getHttp_referer() {
            return http_referer;
        }
        
        public String getHttp_referer_domain(){
            if(http_referer.length()<8){ 
                return http_referer;
            }
            
            String str=this.http_referer.replace(""", "").replace("http://", "").replace("https://", "");
            return str.indexOf("/")>0?str.substring(0, str.indexOf("/")):str;
        }
    
        public void setHttp_referer(String http_referer) {
            this.http_referer = http_referer;
        }
    
        public String getHttp_user_agent() {
            return http_user_agent;
        }
    
        public void setHttp_user_agent(String http_user_agent) {
            this.http_user_agent = http_user_agent;
        }
    
        public boolean isValid() {
            return valid;
        }
    
        public void setValid(boolean valid) {
            this.valid = valid;
        }
    
        public static void main(String args[]) {
            String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"";
            System.out.println(line);
            KPI kpi = new KPI();
            String[] arr = line.split(" ");
    
            kpi.setRemote_addr(arr[0]);
            kpi.setRemote_user(arr[1]);
            kpi.setTime_local(arr[3].substring(1));
            kpi.setRequest(arr[6]);
            kpi.setStatus(arr[8]);
            kpi.setBody_bytes_sent(arr[9]);
            kpi.setHttp_referer(arr[10]);
            kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
            System.out.println(kpi);
    
            try {
                SimpleDateFormat df = new SimpleDateFormat("yyyy.MM.dd:HH:mm:ss", Locale.US);
                System.out.println(df.format(kpi.getTime_local_Date()));
                System.out.println(kpi.getTime_local_Date_hour());
                System.out.println(kpi.getHttp_referer_domain());
            } catch (ParseException e) {
                e.printStackTrace();
            }
        }
    
    }
    

    从日志文件中,取一行通过main函数写一个简单的解析测试。

    控制台输出:

    
    222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
    valid:true
    remote_addr:222.68.172.190
    remote_user:-
    time_local:18/Sep/2013:06:49:57
    request:/images/my.jpg
    status:200
    body_bytes_sent:19939
    http_referer:"http://www.angularjs.cn/A00n"
    http_user_agent:"Mozilla/5.0 (Windows
    2013.09.18:06:49:57
    2013091806
    www.angularjs.cn
    

    我们看到日志行,被正确的解析成了kpi对象的属性。我们把解析过程,单独封装成一个方法。

    
        private static KPI parser(String line) {
            System.out.println(line);
            KPI kpi = new KPI();
            String[] arr = line.split(" ");
            if (arr.length > 11) {
                kpi.setRemote_addr(arr[0]);
                kpi.setRemote_user(arr[1]);
                kpi.setTime_local(arr[3].substring(1));
                kpi.setRequest(arr[6]);
                kpi.setStatus(arr[8]);
                kpi.setBody_bytes_sent(arr[9]);
                kpi.setHttp_referer(arr[10]);
                
                if (arr.length > 12) {
                    kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
                } else {
                    kpi.setHttp_user_agent(arr[11]);
                }
    
                if (Integer.parseInt(kpi.getStatus()) >= 400) {// 大于400,HTTP错误
                    kpi.setValid(false);
                }
            } else {
                kpi.setValid(false);
            }
            return kpi;
        }
    

    对map方法,reduce方法,启动方法,我们单独写一个类来实现

    下面将分别介绍MapReduce的实现类:

    • PV:org.conan.myhadoop.mr.kpi.KPIPV.java
    • IP: org.conan.myhadoop.mr.kpi.KPIIP.java
    • Time: org.conan.myhadoop.mr.kpi.KPITime.java
    • Browser: org.conan.myhadoop.mr.kpi.KPIBrowser.java

    1). PV:org.conan.myhadoop.mr.kpi.KPIPV.java

    
    package org.conan.myhadoop.mr.kpi;
    
    import java.io.IOException;
    import java.util.Iterator;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileInputFormat;
    import org.apache.hadoop.mapred.FileOutputFormat;
    import org.apache.hadoop.mapred.JobClient;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.Mapper;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reducer;
    import org.apache.hadoop.mapred.Reporter;
    import org.apache.hadoop.mapred.TextInputFormat;
    import org.apache.hadoop.mapred.TextOutputFormat;
    
    public class KPIPV { 
    
        public static class KPIPVMapper extends MapReduceBase implements Mapper<object, text,="" intwritable=""> {
            private IntWritable one = new IntWritable(1);
            private Text word = new Text();
    
            @Override
            public void map(Object key, Text value, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {
                KPI kpi = KPI.filterPVs(value.toString());
                if (kpi.isValid()) {
                    word.set(kpi.getRequest());
                    output.collect(word, one);
                }
            }
        }
    
        public static class KPIPVReducer extends MapReduceBase implements Reducer<text, intwritable,="" text,="" intwritable=""> {
            private IntWritable result = new IntWritable();
    
            @Override
            public void reduce(Text key, Iterator values, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException {
                int sum = 0;
                while (values.hasNext()) {
                    sum += values.next().get();
                }
                result.set(sum);
                output.collect(key, result);
            }
        }
    
        public static void main(String[] args) throws Exception {
            String input = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/";
            String output = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv";
    
            JobConf conf = new JobConf(KPIPV.class);
            conf.setJobName("KPIPV");
            conf.addResource("classpath:/hadoop/core-site.xml");
            conf.addResource("classpath:/hadoop/hdfs-site.xml");
            conf.addResource("classpath:/hadoop/mapred-site.xml");
    
            conf.setMapOutputKeyClass(Text.class);
            conf.setMapOutputValueClass(IntWritable.class);
    
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);
    
            conf.setMapperClass(KPIPVMapper.class);
            conf.setCombinerClass(KPIPVReducer.class);
            conf.setReducerClass(KPIPVReducer.class);
    
            conf.setInputFormat(TextInputFormat.class);
            conf.setOutputFormat(TextOutputFormat.class);
    
            FileInputFormat.setInputPaths(conf, new Path(input));
            FileOutputFormat.setOutputPath(conf, new Path(output));
    
            JobClient.runJob(conf);
            System.exit(0);
        }
    }
    

    在程序中会调用KPI类的方法

    KPI kpi = KPI.filterPVs(value.toString());

    通过filterPVs方法,我们可以实现对PV,更多的控制。

    在KPK.java中,增加filterPVs方法

    
        /**
         * 按page的pv分类
         */
        public static KPI filterPVs(String line) {
            KPI kpi = parser(line);
            Set pages = new HashSet();
            pages.add("/about");
            pages.add("/black-ip-list/");
            pages.add("/cassandra-clustor/");
            pages.add("/finance-rhive-repurchase/");
            pages.add("/hadoop-family-roadmap/");
            pages.add("/hadoop-hive-intro/");
            pages.add("/hadoop-zookeeper-intro/");
            pages.add("/hadoop-mahout-roadmap/");
    
            if (!pages.contains(kpi.getRequest())) {
                kpi.setValid(false);
            }
            return kpi;
        }
    

    在filterPVs方法,我们定义了一个pages的过滤,就是只对这个页面进行PV统计。

    我们运行一下KPIPV.java

    
    2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
    信息: Starting flush of map output
    2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
    信息: Finished spill 0
    2013-10-9 11:53:28 org.apache.hadoop.mapred.Task done
    信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757
    2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Task sendDone
    信息: Task 'attempt_local_0001_m_000000_0' done.
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Task initialize
    信息:  Using ResourceCalculatorPlugin : null
    2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: 
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge
    信息: Merging 1 sorted segments
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge
    信息: Down to the last merge-pass, with 1 segments left of total size: 213 bytes
    2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: 
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Task done
    信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
    2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: 
    2013-10-9 11:53:30 org.apache.hadoop.mapred.Task commit
    信息: Task attempt_local_0001_r_000000_0 is allowed to commit now
    2013-10-9 11:53:30 org.apache.hadoop.mapred.FileOutputCommitter commitTask
    信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv
    2013-10-9 11:53:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
    信息:  map 100% reduce 0%
    2013-10-9 11:53:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    信息: reduce > reduce
    2013-10-9 11:53:33 org.apache.hadoop.mapred.Task sendDone
    信息: Task 'attempt_local_0001_r_000000_0' done.
    2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
    信息:  map 100% reduce 100%
    2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
    信息: Job complete: job_local_0001
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息: Counters: 20
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:   File Input Format Counters 
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Bytes Read=3025757
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:   File Output Format Counters 
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Bytes Written=183
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:   FileSystemCounters
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     FILE_BYTES_READ=545
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     HDFS_BYTES_READ=6051514
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     FILE_BYTES_WRITTEN=83472
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     HDFS_BYTES_WRITTEN=183
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:   Map-Reduce Framework
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Map output materialized bytes=217
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Map input records=14619
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Reduce shuffle bytes=0
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Spilled Records=16
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Map output bytes=2004
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Total committed heap usage (bytes)=376569856
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Map input bytes=3025757
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     SPLIT_RAW_BYTES=110
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Combine input records=76
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Reduce input records=8
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Reduce input groups=8
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Combine output records=8
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Reduce output records=8
    2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log
    信息:     Map output records=76
    

    用hadoop命令查看HDFS文件

    
    ~ hadoop fs -cat /user/hdfs/log_kpi/pv/part-00000
    
    /about  5
    /black-ip-list/ 2
    /cassandra-clustor/     3
    /finance-rhive-repurchase/      13
    /hadoop-family-roadmap/ 13
    /hadoop-hive-intro/     14
    /hadoop-mahout-roadmap/ 20
    /hadoop-zookeeper-intro/        6
    

    这样我们就得到了,刚刚日志文件中的,指定页面的PV值。

    指定页面,就像网站的站点地图一样,如果没有指定所有访问链接都会被找出来,通过“站点地图”的指定,我们可以更容易地找到,我们所需要的信息。

    后面,其他的统计指标的提取思路,和PV的实现过程都是类似的,大家可以直接下载源代码,运行看到结果!!

    后面我会把我代码上传到github上面:

    https://github.com/blench/

  • 相关阅读:
    code of C/C++(2)
    code of C/C++ (1)
    dll 的编写和使用
    Python基础练习-数据类型与变量part2
    Python基础练习-数据类型与变量
    python基础练习-循环
    Linux grep
    nginx反向代理
    正则表达式
    Linux samba ing
  • 原文地址:https://www.cnblogs.com/zhuixun/p/10085108.html
Copyright © 2020-2023  润新知