• 大数据时代之hadoop(二):hadoop脚本解析



    大数据时代之hadoop(一):hadoop安装



    “兵马未动,粮草先行”,要想深入的了解hadoop,我认为启动或停止hadoop的脚本是必需要先了解的。说究竟。hadoop就是一个分布式存储和计算框架,可是这个分布式环境是怎样启动。管理的呢,我就带着大家先从脚本入手吧。说实话,hadoop的启动脚本写的真好,里面考虑的地方很周全(比方说路径中有空格。软连接等)。


    1、hadoop脚本简介


       hadoop的脚本分布在$HADOOP_HOME以下的bin文件夹下和conf文件夹下,主要介绍例如以下:

    bin文件夹下
            hadoop                 hadoop底层核心脚本。全部分布式程序终于都是通过这个脚本启动的。


    hadoop-config.sh       基本别的脚本都会内嵌调用这个脚本,这个脚本作用就是解析命令行可选參数(--config :hadoop conf文件夹路径 和--hosts)
    hadoop-daemon.sh       启动或停止本机command參数所指定的分布式程序,通过调用hadoop脚本实现。


    hadoop-daemons.sh      启动全部机器上的hadoop分布式程序,通过调用slaves.sh实现。


    slaves.sh              在全部的机器上执行一组指定的命令(通过ssh无password登陆),供上层使用。
    start-dfs.sh           在本机启动namenode。在slaves机器上启动datanode。在master机器上启动secondarynamenode。通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
    start-mapred.sh        在本机启动jobtracker。在slaves机器上启动tasktracker,通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
    start-all.sh           启动全部分布式hadoop程序,通过调用start-dfs.sh和start-mapred.sh实现。
    start-balancer.sh      启动hadoop分布式环境复杂均衡调度程序,平衡各节点存储和处理能力。


    还有几个stop 脚本,就不用具体说了。


    conf文件夹下
    hadoop-env.sh          配置hadoop执行时所需要的一些參数变量,比方JAVA_HOME,HADOOP_LOG_DIR,HADOOP_PID_DIR等。


    2、脚本的魅力(详解)

    hadoop的脚本写的真好。不服不行。从中学习到了好多知识。

    2.1、hadoop-config.sh

     这个脚本比較简单。并且基本其它脚本都内嵌通过“. $bin/hadoop-config.sh”的形式调用此脚本,所以这个脚本就不用在第一行声明解释权,由于这样的调用方式相似于把此脚本内容拷贝到父脚本里在同一个解释器里面执行。

         这个脚本主要做三部分内容:

    1、软连接解析和绝对路径解析
    #软连接解析
    this="$0"
    while [ -h "$this" ]; do
      ls=`ls -ld "$this"`
      link=`expr "$ls" : '.*-> (.*)$'`
      if expr "$link" : '.*/.*' > /dev/null; then
        this="$link"
      else
        this=`dirname "$this"`/"$link"
      fi
    done
    
    #绝对路径解析
    # convert relative path to absolute path
    bin=`dirname "$this"`
    script=`basename "$this"`
    bin=`cd "$bin"; pwd`
    this="$bin/$script"
    
    # the root of the Hadoop installation
    export HADOOP_HOME=`dirname "$this"`/..

    2、命令行可选參数--config解析并赋值
    #check to see if the conf dir is given as an optional argument
    if [ $# -gt 1 ]
    then
        if [ "--config" = "$1" ]
    	  then
    	      shift
    	      confdir=$1
    	      shift
    	      HADOOP_CONF_DIR=$confdir
        fi
    fi

    3、命令行可选參数--config解析并赋值
    #check to see it is specified whether to use the slaves or the
    # masters file
    if [ $# -gt 1 ]
    then
        if [ "--hosts" = "$1" ]
        then
            shift
            slavesfile=$1
            shift
            export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile"
        fi
    fi

    2.2、hadoop


        此脚本是hadoop脚本的核心,变量的设置。程序的启动都是通过这个脚本做的。

    1、声明用法
    # if no args specified, show usage
    if [ $# = 0 ]; then
      echo "Usage: hadoop [--config confdir] COMMAND"
      echo "where COMMAND is one of:"
      echo "  namenode -format     format the DFS filesystem"
      echo "  secondarynamenode    run the DFS secondary namenode"
      echo "  namenode             run the DFS namenode"
      echo "  datanode             run a DFS datanode"
      echo "  dfsadmin             run a DFS admin client"
      echo "  mradmin              run a Map-Reduce admin client"
      echo "  fsck                 run a DFS filesystem checking utility"
      echo "  fs                   run a generic filesystem user client"
      echo "  balancer             run a cluster balancing utility"
      echo "  jobtracker           run the MapReduce job Tracker node" 
      echo "  pipes                run a Pipes job"
      echo "  tasktracker          run a MapReduce task Tracker node" 
      echo "  job                  manipulate MapReduce jobs"
      echo "  queue                get information regarding JobQueues" 
      echo "  version              print the version"
      echo "  jar <jar>            run a jar file"
      echo "  distcp <srcurl> <desturl> copy file or directories recursively"
      echo "  archive -archiveName NAME <src>* <dest> create a hadoop archive"
      echo "  daemonlog            get/set the log level for each daemon"
      echo " or"
      echo "  CLASSNAME            run the class named CLASSNAME"
      echo "Most commands print help when invoked w/o parameters."
      exit 1
    fi

    2、设置java执行环境

            代码简单,就不写出来了。包含JAVA_HOME,JAVA_HEAP_MAX。CLASSPATH,HADOOP_LOG_DIR。HADOOP_POLICYFILE。当中用到了设置IFS-储界定符号的环境变量,默认值是空白字符(换行。制表符或者空格)。


    3、依据cmd设置执行时class

    # figure out which class to run
    if [ "$COMMAND" = "namenode" ] ; then
      CLASS='org.apache.hadoop.hdfs.server.namenode.NameNode'
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"
    elif [ "$COMMAND" = "secondarynamenode" ] ; then
      CLASS='org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode'
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"
    elif [ "$COMMAND" = "datanode" ] ; then
      CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode'
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"
    elif [ "$COMMAND" = "fs" ] ; then
      CLASS=org.apache.hadoop.fs.FsShell
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "dfs" ] ; then
      CLASS=org.apache.hadoop.fs.FsShell
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "dfsadmin" ] ; then
      CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "mradmin" ] ; then
      CLASS=org.apache.hadoop.mapred.tools.MRAdmin
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "fsck" ] ; then
      CLASS=org.apache.hadoop.hdfs.tools.DFSck
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "balancer" ] ; then
      CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"
    elif [ "$COMMAND" = "jobtracker" ] ; then
      CLASS=org.apache.hadoop.mapred.JobTracker
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"
    elif [ "$COMMAND" = "tasktracker" ] ; then
      CLASS=org.apache.hadoop.mapred.TaskTracker
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"
    elif [ "$COMMAND" = "job" ] ; then
      CLASS=org.apache.hadoop.mapred.JobClient
    elif [ "$COMMAND" = "queue" ] ; then
      CLASS=org.apache.hadoop.mapred.JobQueueClient
    elif [ "$COMMAND" = "pipes" ] ; then
      CLASS=org.apache.hadoop.mapred.pipes.Submitter
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "version" ] ; then
      CLASS=org.apache.hadoop.util.VersionInfo
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "jar" ] ; then
      CLASS=org.apache.hadoop.util.RunJar
    elif [ "$COMMAND" = "distcp" ] ; then
      CLASS=org.apache.hadoop.tools.DistCp
      CLASSPATH=${CLASSPATH}:${TOOL_PATH}
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "daemonlog" ] ; then
      CLASS=org.apache.hadoop.log.LogLevel
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "archive" ] ; then
      CLASS=org.apache.hadoop.tools.HadoopArchives
      CLASSPATH=${CLASSPATH}:${TOOL_PATH}
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    elif [ "$COMMAND" = "sampler" ] ; then
      CLASS=org.apache.hadoop.mapred.lib.InputSampler
      HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
    else
      CLASS=$COMMAND
    fi

    4、设置本地库

    # setup 'java.library.path' for native-hadoop code if necessary
    JAVA_LIBRARY_PATH=''
    if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then
    #通过执行一个java 类来决定当前平台,挺有意思
      JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`
      
      if [ -d "$HADOOP_HOME/build/native" ]; then
        JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
      fi
      
      if [ -d "${HADOOP_HOME}/lib/native" ]; then
        if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
          JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}
        else
          JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}
        fi
      fi
    fi

    5、执行分布式程序

     # run it
    exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"


    2.3、hadoop-daemon.sh       

           启动或停止本机command參数所指定的分布式程序,通过调用hadoop脚本实现。事实上也挺简单的。

    1、声明用法

    usage="Usage: hadoop-daemon.sh [--config <conf-dir>] [--hosts hostlistfile] (start|stop) <hadoop-command> <args...>"
    
    # if no args specified, show usage
    if [ $# -le 1 ]; then
      echo $usage
      exit 1
    fi

    2、环境变量设置


       
    首先内嵌执行hadoop-env.sh脚本。然后设置HADOOP_PID_DIR等环境变量。


    3、启动或停止程序

    case $startStop in
    
      (start)
    
        mkdir -p "$HADOOP_PID_DIR"
    
        if [ -f $pid ]; then
        #假设程序已经启动的话,就停止,并退出。
          if kill -0 `cat $pid` > /dev/null 2>&1; then
            echo $command running as process `cat $pid`.  Stop it first.
            exit 1
          fi
        fi
    
        if [ "$HADOOP_MASTER" != "" ]; then
          echo rsync from $HADOOP_MASTER
          rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME"
        fi
    # rotate 当前已经存在的log
        hadoop_rotate_log $log
        echo starting $command, logging to $log
        cd "$HADOOP_HOME"
        #通过nohup 和bin/hadoop脚本启动相关程序
        nohup nice -n $HADOOP_NICENESS "$HADOOP_HOME"/bin/hadoop --config $HADOOP_CONF_DIR $command "$@" > "$log" 2>&1 < /dev/null &
        #获取新启动的进程pid并写入到pid文件里
        echo $! > $pid
        sleep 1; head "$log"
        ;;
              
      (stop)
    
        if [ -f $pid ]; then
          if kill -0 `cat $pid` > /dev/null 2>&1; then
            echo stopping $command
            kill `cat $pid`
          else
            echo no $command to stop
          fi
        else
          echo no $command to stop
        fi
        ;;
    
      (*)
        echo $usage
        exit 1
        ;;
    esac


    2.4、slaves.sh


         
    在全部的机器上执行一组指定的命令(通过ssh无password登陆)。供上层使用。


    1、声明用法

    usage="Usage: slaves.sh [--config confdir] command..."
    
    # if no args specified, show usage
    if [ $# -le 0 ]; then
      echo $usage
      exit 1
    fi

    2、设置远程主机列表

    # If the slaves file is specified in the command line,
    # then it takes precedence over the definition in 
    # hadoop-env.sh. Save it here.
    HOSTLIST=$HADOOP_SLAVES
    
    if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then
      . "${HADOOP_CONF_DIR}/hadoop-env.sh"
    fi
    
    if [ "$HOSTLIST" = "" ]; then
      if [ "$HADOOP_SLAVES" = "" ]; then
        export HOSTLIST="${HADOOP_CONF_DIR}/slaves"
      else
        export HOSTLIST="${HADOOP_SLAVES}"
      fi
    fi

    3、分别在远程主机运行相关命令

    #挺重要,里面技术含量也挺高,对远程主机文件进行去除特殊字符和删除空行;对命令行进行空格替换。并通过ssh在目标主机运行命令;最后等待命令在全部目标主机运行完后。退出。
    for slave in `cat "$HOSTLIST"|sed  "s/#.*$//;/^$/d"`; do
     ssh $HADOOP_SSH_OPTS $slave $"${@// /\ }" 
       2>&1 | sed "s/^/$slave: /" &
     if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
       sleep $HADOOP_SLAVE_SLEEP
     fi
    done
    
    wait

    2.5、hadoop-daemons.sh

          启动远程机器上的hadoop分布式程序,通过调用slaves.sh实现。

    1、声明用法

    # Run a Hadoop command on all slave hosts.
    
    usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..."
    
    # if no args specified, show usage
    if [ $# -le 1 ]; then
      echo $usage
      exit 1
    fi

    2、在远程主机调用命令
     #通过salves.sh来实现
     exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" ; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$@"
    

    2.6、start-dfs.sh 

          本机(调用此脚本的主机)启动namenode,在slaves机器上启动datanode,在master机器上启动secondarynamenode。通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。


    1、声明使用方式

    # Start hadoop dfs daemons.
    # Optinally upgrade or rollback dfs state.
    # Run this on master node.
    
    usage="Usage: start-dfs.sh [-upgrade|-rollback]"

    2、启动程序

    # start dfs daemons
    # start namenode after datanodes, to minimize time namenode is up w/o data
    # note: datanodes will log connection errors until namenode starts
    #在本机(调用此脚本的主机)启动namenode
    "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
    #在slaves机器上启动datanode
    "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
    #在master机器上启动secondarynamenode
    "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
    

    2.7、start-mapred.sh 

         在本机(调用此脚本的主机)启动jobtracker,在slaves机器上启动tasktracker,通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
     # start mapred daemons
    # start jobtracker first to minimize connection errors at startup
    #在本机(调用此脚本的主机)启动jobtracker
    "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
    #在master机器上启动tasktracker
    "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker


    其它的脚本就都已经很easy了。不用再具体说明了,仅仅要看下,大致都能看懂。


    对了,最后再说下hadoop的脚本里面用的shell解释器的声明吧。

    #!/usr/bin/env bash
    作用就是适应各种linux操作系统。可以找到 bash shell来解释运行本脚本,也挺实用的。


  • 相关阅读:
    哈夫曼编码-C语言实现
    KMP和BF算法-C语言实现
    Spring框架(1)--只是入门
    MyBatis(1)
    antd TreeSelect 无限级选择
    download下载文件
    react搭建项目问题
    js 数组转tree结构
    反映复制到剪贴板(npm安装)
    js前端实现Table导出excel表格
  • 原文地址:https://www.cnblogs.com/slgkaifa/p/6978788.html
Copyright © 2020-2023  润新知