• Spark2.2(三十九):如何根据appName监控spark任务,当任务不存在则启动(任务存在当超过多久没有活动状态则kill,等待下次启动)


    业务需求

    实现一个根据spark任务的appName来监控任务是否存在,及任务是否卡死的监控。

    1)给定一个appName,根据appName从yarn application -list中验证任务是否存在,不存在则调用spark-submit.sh脚本来启动任务;

    2)如果任务存在yarn application -list中,则读取‘监控文件(监控文件内容包括:appId,最新活动时间)’,从监控文件中读取出最后活动的日期,计算当前日期与app的最后活动日期相差时间为X,如果X大于30minutes(认为任务处于假死状态[再发布环境发现有的任务DAG抛出OOM,导致app的executor和driver依然存在,当时不执行任务调度,程序卡死。具体错误详情请参考《https://issues.apache.org/jira/browse/SPARK-26452》]),则执行yarn application -kill appId(杀掉任务),等待下次监控脚本执行时重启任务。

    监控实现

    脚本

    #/bin/sh
    #LANG=zh_CN.utf8
    #export LANG
    export SPARK_KAFKA_VERSION=0.10
    export LANG=zh_CN.UTF-8
    # export env variable
    if [ -f ~/.bash_profile ];
    then
       source ~/.bash_profile
    fi
    source /etc/profile
    
    myAppName='myBatchTopic'               #这里指定需要监控的spark任务的appName,注意:这名字重复了会导致监控失败。
    apps=''
    
    for app in `yarn application -list`
    do
      apps=${app},$apps
    done
    apps=${apps%?}
    
    if [[ $apps =~ $myAppName ]];
    then
      echo "appName($myAppName) exists in yarn application list"
      #1)运行 hadop fs -cat /目录/appName,读取其中最后更新日期;(如果文件不存在,则跳过等待文件生成。)
      monitorInfo=$(hadoop fs -cat /user/dx/streaming/monitor/${myAppName})
      LD_IFS="$IFS"
      IFS=","
      array=($monitorInfo)
      IFS="$OLD_IFS"  
      appId=${array[0]}
      monitorLastDate=${array[1]}
      echo "loading mintor information 'appId:$appId,monitorLastUpdateDate:$monitorLastDate'"
    
      current_date=$(date "+%Y-%m-%d %H:%M:%S")
      echo "loading current date '$current_date'"
      
      #2)与当前日期对比:
      # 如果距离当前日期相差小于30min,则不做处理;
      # 如果大于30min则kill job,根据上边yarn application -list中能获取对应的appId,运行yarn application -kill appId
      t1=`date -d "$current_date" +%s`
      t2=`date -d "$monitorLastDate" +%s`
      diff_minute=$(($(($t1-$t2))/60))
      echo "current date($current_date) over than monitorLastDate($monitorLastDate) $diff_minute minutes"
      if [ $diff_minute -gt 30 ];
      then
        echo 'over then 30 minutes'
        $(yarn application -kill ${appId})
        echo "kill application ${appId}"
      else
        echo 'less than 30 minutes'
      fi
    else
      echo "appName($myAppName) not exists in yarn application list"
      #./submit_x1_x2.sh abc TestRestartDriver #这里指定需要启动的脚本来启动相关任务
      $(nohup ./submit_checkpoint2.sh >> ./output.log 2>&1 &)
    fi

    监控脚本业务流程图:

    监控文件生成

    我这里程序是spark structured streaming,因此可以注册sparkSesssion的streams()的query的监听事件

    sparkSession.streams().addListener(new GlobalStreamingQueryListener(sparkSession。。。))

    在监听事件中实现如下:

    public class GlobalStreamingQueryListener extends StreamingQueryListener {
        private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(GlobalStreamingQueryListener.class);
        private static final String monitorBaseDir = "/user/dx/streaming/monitor/";
        private SparkSession sparkSession = null;
        private LongAccumulator triggerAccumulator = null;
    
        public GlobalStreamingQueryListener(SparkSession sparkSession, LongAccumulator triggerAccumulator) {
            this.sparkSession = sparkSession;
            this.triggerAccumulator = triggerAccumulator;
        }
    
        @Override
        public void onQueryStarted(QueryStartedEvent queryStarted) {
            System.out.println("Query started: " + queryStarted.id());
        }
    
        @Override
        public void onQueryTerminated(QueryTerminatedEvent queryTerminated) {
            System.out.println("Query terminated: " + queryTerminated.id());
        }
    
        @Override
        public void onQueryProgress(QueryProgressEvent queryProgress) {
            System.out.println("Query made progress: " + queryProgress.progress());
            // sparkSession.sql("select * from " +
            // queryProgress.progress().name()).show();
    
            triggerAccumulator.add(1);
            System.out.println("Trigger accumulator value: " + triggerAccumulator.value());
    
            logger.info("minitor start .... ");
            try {
                if (HDFSUtility.createDir(monitorBaseDir)) {
                    logger.info("Create monitor base dir(" + monitorBaseDir + ") success");
                } else {
                    logger.info("Create monitor base dir(" + monitorBaseDir + ") fail");
                }
            } catch (IOException e) {
                logger.error("An error was thrown while create monitor base dir(" + monitorBaseDir + ")");
                e.printStackTrace();
            }
    
            // spark.app.id application_1543820999543_0193
            String appId = this.sparkSession.conf().get("spark.app.id");
            // spark.app.name myBatchTopic
            String appName = this.sparkSession.conf().get("spark.app.name");
            String mintorFilePath = (monitorBaseDir.endsWith(File.separator) ? monitorBaseDir : monitorBaseDir + File.separator) + appName;
            logger.info("The application's id is " + appId);
            logger.info("The application's name is " + appName);
            logger.warn("If the appName is not unique,it will result in a monitor error");
    
            try {
                HDFSUtility.overwriter(mintorFilePath, appId + "," + new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()));
            } catch (IOException e) {
                logger.error("An error was thrown while write info to monitor file(" + mintorFilePath + ")");
                e.printStackTrace();
            }
            logger.info("minitor stop .... ");
        }
    
    }
    HDFSUtility.java中方法如下:
    public class HDFSUtility {
        private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HDFSUtility.class);
    
        /**
         * 当目录不存在时,创建目录。
         * 
         * @param dirPath
         *            目标目录
         * @return true-創建成功;false-失敗。
         * @throws IOException
         * */
        public static boolean createDir(String dirPath) throws IOException {
            FileSystem fs = null;
            Path dir = new Path(dirPath);
            boolean success = false;
            try {
                fs = FileSystem.get(new Configuration());
    
                if (!fs.exists(dir)) {
                    success = fs.mkdirs(dir);
                } else {
                    success = true;
                }
            } catch (IOException e) {
                logger.error("create dir (" + dirPath + ") fail:", e);
                throw e;
            } finally {
                try {
                    fs.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
    
            return success;
        }
    
        /**
         * 覆盖文件写入信息
         * 
         * @param filePath
         *            目标文件路径
         * @param content
         *            被写入内容
         * @throws IOException
         * */
        public static void overwriter(String filePath, String content) throws IOException {
            FileSystem fs = null;
            // 在指定路径创建FSDataOutputStream。默认情况下会覆盖文件。
            FSDataOutputStream outputStream = null;
            Path file = new Path(filePath);
    
            try {
                fs = FileSystem.get(new Configuration());
                if (fs.exists(file)) {
                    System.out.println("File exists(" + filePath + ")");
                }
                outputStream = fs.create(file);
                outputStream.write(content.getBytes());
            } catch (IOException e) {
                logger.error("write into file(" + filePath + ") fail:", e);
                throw e;
            } finally {
                if (outputStream != null) {
                    try {
                        outputStream.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                try {
                    fs.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
     
  • 相关阅读:
    可视化和解释的11种基本神经网络架构 标准,循环,卷积和自动编码器网络
    从头开始使用梯度下降优化在Python中实现单变量多项式回归(后续3)
    从头开始使用梯度下降优化在Python中实现单变量多项式回归(后续2)
    从头开始使用梯度下降优化在Python中实现单变量多项式回归
    从头开始使用梯度下降优化在Python中实现多元线性回归(后续)
    从头开始使用梯度下降优化在Python中实现多元线性回归
    算法分析 八: 总结补充补充
    算法分析五:贪婪算法
    JDBC简单使用、工具类构建以及Statement与PreparedStatement区别
    利用抓包工具Fiddler分析post和get对http请求、响应的区别。
  • 原文地址:https://www.cnblogs.com/yy3b2007com/p/10241914.html
Copyright © 2020-2023  润新知