Hive开窗函数整理
https://www.cnblogs.com/zz-ksw/p/12917693.html
Hadoop基础-HDFS的API常见操作
https://www.cnblogs.com/yinzhengjie/p/9906192.html
Yarn 的三种资源调度器详解
https://www.cnblogs.com/zz-ksw/p/12895909.html
https://winyter.github.io/MyBlog/2020/05/23/yarn-fair-scheduler-guide/
yarn能并行运行任务总数
https://cloud.tencent.com/developer/article/1534332
arn为了很方便控制在运行的任务数,也即是处于running状态任务的数目,提供了一个重要的参数配置
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
<description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description>
</property>
配置文件是:hadoop-2.7.4/etc/hadoop/capacity-scheduler.xml
参数含义很明显就是所有AM占用的总内存数要小于yarn所管理总内存的一定比例,默认是0.1。
也即是yarn所能同时运行的任务数受限于该参数和单个AM的内存。
影响yarn能同时运行的任务的个数的因素有两个:
1.yarn的内存调度最小单元
2.hadoop-2.7.4/etc/hadoop/capacity-scheduler.xml中的参数
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
<description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description>
</property>
所以需要将调度内存调到默认值1GB,其实一般情况下没必要调整,然后将AM总内存占比提高,比如1,即可。
如下配置可以使任务使用实际指定的内存,cpu资源执行任务。
$HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
=========================================================================================
spark-submit
spark-submit --master yarn --driver-memory 1G --driver-cores 1 --executor-memory 1G --executor-cores 1 --num-executors 3 --py-files dacomponent.zip taskflow_22_5457.py