• spark on yarn 运行问题记录


    问题一:

    18/03/15 07:59:23 INFO yarn.Client: 
    	 client token: N/A
    	 diagnostics: Application application_1521099425266_0002 failed 2 times due to AM Container for appattempt_1521099425266_0002_000002 exited with  exitCode: 1
    For more detailed output, check application tracking page:http://spark1:8088/proxy/application_1521099425266_0002/Then, click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1521099425266_0002_02_000001
    Exit code: 1
    Stack trace: ExitCodeException exitCode=1: 
    	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    	at org.apache.hadoop.util.Shell.run(Shell.java:455)
    	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    

      

    此问题一般和内存有关,调大内存

    再把虚拟和物理监控线程关闭

    	<property>
    		<name>yarn.nodemanager.pmem-check-enabled</name>
    		<value>false</value>
    	</property>
    	<property>
    		<name>yarn.nodemanager.vmem-check-enabled</name>
    		<value>false</value>
    	</property>
    

      

    问题二:

    Container exited with a non-zero exit code 1
    Failing this attempt. Failing the application.
    	 ApplicationMaster host: N/A
    	 ApplicationMaster RPC port: -1
    	 queue: root.kfk
    	 start time: 1521115132862
    	 final status: FAILED
    	 tracking URL: http://spark1:8088/cluster/app/application_1521099425266_0002
    	 user: kfk
    Exception in thread "main" org.apache.spark.SparkException: Application application_1521099425266_0002 finished with failed status
    	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
    	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
    	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    18/03/15 07:59:23 INFO util.ShutdownHookManager: Shutdown hook called
    18/03/15 07:59:23 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-edf48e42-1bda-41b6-8a1b-7f9e176da728
    

      

    此问题一般是由于集群配置原因,检查jdk ,yarn 的配置文件

    问题三:

    diagnostics: Application application_1521099425266_0004 failed 2 times due to Error launching appattempt_1521099425266_0004_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. 
    This token is expired. current time is 1521213771615 found 1521138303131
    Note: System times on machines may be out of sync. Check system time and time zones.
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
    	at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
    	at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:123)
    	at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    

      

    同步集群的时间即可,本人集群其实一直都是时钟同步的,但是不知道为什么第三个节点会突然时钟错乱,jdk版本也错乱了

    问题问题四:

    Container exited with a non-zero exit code 15
    Failing this attempt. Failing the application.
    2018-03-16 11:59:29,345 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1521214648009_0003 State change from FINAL_SAVING to FAILED
    2018-03-16 11:59:29,346 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=Application Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App failed with state: FAILED	PERMISSIONS=Application application_1521214648009_0003 failed 2 times due to AM Container for appattempt_1521214648009_0003_000002 exited with  exitCode: 15
    For more detailed output, check application tracking page:http://spark2:8088/proxy/application_1521214648009_0003/Then, click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1521214648009_0003_02_000001
    Exit code: 15
    Stack trace: ExitCodeException exitCode=15: 
    	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    	at org.apache.hadoop.util.Shell.run(Shell.java:455)
    	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    
    Container exited with a non-zero exit code 15
    Failing this attempt. Failing the application.	APPID=application_1521214648009_0003
    2018-03-16 11:59:29,346 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1521214648009_0003,name=com.spark.test.MyScalaWordCout,user=kfk,queue=root.kfk,state=FAILED,trackingUrl=http://spark2:8088/cluster/app/application_1521214648009_0003,appMasterHost=N/A,startTime=1521215923660,finishTime=1521215968592,finalStatus=FAILED
    2018-03-16 11:59:30,164 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
    2018-03-16 12:00:15,892 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6667ms for sessionid 0x3622d0b65080001, closing socket connection and attempting reconnect
    2018-03-16 12:00:15,996 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
    2018-03-16 12:00:15,996 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected
    2018-03-16 12:00:16,123 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server spark1/192.168.208.151:2181. Will not attempt to authenticate using SASL (unknown error)
    2018-03-16 12:00:17,199 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6670ms for sessionid 0x1622882ae9c0001, closing socket connection and attempting reconnect
    2018-03-16 12:00:17,301 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
    2018-03-16 12:00:17,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server spark3/192.168.208.153:2181. Will not attempt to authenticate using SASL (unknown error)
    2018-03-16 12:00:18,838 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.208.152:35089, server: spark3/192.168.208.153:2181
    2018-03-16 12:00:18,843 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server spark3/192.168.208.153:2181, sessionid = 0x1622882ae9c0001, negotiated timeout = 10000
    2018-03-16 12:00:18,844 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
    2018-03-16 12:00:18,858 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
    2018-03-16 12:00:18,862 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0272731203726d32
    2018-03-16 12:00:18,862 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old node has our own data, so don't need to fence it.
    2018-03-16 12:00:18,862 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /yarn-leader-election/rs/ActiveBreadCrumb to indicate that the local node is the most recent active...
    2018-03-16 12:00:19,127 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.208.152:50168, server: spark1/192.168.208.151:2181
    2018-03-16 12:00:21,384 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server spark1/192.168.208.151:2181, sessionid = 0x3622d0b65080001, negotiated timeout = 10000
    2018-03-16 12:00:21,386 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/opt/modules/hadoop-2.6.0-cdh5.4.5/etc/hadoop/yarn-site.xml
    2018-03-16 12:00:21,387 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
    2018-03-16 12:00:21,387 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected
    2018-03-16 12:00:21,387 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored
    2018-03-16 12:00:21,406 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=refreshAdminAcls	TARGET=AdminService	RESULT=SUCCESS
    2018-03-16 12:00:21,407 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in active state
    2018-03-16 12:00:21,407 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=refreshQueues	TARGET=AdminService	RESULT=SUCCESS
    2018-03-16 12:00:21,408 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/opt/modules/hadoop-2.6.0-cdh5.4.5/etc/hadoop/yarn-site.xml
    2018-03-16 12:00:21,426 INFO org.apache.hadoop.util.HostsFileReader: Setting the includes file to 
    2018-03-16 12:00:21,426 INFO org.apache.hadoop.util.HostsFileReader: Setting the excludes file to 
    2018-03-16 12:00:21,426 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
    2018-03-16 12:00:21,431 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=refreshNodes	TARGET=AdminService	RESULT=SUCCESS
    2018-03-16 12:00:21,432 INFO org.apache.hadoop.conf.Configuration: found resource core-site.xml at file:/opt/modules/hadoop-2.6.0-cdh5.4.5/etc/hadoop/core-site.xml
    2018-03-16 12:00:21,432 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/opt/modules/hadoop-2.6.0-cdh5.4.5/etc/hadoop/yarn-site.xml
    2018-03-16 12:00:21,450 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=refreshSuperUserGroupsConfiguration	TARGET=AdminService	RESULT=SUCCESS
    2018-03-16 12:00:21,450 INFO org.apache.hadoop.conf.Configuration: found resource core-site.xml at file:/opt/modules/hadoop-2.6.0-cdh5.4.5/etc/hadoop/core-site.xml
    2018-03-16 12:00:21,451 INFO org.apache.hadoop.security.Groups: clearing userToGroupsMap cache
    2018-03-16 12:00:21,451 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=refreshUserToGroupsMappings	TARGET=AdminService	RESULT=SUCCESS
    2018-03-16 12:00:21,451 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kfk	OPERATION=transitionToActive	TARGET=RMHAProtocolService	RESULT=SUCCESS
    

      这些问题看表面一般看不出来,在yarn的日志里面可以查看具体日志

    问题五:

    Exception in thread "main" org.apache.spark.SparkException: Application application_1521293577934_0006 finished with failed status
    	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
    	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
    	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    

      这只是个表面错误,实际错误找到资源调度列表中的错误任务,点击进去发现实际错误

    Diagnostics:	User class threw exception: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ns/opt/datas/stu2.txt
    

      

  • 相关阅读:
    centos 7 安装VCL播放器
    pheatmap, gplots heatmap.2和ggplot2 geom_tile实现数据聚类和热图plot
    R语言通过loess去除某个变量对数据的影响
    安卓手机免root实现对其他软件最高管理(sandbox思想)
    R语言写2048游戏
    R语言各种假设检验实例整理(常用)
    R语言实现对基因组SNV进行注释
    R语言实现二分查找法
    将基因组数据分类并写出文件,python,awk,R data.table速度PK
    PHP设计模式练习——制作简单的投诉页面
  • 原文地址:https://www.cnblogs.com/qiaoyihang/p/8593613.html
Copyright © 2020-2023  润新知