反垃圾rd那边有一个hql,在执行过程中出现错误退出,报java.io.IOException: Broken pipe异常,hql中使用到了python脚本,hql和python脚本最近没有人改过,在10.1号时还执行正常,可是在10.4号之后执行就老是出现同样的错误,并且错误出如今stage-2的reduce阶段,gateway上面的错误提演示样例如以下:
2014-10-10 15:05:32,724 Stage-2 map = 100%, reduce = 100% Ended Job = job_201406171104_4019895 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
jobtracker页面job报错信息:
2014-10-10 15:00:29,614 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"1000390355","reducesinkkey1":"14"},"value":{"_col0":"1000390355","_col1":25,"_col2":"Infinity","_col3":"14","_col4":17},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:518) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:419) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1061) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"1000390355","reducesinkkey1":"14"},"value":{"_col0":"1000390355","_col1":25,"_col2":"Infinity","_col3":"14","_col4":17},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256) ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Broken pipe at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:348) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247) ... 7 more Caused by: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:43) at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:331) ... 15 more
stderr logs:
Traceback (most recent call last): File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 86, in <module> pranalysis(cols[0],pr,cols[1],cols[4],prnum) File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 60, in pranalysis print '%s %d %d %d'%(uid,v[14]-20,type,rank) TypeError: %d format: a number is required, not float
从以上job的错误信息初步推断,问题原因应该是10.1之后的数据出现故障。导致python脚本运行的时候退出。数据流通道被关闭,而ExecReducer.reduce()方法不知道往python写数据的通道已经由于异常而关闭。还继续往里写数据,这时就会出现java.io.IOException: Broken pipe异常。
下面是分析过程:
1、hql和python
hql内容例如以下:
add file /usr/home/wbdata_anti/shell/sass_offline/pranalysis.py; select transform(BS.*) using 'pranalysis.py' as uid,prvalue,trend,prlevel from ( select B1.uid,B1.flws,B1.pr,iter,B2.alivefans from tmp_anti_user_pagerank1 B1 join mds_anti_user_flwpr B2 on B1.uid=B2.uid where iter>'00' and iter<='14' and dt='lowrlfans20141001' distribute by uid sort by uid,iter )BS;python脚本内容例如以下:
#!/usr/bin/python #coding=utf-8 import sys,time import re,math from optparse import OptionParser import ConfigParser reload(sys) sys.setdefaultencoding('utf-8') parser = OptionParser(usage="usage:%prog [optinos] filepath") parser.add_option("-i", "--iter",action = "store",type = 'string', dest = "iter", default = '14', help="how many iterators" ) (options, args) = parser.parse_args() def pranalysis(uid,prs,flw,fans,prnum): tasc=tdesc=0 try: v=[float(pr)*100000000000 for pr in prs] fans=int(fans) interval=fans/100 except: #rst=sys.exc_info() #sys.excepthook(rst[0],rst[1],rst[2]) return for i in range(1,prnum-1) : if i==1: if v[i+1]-v[i]>interval and v>fans: tasc += 1 elif v[i]-v[i+1]>interval and v[i+1]<fans: tdesc += 1 continue if v[i+1]-v[i]>interval: tasc += 1 elif v[i]-v[i+1]>interval: tdesc += 1 # rank indicate the rate between pr and fans. higher rank(big number) mean more possible negative user rate=v[prnum-1]/fans rank=4 if rate>3.0: rank=0 elif rate>2.0: rank=1 elif rate>1.3: rank=2 elif rate>0.7: rank=3 elif rate>0.5: rank=4 elif rate>0.3: rank=5 elif rate>0.2: rank=6 else: rank=7 # 0 for stable trend. 1 for round trend, 2, for positive user, 3 for negative user. type=0 if tasc>0 and tdesc>0: type=1 elif tasc>0: type=2 elif tdesc>0: type=3 else: # tdesc=0 and tasc=0 type=0 #if fans<60: # type=0 print '%s %d %d %d'%(uid,v[14]-20,type,rank) #format sort by uid, iter #uid follow pr iter fans #1642909335 919 0.00070398898 04 68399779 prnum=int(options.iter)+1 pr=[0]*prnum idx=1 lastiter='00' lastuid='' for line in sys.stdin: line=line.rstrip(' ') cols=line.split(' ') if len(cols)<5: continue if cols[3]>options.iter or cols[3]=='00': continue if cols[3]<=lastiter: print '%s %d %d %d'%(lastuid,2,0,7) pr=[0]*prnum idx=1 lastiter=cols[3] lastuid=cols[0] pr[idx]=cols[2] idx+=1 if cols[3]==options.iter: pranalysis(cols[0],pr,cols[1],cols[4],prnum) pr=[0]*prnum lastiter='00' idx=1
2、stage-2 reduce阶段的运行计划:
Reduce Operator Tree: Extract Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: string expr: _col3 type: string expr: _col4 type: bigint outputColumnNames: _col0, _col1, _col2, _col3, _col4 Transform Operator command: pranalysis.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
依据运行计划,能够看出。stage-2 的reduce阶段事实上非常easy,就是将map阶段拿到的数据使用pranalysis.py脚本进行计算。由5列转换成4列,python输出的时候有数据格式要求:
print '%s %d %d %d'%(uid,v[14]-20,type,rank)
依据运行计划定位到的结果。在结合job的stderr logs信息:
Traceback (most recent call last): File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 86, in <module> pranalysis(cols[0],pr,cols[1],cols[4],prnum) File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 60, in pranalysis print '%s %d %d %d'%(uid,v[14]-20,type,rank) TypeError: %d format: a number is required, not float能够看出,hql确实是在运行python的时候由于数据出现异常。python计算完毕之后的有一个数据的格式是float型的,而我们对该数据预期的格式应该是number型的,导致python脚本异常退出,退出的时候关闭了数据流通道。可是ExecReducer.reduce()方法事实上是不知道往python写数据的通道已经由于异常而关闭,还继续往里写数据,这时就出现了java.io.IOException: Broken pipe的异常。
參考:
http://fgh2011.iteye.com/blog/1684544
http://blog.csdn.net/churylin/article/details/11969925