saltstack的master上minion连接较多,下面这个程序可以分析哪些minion任务执行成功,哪些执行失败以及哪些没有返回。
脚本说明:
一、最先打印出本次任务的job id、command name以及其它相关信息,然后是本次任务的执行流程和结果,这和我们单独执行这个命令是一致的。最后程序会打印出所有未成功的任务和未返回的任务,并且重新执行一遍。 这里要说明的是,因为没有查看对应的情景,对于失败任务的排判断做的不好,另外minion未连接我也归为任务未返回,并且会再执行一遍,实际上如果是minion未连接,则不应该执行。
二、 程序我们先派生子进程去执行salt命令,再salt命令执行完毕后,我们的程序会对其中失败的和未返回的minion任务二次执行
三、编写脚本
import salt.utils.event import re import signal, time import sys import os def single_handler(target): os.execl('/usr/bin/salt', 'salt', target, 'state.sls', 'os') def handler(num1, num2): #signal.signal(signal.SIGCLD,signal.SIG_IGN) print 'We are in signal handler' print 'Job Not Ret: '+str(record[jid]) print ' Job Failed: '+str(failedrecord[jid]) print 'all done...' for item in failedrecord[jid]: #print item try: pid = os.fork() if pid == 0: single_handler(item) except OSError: print 'we exec. '+ item +' error!' for item in record[jid]: #print item try: print 'fork ok ' + item pid = os.fork() if pid == 0 : single_handler(item) except OSError: print 'we exec. '+item + ' error!' sys.stdout.flush() os._exit(0) fd = open('/tmp/record', 'w+') #sys.stdout = fd #sys.stderr = fd signal.signal(signal.SIGCLD, handler) #fd = open('/var/log/record', 'w+') os.dup2(fd.fileno(), sys.stdout.fileno()) os.dup2(fd.fileno(), sys.stderr.fileno()) #sys.stdout = fd #sys.stderr = fd try: pid = os.fork() if pid == 0: time.sleep(2) try: os.execl('/usr/bin/salt', 'salt', '*', 'state.sls', 'os') except OSError: print 'exec error!' os._exit(1) except OSError: print 'first fork error!' os._exit(1) event = salt.utils.event.MasterEvent('/var/run/salt/master') flag=False reg=re.compile('salt/job/([0-9]+)/new') reg1=reg #a process to exec. command, but will sleep some time #another process listen the event #if we use this method, we can filter the event through func. name record={} failedrecord={} jid = 0 #try: for eachevent in event.iter_events(tag='salt/job',full=True): eachevent=dict(eachevent) result = reg.findall(eachevent['tag']) if not flag and result: flag = True jid = result[0] print " job_id: " + jid print " Command: " + dict(eachevent['data'])['fun'] + ' ' + str(dict(eachevent['data'])['arg']) print " RunAs: " + dict(eachevent['data'])['user'] print "exec_time: " + dict(eachevent['data'])['_stamp'] print "host_list: " + str(dict(eachevent['data'])['minions']) sys.stdout.flush() record[jid]=eachevent['data']['minions'] failedrecord[jid]=[] reg1 = re.compile('salt/job/'+jid+'/ret/([0-9.]+)') else: result = reg1.findall(eachevent['tag']) if result: record[jid].remove(result[0]) if not dict(eachevent['data'])['success']: failedrecord[jid].append(result[0]) #except: # print 'we in except' """ print 'Job Not Ret: '+str(record[jid]) print ' Job Failed: '+str(failedrecord[jid]) for item in failedrecord[jid]: os.system('salt '+ str(item) + ' state.sls os') for item in record[jid]: os.system('salt '+ str(item) + ' state.sls os') os._exit(0) """
执行结果:
job_id: 20151208025319005896 Command: state.sls ['os'] RunAs: root exec_time: 2015-12-08T02:53:19.006284 host_list: ['172.18.1.212', '172.18.1.214', '172.18.1.213', '172.18.1.211'] 172.18.1.213: ---------- ID: configfilecopy Function: file.managed Name: /root/node3 Result: True Comment: File /root/node3 is in the correct state Started: 02:53:19.314015 Duration: 13.033 ms Changes: ---------- ID: commonfile Function: file.managed Name: /root/commonfile Result: True Comment: File /root/commonfile is in the correct state Started: 02:53:19.327173 Duration: 1.993 ms Changes: Summary ------------ Succeeded: 2 Failed: 0 ------------ Total states run: 2 172.18.1.212: ---------- ID: configfilecopy Function: file.managed Name: /root/node2 Result: True Comment: File /root/node2 is in the correct state Started: 02:53:19.337325 Duration: 8.327 ms Changes: ---------- ID: commonfile Function: file.managed Name: /root/commonfile Result: True Comment: File /root/commonfile is in the correct state Started: 02:53:19.345787 Duration: 1.996 ms Changes: Summary ------------ Succeeded: 2 Failed: 0 ------------ Total states run: 2 172.18.1.211: ---------- ID: configfilecopy Function: file.managed Name: /root/node1 Result: True Comment: File /root/node1 is in the correct state Started: 02:53:19.345017 Duration: 12.741 ms Changes: ---------- ID: commonfile Function: file.managed Name: /root/commonfile Result: True Comment: File /root/commonfile is in the correct state Started: 02:53:19.357873 Duration: 1.948 ms Changes: Summary ------------ Succeeded: 2 Failed: 0 ------------ Total states run: 2 172.18.1.214: Minion did not return. [Not connected] We are in signal handler Job Not Ret: ['172.18.1.214'] Job Failed: [] all done... fork ok 172.18.1.214 172.18.1.214: Minion did not return. [Not connected]