• Android Watchdog源码简析--Based on Android 6.0.1


    1. Watchdog 简介

    Android 为了保证系统的高可用性,设计了Watchdog用以监视系统的一些关键服务的运行状况,如果关键服务出现了死锁,将重启SystemServer;另外,接收系统内部reboot请求,重启系统。

    总结一下:Watchdog就如下两个主要功能:

    1. 接收系统内部reboot请求,重启系统;
    2. 监控系统关键服务,如果关键服务出现了死锁,将重启SystemServer。
      被监控的关键服务,这些服务必须实现Watchdog.Monitor接口:
      ActivityManagerService
      InputManagerService
      MountService
      NativeDaemonConnector
      NetworkManagementService
      PowerManagerService
      WindowManagerService
      MediaRouterService
      MediaProjectionManagerService

    2. Watchdog 详解

    一张图理解 Watchdog
    ![一张图理解 Watchdog](http://images2015.cnblogs.com/blog/632312/201611/632312-20161130115015006-1586848184.png)

    Watchdog 是在SystemServer启动的时候 调用 startOtherServices 启动的。 Watchdog 初始化了一个单例的对象并且继承自 Thread,因此,Watchdog实际是跑在 SystemServer 进程中的。

    启动之后,watchdog的run进程会每30s检查一次监控服务是否发生死锁。检查死锁通过hc.scheduleCheckLocked(),然后调用各个被监控对象的monitor()来验证。下面我们以 ActivityManagerService 为例。

        /** In this method we try to acquire our lock to make sure that we have not deadlocked */
        public void monitor() {
            synchronized (this) { }
        }
    

    由于我们关键部分都用了synchronized (this) 这个锁来进行锁定,如果我们在monitor()的时候两次每隔30s的检查都未能获取到相应的锁,就表示这个进程死锁,如果死锁将杀死SystemServer进程(Watchdog跑在SystemServer进程中,因此Process.killProcess(Process.myPid()) 这里的myPid()就是SystemServer对应的PID)。

    SystemServer 进程被杀死之后, Zygote 也会死掉(com_android_internal_os_Zygote.cpp 中通过 signal 机制 收到 SIGCHLD 就杀掉Zygote进程),最后init进程(init.rc中配置了onrestart,则就会有SVC_RESTARTING标签,init.cpp执行到restart_processes())检测到zygote死掉()会重新启动Zygote 和 SystemServer。

    下面,我们结合代码来详细看下这个流程:

    @Override
        public void run() {
            boolean waitedHalf = false;
            while (true) {
                final ArrayList<HandlerChecker> blockedCheckers;
                final String subject;
                final boolean allowRestart;
                int debuggerWasConnected = 0;
                synchronized (this) {
                    long timeout = CHECK_INTERVAL;
                    // Make sure we (re)spin the checkers that have become idle within
                    // this wait-and-check interval
                    for (int i=0; i<mHandlerCheckers.size(); i++) {
                        HandlerChecker hc = mHandlerCheckers.get(i);
                        // 1. 对每个关注的服务进行监控
                        hc.scheduleCheckLocked();
                    }
    
                    if (debuggerWasConnected > 0) {
                        debuggerWasConnected--;
                    }
    
                    // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                    // wait while asleep. If the device is asleep then the thing that we are waiting
                    // to timeout on is asleep as well and won't have a chance to run, causing a false
                    // positive on when to kill things.
                    long start = SystemClock.uptimeMillis();
                    while (timeout > 0) {
                        if (Debug.isDebuggerConnected()) {
                            debuggerWasConnected = 2;
                        }
                        try {
                            // 2. 等待timeout时间,默认30s
                            wait(timeout);
                        } catch (InterruptedException e) {
                            Log.wtf(TAG, e);
                        }
                        if (Debug.isDebuggerConnected()) {
                            debuggerWasConnected = 2;
                        }
                        timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                    }
    
                    // 3. 获取监控之后的waitState状态,如果状态为COMPLETED、WAITING、WAITED_HALF,就结束本次循环,继续执行后面的循环;如果是OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        if (!waitedHalf) {
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<Integer>();
                            pids.add(Process.myPid());
                            ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                    NATIVE_STACKS_OF_INTEREST);
                            waitedHalf = true;
                        }
                        continue;
                    }
    
                    // 4. OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。
                    // something is overdue!
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                    allowRestart = mAllowRestart;
                }
    
                // If we got here, that means that the system is most likely hung.
                // First collect stack traces from all threads of the system process.
                // Then kill this process so that the system will restart.
                EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
    
                ArrayList<Integer> pids = new ArrayList<Integer>();
                pids.add(Process.myPid());
                if (mPhonePid > 0) pids.add(mPhonePid);
                // 5. dump AMS 堆栈信息
                // Pass !waitedHalf so that just in case we somehow wind up here without having
                // dumped the halfway stacks, we properly re-initialize the trace file.
                final File stack = ActivityManagerService.dumpStackTraces(
                        !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
    
                // Give some extra time to make sure the stack traces get written.
                // The system's been hanging for a minute, another second or two won't hurt much.
                SystemClock.sleep(2000);
    
                // 6. dump kernel 堆栈信息
                // Pull our own kernel thread stacks as well if we're configured for that
                if (RECORD_KERNEL_THREADS) {
                    dumpKernelStackTraces();
                }
    
                // 7. 触发 kernel dump 所有阻塞的线程信息 和 所有CPU的backtraces放到 kernel 的 log 中
                // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
                doSysRq('w');
                doSysRq('l');
    
                // 8. 尝试把错误信息放大dropbox里面,这个假设AMS还活着,如果AMS死锁了,那watchdog也死锁了
                // Try to add the error to the dropbox, but assuming that the ActivityManager
                // itself may be deadlocked.  (which has happened, causing this statement to
                // deadlock and the watchdog as a whole to be ineffective)
                Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                        public void run() {
                            mActivity.addErrorToDropBox(
                                    "watchdog", null, "system_server", null, null,
                                    subject, null, stack, null);
                        }
                    };
                dropboxThread.start();
                try {
                    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
                } catch (InterruptedException ignored) {}
    
                // 9. ActivityController 检查 systemNotResponding(subject) 的处理方式,1 = keep waiting, -1 = kill system
                IActivityController controller;
                synchronized (this) {
                    controller = mController;
                }
                if (controller != null) {
                    Slog.i(TAG, "Reporting stuck state to activity controller");
                    try {
                        Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                        // 1 = keep waiting, -1 = kill system
                        int res = controller.systemNotResponding(subject);
                        if (res >= 0) {
                            Slog.i(TAG, "Activity controller requested to coninue to wait");
                            waitedHalf = false;
                            continue;
                        }
                    } catch (RemoteException e) {
                    }
                }
    
                // Only kill the process if the debugger is not attached.
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                if (debuggerWasConnected >= 2) {
                    Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
                } else if (debuggerWasConnected > 0) {
                    Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
                } else if (!allowRestart) {
                    Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
                } else {
                    // 10. 打印堆栈信息
                    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                    for (int i=0; i<blockedCheckers.size(); i++) {
                        Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                        StackTraceElement[] stackTrace
                                = blockedCheckers.get(i).getThread().getStackTrace();
                        for (StackTraceElement element: stackTrace) {
                            Slog.w(TAG, "    at " + element);
                        }
                    }
                    Slog.w(TAG, "*** GOODBYE!");
                    // 11. 杀死进程
                    Process.killProcess(Process.myPid());
                    System.exit(10);
                }
    
                waitedHalf = false;
            }
        }
    
  • 相关阅读:
    python 获取当前文件路径和上级路径
    软件测试如何优雅的拒绝offer?
    软件测试面试题04Linux常用命令
    测试工程师刚入职如何快速熟悉需求并输出测试用例?
    软件测试面试题05接口测试中如何校验结果是否正确?
    window10 清空wifi连接记录
    软件测试面试题02软件测试流程?
    软件测试面试题06没有接口文档,如何做接口测试?
    selenium 元素定位
    软件测试面试题01mysql 与redis 的区别?
  • 原文地址:https://www.cnblogs.com/GMCisMarkdownCraftsman/p/6117129.html
Copyright © 2020-2023  润新知