ANR 机制以及问题分析

本文目标讲解Android中的ANR原理以及如何分析ANR问题

概览

ANR(Application Not Responding),应用程序无响应,简单的一个定义,却涵盖了很多AAndroid系统的设计思想。

首先,ANR属于应用程序范畴,不同于SNR(System Not Responding),SNR反映的问题是系统进程(system_server)失去了响应能力,而ANR明确将问题圈定在应用程序。SNR是有Watchdog机制保证,具体可查阅Watchdog机制以及问题分析;ANR由消息处理机制保证,Android在系统层实现了一套机制来发现ANR,核心原理是消息调度和超时处理。

其次,ANR机制的主体实现在系统层。所有与ANR相关的消息,都会经过系统进程(system_server)调度,然后派发发哦应用进程完成对消息的实际处理,同时,系统进程设计了不同的超时限制来跟踪消息的处理。一旦应用程序处理消息不当,超时限制就起了作用了,它收集一些系统状态,比如CPU/IO使用情况、进程函数调用栈,并报告用户进程无响应(弹出ANR对话框).

然后,ANR问题本质是一个性能问题和稳定性的问题。ANR机制实际上对应用程序主线程的限制,要求主线程在限定的时间内处理完一些常见的操作(启动服务、处理广播、处理输入),如果超时,则任务主线程已经失去了响应其他操作的能力。主线程中的耗时操作,比如密集CPU运算、大量IO、复杂界面布局等,都会降低应用程序的响应能力。当系统中的ANR问题太多的时候,就影响到了系统的稳定性。

最后,部分ANR问题时很难分析的,有时候由于系统底层的一些影响,导致消息调度失败,比如说Binder资源耗尽,nativepollonce等,出现问题的场景又难以复现。这类ANR问题往往需要花费大量的事件去了解系统的一些行为,超出了ANR机制本省的范畴。

出现ANR的场景主要有以下几种:

Service Timeout:服务在20s内未执行完成;
BroadcastQueue Timeout:比如前台广播在10s内执行完成
ContentProvider Timeout:内容提供者执行超时
inputDispatching Timeout: 输入事件分发超时5s,包括按键分发事件的超时

Google 官方关于ANR的介绍

Google官方关于ANR的介绍是在介绍使用Android Vitals工具时提到:https://developer.android.com/topic/performance/vitals/anr

查找ANR根因

查找ANR根因的方法有

  1. 使用StrictMode
    具体请参照: StrictMode 机制以及性能调优

  2. 启动ANR 背景对话框
    Enable background ANR dialogs

    Android shows ANR dialogs for apps that take too long to process the broadcast message only if Show all ANRs is enabled in the device’s Developer options. For this reason, background ANR dialogs are not always displayed to the user, but the app could still be experiencing performance issues.

    打开 开发者模式 –> 打开 Show All ANR

  3. 使用TraceView

  4. adb pull data/anr

解决ANR问题的方式

There are some common patterns to look for when diagnosing ANRs:
1. The app is doing slow operations involving I/O on the main thread.
2. The app is doing a long calculation on the main thread.
3. The main thread is doing a synchronous binder call to another process, and that other process is taking a long time to return.
4. The main thread is blocked waiting for a synchronized block for a long operation that is happening on another thread.
5. The main thread is in a deadlock with another thread, either in your process or via a binder call. The main thread is not just waiting for a long operation to finish, but is in a deadlock situation. For more information, see Deadlock on Wikipedia.

翻译如下:
在诊断ANR时有一些常见的模式需要查找:

1. 该应用程序在主线程上执行涉及I / O的缓慢操作。
2. 该应用程序正在对主线程进行长时间计算。
3. 主线程正在对另一个进程执行同步绑定程序调用,而其他进程需要很长时间才能返回。
4. 主线程被阻塞,等待正在另一个线程上发生的长操作的同步块。
5. 主线程与另一个线程处于死锁状态,无论是在您的进程中还是通过联编程序调用。主线不仅仅是等待长时间的操作才能完成,而是处于死锁状态。有关更多信息,请参阅 维基百科上的死锁。
  1. 主线程执行超时
    定位到是主线程超时时,需要找到对应的代码,将超时操作移到Handler里面去处理,或者使用AsyncTask(https://developer.android.com/reference/android/os/AsyncTask) 异步任务的方式来处理。

  2. 主线程I/O超时
    I/O超时,将超时操作移动到线程中处理,使用Handler或者AsyncTask.如果I/O是网络操作的或者存储操作的话,网络操作使用Volly的方式来实现,数据存储的话,参照Android提供的几种存储方式https://developer.android.com/guide/topics/data/data-storage

  3. 互斥锁
    当前工作线程持有了UI线程所需要的资源太长时间,导致UI线程一直处于Wait状态.UI线程被阻塞住。可以使用Semaphore(计数信号量)或者线程等其他的互斥机制来解决。

  4. 死锁
    UI线程和其他线程互相持有对方所需要的资源,导致死锁。双方都在等待对方释放资源。
    解决死锁的方式:

  5. 广播超时
    广播OnReceiver方法中耗时太久,比如前台广播超过10s未完成等
    如果广播中需要执行复杂的操作,可以通过IntenService来实现。

  6. 服务超时/内容提供者超时/输入事件超时
    都是用Handler或者AsyncTask来处理

  7. CPU负载过高
    一般CPU负载过高,只有在进行压力测试的时候才能出现,出现CPU负载过高之后,会使得很多事件得不到响应。一般CPU负载过高的ANR问题都不解。

ANR 机制

分析一些初级的ANR问题,只需要简单理解最终输出的日志即可,但对于一些系统问题(比如CPU负载过高、进程卡死) 发的ANR,就需要对整个ANR机制有所了解,才能定位出问题的原因。

ANR机制可以分为两部分:

• * ANR的监测。 Android对不同的ANR类型(Broadcase/Service/InputEvent)都有一套监测机制。

• * ANR的报告。 在监测到ANR以后,需要显示ANR对话框、输出日志(发生ANR时的进程函数调用栈、CPU使用情况等)。

整个ANR机制的代码也是横跨了Android的几个层:

• * App层:应用主线程的处理逻辑

• * Framework层: ANR机制的核心

  • frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
  • frameworks/base/services/core/java/com/android/server/am/BroadcastQueue.java
  • frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
  • frameworks/base/services/core/java/com/android/server/input/InputManagerService.java
  • frameworks/base/services/core/java/com/android/server/wm/InputMonitor.java
  • frameworks/base/core/java/android/view/InputChannel
  • frameworks/base/services/core/java/com/android/internal/os/ProcessCpuTracker

• * Native层: 输入事件派发机制。针对InputEvent类型的ANR

  • frameworks/base//services/core/jni/com_android_server_input_InputManagerService.cpp
  • frameworks/native/services/inputflinger/InputDispatcher.cpp

下面我们会深入源码,分析ANR的监测和报告过程。

ANR检测机制

Service 处理超时

Service运行在应用程序的主线程,如果Service的执行时间超过20秒,则会引发ANR。

当发生Service ANR时,一般可以先排查一下在Service的生命周期函数中(onCreate(), onStartCommand()等)有没有做耗时的操作,譬如复杂的运算、IO操作等。如果应用程序的代码逻辑查不出问题,就需要深入检查当前系统的状态:CPU的使用情况、系统服务的状态等,判断当时发生ANR进程是否受到系统运行异常的影响。

如何检测Service超时呢?Android是通过设置定时消息实现的。定时消息是由AMS的消息队列处理的(system_server的ActivityManager线程)。 AMS有Service运行的上下文信息,所以在AMS中设置一套超时检测机制也是合情合理的。

Service ANR机制相对最为简单,简单说就是AMS中的mHandler收到SERVICE_TIMEOUT_MSG消息时触发。
主体实现在ActiveServices.java中。 当Service的生命周期开始时,bumpServiceExecutingLocked()会被调用,紧接着会调用scheduleServiceTimeoutLocked():

StartService的时候:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private final void realStartServiceLocked(ServiceRecord r,
ProcessRecord app, boolean execInFg) throws RemoteException {
if (app.thread == null) {
throw new RemoteException();
}
if (DEBUG_MU)
Slog.v(TAG_MU, "realStartServiceLocked, ServiceRecord.uid = " + r.appInfo.uid
+ ", ProcessRecord.uid = " + app.uid);
r.app = app;
r.restartTime = r.lastActivity = SystemClock.uptimeMillis();

final boolean newService = app.services.add(r);
bumpServiceExecutingLocked(r, execInFg, "create");//执行bumpServiceExecutingLocked
mAm.updateLruProcessLocked(app, false, null);
updateServiceForegroundLocked(r.app, /* oomAdj= */ false);
mAm.updateOomAdjLocked();

boolean created = false;
try {
if (LOG_SERVICE_START_STOP) {
String nameTerm;
int lastPeriod = r.shortName.lastIndexOf('.');
nameTerm = lastPeriod >= 0 ? r.shortName.substring(lastPeriod) : r.shortName;
EventLogTags.writeAmCreateService(
r.userId, System.identityHashCode(r), nameTerm, r.app.uid, r.app.pid);
}
........

BindService的时候:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
private final boolean requestServiceBindingLocked(ServiceRecord r, IntentBindRecord i,
boolean execInFg, boolean rebind) throws TransactionTooLargeException {
if (r.app == null || r.app.thread == null) {
// If service is not currently running, can't yet bind.
return false;
}
if (DEBUG_SERVICE) Slog.d(TAG_SERVICE, "requestBind " + i + ": requested=" + i.requested
+ " rebind=" + rebind);
if ((!i.requested || rebind) && i.apps.size() > 0) {
try {
bumpServiceExecutingLocked(r, execInFg, "bind");//执行bumpServiceExecutingLocked
r.app.forceProcessStateUpTo(ActivityManager.PROCESS_STATE_SERVICE);
r.app.thread.scheduleBindService(r, i.intent.getIntent(), rebind,
r.app.repProcState);
if (!rebind) {
i.requested = true;
}
i.hasBound = true;
i.doRebind = false;
} catch (TransactionTooLargeException e) {
// Keep the executeNesting count accurate.
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while binding " + r, e);
final boolean inDestroying = mDestroyingServices.contains(r);
serviceDoneExecutingLocked(r, inDestroying, inDestroying);
throw e;
} catch (RemoteException e) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while binding " + r);
// Keep the executeNesting count accurate.
final boolean inDestroying = mDestroyingServices.contains(r);
serviceDoneExecutingLocked(r, inDestroying, inDestroying);
return false;
}
}
return true;
}

调用bumpServiceExecutingLocked

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, ">>> EXECUTING "
+ why + " of " + r + " in app " + r.app);
else if (DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING, ">>> EXECUTING "
+ why + " of " + r.shortName);
long now = SystemClock.uptimeMillis();
if (r.executeNesting == 0) {
r.executeFg = fg;
ServiceState stracker = r.getTracker();
if (stracker != null) {
stracker.setExecuting(true, mAm.mProcessStats.getMemFactorLocked(), now);
}
if (r.app != null) {
r.app.executingServices.add(r);
r.app.execServicesFg |= fg;
if (r.app.executingServices.size() == 1) {
scheduleServiceTimeoutLocked(r.app);//
}
}
} else if (r.app != null && fg && !r.app.execServicesFg) {
r.app.execServicesFg = true;
scheduleServiceTimeoutLocked(r.app);//
}
r.executeFg |= fg;
r.executeNesting++;
r.executingStart = now;
}

紧接着调用scheduleServiceTimeoutLocked

1
2
3
4
5
6
7
8
9
10
void scheduleServiceTimeoutLocked(ProcessRecord proc) {
if (proc.executingServices.size() == 0 || proc.thread == null) {
return;
}
Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
msg.obj = proc;
mAm.mHandler.sendMessageDelayed(msg,
proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
}

上述方法通过AMS.MainHandler抛出一个定时消息SERVICE_TIMEOUT_MSG:

前台进程中执行Service,超时时间是SERVICE_TIMEOUT(20秒)
后台进程中执行Service,超时时间是SERVICE_BACKGROUND_TIMEOUT(200秒)

1
2
3
4
5
6
7
8
9
// How long we wait for a service to finish executing.
static final int SERVICE_TIMEOUT = 20*1000;

// How long we wait for a service to finish executing.
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;

// How long the startForegroundService() grace period is to get around to
// calling startForeground() before we ANR + stop it.
static final int SERVICE_START_FOREGROUND_TIMEOUT = 5*1000;

当Service的生命周期结束时,会调用serviceDoneExecutingLocked()方法,之前抛出的SERVICE_TIMEOUT_MSG消息在这个方法中会被清除。 如果在超时时间内,SERVICE_TIMEOUT_MSG没有被清除,那么,AMS.MainHandler就会响应这个消息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
boolean finishing) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "<<< DONE EXECUTING " + r
+ ": nesting=" + r.executeNesting
+ ", inDestroying=" + inDestroying + ", app=" + r.app);
else if (DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
"<<< DONE EXECUTING " + r.shortName);
r.executeNesting--;
if (r.executeNesting <= 0) {
if (r.app != null) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
"Nesting at 0 of " + r.shortName);
r.app.execServicesFg = false;
r.app.executingServices.remove(r);
if (r.app.executingServices.size() == 0) {
if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
"No more executingServices of " + r.shortName);
mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);//清除SERVICE_TIMEOUT_MSG消息
} else if (r.executeFg) {
// Need to re-evaluate whether the app still needs to be in the foreground.
for (int i=r.app.executingServices.size()-1; i>=0; i--) {
if (r.app.executingServices.valueAt(i).executeFg) {
r.app.execServicesFg = true;
break;
}
}
}
.....

AMS.MainHandler就会响应这个消息:
ActivityManagerService
本文是基于Android8.0的代码分析,与http://duanqz.github.io/2015-10-12-ANR-Analysis 这篇文章这里的代码不一样。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
final class MainHandler extends Handler {
public MainHandler(Looper looper) {
super(looper, null, true);
}

@Override
public void handleMessage(Message msg) {
switch (msg.what) {
case UPDATE_CONFIGURATION_MSG: {
final ContentResolver resolver = mContext.getContentResolver();
Settings.System.putConfigurationForUser(resolver, (Configuration) msg.obj,
msg.arg1);
} break;
case GC_BACKGROUND_PROCESSES_MSG: {
synchronized (ActivityManagerService.this) {
performAppGcsIfAppropriateLocked();
}
} break;
case SERVICE_TIMEOUT_MSG: {
mServices.serviceTimeout((ProcessRecord)msg.obj);
} break;
case SERVICE_FOREGROUND_TIMEOUT_MSG: {
mServices.serviceForegroundTimeout((ServiceRecord)msg.obj);
} break;
case SERVICE_FOREGROUND_CRASH_MSG: {
mServices.serviceForegroundCrash((ProcessRecord)msg.obj);
} break;
case DISPATCH_PENDING_INTENT_CANCEL_MSG: {
RemoteCallbackList<IResultReceiver> callbacks
= (RemoteCallbackList<IResultReceiver>)msg.obj;
int N = callbacks.beginBroadcast();
for (int i = 0; i < N; i++) {
try {
callbacks.getBroadcastItem(i).send(Activity.RESULT_CANCELED, null);
} catch (RemoteException e) {
}
}
callbacks.finishBroadcast();
} break;
......

mServices是ActiveServices,又返回到ActiveService中进行处理:
serviceTimeout 的处理如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
void serviceTimeout(ProcessRecord proc) {
String anrMessage = null;
synchronized(mAm) {
if (proc.executingServices.size() == 0 || proc.thread == null) {
return;
}
final long now = SystemClock.uptimeMillis();
final long maxTime = now -
(proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
ServiceRecord timeout = null;
long nextTime = 0;
for (int i=proc.executingServices.size()-1; i>=0; i--) {// 寻找运行超时的Service
ServiceRecord sr = proc.executingServices.valueAt(i);
if (sr.executingStart < maxTime) {
timeout = sr;
break;
}
if (sr.executingStart > nextTime) {
nextTime = sr.executingStart;
}
}
if (timeout != null && mAm.mLruProcesses.contains(proc)) { // 判断执行Service超时的进程是否在最近运行进程列表,如果在,则输出log到logcat和dumplog中,并记录anrMessage供弹框使用
Slog.w(TAG, "Timeout executing service: " + timeout);
StringWriter sw = new StringWriter();
PrintWriter pw = new FastPrintWriter(sw, false, 1024);
pw.println(timeout);
timeout.dump(pw, " ");
pw.close();
mLastAnrDump = sw.toString();
mAm.mHandler.removeCallbacks(mLastAnrDumpClearer);
mAm.mHandler.postDelayed(mLastAnrDumpClearer, LAST_ANR_LIFETIME_DURATION_MSECS);
anrMessage = "executing service " + timeout.shortName;
} else { //其他的三种情况分别如下,则继续等待监测20s或者200s
//a)如果Service超时,但是不在最近运行进程列表中
//b)如果Service不超时,且不在最近运行进程列表中
//c)如果Service不超时,且在最近运进程列表中
Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
msg.obj = proc;
mAm.mHandler.sendMessageAtTime(msg, proc.execServicesFg
? (nextTime+SERVICE_TIMEOUT) : (nextTime + SERVICE_BACKGROUND_TIMEOUT));
}
}
if (anrMessage != null) {
mAm.mAppErrors.appNotResponding(proc, null, null, false, anrMessage);//ANR的弹框处理
}
}

上述方法会找到当前进程已经超时的Service,经过一些判定后,决定要报告ANR,最终调用AMS.appNotResponding()方法。 走到这一步,ANR机制已经完成了监测报告任务,剩下的任务就是ANR结果的输出,我们称之为ANR的报告机制。 ANR的报告机制是通过AMS.appNotResponding()完成的,Broadcast和InputEvent类型的ANR最终也都会调用这个方法,我们后文再详细展开。

至此,我们分析了Service的ANR机制:

通过定时消息跟踪Service的运行,当定时消息被响应时,说明Service还没有运行完成,这就意味着Service ANR。

Broadcast处理超时

应用程序可以注册广播接收器,实现BroadcastReceiver.onReceive()方法来完成对广播的处理。通常,这个方法是在主线程执行的,Android限定它执行时间不能超过10秒,否则,就会引发ANR。

onReceive()也可以调度在其他线程执行,通过Context.registerReceiver(BroadcastReceiver, IntentFilter, String, Handler)这个方法注册广播接收器, 可以指定一个处理的Handler,将onReceive()调度在非主线程执行。

这里先把问题抛出来了:

Android如何将广播投递给各个应用程序?
Android如何检测广播处理超时?

广播消息的调度
AMS维护了两个广播队列BroadcastQueue:

foreground queue,前台队列的超时时间是10秒
background queue,后台队列的超时时间是60秒

之所以有两个,就是因为要区分的不同超时时间。所有发送的广播都会进入到队列中等待调度,在发送广播时,可以通过Intent.FLAG_RECEIVER_FOREGROUND参数将广播投递到前台队列。 AMS线程会不断地从队列中取出广播消息派发到各个接收器(BroadcastReceiver)。当要派发广播时,AMS会调用BroadcastQueue.scheduleBroadcastsLocked()方法:

参见 Android中广播注册和发送机制分析

frameworks/base/services/core/java/com/android/server/am/BroadcastQueue.java

1
2
3
4
5
6
7
8
9
10
11
public void scheduleBroadcastsLocked() {
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Schedule broadcasts ["
+ mQueueName + "]: current="
+ mBroadcastsScheduled);

if (mBroadcastsScheduled) {
return;
}
mHandler.sendMessage(mHandler.obtainMessage(BROADCAST_INTENT_MSG, this));
mBroadcastsScheduled = true;
}

上述方法中,往AMS线程的消息队列发送BROADCAST_INTENT_MSG消息,由此也可以看到真正派发广播的是AMS线程(system_server进程中的ActivityManager线程)。 由于上述方法可能被并发调用,所以通过mBroadcastsScheduled这个变量来标识BROADCAST_INTENT_MSG是不是已经被AMS线程接收了,当已经抛出的消息还未被接受时,不需要重新抛出。 该消息被接收后的处理逻辑如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static final int BROADCAST_INTENT_MSG = ActivityManagerService.FIRST_BROADCAST_QUEUE_MSG;
static final int BROADCAST_TIMEOUT_MSG = ActivityManagerService.FIRST_BROADCAST_QUEUE_MSG + 1;

final BroadcastHandler mHandler;

private final class BroadcastHandler extends Handler {
public BroadcastHandler(Looper looper) {
super(looper, null, true);
}

@Override
public void handleMessage(Message msg) {
switch (msg.what) {
case BROADCAST_INTENT_MSG: {
if (DEBUG_BROADCAST) Slog.v(
TAG_BROADCAST, "Received BROADCAST_INTENT_MSG");
processNextBroadcast(true);
} break;
case BROADCAST_TIMEOUT_MSG: {
synchronized (mService) {
broadcastTimeoutLocked(true);
}
} break;
}
}
}

直接调用BroadcastQueue.processNextBroadcast()方法,fromMsg参数为true表示这是一次来自BROADCAST_INTENT_MSG消息的派发请求。 BroadcastQueue.processNextBroadcast()是派发广播消息最为核心的函数,代码量自然也不小,我们分成几个部分来分析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// processNextBroadcast部分1:处理非串行广播消息
final ActivityManagerService mService;//mService是ActivityManagerService
.....
final void processNextBroadcast(boolean fromMsg) {
synchronized(mService) {
BroadcastRecord r;

if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "processNextBroadcast ["
+ mQueueName + "]: "
+ mParallelBroadcasts.size() + " parallel broadcasts, "
+ mOrderedBroadcasts.size() + " ordered broadcasts");

mService.updateCpuStats();//更新CPU状态信息

//1. 更新mBroadcastsScheduled 标志状态
if (fromMsg) {
mBroadcastsScheduled = false;
}

// First, deliver any non-serialized broadcasts right away.
//2. 处理并行广播
while (mParallelBroadcasts.size() > 0) {
r = mParallelBroadcasts.remove(0);
r.dispatchTime = SystemClock.uptimeMillis();
r.dispatchClockTime = System.currentTimeMillis();

if (Trace.isTagEnabled(Trace.TRACE_TAG_ACTIVITY_MANAGER)) {
Trace.asyncTraceEnd(Trace.TRACE_TAG_ACTIVITY_MANAGER,
createBroadcastTraceTitle(r, BroadcastRecord.DELIVERY_PENDING),
System.identityHashCode(r));
Trace.asyncTraceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER,
createBroadcastTraceTitle(r, BroadcastRecord.DELIVERY_DELIVERED),
System.identityHashCode(r));
}

final int N = r.receivers.size();
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Processing parallel broadcast ["
+ mQueueName + "] " + r);
for (int i=0; i<N; i++) {
Object target = r.receivers.get(i);
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Delivering non-ordered on [" + mQueueName + "] to registered "
+ target + ": " + r);
deliverToRegisteredReceiverLocked(r, (BroadcastFilter)target, false, i);
}
addBroadcastToHistoryLocked(r);
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Done with parallel broadcast ["
+ mQueueName + "] " + r);
}

// Now take care of the next serialized one...

// If we are waiting for a process to come up to handle the next
// broadcast, then do nothing at this point. Just in case, we
// check that the process we're waiting for still exists.
//3. 处理阻塞广播
if (mPendingBroadcast != null) {
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST,
"processNextBroadcast [" + mQueueName + "]: waiting for "
+ mPendingBroadcast.curApp);

boolean isDead; // isDead表示当前广播消息的进程的存活状态
synchronized (mService.mPidsSelfLocked) {
ProcessRecord proc = mService.mPidsSelfLocked.get(mPendingBroadcast.curApp.pid);
isDead = proc == null || proc.crashing;
}
if (!isDead) {
// It's still alive, so keep waiting// 如果还活着,则返回该函数,继续等待下次派发
return;
} else {
Slog.w(TAG, "pending app ["
+ mQueueName + "]" + mPendingBroadcast.curApp
+ " died before responding to broadcast");
mPendingBroadcast.state = BroadcastRecord.IDLE;
mPendingBroadcast.nextReceiver = mPendingBroadcastRecvIndex;
mPendingBroadcast = null;
}
}
//未完待续

第一个部分是处理非”串行广播消息”,有以下几个步骤:

  1. 设置mBroadcastsScheduled。该变量在前文说过,是对BROADCAST_INTENT_MSG进行控制。 如果是响应BROADCAST_INTENT_MSG 的派发调用,则将mBroadcastsScheduled设为false, 表示本次 BROADCAST_INTENT_MSG 已经处理完毕,可以继续抛出下一次 BROADCAST_INTENT_MSG消息了

  2. 处理”并行广播消息”。广播接受器有”动态”和”静态”之分,通过Context.registerReceiver()注册的广播接收器为”动态”的,通过AndroidManifest.xml注册的广播接收器为”静态”的。 广播消息有”并行”和”串行”之分,”并行广播消息”都会派发到”动态”接收器,”串行广播消息”则会根据实际情况派发到两种接收器。 我们先不去探究Android为什么这么设计,只关注这两种广播消息派发的区别。在BroadcastQueue维护着两个队列:

mParallelBroadcasts,”并行广播消息”都会进入到此队列中排队。”并行广播消息”可以一次性派发完毕,即在一个循环中将广播派发到所有的”动态”接收器

mOrderedBroadcasts,”串行广播消息”都会进入到此队列中排队。”串行广播消息”需要轮侯派发,当一个接收器处理完毕后,会再抛出BROADCAST_INTENT_MSG消息, 再次进入BroadcastQueue.processNextBroadcast()处理下一个

  1. 处理阻塞的广播消息。有时候会存在一个广播消息派发不出去的情况,这个广播消息会保存在mPendingBroadcast变量中。新一轮的派发启动时,会判断接收该消息的进程是否还活着, 如果接收进程还活着,那么就继续等待。否则,就放弃这个广播消息

接下来是最为复杂的一部分,处理”串行广播消息”,ANR监测机制只在这一类广播消息中才发挥作用,也就是说”并行广播消息”是不会发生ANR的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// processNextBroadcast部分2:从队列中取出“串行广播消息”
boolean looped = false;

do {//处理有序广播
if (mOrderedBroadcasts.size() == 0) {//有序广播都处理完了时
// No more broadcasts pending, so all done!
mService.scheduleAppGcsLocked();
if (looped) {
// If we had finished the last ordered broadcast, then
// make sure all processes have correct oom and sched
// adjustments.
mService.updateOomAdjLocked();
}
return;
}
r = mOrderedBroadcasts.get(0);
boolean forceReceive = false;

// 1. 广播消息的第一个ANR监测机制
// Ensure that even if something goes awry with the timeout
// detection, we catch "hung" broadcasts here, discard them,
// and continue to make progress.
//
// This is only done if the system is ready so that PRE_BOOT_COMPLETED
// receivers don't get executed with timeouts. They're intended for
// one time heavy lifting after system upgrades and can take
// significant amounts of time.
int numReceivers = (r.receivers != null) ? r.receivers.size() : 0;
if (mService.mProcessesReady && r.dispatchTime > 0) {
long now = SystemClock.uptimeMillis();
if ((numReceivers > 0) &&
(now > r.dispatchTime + (2*mTimeoutPeriod*numReceivers))) {
Slog.w(TAG, "Hung broadcast ["
+ mQueueName + "] discarded after timeout failure:"
+ " now=" + now
+ " dispatchTime=" + r.dispatchTime
+ " startTime=" + r.receiverTime
+ " intent=" + r.intent
+ " numReceivers=" + numReceivers
+ " nextReceiver=" + r.nextReceiver
+ " state=" + r.state);
broadcastTimeoutLocked(false); // forcibly finish this broadcast
forceReceive = true;
r.state = BroadcastRecord.IDLE;
}
}

if (r.state != BroadcastRecord.IDLE) {
if (DEBUG_BROADCAST) Slog.d(TAG_BROADCAST,
"processNextBroadcast("
+ mQueueName + ") called when not idle (state="
+ r.state + ")");
return;
}

// 2. 判断该广播消息是否处理完毕
if (r.receivers == null || r.nextReceiver >= numReceivers
|| r.resultAbort || forceReceive) {
// No more receivers for this broadcast! Send the final
// result if requested...
if (r.resultTo != null) {
try {
if (DEBUG_BROADCAST) Slog.i(TAG_BROADCAST,
"Finishing broadcast [" + mQueueName + "] "
+ r.intent.getAction() + " app=" + r.callerApp);
performReceiveLocked(r.callerApp, r.resultTo,
new Intent(r.intent), r.resultCode,
r.resultData, r.resultExtras, false, false, r.userId);
// Set this to null so that the reference
// (local and remote) isn't kept in the mBroadcastHistory.
r.resultTo = null;
} catch (RemoteException e) {
r.resultTo = null;
Slog.w(TAG, "Failure ["
+ mQueueName + "] sending broadcast result of "
+ r.intent, e);

}
}

if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Cancelling BROADCAST_TIMEOUT_MSG");
cancelBroadcastTimeoutLocked();

if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST,
"Finished with ordered broadcast " + r);

// ... and on to the next...
addBroadcastToHistoryLocked(r);
if (r.intent.getComponent() == null && r.intent.getPackage() == null
&& (r.intent.getFlags()&Intent.FLAG_RECEIVER_REGISTERED_ONLY) == 0) {
// This was an implicit broadcast... let's record it for posterity.
mService.addBroadcastStatLocked(r.intent.getAction(), r.callerPackage,
r.manifestCount, r.manifestSkipCount, r.finishTime-r.dispatchTime);
}
mOrderedBroadcasts.remove(0);
r = null;
looped = true;
continue;
}
} while (r == null);
//未完待续

这部分是一个do-while循环,每次都从mOrderedBroadcasts队列中取出第一条广播消息进行处理。第一个Broadcast ANR监测机制千呼万唤总算是出现了:

判定当前时间是否已经超过了r.dispatchTime + 2×mTimeoutPeriod×numReceivers:

dispatchTime表示这一系列广播消息开始派发的时间。“串行广播消息”是逐个接收器派发的,一个接收器处理完毕后,才开始处理下一个消息派发。 开始派发到第一个接收器的时间就是dispatchTime。dispatchTime需要开始等广播消息派发以后才会设定,也就是说,第一次进入processNextBroadcast()时, dispatchTime=0,并不会进入该条件判断

mTimeoutPeriod由当前BroadcastQueue的类型决定(forground为10秒,background为60秒)。这个时间在初始化BroadcastQueue的时候就设置好了, 本意是限定每一个Receiver处理广播的时间,这里利用它做了一个超时计算

假设一个广播消息有2个接受器,mTimeoutPeriod是10秒,当2×10×2=40秒后,该广播消息还未处理完毕,就调用broadcastTimeoutLocked()方法, 这个方法会判断当前是不是发生了ANR,我们后文再分析。

如果广播消息是否已经处理完毕,则从mOrderedBroadcasts中移除,重新循环,处理下一条;否则,就会跳出循环。

以上代码块完成的主要任务是从队列中取一条“串行广播消息”,接下来就准备派发了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// processNextBroadcast部分3:串行广播消息的第二个ANR监测机制
// Get the next receiver...
int recIdx = r.nextReceiver++;

// Keep track of when this receiver started, and make sure there
// is a timeout message pending to kill it if need be.
r.receiverTime = SystemClock.uptimeMillis();
if (recIdx == 0) {
r.dispatchTime = r.receiverTime;
r.dispatchClockTime = System.currentTimeMillis();
if (Trace.isTagEnabled(Trace.TRACE_TAG_ACTIVITY_MANAGER)) {
Trace.asyncTraceEnd(Trace.TRACE_TAG_ACTIVITY_MANAGER,
createBroadcastTraceTitle(r, BroadcastRecord.DELIVERY_PENDING),
System.identityHashCode(r));
Trace.asyncTraceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER,
createBroadcastTraceTitle(r, BroadcastRecord.DELIVERY_DELIVERED),
System.identityHashCode(r));
}
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Processing ordered broadcast ["
+ mQueueName + "] " + r);
}
if (! mPendingBroadcastTimeoutMessage) {
long timeoutTime = r.receiverTime + mTimeoutPeriod;
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Submitting BROADCAST_TIMEOUT_MSG ["
+ mQueueName + "] for " + r + " at " + timeoutTime);
setBroadcastTimeoutLocked(timeoutTime);
}
//未完待续

取出“串行广播消息”后,一旦要开始派发,第二个ANR检测机制就出现了。mPendingBroadcastTimeoutMessage变量用于标识当前是否有阻塞的超时消息, 如果没有则调用BroadcastQueue.setBroadcastTimeoutLocked():

1
2
3
4
5
6
7
final void setBroadcastTimeoutLocked(long timeoutTime) {
if (! mPendingBroadcastTimeoutMessage) {
Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
mHandler.sendMessageAtTime(msg, timeoutTime);
mPendingBroadcastTimeoutMessage = true;
}
}

通过设置一个定时消息BROADCAST_TIMEOUT_MSG来跟踪当前广播消息的执行情况,这种超时监测机制跟Service ANR很类似,也是抛到AMS线程的消息队列。 如果所有的接收器都处理完毕了,则会调用cancelBroadcastTimeoutLocked()清除该消息;否则,该消息就会响应,并调用broadcastTimeoutLocked(), 这个方法在第一种ANR监测机制的时候调用过,第二种ANR监测机制也会调用,我们留到后文分析。

继续分析processNextBroadcast函数
设置完定时消息后,就开始派发广播消息了,首先是“动态”接收器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// processNextBroadcast部分4: 向“动态”接收器派发广播消息
final BroadcastOptions brOptions = r.options;
final Object nextReceiver = r.receivers.get(recIdx);
// 动态接收器的类型都是BroadcastFilter
if (nextReceiver instanceof BroadcastFilter) {
// Simple case: this is a registered receiver who gets
// a direct call.
BroadcastFilter filter = (BroadcastFilter)nextReceiver;
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Delivering ordered ["
+ mQueueName + "] to registered "
+ filter + ": " + r);
deliverToRegisteredReceiverLocked(r, filter, r.ordered, recIdx);
if (r.receiver == null || !r.ordered) {
// The receiver has already finished, so schedule to
// process the next one.
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Quick finishing ["
+ mQueueName + "]: ordered="
+ r.ordered + " receiver=" + r.receiver);
r.state = BroadcastRecord.IDLE;
scheduleBroadcastsLocked();
} else {
if (brOptions != null && brOptions.getTemporaryAppWhitelistDuration() > 0) {
scheduleTempWhitelistLocked(filter.owningUid,
brOptions.getTemporaryAppWhitelistDuration(), r);
}
}
return;
}
//未完待续

“动态”接收器的载体进程一般是处于运行状态的,所以向这种类型的接收器派发消息相对简单,调用BroadcastQueue.deliverToRegisteredReceiverLocked()完成接下来的工作。 但“静态”接收器是在AndroidManifest.xml中注册的,派发的时候,可能广播接收器的载体进程还没有启动,所以,这种场景会复杂很多。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
// processNextBroadcast部分5: 向“静态”接收器派发广播消息
// Hard case: need to instantiate the receiver, possibly
// starting its application process to host it.
//// 静态接收器的类型都是 ResolveInfo
ResolveInfo info =
(ResolveInfo)nextReceiver;
ComponentName component = new ComponentName(
info.activityInfo.applicationInfo.packageName,
info.activityInfo.name);

boolean skip = false;
if (brOptions != null &&
(info.activityInfo.applicationInfo.targetSdkVersion
< brOptions.getMinManifestReceiverApiLevel() ||
info.activityInfo.applicationInfo.targetSdkVersion
> brOptions.getMaxManifestReceiverApiLevel())) {
skip = true;
}
// 1. 权限检查
int perm = mService.checkComponentPermission(info.activityInfo.permission,
r.callingPid, r.callingUid, info.activityInfo.applicationInfo.uid,
info.activityInfo.exported);
if (!skip && perm != PackageManager.PERMISSION_GRANTED) {
if (!info.activityInfo.exported) {
Slog.w(TAG, "Permission Denial: broadcasting "
+ r.intent.toString()
+ " from " + r.callerPackage + " (pid=" + r.callingPid
+ ", uid=" + r.callingUid + ")"
+ " is not exported from uid " + info.activityInfo.applicationInfo.uid
+ " due to receiver " + component.flattenToShortString());
} else {
Slog.w(TAG, "Permission Denial: broadcasting "
+ r.intent.toString()
+ " from " + r.callerPackage + " (pid=" + r.callingPid
+ ", uid=" + r.callingUid + ")"
+ " requires " + info.activityInfo.permission
+ " due to receiver " + component.flattenToShortString());
}
skip = true;
} else if (!skip && info.activityInfo.permission != null) {
final int opCode = AppOpsManager.permissionToOpCode(info.activityInfo.permission);
if (opCode != AppOpsManager.OP_NONE
&& mService.mAppOpsService.noteOperation(opCode, r.callingUid,
r.callerPackage) != AppOpsManager.MODE_ALLOWED) {
Slog.w(TAG, "Appop Denial: broadcasting "
+ r.intent.toString()
+ " from " + r.callerPackage + " (pid="
+ r.callingPid + ", uid=" + r.callingUid + ")"
+ " requires appop " + AppOpsManager.permissionToOp(
info.activityInfo.permission)
+ " due to registered receiver "
+ component.flattenToShortString());
skip = true;
}
}
if (!skip && info.activityInfo.applicationInfo.uid != Process.SYSTEM_UID &&
r.requiredPermissions != null && r.requiredPermissions.length > 0) {
for (int i = 0; i < r.requiredPermissions.length; i++) {
String requiredPermission = r.requiredPermissions[i];
try {
perm = AppGlobals.getPackageManager().
checkPermission(requiredPermission,
info.activityInfo.applicationInfo.packageName,
UserHandle
.getUserId(info.activityInfo.applicationInfo.uid));
} catch (RemoteException e) {
perm = PackageManager.PERMISSION_DENIED;
}
if (perm != PackageManager.PERMISSION_GRANTED) {
Slog.w(TAG, "Permission Denial: receiving "
+ r.intent + " to "
+ component.flattenToShortString()
+ " requires " + requiredPermission
+ " due to sender " + r.callerPackage
+ " (uid " + r.callingUid + ")");
skip = true;
break;
}
int appOp = AppOpsManager.permissionToOpCode(requiredPermission);
if (appOp != AppOpsManager.OP_NONE && appOp != r.appOp
&& mService.mAppOpsService.noteOperation(appOp,
info.activityInfo.applicationInfo.uid, info.activityInfo.packageName)
!= AppOpsManager.MODE_ALLOWED) {
Slog.w(TAG, "Appop Denial: receiving "
+ r.intent + " to "
+ component.flattenToShortString()
+ " requires appop " + AppOpsManager.permissionToOp(
requiredPermission)
+ " due to sender " + r.callerPackage
+ " (uid " + r.callingUid + ")");
skip = true;
break;
}
}
}
if (!skip && r.appOp != AppOpsManager.OP_NONE
&& mService.mAppOpsService.noteOperation(r.appOp,
info.activityInfo.applicationInfo.uid, info.activityInfo.packageName)
!= AppOpsManager.MODE_ALLOWED) {
Slog.w(TAG, "Appop Denial: receiving "
+ r.intent + " to "
+ component.flattenToShortString()
+ " requires appop " + AppOpsManager.opToName(r.appOp)
+ " due to sender " + r.callerPackage
+ " (uid " + r.callingUid + ")");
skip = true;
}
if (!skip) {
skip = !mService.mIntentFirewall.checkBroadcast(r.intent, r.callingUid,
r.callingPid, r.resolvedType, info.activityInfo.applicationInfo.uid);
}
boolean isSingleton = false;
try {
isSingleton = mService.isSingleton(info.activityInfo.processName,
info.activityInfo.applicationInfo,
info.activityInfo.name, info.activityInfo.flags);
} catch (SecurityException e) {
Slog.w(TAG, e.getMessage());
skip = true;
}
if ((info.activityInfo.flags&ActivityInfo.FLAG_SINGLE_USER) != 0) {
if (ActivityManager.checkUidPermission(
android.Manifest.permission.INTERACT_ACROSS_USERS,
info.activityInfo.applicationInfo.uid)
!= PackageManager.PERMISSION_GRANTED) {
Slog.w(TAG, "Permission Denial: Receiver " + component.flattenToShortString()
+ " requests FLAG_SINGLE_USER, but app does not hold "
+ android.Manifest.permission.INTERACT_ACROSS_USERS);
skip = true;
}
}
if (!skip && info.activityInfo.applicationInfo.isInstantApp()
&& r.callingUid != info.activityInfo.applicationInfo.uid) {
Slog.w(TAG, "Instant App Denial: receiving "
+ r.intent
+ " to " + component.flattenToShortString()
+ " due to sender " + r.callerPackage
+ " (uid " + r.callingUid + ")"
+ " Instant Apps do not support manifest receivers");
skip = true;
}
if (!skip && r.callerInstantApp
&& (info.activityInfo.flags & ActivityInfo.FLAG_VISIBLE_TO_INSTANT_APP) == 0
&& r.callingUid != info.activityInfo.applicationInfo.uid) {
Slog.w(TAG, "Instant App Denial: receiving "
+ r.intent
+ " to " + component.flattenToShortString()
+ " requires receiver have visibleToInstantApps set"
+ " due to sender " + r.callerPackage
+ " (uid " + r.callingUid + ")");
skip = true;
}
if (!skip) {
r.manifestCount++;
} else {
r.manifestSkipCount++;
}
if (r.curApp != null && r.curApp.crashing) {
// If the target process is crashing, just skip it.
Slog.w(TAG, "Skipping deliver ordered [" + mQueueName + "] " + r
+ " to " + r.curApp + ": process crashing");
skip = true;
}
if (!skip) {
boolean isAvailable = false;
try {
isAvailable = AppGlobals.getPackageManager().isPackageAvailable(
info.activityInfo.packageName,
UserHandle.getUserId(info.activityInfo.applicationInfo.uid));
} catch (Exception e) {
// all such failures mean we skip this receiver
Slog.w(TAG, "Exception getting recipient info for "
+ info.activityInfo.packageName, e);
}
if (!isAvailable) {
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Skipping delivery to " + info.activityInfo.packageName + " / "
+ info.activityInfo.applicationInfo.uid
+ " : package no longer available");
skip = true;
}
}

// If permissions need a review before any of the app components can run, we drop
// the broadcast and if the calling app is in the foreground and the broadcast is
// explicit we launch the review UI passing it a pending intent to send the skipped
// broadcast.
if (mService.mPermissionReviewRequired && !skip) {
if (!requestStartTargetPermissionsReviewIfNeededLocked(r,
info.activityInfo.packageName, UserHandle.getUserId(
info.activityInfo.applicationInfo.uid))) {
skip = true;
}
}

// This is safe to do even if we are skipping the broadcast, and we need
// this information now to evaluate whether it is going to be allowed to run.
final int receiverUid = info.activityInfo.applicationInfo.uid;
// If it's a singleton, it needs to be the same app or a special app
if (r.callingUid != Process.SYSTEM_UID && isSingleton
&& mService.isValidSingletonCall(r.callingUid, receiverUid)) {
info.activityInfo = mService.getActivityInfoForUser(info.activityInfo, 0);
}

// 2. 获取接收器所在的进程
String targetProcess = info.activityInfo.processName;
ProcessRecord app = mService.getProcessRecordLocked(targetProcess,
info.activityInfo.applicationInfo.uid, false);

if (!skip) {
final int allowed = mService.getAppStartModeLocked(
info.activityInfo.applicationInfo.uid, info.activityInfo.packageName,
info.activityInfo.applicationInfo.targetSdkVersion, -1, true, false);
if (allowed != ActivityManager.APP_START_MODE_NORMAL) {
// We won't allow this receiver to be launched if the app has been
// completely disabled from launches, or it was not explicitly sent
// to it and the app is in a state that should not receive it
// (depending on how getAppStartModeLocked has determined that).
if (allowed == ActivityManager.APP_START_MODE_DISABLED) {
Slog.w(TAG, "Background execution disabled: receiving "
+ r.intent + " to "
+ component.flattenToShortString());
skip = true;
} else if (((r.intent.getFlags()&Intent.FLAG_RECEIVER_EXCLUDE_BACKGROUND) != 0)
|| (r.intent.getComponent() == null
&& r.intent.getPackage() == null
&& ((r.intent.getFlags()
& Intent.FLAG_RECEIVER_INCLUDE_BACKGROUND) == 0)
&& !isSignaturePerm(r.requiredPermissions))) {
mService.addBackgroundCheckViolationLocked(r.intent.getAction(),
component.getPackageName());
Slog.w(TAG, "Background execution not allowed: receiving "
+ r.intent + " to "
+ component.flattenToShortString());
skip = true;
}
}
}

if (!skip) {
skip = !mService.isAutoStartAllowed(info.activityInfo.applicationInfo.uid, info.activityInfo.applicationInfo.packageName);
}

if (skip) {
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Skipping delivery of ordered [" + mQueueName + "] "
+ r + " for whatever reason");
r.delivery[recIdx] = BroadcastRecord.DELIVERY_SKIPPED;
r.receiver = null;
r.curFilter = null;
r.state = BroadcastRecord.IDLE;
scheduleBroadcastsLocked();
return;
}

r.delivery[recIdx] = BroadcastRecord.DELIVERY_DELIVERED;
r.state = BroadcastRecord.APP_RECEIVE;
r.curComponent = component;
r.curReceiver = info.activityInfo;
if (DEBUG_MU && r.callingUid > UserHandle.PER_USER_RANGE) {
Slog.v(TAG_MU, "Updated broadcast record activity info for secondary user, "
+ info.activityInfo + ", callingUid = " + r.callingUid + ", uid = "
+ info.activityInfo.applicationInfo.uid);
}

if (brOptions != null && brOptions.getTemporaryAppWhitelistDuration() > 0) {
scheduleTempWhitelistLocked(receiverUid,
brOptions.getTemporaryAppWhitelistDuration(), r);
}

// Broadcast is being executed, its package can't be stopped.
try {
AppGlobals.getPackageManager().setPackageStoppedState(
r.curComponent.getPackageName(), false, UserHandle.getUserId(r.callingUid));
} catch (RemoteException e) {
} catch (IllegalArgumentException e) {
Slog.w(TAG, "Failed trying to unstop package "
+ r.curComponent.getPackageName() + ": " + e);
}

// 3. 进程已经启动
// Is this receiver's application already running?
if (app != null && app.thread != null && !app.killed) {
try {
app.addPackage(info.activityInfo.packageName,
info.activityInfo.applicationInfo.versionCode, mService.mProcessStats);
processCurBroadcastLocked(r, app);
return;
} catch (RemoteException e) {
Slog.w(TAG, "Exception when sending broadcast to "
+ r.curComponent, e);
} catch (RuntimeException e) {
Slog.wtf(TAG, "Failed sending broadcast to "
+ r.curComponent + " with " + r.intent, e);
// If some unexpected exception happened, just skip
// this broadcast. At this point we are not in the call
// from a client, so throwing an exception out from here
// will crash the entire system instead of just whoever
// sent the broadcast.
logBroadcastReceiverDiscardLocked(r);
finishReceiverLocked(r, r.resultCode, r.resultData,
r.resultExtras, r.resultAbort, false);
scheduleBroadcastsLocked();
// We need to reset the state if we failed to start the receiver.
r.state = BroadcastRecord.IDLE;
return;
}

// If a dead object exception was thrown -- fall through to
// restart the application.
}

// 4. 进程还未启动
// Not running -- get it started, to be executed when the app comes up.
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Need to start app ["
+ mQueueName + "] " + targetProcess + " for broadcast " + r);
if ((r.curApp=mService.startProcessLocked(targetProcess,
info.activityInfo.applicationInfo, true,
r.intent.getFlags() | Intent.FLAG_FROM_BACKGROUND,
"broadcast", r.curComponent,
(r.intent.getFlags()&Intent.FLAG_RECEIVER_BOOT_UPGRADE) != 0, false, false))
== null) {
// Ah, this recipient is unavailable. Finish it if necessary,
// and mark the broadcast record as ready for the next.
Slog.w(TAG, "Unable to launch app "
+ info.activityInfo.applicationInfo.packageName + "/"
+ info.activityInfo.applicationInfo.uid + " for broadcast "
+ r.intent + ": process is bad");
logBroadcastReceiverDiscardLocked(r);
finishReceiverLocked(r, r.resultCode, r.resultData,
r.resultExtras, r.resultAbort, false);
scheduleBroadcastsLocked();
r.state = BroadcastRecord.IDLE;
return;
}

// 5. 进程启动失败
mPendingBroadcast = r;
mPendingBroadcastRecvIndex = recIdx;
}
}
// processNextBroadcast完

• 1.“静态”接收器是ResolveInfo,需要通过PackageManager获取包信息,进行权限检查。权限检查的内容非常庞大,此处不表。

• 2.经过一系列复杂的权限检查后,终于可以向目标接收器派发了。通过AMS.getProcessRecordLocked()获取广播接收器的进程信息

• 3.如果app.thread != null,则进程已经启动,就可以调用BroadcastQueue.processCurBroadcastLocked()进行接下来的派发处理了

• 4.如果进程还没有启动,则需要通过AMS.startProcessLocked()来启动进程,当前消息并未派发,调用BroadcastQueue.scheduleBroadcastsLocked()进入下一次的调度

• 5.如果进程启动失败了,则当前消息记录成mPendingBroadcast,即阻塞的广播消息,等待下一次调度时处理

庞大的processNextBroadcast()终于完结了,它的功能就是对广播消息进行调度,该方法被设计得十分复杂而精巧,用于应对不同的广播消息和接收器的处理。

广播消息的跨进程传递调度是完成了,接下来,我们就来分析被调度广播消息如何到达应用程序。上文的分析中,最终有两个方法将广播消息派发出去: BroadcastQueue.deliverToRegisteredReceiverLocked() 和BroadcastQueue.processCurBroadcastLocked()。

我们先不展开这两个函数的逻辑,试想要将广播消息的从AMS线程所在的system_server进程传递到应用程序的进程,该怎么实现? 自然需要用到跨进程调用,Android中最常规的手段就是Binder机制。没错,广播消息派发到应用进程就是这么玩的。

对于应用程序已经启动(app.thread != null)的情况,会通过IApplicationThread发起跨进程调用, 调用关系如下:

1
2
3
ActivityThread.ApplicationThread.scheduleReceiver()
└── ActivityThread.handleReceiver()
└── BroadcastReceiver.onReceive()

对于应用程序还未启动的情况,会调用IIntentReceiver发起跨进程调用,应用进程的实现在LoadedApk.ReceiverDispatcher.IntentReceiver 中, 调用关系如下:
1
2
3
4
LoadedApk.ReceiverDispatcher.IntentReceiver.performReceive()
└── LoadedApk.ReceiverDispatcher.performReceiver()
└── LoadedApk.ReceiverDispatcher.Args.run()
└── BroadcastReceiver.onReceive()

最终,都会调用到BroadcastReceiver.onReceive(),在应用进程执行接收广播消息的具体动作。 对于“串行广播消息”而言,执行完了以后,还需要通知system_server进程,才能继续将广播消息派发到下一个接收器,这又需要跨进程调用了。 应用进程在处理完广播消息后,即在BroadcastReceiver.onReceive()执行完毕后,会调用BroadcastReceiver.PendingResult.finish(), 接下来的调用关系如下:

1
2
3
4
5
BroadcastReceiver.PendingResult.finish()
└── BroadcastReceiver.PendingResult.sendFinished()
└── IActivityManager.finishReceiver()
└── ActivityManagerService.finishReceiver()
└── BroadcastQueue.processNextBroadcat()

通过IActivityManager发起了一个从应用进程到system_server进程的调用,最终在AMS线程中,又走到了BroadcastQueue.processNextBroadcat(), 开始下一轮的调度。

broadcastTimeoutLocked()方法
前文说过,两种ANR机制最终都会调用BroadcastQueue.broadcastTimeoutLocked()方法, 第一种ANR监测生效时,会将fromMsg设置为false;第二种ANR监测生效时,会将fromMsg参数为True时,表示当前正在响应BROADCAST_TIMEOUT_MSG消息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
final void broadcastTimeoutLocked(boolean fromMsg) {
if (fromMsg) {// 1. 设置mPendingBroadcastTimeoutMessage
mPendingBroadcastTimeoutMessage = false;
}

if (mOrderedBroadcasts.size() == 0) {
return;
}

// 2. 判断第二种ANR机制是否超时
long now = SystemClock.uptimeMillis();
BroadcastRecord r = mOrderedBroadcasts.get(0);
if (fromMsg) {
if (!mService.mProcessesReady) {
// Only process broadcast timeouts if the system is ready. That way
// PRE_BOOT_COMPLETED broadcasts can't timeout as they are intended
// to do heavy lifting for system up.
return;
}

long timeoutTime = r.receiverTime + mTimeoutPeriod;
if (timeoutTime > now) {
// We can observe premature timeouts because we do not cancel and reset the
// broadcast timeout message after each receiver finishes. Instead, we set up
// an initial timeout then kick it down the road a little further as needed
// when it expires.
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Premature timeout ["
+ mQueueName + "] @ " + now + ": resetting BROADCAST_TIMEOUT_MSG for "
+ timeoutTime);
setBroadcastTimeoutLocked(timeoutTime);
return;
}
}

BroadcastRecord br = mOrderedBroadcasts.get(0);
if (br.state == BroadcastRecord.WAITING_SERVICES) {
// In this case the broadcast had already finished, but we had decided to wait
// for started services to finish as well before going on. So if we have actually
// waited long enough time timeout the broadcast, let's give up on the whole thing
// and just move on to the next.
Slog.i(TAG, "Waited long enough for: " + (br.curComponent != null
? br.curComponent.flattenToShortString() : "(null)"));
br.curComponent = null;
br.state = BroadcastRecord.IDLE;
processNextBroadcast(false);
return;
}

Slog.w(TAG, "Timeout of broadcast " + r + " - receiver=" + r. receiver
+ ", started " + (now - r.receiverTime) + "ms ago");
r.receiverTime = now;
r.anrCount++;

ProcessRecord app = null;
String anrMessage = null;

Object curReceiver;
if (r.nextReceiver > 0) {
curReceiver = r.receivers.get(r.nextReceiver-1);
r.delivery[r.nextReceiver-1] = BroadcastRecord.DELIVERY_TIMEOUT;
} else {
curReceiver = r.curReceiver;
}
Slog.w(TAG, "Receiver during timeout of " + r + " : " + curReceiver);
logBroadcastReceiverDiscardLocked(r);
if (curReceiver != null && curReceiver instanceof BroadcastFilter) {
BroadcastFilter bf = (BroadcastFilter)curReceiver;
if (bf.receiverList.pid != 0
&& bf.receiverList.pid != ActivityManagerService.MY_PID) {
synchronized (mService.mPidsSelfLocked) {
app = mService.mPidsSelfLocked.get(
bf.receiverList.pid);
}
}
} else {
app = r.curApp;
}

if (app != null) {
anrMessage = "Broadcast of " + r.intent.toString();
}

if (mPendingBroadcast == r) {
mPendingBroadcast = null;
}

// 3. 已经超时,则结束对当前接收器,开始新一轮调度
// Move on to the next receiver.
finishReceiverLocked(r, r.resultCode, r.resultData,
r.resultExtras, r.resultAbort, false);
scheduleBroadcastsLocked();

// 4. 抛出绘制ANR对话框的消息
if (anrMessage != null) {
// Post the ANR to the handler since we do not want to process ANRs while
// potentially holding our lock.
mHandler.post(new AppNotResponding(app, anrMessage));
}
}

• 1.mPendingBroadcastTimeoutMessage 标识是否存在未处理的BROADCAST_TIMEOUT_MSG 消息, 将其设置成false,允许继续抛出BROADCAST_TIMEOUT_MSG 消息

• 2.每次将广播派发到接收器,都会将r.receiverTime 更新,如果判断当前还未超时,则又抛出一个 BROADCAST_TIMEOUT_MSG 息。正常情况下,所有接收器处理完毕后,才会清除 BROADCAST_TIMEOUT_MSG ;否则,每进行一次广播消息的调度,都会抛出 BROADCAST_TIMEOUT_MSG 消息

• 3.判断已经超时了,说明当前的广播接收器还未处理完毕,则结束掉当前的接收器,开始新一轮广播调度

• 4.最终,发出绘制ANR对话框的消息

至此,我们回答了前文提出的两个问题:

AMS维护着广播队列BroadcastQueue,AMS线程不断从队列中取出消息进行调度,完成广播消息的派发。 在派发“串行广播消息”时,会抛出一个定时消息 BROADCAST_TIMEOUT_MSG ,在广播接收器处理完毕后,AMS会将定时消息清除。 如果BROADCAST_TIMEOUT_MSG得到了响应,就会判断是否广播消息处理超时,最终通知ANR的发生。

Input处理超时

应用程序可以接收输入事件(按键、触屏、轨迹球等),当5秒内没有处理完毕时,则会引发ANR。

如果Broadcast ANR一样,我们抛出Input ANR的几个问题:

  1. 输入事件经历了一些什么工序才能被派发到应用的界面?
  2. 如何检测到输入时间处理超时

输入事件最开始由硬件设备(譬如按键或触摸屏幕)发起,Android有一套输入子系统来发现各种输入事件, 这些事件最终都会被InputDispatcher分发到各个需要接收事件的窗口。 那么,窗口如何告之InputDispatcher自己需要处理输入事件呢?Android通过InputChannel 连接InputDispatcher和窗口,InputChannel其实是封装后的Linux管道(Pipe)。 每一个窗口都会有一个独立的InputChannel,窗口需要将这个InputChannel注册到InputDispatcher中:

frameworks/native/services/inputflinger/InputDispatcher.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
status_t InputDispatcher::registerInputChannel(const sp<InputChannel>& inputChannel,
const sp<InputWindowHandle>& inputWindowHandle, bool monitor) {
#if DEBUG_REGISTRATION
ALOGD("channel '%s' ~ registerInputChannel - monitor=%s", inputChannel->getName().string(),
toString(monitor));
#endif

{ // acquire lock
AutoMutex _l(mLock);

if (getConnectionIndexLocked(inputChannel) >= 0) {
ALOGW("Attempted to register already registered input channel '%s'",
inputChannel->getName().string());
return BAD_VALUE;
}

sp<Connection> connection = new Connection(inputChannel, inputWindowHandle, monitor);

int fd = inputChannel->getFd();
mConnectionsByFd.add(fd, connection);

if (monitor) {
mMonitoringChannels.push(inputChannel);
}

mLooper->addFd(fd, 0, ALOOPER_EVENT_INPUT, handleReceiveCallback, this);
} // release lock

// Wake the looper because some connections have changed.
mLooper->wake();
return OK;
}

对于InputDispatcher而言,每注册一个InputChannel都被视为一个Connection,通过文件描述符来区别。InputDispatcher是一个消息处理循环,当有新的Connection时,就需要唤醒消息循环队列进行处理。

输入事件的类型有很多,按键、轨迹球、触屏等,Android对这些事件进行了分类,处理这些事件的窗口也被赋予了一个类型 (targetType):Foucused或Touched ,如果当前输入事件是按键类型,则寻找 Focused类型的窗口;如果当前输入事件类型是触摸类型,则寻找Touched类型的窗口。 InputDispatcher需要经过以下复杂的调用关系,才能把一个输入事件派发出去(调用关系以按键事件为例,触屏事件的调用关系类似):

1
2
3
4
5
6
7
8
9
InputDispatcherThread::threadLoop()
└── InputDispatcher::dispatchOnce()
└── InputDispatcher::dispatchOnceInnerLocked()
└── InputDispatcher::dispatchKeyLocked()
└── InputDispatcher::dispatchEventLocked()
└── InputDispatcher::prepareDispatchCycleLocked()
└── InputDispatcher::enqueueDispatchEntriesLocked()
└── InputDispatcher::startDispatchCycleLocked()
└── InputPublisher::publishKeyEvent()

具体每个函数的实现逻辑此处不表。我们提炼出几个关键点:

  1. InputDispatcherThread是一个线程,它处理一次消息的派发
  2. 输入事件作为一个消息,需要排队等待派发,每一个Connection都维护两个队列:
    outboundQueue: 等待发送给窗口的事件。每一个新消息到来,都会先进入到此队列
    waitQueue: 已经发送给窗口的事件
  3. publishKeyEvent完成后,表示事件已经派发了,就将事件从outboundQueue挪到了waitQueue

事件经过这么一轮处理,就算是从InputDispatcher派发出去了,但事件是不是被窗口收到了,还需要等待接收方的“finished”通知。 在向InputDispatcher 注册InputChannel 的时候,同时会注册一个回调函数handleReceiveCallback():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
int InputDispatcher::handleReceiveCallback(int fd, int events, void* data) {
InputDispatcher* d = static_cast<InputDispatcher*>(data);

{ // acquire lock
AutoMutex _l(d->mLock);

ssize_t connectionIndex = d->mConnectionsByFd.indexOfKey(fd);
if (connectionIndex < 0) {
ALOGE("Received spurious receive callback for unknown input channel. "
"fd=%d, events=0x%x", fd, events);
return 0; // remove the callback
}

bool notify;
sp<Connection> connection = d->mConnectionsByFd.valueAt(connectionIndex);
if (!(events & (ALOOPER_EVENT_ERROR | ALOOPER_EVENT_HANGUP))) {
if (!(events & ALOOPER_EVENT_INPUT)) {
ALOGW("channel '%s' ~ Received spurious callback for unhandled poll event. "
"events=0x%x", connection->getInputChannelName(), events);
return 1;
}

nsecs_t currentTime = now();
bool gotOne = false;
status_t status;
for (;;) {
uint32_t seq;
bool handled;
status = connection->inputPublisher.receiveFinishedSignal(&seq, &handled);
if (status) {
break;
}
d->finishDispatchCycleLocked(currentTime, connection, seq, handled);
gotOne = true;
}
if (gotOne) {
d->runCommandsLockedInterruptible();
if (status == WOULD_BLOCK) {
return 1;
}
}

notify = status != DEAD_OBJECT || !connection->monitor;
if (notify) {
ALOGE("channel '%s' ~ Failed to receive finished signal. status=%d",
connection->getInputChannelName(), status);
}
} else {
// Monitor channels are never explicitly unregistered.
// We do it automatically when the remote endpoint is closed so don't warn
// about them.
notify = !connection->monitor;
if (notify) {
ALOGW("channel '%s' ~ Consumer closed input channel or an error occurred. "
"events=0x%x", connection->getInputChannelName(), events);
}
}

// Unregister the channel.
d->unregisterInputChannelLocked(connection->inputChannel, notify);
return 0; // remove the callback
} // release lock
}

当收到的status为OK时,会调用finishDispatchCycleLocked()来完成一个消息的处理:

1
2
3
4
InputDispatcher::finishDispatchCycleLocked()
└── InputDispatcher::onDispatchCycleFinishedLocked()
└── InputDispatcher::doDispatchCycleFinishedLockedInterruptible()
└── InputDispatcher::startDispatchCycleLocked()

调用到doDispatchCycleFinishedLockedInterruptible() 方法时,会将已经成功派发的消息从waitQueue中移除, 进一步调用会 startDispatchCycleLocked开始派发新的事件。

至此,我们回答了第一个问题:

一个正常的输入事件会经过从outboundQueue挪到waitQueue的过程,表示消息已经派发出去;再经过从waitQueue中移除的过程,表示消息已经被窗口接收。InputDispatcher作为中枢,不停地在递送着输入事件,当一个事件无法得到处理的时候,InputDispatcher不能就此死掉啊,否则系统也太容易崩溃了。 InputDispatcher 的策略是放弃掉处理不过来的事件, 并发出通知(这个通知机制就是ANR),继续进行下一轮消息的处理。

理解输入事件分发模型,我们可以举一个生活中的例子:
每一个输入事件可以比做一个快递,InputDispatcher就像一个快递中转站,窗口就像是收件人,InputChannel就像是快递员。 所有快递都会经过中转站中处理,中转站需要知道每一个快递的收件人是谁,通过快递员将快递发送到具体的收件人。 这其中有很多场景导致快递不能及时送到:譬如联系不到收件人;快递很多,快递员会忙不过来;快递员受伤休假了等等… 这时候快递员就需要告知中转站:有快递无法及时送到了。中转站在收到快递员的通知后,一边继续派发其他快递,一边报告上级。

在了解输入事件分发模型之后,我们可以见识一下ANR机制了。在派发事件时,dispatchKeyLocked()和dispatchMotionLocked(), 需要找到当前的焦点窗口 ,焦点窗口才是最终接收事 件的地方,找窗口的过程就会判断是否已经发生了ANR:

1
2
3
4
5
6
InputDispatcher::findFocusedWindowTargetsLocked()
InputDispatcher::findTouchedWindowTargetsLocked()
└── InputDispatcher::handleTargetsNotReadyLocked()
└── InputDispatcher::onANRLocked()
└── InputDispatcher::doNotifyANRLockedInterruptible()
└── NativeInputManager::notifyANR()
  • 首先,会调用findFocusedWindowTargetsLocked()或findTouchedWindowTargetsLocked()寻找接收输入事件的窗口。在找到窗口以后,会调用checkWindowReadyForMoreInputLocked() 检查窗口是否有能力再接收新的输入事件,会有一系列的场景阻碍事件的继续派发:

    场景1: 窗口处于paused状态,不能处理输入事件“Waiting because the [targetType] window is paused.”

    场景2: 窗口还未向InputDispatcher注册,无法将事件派发到窗口“Waiting because the [targetType] window’s input channel is not registered with the input dispatcher. The window may be in the process of being removed.”

    场景3: 窗口和InputDispatcher的连接已经中断,即InputChannel不能正常工作“Waiting because the [targetType] window’s input connection is [status]. The window may be in the process of being removed.”

    场景4: InputChannel已经饱和,不能再处理新的事件“Waiting because the [targetType] window’s input channel is full. Outbound queue length: %d. Wait queue length: %d.”

    场景5: 对于按键类型(KeyEvent)的输入事件,需要等待上一个事件处理完毕“Waiting to send key event because the [targetType] window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: %d. Wait queue length: %d.”

    场景6: 对于触摸类型(TouchEvent)的输入事件,可以立即派发到当前的窗口,因为TouchEvent都是发生在用户当前可见的窗口。但有一种情况,如果当前应用由于队列有太多的输入事件等待派发,导致发生了ANR,那TouchEvent事件就需要排队等待派发。“Waiting to send non-key event because the %s window has not finished processing certain input events that were delivered to it over %0.1fms ago. Wait queue length: %d. Wait queue head age: %0.1fms.”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
String8 InputDispatcher::checkWindowReadyForMoreInputLocked(nsecs_t currentTime,
const sp<InputWindowHandle>& windowHandle, const EventEntry* eventEntry,
const char* targetType) {
// If the window is paused then keep waiting.
if (windowHandle->getInfo()->paused) {
return String8::format("Waiting because the %s window is paused.", targetType);
}

// If the window's connection is not registered then keep waiting.
ssize_t connectionIndex = getConnectionIndexLocked(windowHandle->getInputChannel());
if (connectionIndex < 0) {
return String8::format("Waiting because the %s window's input channel is not "
"registered with the input dispatcher. The window may be in the process "
"of being removed.", targetType);
}

// If the connection is dead then keep waiting.
sp<Connection> connection = mConnectionsByFd.valueAt(connectionIndex);
if (connection->status != Connection::STATUS_NORMAL) {
return String8::format("Waiting because the %s window's input connection is %s."
"The window may be in the process of being removed.", targetType,
connection->getStatusLabel());
}

// If the connection is backed up then keep waiting.
if (connection->inputPublisherBlocked) {
return String8::format("Waiting because the %s window's input channel is full. "
"Outbound queue length: %d. Wait queue length: %d.",
targetType, connection->outboundQueue.count(), connection->waitQueue.count());
}

// Ensure that the dispatch queues aren't too far backed up for this event.
if (eventEntry->type == EventEntry::TYPE_KEY) {
// If the event is a key event, then we must wait for all previous events to
// complete before delivering it because previous events may have the
// side-effect of transferring focus to a different window and we want to
// ensure that the following keys are sent to the new window.
//
// Suppose the user touches a button in a window then immediately presses "A".
// If the button causes a pop-up window to appear then we want to ensure that
// the "A" key is delivered to the new pop-up window. This is because users
// often anticipate pending UI changes when typing on a keyboard.
// To obtain this behavior, we must serialize key events with respect to all
// prior input events.
if (!connection->outboundQueue.isEmpty() || !connection->waitQueue.isEmpty()) {
return String8::format("Waiting to send key event because the %s window has not "
"finished processing all of the input events that were previously "
"delivered to it. Outbound queue length: %d. Wait queue length: %d.",
targetType, connection->outboundQueue.count(), connection->waitQueue.count());
}
} else {
// Touch events can always be sent to a window immediately because the user intended
// to touch whatever was visible at the time. Even if focus changes or a new
// window appears moments later, the touch event was meant to be delivered to
// whatever window happened to be on screen at the time.
//
// Generic motion events, such as trackball or joystick events are a little trickier.
// Like key events, generic motion events are delivered to the focused window.
// Unlike key events, generic motion events don't tend to transfer focus to other
// windows and it is not important for them to be serialized. So we prefer to deliver
// generic motion events as soon as possible to improve efficiency and reduce lag
// through batching.
//
// The one case where we pause input event delivery is when the wait queue is piling
// up with lots of events because the application is not responding.
// This condition ensures that ANRs are detected reliably.
if (!connection->waitQueue.isEmpty()
&& currentTime >= connection->waitQueue.head->deliveryTime
+ STREAM_AHEAD_EVENT_TIMEOUT) {
return String8::format("Waiting to send non-key event because the %s window has not "
"finished processing certain input events that were delivered to it over "
"%0.1fms ago. Wait queue length: %d. Wait queue head age: %0.1fms.",
targetType, STREAM_AHEAD_EVENT_TIMEOUT * 0.000001f,
connection->waitQueue.count(),
(currentTime - connection->waitQueue.head->deliveryTime) * 0.000001f);
}
}
return String8::empty();
}
  • 然后,上述有任何一个场景发生了,则输入事件需要继续等待,紧接着就会调用handleTargetsNotReadyLocked()来判断是不是已经的等待超时了:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
int32_t InputDispatcher::handleTargetsNotReadyLocked(nsecs_t currentTime,
const EventEntry* entry,
const sp<InputApplicationHandle>& applicationHandle,
const sp<InputWindowHandle>& windowHandle,
nsecs_t* nextWakeupTime, const char* reason) {
if (applicationHandle == NULL && windowHandle == NULL) {
if (mInputTargetWaitCause != INPUT_TARGET_WAIT_CAUSE_SYSTEM_NOT_READY) {
#if DEBUG_FOCUS
ALOGD("Waiting for system to become ready for input. Reason: %s", reason);
#endif
mInputTargetWaitCause = INPUT_TARGET_WAIT_CAUSE_SYSTEM_NOT_READY;
mInputTargetWaitStartTime = currentTime;
mInputTargetWaitTimeoutTime = LONG_LONG_MAX;
mInputTargetWaitTimeoutExpired = false;
mInputTargetWaitApplicationHandle.clear();
}
} else {
if (mInputTargetWaitCause != INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY) {
#if DEBUG_FOCUS
ALOGD("Waiting for application to become ready for input: %s. Reason: %s",
getApplicationWindowLabelLocked(applicationHandle, windowHandle).string(),
reason);
#endif
nsecs_t timeout;
if (windowHandle != NULL) {
timeout = windowHandle->getDispatchingTimeout(DEFAULT_INPUT_DISPATCHING_TIMEOUT);
} else if (applicationHandle != NULL) {
timeout = applicationHandle->getDispatchingTimeout(
DEFAULT_INPUT_DISPATCHING_TIMEOUT);
} else {
timeout = DEFAULT_INPUT_DISPATCHING_TIMEOUT;
}

mInputTargetWaitCause = INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY;
mInputTargetWaitStartTime = currentTime;
mInputTargetWaitTimeoutTime = currentTime + timeout;
mInputTargetWaitTimeoutExpired = false;
mInputTargetWaitApplicationHandle.clear();

if (windowHandle != NULL) {
mInputTargetWaitApplicationHandle = windowHandle->inputApplicationHandle;
}
if (mInputTargetWaitApplicationHandle == NULL && applicationHandle != NULL) {
mInputTargetWaitApplicationHandle = applicationHandle;
}
}
}

if (mInputTargetWaitTimeoutExpired) {
return INPUT_EVENT_INJECTION_TIMED_OUT;
}

if (currentTime >= mInputTargetWaitTimeoutTime) {
onANRLocked(currentTime, applicationHandle, windowHandle,
entry->eventTime, mInputTargetWaitStartTime, reason);

// Force poll loop to wake up immediately on next iteration once we get the
// ANR response back from the policy.
*nextWakeupTime = LONG_LONG_MIN;
return INPUT_EVENT_INJECTION_PENDING;
} else {
// Force poll loop to wake up when timeout is due.
if (mInputTargetWaitTimeoutTime < *nextWakeupTime) {
*nextWakeupTime = mInputTargetWaitTimeoutTime;
}
return INPUT_EVENT_INJECTION_PENDING;
}
}
  • 最后,如果当前事件派发已经超时,则说明已经检测到了ANR,调用onANRLocked()方法,然后将nextWakeupTime设置为最小值,马上开始下一轮调度。 在onANRLocked()方法中, 会保存ANR的一些状态信息,调用doNotifyANRLockedInterruptible(),进一步会调用到JNI层的 NativeInputManager::notifyANR()方法, 它的主要功能就是衔接Native层和 Java层,直接调用Java层的InputManagerService.notifyANR()方法。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void InputDispatcher::onANRLocked(
nsecs_t currentTime, const sp<InputApplicationHandle>& applicationHandle,
const sp<InputWindowHandle>& windowHandle,
nsecs_t eventTime, nsecs_t waitStartTime, const char* reason) {
float dispatchLatency = (currentTime - eventTime) * 0.000001f;
float waitDuration = (currentTime - waitStartTime) * 0.000001f;
ALOGI("Application is not responding: %s. "
"It has been %0.1fms since event, %0.1fms since wait started. Reason: %s",
getApplicationWindowLabelLocked(applicationHandle, windowHandle).string(),
dispatchLatency, waitDuration, reason);

// Capture a record of the InputDispatcher state at the time of the ANR.
time_t t = time(NULL);
struct tm tm;
localtime_r(&t, &tm);
char timestr[64];
strftime(timestr, sizeof(timestr), "%F %T", &tm);
mLastANRState.clear();
mLastANRState.append(INDENT "ANR:\n");
mLastANRState.appendFormat(INDENT2 "Time: %s\n", timestr);
mLastANRState.appendFormat(INDENT2 "Window: %s\n",
getApplicationWindowLabelLocked(applicationHandle, windowHandle).string());
mLastANRState.appendFormat(INDENT2 "DispatchLatency: %0.1fms\n", dispatchLatency);
mLastANRState.appendFormat(INDENT2 "WaitDuration: %0.1fms\n", waitDuration);
mLastANRState.appendFormat(INDENT2 "Reason: %s\n", reason);
dumpDispatchStateLocked(mLastANRState);

CommandEntry* commandEntry = postCommandLocked(
& InputDispatcher::doNotifyANRLockedInterruptible);
commandEntry->inputApplicationHandle = applicationHandle;
commandEntry->inputWindowHandle = windowHandle;
commandEntry->reason = reason;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
void InputDispatcher::doNotifyANRLockedInterruptible(
CommandEntry* commandEntry) {
mLock.unlock();

nsecs_t newTimeout = mPolicy->notifyANR(
commandEntry->inputApplicationHandle, commandEntry->inputWindowHandle,
commandEntry->reason);

mLock.lock();

resumeAfterTargetsNotReadyTimeoutLocked(newTimeout,
commandEntry->inputWindowHandle != NULL
? commandEntry->inputWindowHandle->getInputChannel() : NULL);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
InputDispatcher::InputDispatcher(const sp<InputDispatcherPolicyInterface>& policy) :
mPolicy(policy),
mPendingEvent(NULL), mLastDropReason(DROP_REASON_NOT_DROPPED),
mAppSwitchSawKeyDown(false), mAppSwitchDueTime(LONG_LONG_MAX),
mNextUnblockedEvent(NULL),
mDispatchEnabled(false), mDispatchFrozen(false), mInputFilterEnabled(false),
mInputTargetWaitCause(INPUT_TARGET_WAIT_CAUSE_NONE) {
mLooper = new Looper(false);

mKeyRepeatState.lastKeyEntry = NULL;

policy->getDispatcherConfiguration(&mConfig);
}

看到mPolicy是InputDispatcherPolicyInterface,对应的在JNI层frameworks/base/services/core/jni/com_android_server_input_InputManagerService.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class NativeInputManager : public virtual RefBase,
public virtual InputReaderPolicyInterface,
public virtual InputDispatcherPolicyInterface,
public virtual PointerControllerPolicyInterface {
protected:
virtual ~NativeInputManager();

public:
NativeInputManager(jobject contextObj, jobject serviceObj, const sp<Looper>& looper);

inline sp<InputManager> getInputManager() const { return mInputManager; }

void dump(String8& dump);

void setVirtualDisplayViewports(JNIEnv* env, jobjectArray viewportObjArray);
void setDisplayViewport(int32_t viewportType, const DisplayViewport& viewport);

status_t registerInputChannel(JNIEnv* env, const sp<InputChannel>& inputChannel,
const sp<InputWindowHandle>& inputWindowHandle, bool monitor);
status_t unregisterInputChannel(JNIEnv* env, const sp<InputChannel>& inputChannel);

void setInputWindows(JNIEnv* env, jobjectArray windowHandleObjArray);
void setFocusedApplication(JNIEnv* env, jobject applicationHandleObj);
void setInputDispatchMode(bool enabled, bool frozen);
void setSystemUiVisibility(int32_t visibility);
void setPointerSpeed(int32_t speed);
void setInputDeviceEnabled(uint32_t deviceId, bool enabled);
void setShowTouches(bool enabled);
void setInteractive(bool interactive);
void reloadCalibration();
void setPointerIconType(int32_t iconId);
void reloadPointerIcons();
void setCustomPointerIcon(const SpriteIcon& icon);
void setPointerCapture(bool enabled);

/* --- InputReaderPolicyInterface implementation --- */

virtual void getReaderConfiguration(InputReaderConfiguration* outConfig);
virtual sp<PointerControllerInterface> obtainPointerController(int32_t deviceId);
virtual void notifyInputDevicesChanged(const Vector<InputDeviceInfo>& inputDevices);
virtual sp<KeyCharacterMap> getKeyboardLayoutOverlay(const InputDeviceIdentifier& identifier);
virtual String8 getDeviceAlias(const InputDeviceIdentifier& identifier);
virtual TouchAffineTransformation getTouchAffineTransformation(JNIEnv *env,
jfloatArray matrixArr);
virtual TouchAffineTransformation getTouchAffineTransformation(
const String8& inputDeviceDescriptor, int32_t surfaceRotation);

/* --- InputDispatcherPolicyInterface implementation --- */

virtual void notifySwitch(nsecs_t when, uint32_t switchValues, uint32_t switchMask,
uint32_t policyFlags);
virtual void notifyConfigurationChanged(nsecs_t when);
virtual nsecs_t notifyANR(const sp<InputApplicationHandle>& inputApplicationHandle,
const sp<InputWindowHandle>& inputWindowHandle,
const String8& reason);
virtual void notifyInputChannelBroken(const sp<InputWindowHandle>& inputWindowHandle);
virtual bool filterInputEvent(const InputEvent* inputEvent, uint32_t policyFlags);
virtual void getDispatcherConfiguration(InputDispatcherConfiguration* outConfig);
virtual void interceptKeyBeforeQueueing(const KeyEvent* keyEvent, uint32_t& policyFlags);
virtual void interceptMotionBeforeQueueing(nsecs_t when, uint32_t& policyFlags);
virtual nsecs_t interceptKeyBeforeDispatching(
const sp<InputWindowHandle>& inputWindowHandle,
const KeyEvent* keyEvent, uint32_t policyFlags);
virtual bool dispatchUnhandledKey(const sp<InputWindowHandle>& inputWindowHandle,
const KeyEvent* keyEvent, uint32_t policyFlags, KeyEvent* outFallbackKeyEvent);
virtual void pokeUserActivity(nsecs_t eventTime, int32_t eventType);
virtual bool checkInjectEventsPermissionNonReentrant(
int32_t injectorPid, int32_t injectorUid);

/* --- PointerControllerPolicyInterface implementation --- */

virtual void loadPointerIcon(SpriteIcon* icon);
virtual void loadPointerResources(PointerResources* outResources);
virtual void loadAdditionalMouseResources(std::map<int32_t, SpriteIcon>* outResources,
std::map<int32_t, PointerAnimation>* outAnimationResources);
virtual int32_t getDefaultPointerIconId();
virtual int32_t getCustomPointerIconId();
......

JNI层中notifyANR的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
nsecs_t NativeInputManager::notifyANR(const sp<InputApplicationHandle>& inputApplicationHandle,
const sp<InputWindowHandle>& inputWindowHandle, const String8& reason) {
#if DEBUG_INPUT_DISPATCHER_POLICY
ALOGD("notifyANR");
#endif
ATRACE_CALL();

JNIEnv* env = jniEnv();

// 将应用程序句柄、窗口句柄、ANR原因字符串,转化为Java层的对象
jobject inputApplicationHandleObj =
getInputApplicationHandleObjLocalRef(env, inputApplicationHandle);
jobject inputWindowHandleObj =
getInputWindowHandleObjLocalRef(env, inputWindowHandle);
jstring reasonObj = env->NewStringUTF(reason.string());

// 调用Java层的InputManagerService.notifyANR()方法
jlong newTimeout = env->CallLongMethod(mServiceObj,
gServiceClassInfo.notifyANR, inputApplicationHandleObj, inputWindowHandleObj,
reasonObj);
if (checkAndClearExceptionFromCallback(env, "notifyANR")) {
newTimeout = 0; // abort dispatch
} else {
assert(newTimeout >= 0);
}

env->DeleteLocalRef(reasonObj);
env->DeleteLocalRef(inputWindowHandleObj);
env->DeleteLocalRef(inputApplicationHandleObj);
return newTimeout;
}

至此,ANR的处理逻辑转交到了Java层。底层(Native)发现一旦有输入事件派发超时,就会通知上层(Java),上层收到ANR通知后,决定是否终止当前输入事件的派发。

发生ANR时,Java层最开始的入口是InputManagerService.notifyANR(),它是直接被Native层调用的。我们先把ANR的Java层调用关系列出来:

1
2
3
4
5
6
7
8
9
InputManagerService.notifyANR()
└── InputMonitor.notifyANR()
├── IApplicationToken.keyDispatchingTimedOut()
│ └── ActivityRecord.keyDispatchingTimedOut()
│ └── AMS.inputDispatchingTimedOut()
│ └── AMS.appNotResponding()

└── AMS.inputDispatchingTimedOut()
└── AMS.appNotResponding()

在Java层的InputManagerService代码: frameworks/base/services/core/java/com/android/server/input/InputManagerService.java

1
2
3
4
5
6
7
8
private WindowManagerCallbacks mWindowManagerCallbacks;
....
// Native callback.
private long notifyANR(InputApplicationHandle inputApplicationHandle,
InputWindowHandle inputWindowHandle, String reason) {
return mWindowManagerCallbacks.notifyANR(
inputApplicationHandle, inputWindowHandle, reason);
}

WindowManagerCallbacks由 InputMonitor.java实现: frameworks/base/services/core/java/com/android/server/wm/InputMonitor.java
InputManagerService.notifyANR()只是为Native层定义了一个接口,它直接调用InputMonitor.notifyANR()。 如果该方法的返回值等于0, 则放弃本次输入事件;如果大于0, 则表示 需要继续等待的时间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
final class InputMonitor implements InputManagerService.WindowManagerCallbacks {
private final WindowManagerService mService;
....

/* Notifies the window manager about an application that is not responding.
* Returns a new timeout to continue waiting in nanoseconds, or 0 to abort dispatch.
*
* Called by the InputManager.
*/
@Override
public long notifyANR(InputApplicationHandle inputApplicationHandle,
InputWindowHandle inputWindowHandle, String reason) {
AppWindowToken appWindowToken = null;
WindowState windowState = null;
boolean aboveSystem = false;
synchronized (mService.mWindowMap) {
if (inputWindowHandle != null) {
windowState = (WindowState) inputWindowHandle.windowState;
if (windowState != null) {
appWindowToken = windowState.mAppToken;
}
}
if (appWindowToken == null && inputApplicationHandle != null) {
appWindowToken = (AppWindowToken)inputApplicationHandle.appWindowToken;
}

if (windowState != null) {
Slog.i(TAG_WM, "Input event dispatching timed out "
+ "sending to " + windowState.mAttrs.getTitle()
+ ". Reason: " + reason);
// Figure out whether this window is layered above system windows.
// We need to do this here to help the activity manager know how to
// layer its ANR dialog.
int systemAlertLayer = mService.mPolicy.getWindowLayerFromTypeLw(
TYPE_APPLICATION_OVERLAY, windowState.mOwnerCanAddInternalSystemWindow);
aboveSystem = windowState.mBaseLayer > systemAlertLayer;
} else if (appWindowToken != null) {
Slog.i(TAG_WM, "Input event dispatching timed out "
+ "sending to application " + appWindowToken.stringName
+ ". Reason: " + reason);
} else {
Slog.i(TAG_WM, "Input event dispatching timed out "
+ ". Reason: " + reason);
}

mService.saveANRStateLocked(appWindowToken, windowState, reason);
}

// All the calls below need to happen without the WM lock held since they call into AM.
mService.mAmInternal.saveANRState(reason);

if (appWindowToken != null && appWindowToken.appToken != null) {
// appToken实际上就是当前的ActivityRecord。
// 如果发生ANR的Activity还存在,则直接通过ActivityRecord通知事件派发超时
// Notify the activity manager about the timeout and let it decide whether
// to abort dispatching or keep waiting.
final AppWindowContainerController controller = appWindowToken.getController();
final boolean abort = controller != null
&& controller.keyDispatchingTimedOut(reason,
(windowState != null) ? windowState.mSession.mPid : -1);
if (!abort) {
// The activity manager declined to abort dispatching.
// Wait a bit longer and timeout again later.
return appWindowToken.mInputDispatchingTimeoutNanos;
}
} else if (windowState != null) {
// 如果发生ANR的Activity已经销毁了,则通过AMS通知事件派发超时
try {
// Notify the activity manager about the timeout and let it decide whether
// to abort dispatching or keep waiting.
long timeout = ActivityManager.getService().inputDispatchingTimedOut(
windowState.mSession.mPid, aboveSystem, reason);
if (timeout >= 0) {
// The activity manager declined to abort dispatching.
// Wait a bit longer and timeout again later.
return timeout * 1000000L; // nanoseconds
}
} catch (RemoteException ex) {
}
}
return 0; // abort dispatching
}

上述方法中有两种不同的调用方式,但最终都会交由AMS.inputDispatchingTimedOut()处理。AMS有重载的inputDispatchingTimedOut()方法,他们的参数不一样。 ActivityRecord 调用时,可以传入的信息更多一点(当前发生ANR的界面是哪一个)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@Override
public long inputDispatchingTimedOut(int pid, final boolean aboveSystem, String reason) {
if (checkCallingPermission(android.Manifest.permission.FILTER_EVENTS)
!= PackageManager.PERMISSION_GRANTED) {
throw new SecurityException("Requires permission "
+ android.Manifest.permission.FILTER_EVENTS);
}
ProcessRecord proc;
long timeout;
synchronized (this) {
synchronized (mPidsSelfLocked) {
proc = mPidsSelfLocked.get(pid);// 1. 根据进程号获取到ProcessRecord
}
// 2. 获取超时时间
// 测试环境下的超时时间是INSTRUMENTATION_KEY_DISPATCHING_TIMEOUT(60秒),
// 正常环境下的超时时间是KEY_DISPATCHING_TIMEOUT(5秒)
timeout = getInputDispatchingTimeoutLocked(proc);
}

// 调用重载的函数,如果返回True,则表示需要中断当前的事件派发;
if (inputDispatchingTimedOut(proc, null, null, aboveSystem, reason)) {
return -1;
}

// 3. 返回继续等待的时间,这个值会传递到Native层
return timeout;
}

/**
* Handle input dispatching timeouts.
* Returns whether input dispatching should be aborted or not.
*/
public boolean inputDispatchingTimedOut(final ProcessRecord proc,
final ActivityRecord activity, final ActivityRecord parent,
final boolean aboveSystem, String reason) {
if (checkCallingPermission(android.Manifest.permission.FILTER_EVENTS)
!= PackageManager.PERMISSION_GRANTED) {
throw new SecurityException("Requires permission "
+ android.Manifest.permission.FILTER_EVENTS);
}

final String annotation;
if (reason == null) {
annotation = "Input dispatching timed out";
} else {
annotation = "Input dispatching timed out (" + reason + ")";
}

if (proc != null) {
synchronized (this) {
if (proc.debugging) {// 1. 发生ANR进程正处于调试状态,不需要中断事件
return false;
}

// 2. 发生ANR的进程是测试进程,需要中断,但不在UI界面显示ANR信息判断
if (proc.instr != null) {
Bundle info = new Bundle();
info.putString("shortMsg", "keyDispatchingTimedOut");
info.putString("longMsg", annotation);
finishInstrumentationLocked(proc, Activity.RESULT_CANCELED, info);
return true;
}
}
// 3. 通知UI界面显示ANR信息
mHandler.post(new Runnable() {
@Override
public void run() {
mAppErrors.appNotResponding(proc, activity, parent, aboveSystem, annotation);
}
});
}

return true;
}

至此,我们回答了第二个问题:

在InputDispatcher派发输入事件时,会寻找接收事件的窗口,如果无法正常派发,则可能会导致当前需要派发的事件超时(默认是5秒)。 Native层发现超时了,会通知Java层,Java层经过一些处理后,会反馈给Native层,是继续等待还是丢弃当前派发的事件。

小结

ANR监测机制包含三种:

• * Service ANR,前台进程中Service生命周期不能超过20秒,后台进程中Service的生命周期不能超过200秒。 在启动Service时,抛出定时消息SERVICE_TIMEOUT_MSG或SERVICE_BACKGOURND_TIMEOUT_MSG,如果定时消息响应了,则说明发生了ANR

• * Broadcast ANR,前台的“串行广播消息”必须在10秒内处理完毕,后台的“串行广播消息”必须在60秒处理完毕, 每派发串行广播消息到一个接收器时,都会抛出一个定时消息BROADCAST_TIMEOUT_MSG,如果定时消息响应,则判断是否广播消息处理超时,超时就说明发生了ANR

• * Input ANR,输入事件必须在5秒内处理完毕。在派发一个输入事件时,会判断当前输入事件是否需要等待,如果需要等待,则判断是否等待已经超时,超时就说明发生了ANR

ANR监测机制实际上是对应用程序主线程的要求,要求主线成必须在限定的时间内,完成对几种操作的响应;否则,就可以认为应用程序主线程失去响应能力。

从ANR的三种监测机制中,我们看到不同超时机制的设计:

Service和Broadcast都是由AMS调度,利用Handler和Looper,设计了一个TIMEOUT消息交由AMS线程来处理,整个超时机制的实现都是在Java层; InputEvent由InputDispatcher调度,待处理的输入事件都会进入队列中等待,设计了一个等待超时的判断,超时机制的实现在Native层。

ANR的报告机制

无论哪种类型的ANR发生以后,最终都会调用 AMS.appNotResponding() 方法,所谓“殊途同归”。这个方法的职能就是向用户或开发者报告ANR发生了。 最终的表现形式是:弹出一个对话框,告诉用户当前某个程序无响应;输入一大堆与ANR相关的日志,便于开发者解决问题。

最终形式我们见过很多,但输出日志的原理是什么,未必所有人都了解,下面我们就来认识一下是如何输出ANR日志的。

AppErrors.java frameworks/base/services/core/java/com/android/server/am/AppErrors.java中处理ANR的弹框

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
final void appNotResponding(ProcessRecord app, ActivityRecord activity,
ActivityRecord parent, boolean aboveSystem, final String annotation) {
// app: 当前发生ANR的进程
// activity: 发生ANR的界面
// parent: 发生ANR的界面的上一级界面
// aboveSystem:
// annotation: 发生ANR的原因
ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
if (mService.mController != null) {
try {
// 0 == continue, -1 = kill process immediately
int res = mService.mController.appEarlyNotResponding(
app.processName, app.pid, annotation);
if (res < 0 && app.pid != MY_PID) {
app.kill("anr", true);
}
} catch (RemoteException e) {
mService.mController = null;
Watchdog.getInstance().setActivityController(null);
}
}
long anrTime = SystemClock.uptimeMillis();
if (ActivityManagerService.MONITOR_CPU_USAGE) {// 1. 更新CPU使用信息。ANR的第一次CPU信息采样
mService.updateCpuStatsNow();
}
// Unless configured otherwise, swallow ANRs in background processes & kill the process.
boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(),
Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0;
boolean isSilentANR;
synchronized (mService) {
// PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
if (mService.mShuttingDown) {//正在关机过程中的ANR,直接忽视掉
Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation);
return;
} else if (app.notResponding) {//重复的ANR,直接忽略掉,不重复弹框
Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation);
return;
} else if (app.crashing) {//APP crash 引起的ANR,不弹框
Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation);
return;
} else if (app.killedByAm) {// APP 已经被 AM kill掉的,不弹框
Slog.i(TAG, "App already killed by AM skipping ANR: " + app + " " + annotation);
return;
} else if (app.killed) {//APP 已经挂掉了,不弹框
Slog.i(TAG, "Skipping died app ANR: " + app + " " + annotation);
return;
}
// In case we come through here for the same app before completing
// this one, mark as anring now so we will bail out.
app.notResponding = true;
// Log the ANR to the event log.
//2.将ANR 信息写入event log中
EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
app.processName, app.info.flags, annotation);
// Dump thread traces as quickly as we can, starting with "interesting" processes.
// 3. 填充firstPids和lastPids数组。从最近运行进程(Last Recently Used)中挑选:
// firstPids用于保存ANR进程及其父进程,system_server进程和persistent的进程(譬如Phone进程)
// lastPids用于保存除firstPids外的其他进程
firstPids.add(app.pid);
// Don't dump other PIDs if it's a background ANR
isSilentANR = !showBackground && !isInterestingForBackgroundTraces(app);
if (!isSilentANR) {
int parentPid = app.pid;
if (parent != null && parent.app != null && parent.app.pid > 0) {
parentPid = parent.app.pid;
}
if (parentPid != app.pid) firstPids.add(parentPid);
if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
for (int i = mService.mLruProcesses.size() - 1; i >= 0; i--) {
ProcessRecord r = mService.mLruProcesses.get(i);
if (r != null && r.thread != null) {
int pid = r.pid;
if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
if (r.persistent) {
firstPids.add(pid);
if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
} else if (r.treatLikeActivity) {
firstPids.add(pid);
if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
} else {
lastPids.put(pid, Boolean.TRUE);
if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
}
}
}
}
}
}
// Log the ANR to the main log.
//4. 将ANR信息输出到 main log中
StringBuilder info = new StringBuilder();
info.setLength(0);
info.append("ANR in ").append(app.processName);
if (activity != null && activity.shortComponentName != null) {
info.append(" (").append(activity.shortComponentName).append(")");
}
info.append("\n");
info.append("PID: ").append(app.pid).append("\n");
if (annotation != null) {
info.append("Reason: ").append(annotation).append("\n");
}
if (parent != null && parent != activity) {
info.append("Parent: ").append(parent.shortComponentName).append("\n");
}
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
// don't dump native PIDs for background ANRs unless it is the process of interest
String[] nativeProcs = null;
if (isSilentANR) {
for (int i = 0; i < NATIVE_STACKS_OF_INTEREST.length; i++) {
if (NATIVE_STACKS_OF_INTEREST[i].equals(app.processName)) {
nativeProcs = new String[] { app.processName };
break;
}
}
} else {
nativeProcs = NATIVE_STACKS_OF_INTEREST;
}
int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
ArrayList<Integer> nativePids = null;
if (pids != null) {
nativePids = new ArrayList<Integer>(pids.length);
for (int i : pids) {
nativePids.add(i);
}
}
// For background ANRs, don't pass the ProcessCpuTracker to
// avoid spending 1/2 second collecting stats to rank lastPids.
File tracesFile = ActivityManagerService.dumpStackTraces(
true, firstPids,
(isSilentANR) ? null : processCpuTracker,
(isSilentANR) ? null : lastPids,
nativePids);
String cpuInfo = null;
if (ActivityManagerService.MONITOR_CPU_USAGE) {
mService.updateCpuStatsNow();
synchronized (mService.mProcessCpuTracker) {
cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime);
}
info.append(processCpuTracker.printCurrentLoad());
info.append(cpuInfo);
}
info.append(processCpuTracker.printCurrentState(anrTime));
Slog.e(TAG, info.toString());//将上面拿到的所有信息,打印到main log中
if (tracesFile == null) {//5.将log打印到 traces.txt中
// There is no trace file, so dump (only) the alleged culprit's threads to the log
Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
}
mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
cpuInfo, tracesFile, null);// 6.将log打印到dropbox
if (mService.mController != null) {
try {
// 0 == show dialog, 1 = keep waiting, -1 = kill process immediately
int res = mService.mController.appNotResponding(
app.processName, app.pid, info.toString());
if (res != 0) {
if (res < 0 && app.pid != MY_PID) {
app.kill("anr", true);
} else {
synchronized (mService) {
mService.mServices.scheduleServiceTimeoutLocked(app);
}
}
return;
}
} catch (RemoteException e) {
mService.mController = null;
Watchdog.getInstance().setActivityController(null);
}
}
// 弹框处理
synchronized (mService) {
mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
if (isSilentANR) {
app.kill("bg anr", true);
return;
}
// Set the app's notResponding state, and look up the errorReportReceiver
makeAppNotRespondingLocked(app,
activity != null ? activity.shortComponentName : null,
annotation != null ? "ANR " + annotation : "ANR",
info.toString());
// Bring up the infamous App Not Responding dialog
Message msg = Message.obtain();
HashMap<String, Object> map = new HashMap<String, Object>();
msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
msg.obj = map;
msg.arg1 = aboveSystem ? 1 : 0;
map.put("app", app);
if (activity != null) {
map.put("activity", activity);
}
mService.mUiHandler.sendMessage(msg);
}
}

该方法的主体逻辑可以分成五个部分来看:

  1. 更新CPU的统计信息。这是发生ANR时,第一次CPU使用信息的采样,采样数据会保存在mProcessStats这个变量中
  2. 将Log信息,写入Event log中
  3. 填充firstPids和lastPids数组。当前发生ANR的应用会首先被添加到firstPids中,这样打印函数栈的时候,当前进程总是在trace文件的最前面打印函数调用栈(StackTrace)。具体实现由dumpStackTraces()函数完成更新CPU的统计信息。这是发生ANR时,第二次CPU使用信息的采样,两次采样的数据分别对应ANR发生前后的CPU使用情况
  4. 准备main log需要的参数,将ANR信息输出到main log中
  5. 将log打印到trace.txt文件中
  6. 将log打印到dropbox中
  7. 弹框处理,显示ANR对话框。抛出SHOW_NOT_RESPONDING_MSG消息,AMS.MainHandler会处理这条消息,显示AppNotRespondingDialog

当然,除了主体逻辑,发生ANR时还会输出各种类别的日志:

event log,通过检索”am_anr”关键字,可以找到发生ANR的应用
main log,通过检索”ANR in “关键字,可以找到ANR的信息,日志的上下文会包含CPU的使用情况
dropbox,通过检索”anr”类型,可以找到ANR的信息,log位于/data/system/dropbox
traces, 发生ANR时,各进程的函数调用栈信息,log位于/data/anr
我们分析ANR问题,往往是从main log中的CPU使用情况和traces中的函数调用栈开始。所以,更新CPU的使用信息updateCpuStatsNow()方法和打印函数栈dumpStackTraces()方法,是系统报告ANR问题关键所在。

CPU的使用情况

AMS.updateCpuStatsNow()方法的实现见下面的代码,更新CPU使用信息的间隔最小是5秒,即如果5秒内连续调用updateCpuStatsNow()方法,其实是没有更新CPU使用信息的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
void updateCpuStatsNow() {
synchronized (mProcessCpuTracker) {
mProcessCpuMutexFree.set(false);
final long now = SystemClock.uptimeMillis();
boolean haveNewCpuStats = false;

if (MONITOR_CPU_USAGE &&
mLastCpuTime.get() < (now-MONITOR_CPU_MIN_TIME)) {
mLastCpuTime.set(now);
mProcessCpuTracker.update();
if (mProcessCpuTracker.hasGoodLastStats()) {
haveNewCpuStats = true;
//Slog.i(TAG, mProcessCpu.printCurrentState());
//Slog.i(TAG, "Total CPU usage: "
// + mProcessCpu.getTotalCpuPercent() + "%");

// Slog the cpu usage if the property is set.
if ("true".equals(SystemProperties.get("events.cpu"))) {
int user = mProcessCpuTracker.getLastUserTime();
int system = mProcessCpuTracker.getLastSystemTime();
int iowait = mProcessCpuTracker.getLastIoWaitTime();
int irq = mProcessCpuTracker.getLastIrqTime();
int softIrq = mProcessCpuTracker.getLastSoftIrqTime();
int idle = mProcessCpuTracker.getLastIdleTime();

int total = user + system + iowait + irq + softIrq + idle;
if (total == 0) total = 1;

EventLog.writeEvent(EventLogTags.CPU,
((user+system+iowait+irq+softIrq) * 100) / total,
(user * 100) / total,
(system * 100) / total,
(iowait * 100) / total,
(irq * 100) / total,
(softIrq * 100) / total);
}
}
}

final BatteryStatsImpl bstats = mBatteryStatsService.getActiveStatistics();
synchronized(bstats) {
synchronized(mPidsSelfLocked) {
if (haveNewCpuStats) {
if (bstats.startAddingCpuLocked()) {
int totalUTime = 0;
int totalSTime = 0;
final int N = mProcessCpuTracker.countStats();
for (int i=0; i<N; i++) {
ProcessCpuTracker.Stats st = mProcessCpuTracker.getStats(i);
if (!st.working) {
continue;
}
ProcessRecord pr = mPidsSelfLocked.get(st.pid);
totalUTime += st.rel_utime;
totalSTime += st.rel_stime;
if (pr != null) {
BatteryStatsImpl.Uid.Proc ps = pr.curProcBatteryStats;
if (ps == null || !ps.isActive()) {
pr.curProcBatteryStats = ps = bstats.getProcessStatsLocked(
pr.info.uid, pr.processName);
}
ps.addCpuTimeLocked(st.rel_utime, st.rel_stime);
pr.curCpuTime += st.rel_utime + st.rel_stime;
if (pr.lastCpuTime == 0) {
pr.lastCpuTime = pr.curCpuTime;
}
} else {
BatteryStatsImpl.Uid.Proc ps = st.batteryStats;
if (ps == null || !ps.isActive()) {
st.batteryStats = ps = bstats.getProcessStatsLocked(
bstats.mapUid(st.uid), st.name);
}
ps.addCpuTimeLocked(st.rel_utime, st.rel_stime);
}
}
final int userTime = mProcessCpuTracker.getLastUserTime();
final int systemTime = mProcessCpuTracker.getLastSystemTime();
final int iowaitTime = mProcessCpuTracker.getLastIoWaitTime();
final int irqTime = mProcessCpuTracker.getLastIrqTime();
final int softIrqTime = mProcessCpuTracker.getLastSoftIrqTime();
final int idleTime = mProcessCpuTracker.getLastIdleTime();
bstats.finishAddingCpuLocked(totalUTime, totalSTime, userTime,
systemTime, iowaitTime, irqTime, softIrqTime, idleTime);
}
}
}

if (mLastWriteTime < (now-BATTERY_STATS_TIME)) {
mLastWriteTime = now;
mBatteryStatsService.scheduleWriteToDisk();
}
}
}
}

CPU使用信息由ProcessCpuTracker这个类维护, 每次调用ProcessCpuTracker.update()方法,就会读取设备节点 /proc 下的文件,来更新CPU使用信息,具体有以下几个维度:

CPU的使用时间: 读取 /proc/stat

user: 用户进程的CPU使用时间
nice: 降低过优先级进程的CPU使用时间。Linux进程都有优先级,这个优先级可以进行动态调整,譬如进程初始优先级的值设为10,运行时降低为8,那么,修正值-2就定义为nice。 Android将user和nice这两个时间归类成user
sys: 内核进程的CPU使用时间
idle: CPU空闲的时间
wait: CPU等待IO的时间
hw irq: 硬件中断的时间。如果外设(譬如硬盘)出现故障,需要通过硬件终端通知CPU保存现场,发生上下文切换的时间就是CPU的硬件中断时间
sw irg: 软件中断的时间。同硬件中断一样,如果软件要求CPU中断,则上下文切换的时间就是CPU的软件中断时间

CPU负载: 读取 /proc/loadavg, 统计最近1分钟,5分钟,15分钟内,CPU的平均活动进程数。 CPU的负载可以比喻成超市收银员负载,如果有1个人正在买单,有2个人在排队,那么该收银员的负载就是3。 在收银员工作时,不断会有人买单完成,也不断会有人排队,可以在固定的时间间隔内(譬如,每隔5秒)统计一次负载,那么,就可以统计出一段时间内的平均负载。

页错误信息: 进程的CPU使用率最后输出的“faults: xxx minor/major”部分表示的是页错误次数,当次数为0时不显示。 major是指Major Page Fault(主要页错误,简称MPF),内核在读取数据时会先后查找CPU的高速缓存和物理内存,如果找不到会发出一个MPF信息,请求将数据加载到内存。 minor是指Minor Page Fault(次要页错误,简称MnPF),磁盘数据被加载到内存后,内核再次读取时,会发出一个MnPF信息。 一个文件第一次被读写时会有很多的MPF,被缓存到内存后再次访问MPF就会很少,MnPF反而变多,这是内核为减少效率低下的磁盘I/O操作采用的缓存技术的结果。

堆栈调用

AMS.dumpStackTraces()方法用于打印进程的函数调用栈,该方法的主体逻辑如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
/**
* If a stack trace dump file is configured, dump process stack traces.
* @param clearTraces causes the dump file to be erased prior to the new
* traces being written, if true; when false, the new traces will be
* appended to any existing file content.
* @param firstPids of dalvik VM processes to dump stack traces for first
* @param lastPids of dalvik VM processes to dump stack traces for last
* @param nativePids optional list of native pids to dump stack crawls
*/
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids,
ArrayList<Integer> nativePids) {
ArrayList<Integer> extraPids = null;

// Measure CPU usage as soon as we're called in order to get a realistic sampling
// of the top users at the time of the request.
if (processCpuTracker != null) {
processCpuTracker.init();
try {
Thread.sleep(200);
} catch (InterruptedException ignored) {
}

processCpuTracker.update();

// We'll take the stack crawls of just the top apps using CPU.
final int N = processCpuTracker.countWorkingStats();
extraPids = new ArrayList<>();
for (int i = 0; i < N && extraPids.size() < 5; i++) {
ProcessCpuTracker.Stats stats = processCpuTracker.getWorkingStats(i);
if (lastPids.indexOfKey(stats.pid) >= 0) {
if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + stats.pid);

extraPids.add(stats.pid);
} else if (DEBUG_ANR) {
Slog.d(TAG, "Skipping next CPU consuming process, not a java proc: "
+ stats.pid);
}
}
}

boolean useTombstonedForJavaTraces = false;
File tracesFile;

final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
if (tracesDirProp.isEmpty()) {
// When dalvik.vm.stack-trace-dir is not set, we are using the "old" trace
// dumping scheme. All traces are written to a global trace file (usually
// "/data/anr/traces.txt") so the code below must take care to unlink and recreate
// the file if requested.
//
// This mode of operation will be removed in the near future.


String globalTracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (globalTracesPath.isEmpty()) {
Slog.w(TAG, "dumpStackTraces: no trace path configured");
return null;
}

tracesFile = new File(globalTracesPath);
try {
if (clearTraces && tracesFile.exists()) {
tracesFile.delete();
}

tracesFile.createNewFile();
FileUtils.setPermissions(globalTracesPath, 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesFile, e);
return null;
}
} else {
File tracesDir = new File(tracesDirProp);
// When dalvik.vm.stack-trace-dir is set, we use the "new" trace dumping scheme.
// Each set of ANR traces is written to a separate file and dumpstate will process
// all such files and add them to a captured bug report if they're recent enough.
maybePruneOldTraces(tracesDir);

// NOTE: We should consider creating the file in native code atomically once we've
// gotten rid of the old scheme of dumping and lot of the code that deals with paths
// can be removed.
tracesFile = createAnrDumpFile(tracesDir);
if (tracesFile == null) {
return null;
}

useTombstonedForJavaTraces = true;
}

dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids,
useTombstonedForJavaTraces);
return tracesFile;
}


private static void dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids,
ArrayList<Integer> nativePids, ArrayList<Integer> extraPids,
boolean useTombstonedForJavaTraces) {

// We don't need any sort of inotify based monitoring when we're dumping traces via
// tombstoned. Data is piped to an "intercept" FD installed in tombstoned so we're in full
// control of all writes to the file in question.
final DumpStackFileObserver observer;
if (useTombstonedForJavaTraces) {
observer = null;
} else {
// Use a FileObserver to detect when traces finish writing.
// The order of traces is considered important to maintain for legibility.
observer = new DumpStackFileObserver(tracesFile);
}

// We must complete all stack dumps within 20 seconds.
long remainingTime = 20 * 1000;
try {
if (observer != null) {
observer.startWatching();
}

// First collect all of the stacks of the most important pids.
if (firstPids != null) {
int num = firstPids.size();
for (int i = 0; i < num; i++) {
if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for pid "
+ firstPids.get(i));
final long timeTaken;
if (useTombstonedForJavaTraces) {
timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime);
} else {
timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime);
}

remainingTime -= timeTaken;
if (remainingTime <= 0) {
Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + firstPids.get(i) +
"); deadline exceeded.");
return;
}

if (DEBUG_ANR) {
Slog.d(TAG, "Done with pid " + firstPids.get(i) + " in " + timeTaken + "ms");
}
}
}

// Next collect the stacks of the native pids
if (nativePids != null) {
for (int pid : nativePids) {
if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for native pid " + pid);
final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime);

final long start = SystemClock.elapsedRealtime();
Debug.dumpNativeBacktraceToFileTimeout(
pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
final long timeTaken = SystemClock.elapsedRealtime() - start;

remainingTime -= timeTaken;
if (remainingTime <= 0) {
Slog.e(TAG, "Aborting stack trace dump (current native pid=" + pid +
"); deadline exceeded.");
return;
}

if (DEBUG_ANR) {
Slog.d(TAG, "Done with native pid " + pid + " in " + timeTaken + "ms");
}
}
}

// Lastly, dump stacks for all extra PIDs from the CPU tracker.
if (extraPids != null) {
for (int pid : extraPids) {
if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + pid);

final long timeTaken;
if (useTombstonedForJavaTraces) {
timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
} else {
timeTaken = observer.dumpWithTimeout(pid, remainingTime);
}

remainingTime -= timeTaken;
if (remainingTime <= 0) {
Slog.e(TAG, "Aborting stack trace dump (current extra pid=" + pid +
"); deadline exceeded.");
return;
}

if (DEBUG_ANR) {
Slog.d(TAG, "Done with extra pid " + pid + " in " + timeTaken + "ms");
}
}
}
} finally {
if (observer != null) {
observer.stopWatching();
}
}
}

问题分析方法

分析ANR问题,有三大利器:Logcat,traces和StrictMode。 在StrictMode机制一文中,有介绍StrictMode的实现机制以及用途,本文中不讨论利用StrictMode来解决ANR问题,但各位读者需要有这个意识。 在Watchdog机制以及问题分析一文中,我们介绍过logcat和traces这两种日志的用途。 分析ANR问题同Watchdog问题一样,都需要经过日志获取、问题定位和场景还原三个步骤。

日志的获取

分析ANR问题最主要的途径就是通过log来分析。Android中log的获取见,Android的log机制分析一问中log的存储位置章节。

问题定位

通过在event log中检索 am_anr 关键字,就可以找到发生ANR的进程,譬如以下日志:

1
10-16 00:48:27 820 907 I am_anr: [0,29533,com.android.systemui,1082670605,Broadcast of Intent { act=android.intent.action.TIME_TICK flg=0x50000114 (has extras) }]

表示在 10-16 00:48:27 这个时刻,PID为 29533 进程发生了ANR,进程名是 com.android.systemui
接下来可以在system log检索 ANR in 关键字,找到发生ANR前后的CPU使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
10-16 00:50:10 820 907 E ActivityManager: ANR in com.android.systemui, time=130090695
10-16 00:50:10 820 907 E ActivityManager: Reason: Broadcast of Intent { act=android.intent.action.TIME_TICK flg=0x50000114 (has extras) }
10-16 00:50:10 820 907 E ActivityManager: Load: 30.4 / 22.34 / 19.94
10-16 00:50:10 820 907 E ActivityManager: Android time :[2015-10-16 00:50:05.76] [130191,266]
10-16 00:50:10 820 907 E ActivityManager: CPU usage from 6753ms to -4ms ago:
10-16 00:50:10 820 907 E ActivityManager: 47% 320/netd: 3.1% user + 44% kernel / faults: 14886 minor 3 major
10-16 00:50:10 820 907 E ActivityManager: 15% 10007/com.sohu.sohuvideo: 2.8% user + 12% kernel / faults: 1144 minor
10-16 00:50:10 820 907 E ActivityManager: 13% 10654/hif_thread: 0% user + 13% kernel
10-16 00:50:10 820 907 E ActivityManager: 11% 175/mmcqd/0: 0% user + 11% kernel
10-16 00:50:10 820 907 E ActivityManager: 5.1% 12165/app_process: 1.6% user + 3.5% kernel / faults: 9703 minor 540 major
10-16 00:50:10 820 907 E ActivityManager: 3.3% 29533/com.android.systemui: 2.6% user + 0.7% kernel / faults: 8402 minor 343 major
10-16 00:50:10 820 907 E ActivityManager: 3.2% 820/system_server: 0.8% user + 2.3% kernel / faults: 5120 minor 523 major
10-16 00:50:10 820 907 E ActivityManager: 2.5% 11817/com.netease.pomelo.push.l.messageservice_V2: 0.7% user + 1.7% kernel / faults: 7728 minor 687 major
10-16 00:50:10 820 907 E ActivityManager: 1.6% 11887/com.android.email: 0.5% user + 1% kernel / faults: 6259 minor 587 major
10-16 00:50:10 820 907 E ActivityManager: 1.4% 11854/com.android.settings: 0.7% user + 0.7% kernel / faults: 5404 minor 471 major
10-16 00:50:10 820 907 E ActivityManager: 1.4% 11869/android.process.acore: 0.7% user + 0.7% kernel / faults: 6131 minor 561 major
10-16 00:50:10 820 907 E ActivityManager: 1.3% 11860/com.tencent.mobileqq: 0.1% user + 1.1% kernel / faults: 5542 minor 470 major
...
10-16 00:50:10 820 907 E ActivityManager: +0% 12832/cat: 0% user + 0% kernel
10-16 00:50:10 820 907 E ActivityManager: +0% 13211/zygote64: 0% user + 0% kernel
10-16 00:50:10 820 907 E ActivityManager: 87% TOTAL: 3% user + 18% kernel + 64% iowait + 0.5% softirq

这一段日志对于Android开发人员而言,实在太熟悉不过了,它包含的信息量巨大:

发生ANR的时间。event log中,ANR的时间是 00:48:27,因为AMS.appNotResponding()首先会打印event log,然后再打印system log, 所以,在system log中,找到ANR的时间是 00:50:10。可以从这个时间点之前的日志中,还原ANR出现时系统的运行状态
打印ANR日志的进程。ANR日志都是在system_server进程的AMS线程打印的,在event log和system log中,都能看到 820 和 907, 所以system_server的PID是 802,AMS线程的TID是 907。ANR的监测机制实现在AMS线程,分析一些受系统影响的ANR,需要知道system_server进程的运行状态
发生ANR的进程。ANR in关键字就表明了当前ANR的进程是com.android.system.ui,通过event log,知道进程的PID是 29533
发生ANR的原因。Reason关键字表明了当前发生ANR的原因是,处理TIME_TICK广播消息超时。 隐含的意思是TIME_TICK是一个串行广播消息,在 29533 的主线程中,执行BroadcastReceiver.onReceive()方法已经超过10秒
CPU负载。Load关键字表明了最近1分钟、5分钟、15分钟内的CPU负载分别是30.4、22.3、19.94。CPU最近1分钟的负载最具参考价值,因为ANR的超时限制基本都是1分钟以内, 这可以近似的理解为CPU最近1分钟平均有30.4个任务要处理,这个负载值是比较高的
CPU使用统计时间段。CPU usage from XX to XX ago关键字表明了这是在ANR发生之前一段时间内的CPU统计。 类似的还有CPU usage from XX to XX after关键字,表明是ANR发生之后一段时间内的CPU统计
各进程的CPU使用率。我们以com.android.systemui进程的CPU使用率为例,它包含以下信息:
1)总的CPU使用率: 3.3%,其中systemui进程在用户态的CPU使用率是2.6%,在内核态的使用率是0.7%
2) 缺页次数fault:8402 minor表示高速缓存中的缺页次数,343 major表示内存的缺页次数。minor可以理解为进程在做内存访问,major可以理解为进程在做IO操作。 当前minor和major值都是比较高的,从侧面反映了发生ANR之前,systemui进程有有较多的内存访问操作,引发的IO次数也会较多
3) CPU使用率前面的 “+”。部分进程的CPU使用率前面有 “+” 号,譬如cat和zygote64,表示在上一次CPU统计的时间片段内,还没有这些进程,而这一次CPU统计的时间片段内,运行了这些进程。 类似的还有 “-” 号,表示两次CPU统计时间片段时,这些进程消亡了
CPU使用汇总。TOTAL关键字表明了CPU使用的汇总,87%是总的CPU使用率,其中有一项iowait表明CPU在等待IO的时间,占到64%,说明发生ANR以前,有大量的IO操作。app_process、 system_server, com.android.systemui这几个进程的major值都比较大,说明这些进程的IO操作较为频繁,从而拉升了整个iowait的时间

信息量是如此的庞大,以致于我们都要下结论了:CPU大量的时间都在等待IO,导致systemui进程分配不到CPU时间,从而主线程处理广播消息超时,发生了ANR。
对于一个严谨的开发人员而言,这种结论下得有点早,因为还有太多的疑问:

  • systemui进程也分到了一些CPU时间(3.3%),难道BroadcastReceiver.onReceive()方法就一直无法执行吗?
  • 为什么iowait的时间会这么多,而且多个进程的major值都很高?
    接下来还是需要从其他日志中还原ANR出现的场景。

场景还原

第一个假设和验证

带着上文提出来的第一个疑问,我们先来做一个假设:如果systemui进程正在执行BroadcatReceiver.onReceive()方法,那么从traces.txt文件中,应该可以看到主线程的函数调用栈正在执行这个方法。
接下来,我们首先从traces文件中,找到发生ANR时(00:48:27),sysemtui进程的函数调用栈信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
----- pid 29533 at 2015-10-16 00:48:06 -----
Cmd line: com.android.systemui

DALVIK THREADS (53):
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x75bd5818 self=0x7f8549a000
| sysTid=29533 nice=0 cgrp=bg_non_interactive sched=0/0 handle=0x7f894bbe58
| state=S schedstat=( 288625433917 93454573244 903419 ) utm=20570 stm=8292 core=3 HZ=100
| stack=0x7fdffda000-0x7fdffdc000 stackSize=8MB
| held mutexes=
native: #00 pc 00060b0c /system/lib64/libc.so (__epoll_pwait+8)
native: #01 pc 0001bb54 /system/lib64/libc.so (epoll_pwait+32)
native: #02 pc 0001b3d8 /system/lib64/libutils.so (android::Looper::pollInner(int)+144)
native: #03 pc 0001b75c /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+76)
native: #04 pc 000d7194 /system/lib64/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+48)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:148)
at android.os.Looper.loop(Looper.java:151)
at android.app.ActivityThread.main(ActivityThread.java:5718)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:975)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:770)

----- pid 29533 at 2015-10-16 00:48:29 -----
Cmd line: com.android.systemui

DALVIK THREADS (54):
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 obj=0x75bd5818 self=0x7f8549a000
| sysTid=29533 nice=0 cgrp=bg_non_interactive sched=0/0 handle=0x7f894bbe58
| state=S schedstat=( 289080040422 93461978317 904874 ) utm=20599 stm=8309 core=0 HZ=100
| stack=0x7fdffda000-0x7fdffdc000 stackSize=8MB
| held mutexes=
at com.mediatek.anrappmanager.MessageLogger.println(SourceFile:77)
- waiting to lock <0x26b337a3> (a com.mediatek.anrappmanager.MessageLogger) held by thread 49
at android.os.Looper.loop(Looper.java:195)
at android.app.ActivityThread.main(ActivityThread.java:5718)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:975)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:770)
...
"Binder_5" prio=5 tid=49 Native
| group="main" sCount=1 dsCount=0 obj=0x136760a0 self=0x7f7e453000
| sysTid=6945 nice=0 cgrp=default sched=0/0 handle=0x7f6e3ce000
| state=S schedstat=( 5505571091 4567508913 30743 ) utm=264 stm=286 core=4 HZ=100
| stack=0x7f6b83f000-0x7f6b841000 stackSize=1008KB
| held mutexes=
native: #00 pc 00019d14 /system/lib64/libc.so (syscall+28)
native: #01 pc 0005b5d8 /system/lib64/libaoc.so (???)
native: #02 pc 002c6f18 /system/lib64/libaoc.so (???)
native: #03 pc 00032c40 /system/lib64/libaoc.so (???)
at libcore.io.Posix.getpid(Native method)
at libcore.io.ForwardingOs.getpid(ForwardingOs.java:83)
at android.system.Os.getpid(Os.java:176)
at android.os.Process.myPid(Process.java:754)
at com.mediatek.anrappmanager.MessageLogger.dump(SourceFile:219)
- locked <0x26b337a3> (a com.mediatek.anrappmanager.MessageLogger)
at com.mediatek.anrappmanager.ANRAppManager.dumpMessageHistory(SourceFile:65)
at android.app.ActivityThread$ApplicationThread.dumpMessageHistory(ActivityThread.java:1302)
at android.app.ApplicationThreadNative.onTransact(ApplicationThreadNative.java:682)
at android.os.Binder.execTransact(Binder.java:451)

最终,我们找到systemui进程ANR时刻(00:48:27)附近的两个函数调用栈:

  • 在ANR发生之前(00:48:06),主线程的函数调用栈处于正常状态:消息队列中,循环中处理消息
  • 在ANR发生之后2秒(00:48:29),主线程处于Blocked状态,在等待一个被49号线程持有的锁。而49号线程是一个Binder线程,anrappmanager正在做dump操作。

笔者分析的日志是MTK平台产生的,所以从函数调用栈中看到com.mediatek.anrappmanager.MessageLogger这样的类,它是MTK在AOSP上的扩展,用于打印ANR日志。

至此,systemui进程发生ANR的直接原因我们已经找到了,systemui进程正在打印traces,存在较长时间的IO操作,导致主线程阻塞,从而无法处理TIME_TICK广播消息,所以发生了ANR。

要避免这种场景下的ANR,我们就需要打破主线程中Blocked的逻辑。其实本例是由于MTK在AOSP的android.os.Looper.loop()扩展了打印消息队列的功能,该功能存在设计缺陷,会导致锁等待的情况。

第二个假设和验证

我们进一步挖掘在systemui还没有发生ANR时,就在打印traces的原因。带着上文提出的第二个疑问,我们来做另一个假设: iowait较高,而且多个进程的major都很高,可能是由于当前正在调用AMS.dumpStackTraces()方法,很多进程都需要将自己的函数调用栈写到traces文件,所以IO就会较高。 如果当前正在调用AMS.dumpStackTraces()方法,那说明当时系统已经发生了异常,要么已经有ANR发生,要么有SNR发生

从event log中,我们检索到了另一个ANR:

1
10-16 00:47:58 820 907 I am_anr  : [0,10464,com.android.settings,1086864965,Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it.  Outbound queue length: 0.  Wait queue length: 1.)]

在 00:47:58 这个时刻,com.android.settings进程发生了ANR,而且ANR的时间在systemui之前(00:48:27)。这一下,我们就找到佐证了,正是因为settings进程先发生了ANR,调用AMS.dumpStackTraces(), 从而很多进程都开始了打印traces的操作,所以系统的整个iowait比较高,大量进程的major值也比较高,systemui就在其列。在MTK逻辑的影响下,打印ANR日志会导致主线程阻塞,从而就连带引发了其他应用的ANR。

在system log中,我们检索到了settings进程ANR的CPU使用信息:

1
2
3
4
5
6
7
8
9
10
10-16 00:48:12 820 907 E ActivityManager: ANR in com.android.settings (com.android.settings/.SubSettings), time=130063718
10-16 00:48:12 820 907 E ActivityManager: Reason: Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 1.)
10-16 00:48:12 820 907 E ActivityManager: Load: 21.37 / 19.25 / 18.84
10-16 00:48:12 820 907 E ActivityManager: Android time :[2015-10-16 00:48:12.24] [130077,742]
10-16 00:48:12 820 907 E ActivityManager: CPU usage from 0ms to 7676ms later:
10-16 00:48:12 820 907 E ActivityManager: 91% 820/system_server: 16% user + 75% kernel / faults: 13192 minor 167 major
10-16 00:48:12 820 907 E ActivityManager: 3.2% 175/mmcqd/0: 0% user + 3.2% kernel
10-16 00:48:12 820 907 E ActivityManager: 2.9% 29533/com.android.systemui: 2.3% user + 0.6% kernel / faults: 1352 minor 10 major
10-16 00:48:12 820 907 E ActivityManager: 2.2% 1736/com.android.phone: 0.9% user + 1.3% kernel / faults: 1225 minor 1 major
10-16 00:48:12 820 907 E ActivityManager: 2.2% 10464/com.android.settings: 0.7% user + 1.4% kernel / faults: 2801 minor 105 major

具体的涵义我们不再赘述了,只关注一下ANR的原因:

1
2
Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it.
Outbound queue length: 0. Wait queue length: 1.)

之前对Input ANR机制的分析派上用长了,我们轻松知道这种ANR的原因是什么。 Wait queue length: 1表示之前的输入事件已经派发到Settings进程了,但Settings进程还没有处理完毕,新来的KeyEvent事件已经等待超过了5秒,所以ANR产生了。
接下来,又需要找到Settings的traces,分析Settings主线程处理输入事件超时的原因,我们点到为止。

总结

本文对Android ANR机制进行了深入的分析:
ANR的监测机制,从Service,Broadcast,InputEvent三种不同的ANR监测机制的源码实现开始,分析了Android如何发现各类ANR。在启动服务、派发广播消息和输入事件时,植入超时检测,用于发现ANR
ANR的报告机制,分析Android如何输出ANR日志。当ANR被发现后,两个很重要的日志输出是:CPU使用情况和进程的函数调用栈,这两类日志是我们解决ANR问题的利器

ANR的解决方法,通过一个案例,对ANR日志进行了深入解读,梳理了分析ANR问题的思路和途径
最后,致各位读者,从日志出发解决ANR问题,理解ANR机制背后的实现原理,碰到再难的ANR问题也无需惊慌。

Android中的各种log机制以及分析。

参考资料

  1. ANR机制以及问题分析
  2. ANR问题分析流程
  3. StrictMode 机制以及使用场景
  4. Watchdog机制以及问题分析