art gc相关死锁黑屏问题总结 | wwm的笔记

背景:

从测试同学那边拿过来两台黑屏的机器,点击电源、屏幕都没有反应,不过还好adb可以链接,有root权限,一番deubgging后,确认表面原因是system_server进程的art虚拟机卡在了gc前flip线程状态的操作中,而最终的root cause也比较有意思,所以本文就简单记录下问题的排查过程。

1. 按照习惯,快速检查下是否发生了watchdog

$ adb shell ``ls` `-l ``/data/anr/
-rw------- 1 system system 53592 2019-05-03 14:21 anr_2019-05-03-14-21-53-368
-rw------- 1 system system 845398 2019-05-03 14:30 anr_2019-05-03-14-30-03-498
-rw------- 1 system system 1078028 2019-05-03 14:58 anr_2019-05-03-14-58-40-505
-rw------- 1 system system 1060669 2019-05-03 14:59 anr_2019-05-03-14-59-00-410
-rw------- 1 system system 527393 2019-05-03 17:32 anr_2019-05-03-17-32-22-852
-rw------- 1 system system 619826 2019-05-03 17:33 anr_2019-05-03-17-32-58-212
// ...

并没有watchdog相关日志

2. 查看系统当前时间

`$ adb shell ``date``Sun May  5 13:52:25 CST 2019`

3. 查看system_server进程各个线程的状态

$ pid=`adb shell pidof system_server` && adb shell ``ps` `-T $pid
USER           PID   TID  PPID     VSZ    RSS WCHAN            ADDR S CMD           
system        1471  1471   635 11629888 168336 futex_wait_queue_me 0 S system_server
system        1471  1477   635 11629888 168336 do_sigtimedwait    0 S Signal Catcher
system        1471  1478   635 11629888 168336 futex_wait_queue_me 0 S ADB-JDWP Connec
system        1471  1479   635 11629888 168336 futex_wait_queue_me 0 S Binder:filter-p
system        1471  1480   635 11629888 168336 poll_schedule_timeout 0 S Binder:``read``-per
system        1471  1481   635 11629888 168336 futex_wait_queue_me 0 S ReferenceQueueD
system        1471  1482   635 11629888 168336 futex_wait_queue_me 0 S FinalizerDaemon
system        1471  1483   635 11629888 168336 futex_wait_queue_me 0 S FinalizerWatchd
system        1471  1484   635 11629888 168336 futex_wait_queue_me 0 S HeapTaskDaemon
system        1471  1524   635 11629888 168336 futex_wait_queue_me 0 S Binder:1471_1
system        1471  1525   635 11629888 168336 futex_wait_queue_me 0 S Binder:1471_2
system        1471  1673   635 11629888 168336 futex_wait_queue_me 0 S android.``bg
system        1471  1675   635 11629888 168336 futex_wait_queue_me 0 S ActivityManager
system        1471  1676   635 11629888 168336 SyS_epoll_wait     0 S android.ui
system        1471  1677   635 11629888 168336 SyS_epoll_wait     0 S ActivityManager
system        1471  1678   635 11629888 168336 SyS_epoll_wait     0 S ActivityManager
system        1471  1684   635 11629888 168336 futex_wait_queue_me 0 S batterystats-wo
system        1471  1687   635 11629888 168336 wait_woken         0 S FileObserver
system        1471  1688   635 11629888 168336 futex_wait_queue_me 0 S android.``fg
system        1471  1689   635 11629888 168336 futex_wait_queue_me 0 S android.io
system        1471  1690   635 11629888 168336 futex_wait_queue_me 0 S android.display
system        1471  1691   635 11629888 168336 futex_wait_queue_me 0 S CpuTracker
system        1471  1692   635 11629888 168336 futex_wait_queue_me 0 S PowerManagerSer
system        1471  1693   635 11629888 168336 futex_wait_queue_me 0 S BatteryStats_wa
system        1471  1694   635 11629888 168336 SyS_epoll_wait     0 S work-thread
system        1471  1695   635 11629888 168336 SyS_epoll_wait     0 S PackageManager
system        1471  1696   635 11629888 168336 SyS_epoll_wait     0 S PackageManager
system        1471  1817   635 11629888 168336 SyS_epoll_wait     0 S PackageInstalle
system        1471  1821   635 11629888 168336 SyS_epoll_wait     0 S android.anim
system        1471  1822   635 11629888 168336 SyS_epoll_wait     0 S android.anim.lf
...

这一步主要是快速检查下有没有D(disk blocked)状态的线程

3. 查看system_server进程backtrace

$ pid=`adb shell pidof system_server` && adb shell ``kill` `-3 $pid

得到:

点击下载完整backtrace!system_server_1471堆栈1471.txt-Android文档类资源-CSDN下载

   
检查各个线程在虚拟机内部的状态后,看到一个比较奇怪的现象:

"Binder:filter-perf-event"` `prio=``5` `tid=``4` `WaitingForGcThreadFlip
"HeapTaskDaemon"` `daemon prio=``5` `tid=``9` `WaitingForGcThreadFlip
"PowerManagerService"` `prio=``5` `tid=``24` `WaitingForGcThreadFlip
"HwBinder:1471_1"` `prio=``5` `tid=``35` `WaitingForGcThreadFlip
"WifiStateMachine"` `prio=``5` `tid=``56` `WaitingForGcThreadFlip
"Binder:1471_3"` `prio=``5` `tid=``92` `WaitingForGcThreadFlip
"Binder:1471_5"` `prio=``5` `tid=``105` `WaitingForGcThreadFlip
"Binder:1471_7"` `prio=``5` `tid=``107` `WaitingForGcThreadFlip
"Binder:1471_A"` `prio=``5` `tid=``120` `WaitingForGcThreadFlip
"Binder:1471_D"` `prio=``5` `tid=``123` `WaitingForGcThreadFlip
"Binder:1471_E"` `prio=``5` `tid=``124` `WaitingForGcThreadFlip
"Binder:1471_11"` `prio=``5` `tid=``127` `WaitingForGcThreadFlip
"Binder:1471_17"` `prio=``5` `tid=``133` `WaitingForGcThreadFlip
"Binder:1471_18"` `prio=``5` `tid=``134` `WaitingForGcThreadFlip
"Binder:1471_19"` `prio=``5` `tid=``135` `WaitingForGcThreadFlip
"HwBinder:1471_4"` `prio=``5` `tid=``12` `WaitingForGcThreadFlip
"Binder:1471_20"` `prio=``5` `tid=``146` `WaitingForGcThreadFlip
"HwBinder:1471_5"` `prio=``5` `tid=``34` `WaitingForGcToComplete
"ADB-JDWP Connection Control Thread"` `daemon prio=``0` `tid=``5` `WaitingForGcToComplete
"android.io"` `prio=``5` `tid=``21` `WaitingForGcToComplete
"CpuTracker"` `prio=``5` `tid=``23` `WaitingForGcToComplete
"UEventObserver"` `prio=``5` `tid=``36` `WaitingForGcToComplete
"HwBinder:1471_2"` `prio=``5` `tid=``43` `WaitingForGcToComplete
"HwBinder:1471_3"` `prio=``5` `tid=``44` `WaitingForGcToComplete
"NetdConnector"` `prio=``5` `tid=``50` `WaitingForGcToComplete
"WifiService"` `prio=``5` `tid=``55` `WaitingForGcToComplete
"DeviceStorageMonitorService"` `prio=``5` `tid=``67` `WaitingForGcToComplete
"SecurityWriteHandlerThread"` `prio=``5` `tid=``72` `WaitingForGcToComplete
"watchdog"` `prio=``5` `tid=``101` `WaitingForGcToComplete
"HwBinder:1471_5"` `prio=``5` `tid=``34` `WaitingForGcToComplete
"Timer-20"` `prio=``5` `tid=``170` `WaitingForGcToComplete
"IpClient.wlan0"` `prio=``5` `tid=``176` `WaitingForGcToComplete
"OkHttp ConnectionPool"` `daemon prio=``5` `tid=``185` `WaitingForGcToComplete
...
"main"` `prio=``5` `tid=``1` `Blocked
"Binder:1471_1"` `prio=``5` `tid=``10` `Blocked
"Binder:1471_2"` `prio=``5` `tid=``11` `Blocked
"android.bg"` `prio=``5` `tid=``13` `Blocked
"ActivityManager"` `prio=``5` `tid=``14` `Blocked
"batterystats-worker"` `prio=``5` `tid=``18` `Blocked
"android.fg"` `prio=``5` `tid=``20` `Blocked
"android.display"` `prio=``5` `tid=``22` `Blocked
"AlarmManager"` `prio=``5` `tid=``40` `Blocked
"InputDispatcher"` `prio=``10` `tid=``46` `Blocked
"InputReader"` `prio=``10` `tid=``47` `Blocked
"NetworkPolicy.uid"` `prio=``5` `tid=``53` `Blocked
"Binder:1471_4"` `prio=``5` `tid=``93` `Blocked
"Binder:1471_6"` `prio=``5` `tid=``106` `Blocked
"Binder:1471_8"` `prio=``5` `tid=``108` `Blocked
"Binder:1471_9"` `prio=``5` `tid=``118` `Blocked
"Binder:1471_B"` `prio=``5` `tid=``121` `Blocked
"Binder:1471_C"` `prio=``5` `tid=``122` `Blocked
"Binder:1471_F"` `prio=``5` `tid=``125` `Blocked
"Binder:1471_10"` `prio=``5` `tid=``126` `Blocked
"Binder:1471_12"` `prio=``5` `tid=``128` `Blocked
"Binder:1471_13"` `prio=``5` `tid=``129` `Blocked
"Binder:1471_14"` `prio=``5` `tid=``130` `Blocked
"Binder:1471_15"` `prio=``5` `tid=``131` `Blocked
"Binder:1471_16"` `prio=``5` `tid=``132` `Blocked
"Binder:1471_1A"` `prio=``5` `tid=``136` `Blocked
"Binder:1471_1B"` `prio=``5` `tid=``140` `Blocked
"Binder:1471_1C"` `prio=``5` `tid=``117` `Blocked
"Binder:1471_1D"` `prio=``5` `tid=``141` `Blocked
"Binder:1471_1E"` `prio=``5` `tid=``144` `Blocked
"Binder:1471_1F"` `prio=``5` `tid=``145` `Blocked
"NetworkMonitorNetworkAgentInfo [WIFI () - 950]"` `prio=``5` `tid=``149` `Blocked

一堆线程都是gc相关的状态,具体什么意思,我也不关心,还是再继续看下blocked的线程,都具体卡在什么地方:

"Binder:1471_6" prio=5 tid=106 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14064288 self=0x721bdfb000
  | sysTid=2126 nice=0 cgrp=default sched=0/0 handle=0x72107ff4f0
  | state=S schedstat=( 5605933458050 5415448290382 12052578 ) utm=475916 stm=84677 core=5 HZ=100
  | stack=0x7210704000-0x7210706000 stackSize=1009KB
  | held mutexes=
  at com.android.server.net.NetworkPolicyManagerService.isUidNetworkingBlockedInternal(NetworkPolicyManagerService.java:4825)
  - waiting to lock <0x05e68352> (a java.lang.Object) held by thread 53
  at com.android.server.net.NetworkPolicyManagerService.access$3600(NetworkPolicyManagerService.java:291)
  at com.android.server.net.NetworkPolicyManagerService$NetworkPolicyManagerInternalImpl.isUidNetworkingBlocked(NetworkPolicyManagerService.java:4900)
  at com.android.server.ConnectivityService.isNetworkWithLinkPropertiesBlocked(ConnectivityService.java:1133)
  at com.android.server.ConnectivityService.filterNetworkStateForUid(ConnectivityService.java:1163)
  at com.android.server.ConnectivityService.getActiveNetworkInfo(ConnectivityService.java:1185)
  at android.net.IConnectivityManager$Stub.onTransact(IConnectivityManager.java:85)
  at android.os.Binder.execTransact(Binder.java:728)
 
"Binder:1471_8" prio=5 tid=108 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14064398 self=0x7219dd9000
  | sysTid=2168 nice=0 cgrp=default sched=0/0 handle=0x720fdff4f0
  | state=S schedstat=( 5601724562758 5464273391739 12340199 ) utm=473400 stm=86772 core=6 HZ=100
  | stack=0x720fd04000-0x720fd06000 stackSize=1009KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:8527)
  - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:225)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3399)
  at android.os.Binder.execTransact(Binder.java:728)
 
"NetworkPolicy.uid" prio=5 tid=53 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x1405f790 self=0x72163fc000
  | sysTid=1881 nice=-2 cgrp=default sched=0/0 handle=0x721767d4f0
  | state=S schedstat=( 992941159886 2687927167537 3629569 ) utm=24231 stm=75063 core=5 HZ=100
  | stack=0x721757a000-0x721757c000 stackSize=1041KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService$LocalService.notifyNetworkPolicyRulesUpdated(ActivityManagerService.java:27503)
  - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146
  at com.android.server.net.NetworkPolicyManagerService.handleUidChanged(NetworkPolicyManagerService.java:4461)
  - locked <0x05e68352> (a java.lang.Object)
  at com.android.server.net.NetworkPolicyManagerService$18.handleMessage(NetworkPolicyManagerService.java:4435)
  at android.os.Handler.dispatchMessage(Handler.java:102)
  at android.os.Looper.loop(Looper.java:207)
  at android.os.HandlerThread.run(HandlerThread.java:65)
  at com.android.server.ServiceThread.run(ServiceThread.java:44)

接着看 0x05e68352 和 0x00c6c1f5 这俩monitor锁的持有线程:

"Binder:1471_20" prio=5 tid=146 WaitingForGcThreadFlip
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14147028 self=0x72126b5800
  | sysTid=18045 nice=-2 cgrp=default sched=0/0 handle=0x72049824f0
  | state=S schedstat=( 6101767900274 5905637387083 13051781 ) utm=516502 stm=93674 core=7 HZ=100
  | stack=0x7204887000-0x7204889000 stackSize=1009KB
  | held mutexes=
  kernel: __switch_to+0xa4/0xfc
  kernel: futex_wait_queue_me+0xc4/0x13c
  kernel: futex_wait+0xe4/0x204
  kernel: do_futex+0x168/0x80c
  kernel: SyS_futex+0x90/0x1b8
  kernel: __sys_trace+0x4c/0x4c
  native: #00 pc 000000000001ef2c  /system/lib64/libc.so (syscall+28)
  native: #01 pc 00000000000d855c  /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
  native: #02 pc 00000000001f0680  /system/lib64/libart.so (art::gc::Heap::IncrementDisableThreadFlip(art::Thread*)+516)
  native: #03 pc 0000000000384fe4  /system/lib64/libart.so (art::JNI::GetStringCritical(_JNIEnv*, _jstring*, unsigned char*)+692)
  native: #04 pc 000000000012e268  /system/lib64/libandroid_runtime.so (android::android_os_Parcel_writeString(_JNIEnv*, _jclass*, long, _jstring*)+64)
  at android.os.Parcel.nativeWriteString(Native method)
  at android.os.Parcel$ReadWriteHelper.writeString(Parcel.java:369)
  at android.os.Parcel.writeString(Parcel.java:707)
  at android.content.pm.PackageItemInfo.writeToParcel(PackageItemInfo.java:652)
  at android.content.pm.ApplicationInfo.writeToParcel(ApplicationInfo.java:1513)
  at android.app.IApplicationThread$Stub$Proxy.bindApplication(IApplicationThread.java:943)
  at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:8146)
  at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:8255)
  - locked <0x00c6c1f5> (a com.android.server.am.ActivityManagerService)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:199)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3399)
  at android.os.Binder.execTransact(Binder.java:728)

可以看到还是owner线程正是之前 WaitingForGcThreadFlip 中的一员,那么下一步通过gdb看下这一类线程具体卡在哪一行代码上:

  • 根据机器的版本号确认symbol文件后,通过公司同事分享的脚本一行命令就可以attach到system_server进程上
  • $ ~/Documents/gdb_native_tools/droid.py attach /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols system_server
     
    process: system_server, pid(1471), parent(zygote64)
    gdbserver:/data/local/tmp/gdbserver64
    bin_path: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/
    solib:    /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/
    bin:      /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64
     
    Attached; pid = 1471
    gdbserver: Unable to determine the number of hardware watchpoints available.
    gdbserver: Unable to determine the number of hardware breakpoints available.
    Listening on port 1234
     
    GNU gdb (GDB) 7.11
    Copyright (C) 2016 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
    For help, type "help".
    Type "apropos word" to search for commands related to "word".
    Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done.
    Remote debugging using localhost:1234
    Remote debugging from host 127.0.0.1
    warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64" is not at the expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libandroid_runtime.so" is not at the expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libaudioclient.so" is not at the expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?)
    warning: Could not load shared library symbols for 15 libraries, e.g. /system/lib64/libclang_rt.ubsan_standalone-aarch64-android.so.
    Use the "info sharedlibrary" command to see the complete listing.
    Do you need "set solib-search-path" or "set sysroot"?
    Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done.
    Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/linker64...done.
    ......
    syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    41  bionic/libc/arch-arm64/bionic/syscall.S: No such file or directory.
    Source directories searched: /mnt/miui/v10-p-dipper-alpha:$cdir:$cwd
    #0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    #1  0x0000007238a80560 in art::futex (uaddr=0x723961df68, op=0, val=20935915, timeout=0x7fc723ee90, uaddr2=0x7fc723ee90, val3=0) at art/runtime/base/mutex-inl.h:43
    #2  art::ConditionVariable::WaitHoldingLocks (this=0x723961df58, self=0x7239614c00) at art/runtime/base/mutex.cc:953
    #3  0x0000007238d6241c in art::Monitor::Lock<(art::LockReason)1> (this=0x723961df20, self=0x7239614c00) at art/runtime/monitor.cc:559
    #4  0x0000007238d687a8 in art::Monitor::MonitorEnter (self=0x7239614c00, obj=<optimized out>, trylock=<optimized out>) at art/runtime/monitor.cc:1462
    #5  0x0000007238ec40fc in art::mirror::Object::MonitorEnter (this=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/mirror/object-inl.h:98
    #6  artLockObjectFromCode (obj=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/entrypoints/quick/quick_lock_entrypoints.cc:32
    #7  0x0000007238f04e9c in art_quick_lock_object () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1510
    #8  0x00000072207cf6a4 in ?? ()
    Backtrace stopped: not enough registers or memory available to unwind further
    /home/pip/Documents/gdb_native_tools/gdb-script/init.gdb: No such file or directory.
    /home/pip/Documents/gdb_native_tools/shadow/gdb_driver.py: No such file or directory.
    (gdb) handle SIG35 nostop noprint pass
    Signal        Stop  Print   Pass to program Description
    SIG35         No    No  Yes     Real-time event 35
    (gdb) handle SIG33 nostop noprint pass
    Signal        Stop  Print   Pass to program Description
    SIG33         No    No  Yes     Real-time event 33
    

tips:handle SIG35 nostop noprint pass 是为了不处理android平台上用于抓native trace的 sig35 信号,减小我们的调试过程意外中断的可能性(只是个习惯, 这个问题中其实不做也没关系)

  • 切换到gc线程
  • (gdb) thread 9
    [Switching to thread 9 (Thread 1471.1484)]
    #0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    41      svc     #0
    (gdb) bt
    #0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    #1  0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x7221a89ab0, uaddr2=0x7221a89ab0, val3=0) at art/runtime/base/mutex-inl.h:43
    #2  art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x723966b000) at art/runtime/base/mutex.cc:953
    #3  0x0000007238b98d54 in art::gc::Heap::ThreadFlipBegin (this=0x72396b7600, self=0x723966b000) at art/runtime/gc/heap.cc:845 // <<<<<<<<<<<<<<<<<<<<<<<
    #4  0x0000007238e52204 in art::ThreadList::FlipThreadRoots (this=<optimized out>, thread_flip_visitor=0x7221a89ca0, flip_callback=<optimized out>, collector=
        0x7239716500, pause_listener=<optimized out>) at art/runtime/thread_list.cc:591
    #5  0x0000007238b65cc4 in art::gc::collector::ConcurrentCopying::FlipThreadRoots (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:620
    #6  0x0000007238b64c50 in art::gc::collector::ConcurrentCopying::RunPhases (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:178
    #7  0x0000007238b7af80 in art::gc::collector::GarbageCollector::Run (this=0x7239716500, gc_cause=<optimized out>, clear_soft_references=<optimized out>)
        at art/runtime/gc/collector/garbage_collector.cc:96
    #8  0x0000007238b9d93c in art::gc::Heap::CollectGarbageInternal (this=0x72396b7600, gc_type=art::gc::collector::kGcTypeFull, gc_cause=art::gc::kGcCauseBackground,
        clear_soft_references=false) at art/runtime/gc/heap.cc:2616
    #9  0x0000007238baeeb8 in art::gc::Heap::ConcurrentGC (this=0x72396b7600, self=<optimized out>, cause=art::gc::kGcCauseBackground, force_full=false)
        at art/runtime/gc/heap.cc:3621
    #10 0x0000007238bb4524 in art::gc::Heap::ConcurrentGCTask::Run (this=<optimized out>, self=0x0 <_DYNAMIC>) at art/runtime/gc/heap.cc:3582
    #11 0x0000007238bd692c in art::gc::TaskProcessor::RunAllTasks (this=0x72396bb100, self=0x723966b000) at art/runtime/gc/task_processor.cc:129
    #12 0x0000000072a7e3f0 in dalvik.system.VMRuntime.clampGrowthLimit [DEDUPED] ()
       from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
    #13 0x0000000072c20bd8 in java.lang.Daemons$HeapTaskDaemon.runInternal ()
       from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
    #14 0x0000000072a80c40 in java.lang.Daemons$Daemon.run ()
       from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat
    #15 0x0000000071baa4dc in java.lang.Thread.run ()
       from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-oj.oat
    #16 0x0000007238f0498c in art_quick_invoke_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1702
    #17 0x0000007238a78524 in art::ArtMethod::Invoke (this=0x6ff68ad8, self=0x723966b000, args=<optimized out>, args_size=<optimized out>, result=0x7221a8a310,
        shorty=0x7269daa9 "V") at art/runtime/art_method.cc:374
    #18 0x0000007238e07258 in art::(anonymous namespace)::InvokeWithArgArray (soa=..., method=0x6ff68ad8, arg_array=0x7221a8a318, result=<optimized out>,
        shorty=0x7269daa9 "V") at art/runtime/reflection.cc:456
    #19 0x0000007238e08314 in art::InvokeVirtualOrInterfaceWithJValues (soa=..., obj=0x5 <_DYNAMIC+5>, mid=<optimized out>, args=0x0 <_DYNAMIC>)
        at art/runtime/reflection.cc:548
    #20 0x0000007238e36764 in art::Thread::CreateCallback (arg=0x723966b000) at art/runtime/thread.cc:473
    #21 0x00000072bb501db0 in __pthread_start (arg=0x7221a8a4f0) at bionic/libc/bionic/pthread_create.cpp:254
    #22 0x00000072bb4a378c in __start_thread (fn=0x72bb501d88 <__pthread_start(void*)>, arg=0x7221a8a4f0) at bionic/libc/bionic/clone.cpp:52
    

结合代码: art/runtime/gc/heap.cc#845

void Heap::ThreadFlipBegin(Thread* self) {
  // Supposed to be called by GC. Set thread_flip_running_ to be true. If disable_thread_flip_count_
  // > 0, block. Otherwise, go ahead.
  CHECK(kUseReadBarrier);
  ScopedThreadStateChange tsc(self, kWaitingForGcThreadFlip);
  MutexLock mu(self, *thread_flip_lock_);
  bool has_waited = false;
  uint64_t wait_start = NanoTime();
  CHECK(!thread_flip_running_);
  // Set this to true before waiting so that frequent JNI critical enter/exits won't starve
  // GC. This like a writer preference of a reader-writer lock.
  thread_flip_running_ = true;
  while (disable_thread_flip_count_ > 0) { // <<<<<<<<<<<<
    has_waited = true;
    thread_flip_cond_->Wait(self); // <<<<<<<<<<<<<
  }
  if (has_waited) {
    uint64_t wait_time = NanoTime() - wait_start;
    total_wait_time_ += wait_time;
    if (wait_time > long_pause_log_threshold_) {
      LOG(INFO) << __FUNCTION__ << " blocked for " << PrettyDuration(wait_time);
    }
  }
}

既然在等待一个condition,那么 disable_thread_flip_count_ 肯定是大于0的,我们可以打印一下这个变量的值:

(gdb) print disable_thread_flip_count_
$1 = 1

果然它是1,正常情况下,当前线程在thread_flip_cond_这个条件上wait一段时间后,肯定会有一个线程会去notify它,来唤醒处于wait状态的线程;

  • 那么,接下来自然是快速过一下源码,看看都有哪些函数会去操作thread_flip_cond_:
void Heap::ThreadFlipEnd(Thread* self) {
  // Supposed to be called by GC. Set thread_flip_running_ to false and potentially wake up mutators
  // waiting before doing a JNI critical.
  CHECK(kUseReadBarrier);
  MutexLock mu(self, *thread_flip_lock_);
  CHECK(thread_flip_running_);
  thread_flip_running_ = false;
  // Potentially notify mutator threads blocking to enter a JNI critical section.
  thread_flip_cond_->Broadcast(self);  // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
}
 
void Heap::DecrementDisableThreadFlip(Thread* self) {
  // Supposed to be called by mutators. Decrement disable_thread_flip_count_ and potentially wake up
  // the GC waiting before doing a thread flip.
  CHECK(kUseReadBarrier);
  self->DecrementDisableThreadFlipCount();
  bool is_outermost = self->GetDisableThreadFlipCount() == 0;
  if (!is_outermost) {
    // If this is not an outermost JNI critical exit, we don't need to decrement the global counter.
    // The global counter is decremented only once for a thread for the outermost exit.
    return;
  }
  MutexLock mu(self, *thread_flip_lock_);
  CHECK_GT(disable_thread_flip_count_, 0U);
  --disable_thread_flip_count_;
  if (disable_thread_flip_count_ == 0) {
    // Potentially notify the GC thread blocking to begin a thread flip.
    thread_flip_cond_->Broadcast(self); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  }
}

再看一下各个线程的backtrace,确认了,没有哪个在调用上面两个任意一个

  • 那些处于WaitingForGcThreadFlip的线程block在哪?
  • (gdb) t 23
    [Switching to thread 23 (Thread 1471.1692)]
    #0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    41      svc     #0
    (gdb) bt
    #0  syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41
    #1  0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x721cbe2850, uaddr2=0x721cbe2850, val3=0) at art/runtime/base/mutex-inl.h:43
    #2  art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x721d26a000) at art/runtime/base/mutex.cc:953
    #3  0x0000007238b98684 in art::gc::Heap::IncrementDisableThreadFlip (this=0x72396b7600, self=0x721d26a000) at art/runtime/gc/heap.cc:798
    #4  0x0000007238d2cfe8 in art::JNI::GetStringCritical (env=<optimized out>, java_string=<optimized out>, is_copy=0x0 <_DYNAMIC>) at art/runtime/jni_internal.cc:1887
    #5  0x00000072bae3177c in android::sp<android::IGraphicBufferProducer>::~sp (this=0x0 <_DYNAMIC>) at system/core/libutils/include/utils/StrongPointer.h:156
    #6  android::nativeCreateScoped (env=<optimized out>, clazz=<optimized out>, surfaceObject=<optimized out>)
        at frameworks/base/core/jni/android_view_SurfaceSession.cpp:51
    #7  0x00000000748c94ac in android.filterfw.format.ImageFormat.create ()
       from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-framework.oat
    #8  0x0000000000000000 in ?? () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64
    

art/runtime/gc/heap.cc:798 对应的代码:

void Heap::IncrementDisableThreadFlip(Thread* self) {
  // Supposed to be called by mutators. If thread_flip_running_ is true, block. Otherwise, go ahead.
  CHECK(kUseReadBarrier);
  bool is_nested = self->GetDisableThreadFlipCount() > 0;
  self->IncrementDisableThreadFlipCount();
  if (is_nested) {
    // If this is a nested JNI critical section enter, we don't need to wait or increment the global
    // counter. The global counter is incremented only once for a thread for the outermost enter.
    return;
  }
  ScopedThreadStateChange tsc(self, kWaitingForGcThreadFlip);     // <<<<<<<<<<<<<<<<<<<<<<< 改变线程状态
  MutexLock mu(self, *thread_flip_lock_);
  bool has_waited = false;
  uint64_t wait_start = NanoTime();
  if (thread_flip_running_) {
    ScopedTrace trace("IncrementDisableThreadFlip");
    while (thread_flip_running_) {
      has_waited = true;
      thread_flip_cond_->Wait(self); // <<<<<<<<<<<<<<<<<<<<
    }
  }
  ++disable_thread_flip_count_;
  if (has_waited) {
    uint64_t wait_time = NanoTime() - wait_start;
    total_wait_time_ += wait_time;
    if (wait_time > long_pause_log_threshold_) {
      LOG(INFO) << __FUNCTION__ << " blocked for " << PrettyDuration(wait_time);
    }
  }
}

所以到这儿,我们清楚的一点是,肯定是由于某个线程没有正常调用 DecrementDisableThreadFlip() 函数,导致 disable_thread_flip_count_ 一直不为0;

  • disable_thread_flip_count_ 一直不为 0 这个锅应该是谁的?
    仔细看一下上面这个函数的代码,不仅全局有个disable_thread_flip_count_, 同时每个线程还有thread local的 disable_thread_flip_count,每次调用IncrementDisableThreadFlip()开始的时候,都会先执行Thread::Current()->IncrementDisableThreadFlipCount(),看下它的代码:
  • // art/runtime/thread.h
    uint32_t GetDisableThreadFlipCount() const {
      CHECK(kUseReadBarrier);
      return tls32_.disable_thread_flip_count;
    }
    

YEAH!下面只需要逐个遍历art里面每个线程,把 tls32_.disable_thread_flip_count > 1的部分打印出来:
稍微修改一下gdb脚本下的 art.gdb脚本里的art_dump_thread_list_base函数:

define art_dump_thread_list
    art_dump_thread_list_base 0
end
 
define art_dump_thread_list_base
    set $dump_mutex = $arg0
    #  第一个节点
    set $first = ('art::Runtime'::instance_)->thread_list_->list_->__end_->__next_
    printf "\n  Id    Thread Id   art::Thread*    disable_flip_count    STATE         FLAGS                NAME             SUSPEND_COUNT           MUTEX\n"
 
    set $list_node = 0
    set $count = 1
 
    while $list_node != $first
        if $list_node == 0
            set $list_node = $first
        end
         
        set $thread = ('art::Thread'*)(*(void**)((uint64_t)$list_node + 2*sizeof(void*)))
        set $flip_count = (('art::Thread'*)$thread)->tls32_->disable_thread_flip_count
        if $flip_count != 0  # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 1 
            set $thread_id = (('art::Thread'*)$thread)->tls32_->tid
 
            #printf "monitor_enter_object: 0x%lx", $thread->tlsPtr_.monitor_enter_object
            printf "%4d     %6d     %p   ", $count, $thread_id, $thread
 
            printf "         %d        ", $flip_count # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2
 
            thread_print_state $thread
 
            printf "              "
            thread_print_name $thread
            printf "              "
            thread_print_suspend_count $thread
 
            if $dump_mutex == 1
                thread_print_mutex $thread
            end
 
            printf "\n"
        end
 
        set $list_node = $list_node->__next_
        set $count = $count + 1
 
        if $list_node == $first
            loop_break
        end
    end
end

然后重新导入脚本然后执行这个函数即可:

(gdb) source /home/pip/Documents/gdb_native_tools/art.gdb
(gdb) art_dump_thread_list
  Id    Thread Id   art::Thread*    disable_flip_count    STATE                       NAME                        SUSPEND_COUNT        
   4       1479     0x723973c800            1          kWaitingForGcThreadFlip     Binder:filter-perf-event     0
  23       1692     0x721d26a000            1          kWaitingForGcThreadFlip     PowerManagerService          0
  31       1825     0x722b7ab800            1          kWaitingForGcThreadFlip     HwBinder:1471_1              0
  52       1884     0x721c426c00            1          kWaitingForGcThreadFlip     WifiStateMachine             0
  88       1938     0x721bc96800            1          kWaitingForGcThreadFlip     Binder:1471_3                0
  98       2125     0x721c54d000            1          kWaitingForGcThreadFlip     Binder:1471_5                0
 100       2166     0x721a129800            1          kWaitingForGcThreadFlip     Binder:1471_7                0
 113       2894     0x7219cc1000            1          kWaitingForGcThreadFlip     Binder:1471_A                0
 116       2949     0x721273ec00            1          kWaitingForGcThreadFlip     Binder:1471_D                0
 117       2950     0x7219cc0400            1          kWaitingForGcThreadFlip     Binder:1471_E                0
 120       3089     0x720b025000            1          kWaitingForGcThreadFlip     Binder:1471_11               0
 126       3135     0x720c454800            1          kWaitingForGcThreadFlip     Binder:1471_17               0
 127       3140     0x721929cc00            1          kWaitingForGcThreadFlip     Binder:1471_18               0
 128       3142     0x720eb42000            1          kWaitingForGcThreadFlip     Binder:1471_19               0
 132       8981     0x7219d25800            1          kWaitingForGcThreadFlip     HwBinder:1471_4              0
 142      18045     0x72126b5800            1          kWaitingForGcThreadFlip     Binder:1471_20               0
 181       4738     0x7211c94400            1          kNative                     AsyncTask #36709             0

最后那个 AsyncTask #36709 十分可疑,别的线程的disable_flip_count 为1的原因是正在调用 IncrementDisableThreadFlip()函数,OK,接着看下它在执行什么函数:

  • 嫌疑人 “AsyncTask #36709” 干了些什么?
  • "AsyncTask #36709" prio=5 tid=180 Native
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x14d149c8 self=0x7211c94400
      | sysTid=4738 nice=0 cgrp=default sched=0/0 handle=0x708411b4f0
      | state=S schedstat=( 3931924 16620626 18 ) utm=0 stm=0 core=7 HZ=100
      | stack=0x7084018000-0x708401a000 stackSize=1041KB
      | held mutexes=
      kernel: __switch_to+0xa4/0xfc
      kernel: binder_thread_read+0xa78/0x11a0
      kernel: binder_ioctl+0x920/0xb08
      kernel: do_vfs_ioctl+0xb8/0x8d8
      kernel: SyS_ioctl+0x84/0x98
      kernel: __sys_trace+0x4c/0x4c
      native: #00 pc 000000000006e5bc  /system/lib64/libc.so (__ioctl+4)
      native: #01 pc 0000000000029544  /system/lib64/libc.so (ioctl+136)
      native: #02 pc 000000000001e22c  /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240)
      native: #03 pc 000000000001e884  /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+324)
      native: #04 pc 0000000000012204  /system/lib64/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312)
      native: #05 pc 00000000000c9764  /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssXtra::_hidl_injectXtraData(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&)+244)
      native: #06 pc 000000000005d5c4  /system/lib64/libandroid_servers.so (android::android_location_GnssLocationProvider_inject_xtra_data(_JNIEnv*, _jobject*, _jbyteArray*, int)+268)
      at com.android.server.location.GnssLocationProvider.native_inject_xtra_data(Native method)
      at com.android.server.location.GnssLocationProvider.access$2300(GnssLocationProvider.java:112)
      at com.android.server.location.GnssLocationProvider$10.run(GnssLocationProvider.java:1140)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
      at java.lang.Thread.run(Thread.java:764)
    

它在一个jni里执行一个hw binder call,结合下代码:

// frameworks/base/services/core/jni/com_android_server_location_GnssLocationProvider.cpp#1613
static void android_location_GnssLocationProvider_inject_xtra_data(JNIEnv* env, jobject /* obj */,
        jbyteArray data, jint length) {
    if (gnssXtraIface == nullptr) {
        ALOGE("XTRA Interface not supported");
        return;
    }
 
    jbyte* bytes = reinterpret_cast<jbyte *>(env->GetPrimitiveArrayCritical(data, 0));
    gnssXtraIface->injectXtraData(std::string((const char*)bytes, length));
    env->ReleasePrimitiveArrayCritical(data, bytes, JNI_ABORT);
}

(不熟悉art实现)一眼看过去,都是正常的jni调用,好像跟disable_flip_count没啥关系呀?全局搜一下DecrementDisableThreadFlip()和IncrementDisableThreadFlip被调用的地方:

ReleaseStringCritical() → DecrementDisableThreadFlip()
ReleasePrimitiveArray() → DecrementDisableThreadFlip()

GetPrimitiveArrayCritical() → IncrementDisableThreadFlip()
GetStringCritical() → IncrementDisableThreadFlip()

到这个地方,稍稍消化总结一下,不难推断出:
这个问题必然是因为 AsyncTask #36709 在执行jni调用过程中,它执行了:
GetPrimitiveArrayCritical() → IncrementDisableThreadFlip() → 使得全局变量 disable_thread_flip_count_ = 1
但是在继续执行 gnssXtraIface →injectXtraData() 这个 hal 接口的时候,一直被block,自然无法执行 ReleasePrimitiveArrayCritical() → DecrementDisableThreadFlip() → disable_thread_flip_count_ = 0

4. gnssXtraIface→injectXtraData() 为什么会一直被block?

查看binder transactin日志

$ adb shell cat /sys/kernel/debug/binder/transactions
 
proc 1471
context hwbinder
  thread 1673: l 00 need_return 0 tr 0
    incoming transaction 979513688: ffffffed66132800 from 1007:1007 to 1471:1673 code 2 flags 10 pri 0:120 r1 node 978962390 size 48:0 data ffffff8014f00258
    outgoing transaction 979513687: ffffffed66133e00 from 1471:1673 to 1007:1007 code 3 flags 10 pri 0:120 r1
  thread 8981: l 01 need_return 0 tr 0
    incoming transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1 node 978962390 size 132:16 data ffffff8014f00508
  thread 4738: l 00 need_return 0 tr 0
    outgoing transaction 979513850: ffffffee6324a400 from 1471:4738 to 1007:0 code 2 flags 10 pri 0:120 r1
  thread 8981: l 01 need_return 0 tr 0
    incoming transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1 node 978962390 size 132:16 data ffffff8014f00508
 
proc 1007
context hwbinder8981
  thread 1007: l 12 need_return 0 tr 0
    outgoing transaction 979513688: ffffffed66132800 from 1007:1007 to 1471:1673 code 2 flags 10 pri 0:120 r1
    incoming transaction 979513687: ffffffed66133e00 from 1471:1673 to 1007:1007 code 3 flags 10 pri 0:120 r1 node 956 size 36:0 data ffffff8012000008
  thread 2040: l 10 need_return 0 tr 0
    outgoing transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1
pending transaction 979513850: ffffffee6324a400 from 1471:4738 to 1007:0 code 2 flags 10 pri 0:120 r1 node 5700 size 120:16 data ffffff8012000030

incoming transaction from 1471:4738 to 1007:0,得到对端进程pid为1007, 继续看下这个进程里各个线程的状态

$ adb shell ps -T 1007
USER           PID   TID  PPID     VSZ    RSS WCHAN            ADDR S CMD           
system        1007  1007     1   29016    156 binder_thread_read  0 S ignss_1_1
system        1007  2040     1   29016    156 binder_thread_read  0 S IPC Thread
system        1007 17865     1   29016    156 poll_schedule_timeout 0 S ignss_1_1

各个线程看起来都是正常的,通过debuggerd查看backtrace

$ adb shell debuggerd -b 1007
----- pid 1007 at 2019-05-05 13:25:52 -----
Cmd line: /vendor/bin/ignss_1_1
ABI: 'arm64'
 
"ignss_1_1" sysTid=1007
  #00 pc 000000000006e5bc  /system/lib64/libc.so (__ioctl+4)
  #01 pc 0000000000029544  /system/lib64/libc.so (ioctl+136)
  #02 pc 000000000001e22c  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240)
  #03 pc 000000000001e77c  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+60)
  #04 pc 0000000000012204  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312)
  #05 pc 0000000000097ddc  /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssCallback::_hidl_gnssStatusCb(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::gnss::V1_0::IGnssCallback::GnssStatusValue)+208)
  #06 pc 000000000001ccf0  /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::statusCb(GpsStatus*)+68)
  #07 pc 000000000001b5b4  /vendor/lib64/hw/gps.brcm.so (gps_stop()+156)
  #08 pc 0000000000013ae0  /vendor/lib64/hw/gps.brcm.so (proxy__gps_stop()+168)
  #09 pc 000000000001e1d4  /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::stop()+28)
  #10 pc 0000000000088328  /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BnHwGnss::_hidl_stop(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+152)
  #11 pc 000000000002e324  /system/lib64/android.hardware.gnss@1.1.so (android::hardware::gnss::V1_1::BnHwGnss::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+1624)
  #12 pc 000000000001df34  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+72)
  #13 pc 000000000001508c  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::executeCommand(int)+1508)
  #14 pc 0000000000014938  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::getAndExecuteCommand()+204)
  #15 pc 0000000000015708  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::joinThreadPool(bool)+268)
  #16 pc 0000000000000cb8  /vendor/bin/ignss_1_1 (main+468)
  #17 pc 00000000000acec0  /system/lib64/libc.so (__libc_init+88)
 
"IPC Thread" sysTid=2040
  #00 pc 000000000006e5bc  /system/lib64/libc.so (__ioctl+4)
  #01 pc 0000000000029544  /system/lib64/libc.so (ioctl+136)
  #02 pc 000000000001e22c  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240)
  #03 pc 000000000001e884  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+324)
  #04 pc 0000000000012204  /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312)
  #05 pc 0000000000098294  /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssCallback::_hidl_gnssNmeaCb(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, long, android::hardware::hidl_string const&)+272)
  #06 pc 000000000001d0b4  /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::nmeaCb(long, char const*, int)+112)
  #07 pc 00000000000136d8  /vendor/lib64/hw/gps.brcm.so (proxy__gps_nmea_cb(long, char const*, int)+180)
  #08 pc 0000000000018c88  /vendor/lib64/hw/gps.brcm.so (GpsiClient::marshal_gps_nmea_cb(IpcIncomingMessage&)+300)
  #09 pc 00000000000252a4  /vendor/lib64/hw/gps.brcm.so (IpcPipeTransportBase::OnSelect(int, bool, bool, bool, void*)+344)
  #10 pc 0000000000025a28  /vendor/lib64/hw/gps.brcm.so (SelectManager::PerformOneWaitAndProcess()+468)
  #11 pc 0000000000017860  /vendor/lib64/hw/gps.brcm.so (ipc_thread_proc(void*)+56)
  #12 pc 0000000000023160  /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (threadFunc(void*)+12)
  #13 pc 0000000000081dac  /system/lib64/libc.so (__pthread_start(void*)+36)
  #14 pc 0000000000023788  /system/lib64/libc.so (__start_thread+68)
 
"ignss_1_1" sysTid=17865
  #00 pc 000000000006e604  /system/lib64/libc.so (__pselect6+4)
  #01 pc 000000000002bbf8  /system/lib64/libc.so (select+144)
  #02 pc 000000000001974c  /vendor/lib64/hw/flp.brcm.so (SelectManager::PerformOneWaitAndProcess()+268)
  #03 pc 000000000000e8e8  /vendor/lib64/hw/flp.brcm.so (ipc_thread_proc(void*)+116)
  #04 pc 0000000000081dac  /system/lib64/libc.so (__pthread_start(void*)+36)
  #05 pc 0000000000023788  /system/lib64/libc.so (__start_thread+68)
 
----- end 1007 -----

结合上面的binder call日志,可以看到ignss_1_1进程内部的两个binder线程都在回调system_server进程的binder 接口,没有空闲binder线程可以处理来自system_server的gnssXtraIface→injectXtraData() 请求!

incoming 1471:1673
outgoing 1007:1007 to 1471:1673 // blocked !
outgoing 1007:2040 to 1471:8981 // WaitingForGcThreadFlip !

p.s. 1471:1673 并不是一个binder线程, 但确响应了来自1007:1007的binder call,这是binder api设计上面的优化,复用? todo:确认

"android.bg" prio=5 tid=13 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13fc3428 self=0x7232239400
  | sysTid=1673 nice=0 cgrp=default sched=0/0 handle=0x721da8e4f0
  | state=S schedstat=( 19417818968810 14397831932031 20757829 ) utm=561393 stm=1380388 core=5 HZ=100
  | stack=0x721d98b000-0x721d98d000 stackSize=1041KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:22705)
  - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146
  at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1195)
  at com.android.server.location.GnssLocationProvider.reportStatus(GnssLocationProvider.java:1766)
  at com.android.server.location.GnssLocationProvider.native_stop(Native method)
  at com.android.server.location.GnssLocationProvider.stopNavigating(GnssLocationProvider.java:1607)
  at com.android.server.location.GnssLocationProvider.updateRequirements(GnssLocationProvider.java:1405)
  at com.android.server.location.GnssLocationProvider.handleSetRequest(GnssLocationProvider.java:1345)
  at com.android.server.location.GnssLocationProvider.access$4100(GnssLocationProvider.java:112)
  at com.android.server.location.GnssLocationProvider$ProviderHandler.handleMessage(GnssLocationProvider.java:2400)
  at android.os.Handler.dispatchMessage(Handler.java:106)
  at android.os.Looper.loop(Looper.java:207)
  at android.os.HandlerThread.run(HandlerThread.java:65)

5. 修复方案

所以这个问题也是system_server↔ ignss_1_1 这两个进程相互binder call + jni call + 虚拟机正好在执行垃圾回收;

参考 oracle的文档:

After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical. We must treat the code inside this pair of functions as running in a “critical region.” Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)

These restrictions make it more likely that the native code will obtain an uncopied version of the array, even if the VM does not support pinning. For example, a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via GetPrimitiveArrayCritical.

GetPrimitiveArrayCritical 和 ReleasePrimitiveArrayCritical 两个函数间的代码执行耗时不应太久! 结合此处的业务逻辑,而且也不会被太频繁的调用,拷贝一份原始数组给binder call作为参数即可:

static void android_location_GnssLocationProvider_inject_xtra_data(JNIEnv* env, jobject /* obj */,
        jbyteArray data, jint length) {
    if (gnssXtraIface == nullptr) {
        ALOGE("XTRA Interface not supported");
        return;
    }
 
--    jbyte* bytes = reinterpret_cast<jbyte *>(env->GetPrimitiveArrayCritical(data, 0));
++    jbyte* bytes = reinterpret_cast<jbyte *>(env->GetByteArrayElements(data, 0));
    gnssXtraIface->injectXtraData(std::string((const char*)bytes, length));
--    env->ReleasePrimitiveArrayCritical(data, bytes, JNI_ABORT);
++    env->ReleaseByteArrayElements(data, bytes, JNI_ABORT);
}
Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐