【转载】art gc相关死锁黑屏问题总结
lart gc相关死锁黑屏问题总结 · 大专栏背景:从测试同学那边拿过来两台黑屏的机器,点击电源、屏幕都没有反应,不过还好 adb 可以链接,有 root 权限,一番 deubgging 后,确认表面原因是 system_server 进程的 art 虚拟机卡在了 gc 前 flip 线程状态的操作中,而最终的 root cause 也比较有意思,所以本文就简单记录下问题的排查过程。1. 按照习惯
art gc相关死锁黑屏问题总结 | wwm的笔记
背景:
从测试同学那边拿过来两台黑屏的机器,点击电源、屏幕都没有反应,不过还好adb可以链接,有root权限,一番deubgging后,确认表面原因是system_server进程的art虚拟机卡在了gc前flip线程状态的操作中,而最终的root cause也比较有意思,所以本文就简单记录下问题的排查过程。
1. 按照习惯,快速检查下是否发生了watchdog
$ adb shell ``ls` `-l ``/data/anr/ -rw------- 1 system system 53592 2019-05-03 14:21 anr_2019-05-03-14-21-53-368 -rw------- 1 system system 845398 2019-05-03 14:30 anr_2019-05-03-14-30-03-498 -rw------- 1 system system 1078028 2019-05-03 14:58 anr_2019-05-03-14-58-40-505 -rw------- 1 system system 1060669 2019-05-03 14:59 anr_2019-05-03-14-59-00-410 -rw------- 1 system system 527393 2019-05-03 17:32 anr_2019-05-03-17-32-22-852 -rw------- 1 system system 619826 2019-05-03 17:33 anr_2019-05-03-17-32-58-212 // ... |
并没有watchdog相关日志
2. 查看系统当前时间
`$ adb shell ``date``Sun May 5 13:52:25 CST 2019` |
3. 查看system_server进程各个线程的状态
$ pid=`adb shell pidof system_server` && adb shell ``ps` `-T $pid USER PID TID PPID VSZ RSS WCHAN ADDR S CMD system 1471 1471 635 11629888 168336 futex_wait_queue_me 0 S system_server system 1471 1477 635 11629888 168336 do_sigtimedwait 0 S Signal Catcher system 1471 1478 635 11629888 168336 futex_wait_queue_me 0 S ADB-JDWP Connec system 1471 1479 635 11629888 168336 futex_wait_queue_me 0 S Binder:filter-p system 1471 1480 635 11629888 168336 poll_schedule_timeout 0 S Binder:``read``-per system 1471 1481 635 11629888 168336 futex_wait_queue_me 0 S ReferenceQueueD system 1471 1482 635 11629888 168336 futex_wait_queue_me 0 S FinalizerDaemon system 1471 1483 635 11629888 168336 futex_wait_queue_me 0 S FinalizerWatchd system 1471 1484 635 11629888 168336 futex_wait_queue_me 0 S HeapTaskDaemon system 1471 1524 635 11629888 168336 futex_wait_queue_me 0 S Binder:1471_1 system 1471 1525 635 11629888 168336 futex_wait_queue_me 0 S Binder:1471_2 system 1471 1673 635 11629888 168336 futex_wait_queue_me 0 S android.``bg system 1471 1675 635 11629888 168336 futex_wait_queue_me 0 S ActivityManager system 1471 1676 635 11629888 168336 SyS_epoll_wait 0 S android.ui system 1471 1677 635 11629888 168336 SyS_epoll_wait 0 S ActivityManager system 1471 1678 635 11629888 168336 SyS_epoll_wait 0 S ActivityManager system 1471 1684 635 11629888 168336 futex_wait_queue_me 0 S batterystats-wo system 1471 1687 635 11629888 168336 wait_woken 0 S FileObserver system 1471 1688 635 11629888 168336 futex_wait_queue_me 0 S android.``fg system 1471 1689 635 11629888 168336 futex_wait_queue_me 0 S android.io system 1471 1690 635 11629888 168336 futex_wait_queue_me 0 S android.display system 1471 1691 635 11629888 168336 futex_wait_queue_me 0 S CpuTracker system 1471 1692 635 11629888 168336 futex_wait_queue_me 0 S PowerManagerSer system 1471 1693 635 11629888 168336 futex_wait_queue_me 0 S BatteryStats_wa system 1471 1694 635 11629888 168336 SyS_epoll_wait 0 S work-thread system 1471 1695 635 11629888 168336 SyS_epoll_wait 0 S PackageManager system 1471 1696 635 11629888 168336 SyS_epoll_wait 0 S PackageManager system 1471 1817 635 11629888 168336 SyS_epoll_wait 0 S PackageInstalle system 1471 1821 635 11629888 168336 SyS_epoll_wait 0 S android.anim system 1471 1822 635 11629888 168336 SyS_epoll_wait 0 S android.anim.lf ... |
这一步主要是快速检查下有没有D(disk blocked)状态的线程
3. 查看system_server进程backtrace
$ pid=`adb shell pidof system_server` && adb shell ``kill` `-3 $pid |
得到:
点击下载完整backtrace!system_server_1471堆栈1471.txt-Android文档类资源-CSDN下载
检查各个线程在虚拟机内部的状态后,看到一个比较奇怪的现象:
"Binder:filter-perf-event"` `prio=``5` `tid=``4` `WaitingForGcThreadFlip "HeapTaskDaemon"` `daemon prio=``5` `tid=``9` `WaitingForGcThreadFlip "PowerManagerService"` `prio=``5` `tid=``24` `WaitingForGcThreadFlip "HwBinder:1471_1"` `prio=``5` `tid=``35` `WaitingForGcThreadFlip "WifiStateMachine"` `prio=``5` `tid=``56` `WaitingForGcThreadFlip "Binder:1471_3"` `prio=``5` `tid=``92` `WaitingForGcThreadFlip "Binder:1471_5"` `prio=``5` `tid=``105` `WaitingForGcThreadFlip "Binder:1471_7"` `prio=``5` `tid=``107` `WaitingForGcThreadFlip "Binder:1471_A"` `prio=``5` `tid=``120` `WaitingForGcThreadFlip "Binder:1471_D"` `prio=``5` `tid=``123` `WaitingForGcThreadFlip "Binder:1471_E"` `prio=``5` `tid=``124` `WaitingForGcThreadFlip "Binder:1471_11"` `prio=``5` `tid=``127` `WaitingForGcThreadFlip "Binder:1471_17"` `prio=``5` `tid=``133` `WaitingForGcThreadFlip "Binder:1471_18"` `prio=``5` `tid=``134` `WaitingForGcThreadFlip "Binder:1471_19"` `prio=``5` `tid=``135` `WaitingForGcThreadFlip "HwBinder:1471_4"` `prio=``5` `tid=``12` `WaitingForGcThreadFlip "Binder:1471_20"` `prio=``5` `tid=``146` `WaitingForGcThreadFlip "HwBinder:1471_5"` `prio=``5` `tid=``34` `WaitingForGcToComplete "ADB-JDWP Connection Control Thread"` `daemon prio=``0` `tid=``5` `WaitingForGcToComplete "android.io"` `prio=``5` `tid=``21` `WaitingForGcToComplete "CpuTracker"` `prio=``5` `tid=``23` `WaitingForGcToComplete "UEventObserver"` `prio=``5` `tid=``36` `WaitingForGcToComplete "HwBinder:1471_2"` `prio=``5` `tid=``43` `WaitingForGcToComplete "HwBinder:1471_3"` `prio=``5` `tid=``44` `WaitingForGcToComplete "NetdConnector"` `prio=``5` `tid=``50` `WaitingForGcToComplete "WifiService"` `prio=``5` `tid=``55` `WaitingForGcToComplete "DeviceStorageMonitorService"` `prio=``5` `tid=``67` `WaitingForGcToComplete "SecurityWriteHandlerThread"` `prio=``5` `tid=``72` `WaitingForGcToComplete "watchdog"` `prio=``5` `tid=``101` `WaitingForGcToComplete "HwBinder:1471_5"` `prio=``5` `tid=``34` `WaitingForGcToComplete "Timer-20"` `prio=``5` `tid=``170` `WaitingForGcToComplete "IpClient.wlan0"` `prio=``5` `tid=``176` `WaitingForGcToComplete "OkHttp ConnectionPool"` `daemon prio=``5` `tid=``185` `WaitingForGcToComplete ... "main"` `prio=``5` `tid=``1` `Blocked "Binder:1471_1"` `prio=``5` `tid=``10` `Blocked "Binder:1471_2"` `prio=``5` `tid=``11` `Blocked "android.bg"` `prio=``5` `tid=``13` `Blocked "ActivityManager"` `prio=``5` `tid=``14` `Blocked "batterystats-worker"` `prio=``5` `tid=``18` `Blocked "android.fg"` `prio=``5` `tid=``20` `Blocked "android.display"` `prio=``5` `tid=``22` `Blocked "AlarmManager"` `prio=``5` `tid=``40` `Blocked "InputDispatcher"` `prio=``10` `tid=``46` `Blocked "InputReader"` `prio=``10` `tid=``47` `Blocked "NetworkPolicy.uid"` `prio=``5` `tid=``53` `Blocked "Binder:1471_4"` `prio=``5` `tid=``93` `Blocked "Binder:1471_6"` `prio=``5` `tid=``106` `Blocked "Binder:1471_8"` `prio=``5` `tid=``108` `Blocked "Binder:1471_9"` `prio=``5` `tid=``118` `Blocked "Binder:1471_B"` `prio=``5` `tid=``121` `Blocked "Binder:1471_C"` `prio=``5` `tid=``122` `Blocked "Binder:1471_F"` `prio=``5` `tid=``125` `Blocked "Binder:1471_10"` `prio=``5` `tid=``126` `Blocked "Binder:1471_12"` `prio=``5` `tid=``128` `Blocked "Binder:1471_13"` `prio=``5` `tid=``129` `Blocked "Binder:1471_14"` `prio=``5` `tid=``130` `Blocked "Binder:1471_15"` `prio=``5` `tid=``131` `Blocked "Binder:1471_16"` `prio=``5` `tid=``132` `Blocked "Binder:1471_1A"` `prio=``5` `tid=``136` `Blocked "Binder:1471_1B"` `prio=``5` `tid=``140` `Blocked "Binder:1471_1C"` `prio=``5` `tid=``117` `Blocked "Binder:1471_1D"` `prio=``5` `tid=``141` `Blocked "Binder:1471_1E"` `prio=``5` `tid=``144` `Blocked "Binder:1471_1F"` `prio=``5` `tid=``145` `Blocked "NetworkMonitorNetworkAgentInfo [WIFI () - 950]"` `prio=``5` `tid=``149` `Blocked |
一堆线程都是gc相关的状态,具体什么意思,我也不关心,还是再继续看下blocked的线程,都具体卡在什么地方:
"Binder:1471_6" prio=5 tid=106 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x14064288 self=0x721bdfb000 | sysTid=2126 nice=0 cgrp=default sched=0/0 handle=0x72107ff4f0 | state=S schedstat=( 5605933458050 5415448290382 12052578 ) utm=475916 stm=84677 core=5 HZ=100 | stack=0x7210704000-0x7210706000 stackSize=1009KB | held mutexes= at com.android.server.net.NetworkPolicyManagerService.isUidNetworkingBlockedInternal(NetworkPolicyManagerService.java:4825) - waiting to lock <0x05e68352> (a java.lang.Object) held by thread 53 at com.android.server.net.NetworkPolicyManagerService.access$3600(NetworkPolicyManagerService.java:291) at com.android.server.net.NetworkPolicyManagerService$NetworkPolicyManagerInternalImpl.isUidNetworkingBlocked(NetworkPolicyManagerService.java:4900) at com.android.server.ConnectivityService.isNetworkWithLinkPropertiesBlocked(ConnectivityService.java:1133) at com.android.server.ConnectivityService.filterNetworkStateForUid(ConnectivityService.java:1163) at com.android.server.ConnectivityService.getActiveNetworkInfo(ConnectivityService.java:1185) at android.net.IConnectivityManager$Stub.onTransact(IConnectivityManager.java:85) at android.os.Binder.execTransact(Binder.java:728) "Binder:1471_8" prio=5 tid=108 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x14064398 self=0x7219dd9000 | sysTid=2168 nice=0 cgrp=default sched=0/0 handle=0x720fdff4f0 | state=S schedstat=( 5601724562758 5464273391739 12340199 ) utm=473400 stm=86772 core=6 HZ=100 | stack=0x720fd04000-0x720fd06000 stackSize=1009KB | held mutexes= at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:8527) - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146 at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:225) at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3399) at android.os.Binder.execTransact(Binder.java:728) "NetworkPolicy.uid" prio=5 tid=53 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x1405f790 self=0x72163fc000 | sysTid=1881 nice=-2 cgrp=default sched=0/0 handle=0x721767d4f0 | state=S schedstat=( 992941159886 2687927167537 3629569 ) utm=24231 stm=75063 core=5 HZ=100 | stack=0x721757a000-0x721757c000 stackSize=1041KB | held mutexes= at com.android.server.am.ActivityManagerService$LocalService.notifyNetworkPolicyRulesUpdated(ActivityManagerService.java:27503) - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146 at com.android.server.net.NetworkPolicyManagerService.handleUidChanged(NetworkPolicyManagerService.java:4461) - locked <0x05e68352> (a java.lang.Object) at com.android.server.net.NetworkPolicyManagerService$18.handleMessage(NetworkPolicyManagerService.java:4435) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:207) at android.os.HandlerThread.run(HandlerThread.java:65) at com.android.server.ServiceThread.run(ServiceThread.java:44) |
接着看 0x05e68352 和 0x00c6c1f5 这俩monitor锁的持有线程:
"Binder:1471_20" prio=5 tid=146 WaitingForGcThreadFlip | group="main" sCount=1 dsCount=0 flags=1 obj=0x14147028 self=0x72126b5800 | sysTid=18045 nice=-2 cgrp=default sched=0/0 handle=0x72049824f0 | state=S schedstat=( 6101767900274 5905637387083 13051781 ) utm=516502 stm=93674 core=7 HZ=100 | stack=0x7204887000-0x7204889000 stackSize=1009KB | held mutexes= kernel: __switch_to+0xa4/0xfc kernel: futex_wait_queue_me+0xc4/0x13c kernel: futex_wait+0xe4/0x204 kernel: do_futex+0x168/0x80c kernel: SyS_futex+0x90/0x1b8 kernel: __sys_trace+0x4c/0x4c native: #00 pc 000000000001ef2c /system/lib64/libc.so (syscall+28) native: #01 pc 00000000000d855c /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148) native: #02 pc 00000000001f0680 /system/lib64/libart.so (art::gc::Heap::IncrementDisableThreadFlip(art::Thread*)+516) native: #03 pc 0000000000384fe4 /system/lib64/libart.so (art::JNI::GetStringCritical(_JNIEnv*, _jstring*, unsigned char*)+692) native: #04 pc 000000000012e268 /system/lib64/libandroid_runtime.so (android::android_os_Parcel_writeString(_JNIEnv*, _jclass*, long, _jstring*)+64) at android.os.Parcel.nativeWriteString(Native method) at android.os.Parcel$ReadWriteHelper.writeString(Parcel.java:369) at android.os.Parcel.writeString(Parcel.java:707) at android.content.pm.PackageItemInfo.writeToParcel(PackageItemInfo.java:652) at android.content.pm.ApplicationInfo.writeToParcel(ApplicationInfo.java:1513) at android.app.IApplicationThread$Stub$Proxy.bindApplication(IApplicationThread.java:943) at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:8146) at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:8255) - locked <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:199) at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3399) at android.os.Binder.execTransact(Binder.java:728) |
可以看到还是owner线程正是之前 WaitingForGcThreadFlip 中的一员,那么下一步通过gdb看下这一类线程具体卡在哪一行代码上:
- 根据机器的版本号确认symbol文件后,通过公司同事分享的脚本一行命令就可以attach到system_server进程上
-
$ ~/Documents/gdb_native_tools/droid.py attach /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols system_server process: system_server, pid(1471), parent(zygote64) gdbserver:/data/local/tmp/gdbserver64 bin_path: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/ solib: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/ bin: /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64 Attached; pid = 1471 gdbserver: Unable to determine the number of hardware watchpoints available. gdbserver: Unable to determine the number of hardware breakpoints available. Listening on port 1234 GNU gdb (GDB) 7.11 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done. Remote debugging using localhost:1234 Remote debugging from host 127.0.0.1 warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libandroid_runtime.so" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libaudioclient.so" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/lib64/libcutils.so" is not at the expected address (wrong library or version mismatch?) warning: Could not load shared library symbols for 15 libraries, e.g. /system/lib64/libclang_rt.ubsan_standalone-aarch64-android.so. Use the "info sharedlibrary" command to see the complete listing. Do you need "set solib-search-path" or "set sysroot"? Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64...done. Reading symbols from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/linker64...done. ...... syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 41 bionic/libc/arch-arm64/bionic/syscall.S: No such file or directory. Source directories searched: /mnt/miui/v10-p-dipper-alpha:$cdir:$cwd #0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 #1 0x0000007238a80560 in art::futex (uaddr=0x723961df68, op=0, val=20935915, timeout=0x7fc723ee90, uaddr2=0x7fc723ee90, val3=0) at art/runtime/base/mutex-inl.h:43 #2 art::ConditionVariable::WaitHoldingLocks (this=0x723961df58, self=0x7239614c00) at art/runtime/base/mutex.cc:953 #3 0x0000007238d6241c in art::Monitor::Lock<(art::LockReason)1> (this=0x723961df20, self=0x7239614c00) at art/runtime/monitor.cc:559 #4 0x0000007238d687a8 in art::Monitor::MonitorEnter (self=0x7239614c00, obj=<optimized out>, trylock=<optimized out>) at art/runtime/monitor.cc:1462 #5 0x0000007238ec40fc in art::mirror::Object::MonitorEnter (this=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/mirror/object-inl.h:98 #6 artLockObjectFromCode (obj=0x62 <_DYNAMIC+98>, self=0x7239614c00) at art/runtime/entrypoints/quick/quick_lock_entrypoints.cc:32 #7 0x0000007238f04e9c in art_quick_lock_object () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1510 #8 0x00000072207cf6a4 in ?? () Backtrace stopped: not enough registers or memory available to unwind further /home/pip/Documents/gdb_native_tools/gdb-script/init.gdb: No such file or directory. /home/pip/Documents/gdb_native_tools/shadow/gdb_driver.py: No such file or directory. (gdb) handle SIG35 nostop noprint pass Signal Stop Print Pass to program Description SIG35 No No Yes Real-time event 35 (gdb) handle SIG33 nostop noprint pass Signal Stop Print Pass to program Description SIG33 No No Yes Real-time event 33
tips:handle SIG35 nostop noprint pass 是为了不处理android平台上用于抓native trace的 sig35 信号,减小我们的调试过程意外中断的可能性(只是个习惯, 这个问题中其实不做也没关系)
- 切换到gc线程
-
(gdb) thread 9 [Switching to thread 9 (Thread 1471.1484)] #0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 41 svc #0 (gdb) bt #0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 #1 0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x7221a89ab0, uaddr2=0x7221a89ab0, val3=0) at art/runtime/base/mutex-inl.h:43 #2 art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x723966b000) at art/runtime/base/mutex.cc:953 #3 0x0000007238b98d54 in art::gc::Heap::ThreadFlipBegin (this=0x72396b7600, self=0x723966b000) at art/runtime/gc/heap.cc:845 // <<<<<<<<<<<<<<<<<<<<<<< #4 0x0000007238e52204 in art::ThreadList::FlipThreadRoots (this=<optimized out>, thread_flip_visitor=0x7221a89ca0, flip_callback=<optimized out>, collector= 0x7239716500, pause_listener=<optimized out>) at art/runtime/thread_list.cc:591 #5 0x0000007238b65cc4 in art::gc::collector::ConcurrentCopying::FlipThreadRoots (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:620 #6 0x0000007238b64c50 in art::gc::collector::ConcurrentCopying::RunPhases (this=0x7239716500) at art/runtime/gc/collector/concurrent_copying.cc:178 #7 0x0000007238b7af80 in art::gc::collector::GarbageCollector::Run (this=0x7239716500, gc_cause=<optimized out>, clear_soft_references=<optimized out>) at art/runtime/gc/collector/garbage_collector.cc:96 #8 0x0000007238b9d93c in art::gc::Heap::CollectGarbageInternal (this=0x72396b7600, gc_type=art::gc::collector::kGcTypeFull, gc_cause=art::gc::kGcCauseBackground, clear_soft_references=false) at art/runtime/gc/heap.cc:2616 #9 0x0000007238baeeb8 in art::gc::Heap::ConcurrentGC (this=0x72396b7600, self=<optimized out>, cause=art::gc::kGcCauseBackground, force_full=false) at art/runtime/gc/heap.cc:3621 #10 0x0000007238bb4524 in art::gc::Heap::ConcurrentGCTask::Run (this=<optimized out>, self=0x0 <_DYNAMIC>) at art/runtime/gc/heap.cc:3582 #11 0x0000007238bd692c in art::gc::TaskProcessor::RunAllTasks (this=0x72396bb100, self=0x723966b000) at art/runtime/gc/task_processor.cc:129 #12 0x0000000072a7e3f0 in dalvik.system.VMRuntime.clampGrowthLimit [DEDUPED] () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat #13 0x0000000072c20bd8 in java.lang.Daemons$HeapTaskDaemon.runInternal () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat #14 0x0000000072a80c40 in java.lang.Daemons$Daemon.run () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-libart.oat #15 0x0000000071baa4dc in java.lang.Thread.run () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-core-oj.oat #16 0x0000007238f0498c in art_quick_invoke_stub () at art/runtime/arch/arm64/quick_entrypoints_arm64.S:1702 #17 0x0000007238a78524 in art::ArtMethod::Invoke (this=0x6ff68ad8, self=0x723966b000, args=<optimized out>, args_size=<optimized out>, result=0x7221a8a310, shorty=0x7269daa9 "V") at art/runtime/art_method.cc:374 #18 0x0000007238e07258 in art::(anonymous namespace)::InvokeWithArgArray (soa=..., method=0x6ff68ad8, arg_array=0x7221a8a318, result=<optimized out>, shorty=0x7269daa9 "V") at art/runtime/reflection.cc:456 #19 0x0000007238e08314 in art::InvokeVirtualOrInterfaceWithJValues (soa=..., obj=0x5 <_DYNAMIC+5>, mid=<optimized out>, args=0x0 <_DYNAMIC>) at art/runtime/reflection.cc:548 #20 0x0000007238e36764 in art::Thread::CreateCallback (arg=0x723966b000) at art/runtime/thread.cc:473 #21 0x00000072bb501db0 in __pthread_start (arg=0x7221a8a4f0) at bionic/libc/bionic/pthread_create.cpp:254 #22 0x00000072bb4a378c in __start_thread (fn=0x72bb501d88 <__pthread_start(void*)>, arg=0x7221a8a4f0) at bionic/libc/bionic/clone.cpp:52
结合代码: art/runtime/gc/heap.cc#845
void Heap::ThreadFlipBegin(Thread* self) { // Supposed to be called by GC. Set thread_flip_running_ to be true. If disable_thread_flip_count_ // > 0, block. Otherwise, go ahead. CHECK(kUseReadBarrier); ScopedThreadStateChange tsc(self, kWaitingForGcThreadFlip); MutexLock mu(self, *thread_flip_lock_); bool has_waited = false; uint64_t wait_start = NanoTime(); CHECK(!thread_flip_running_); // Set this to true before waiting so that frequent JNI critical enter/exits won't starve // GC. This like a writer preference of a reader-writer lock. thread_flip_running_ = true; while (disable_thread_flip_count_ > 0) { // <<<<<<<<<<<< has_waited = true; thread_flip_cond_->Wait(self); // <<<<<<<<<<<<< } if (has_waited) { uint64_t wait_time = NanoTime() - wait_start; total_wait_time_ += wait_time; if (wait_time > long_pause_log_threshold_) { LOG(INFO) << __FUNCTION__ << " blocked for " << PrettyDuration(wait_time); } } } |
既然在等待一个condition,那么 disable_thread_flip_count_ 肯定是大于0的,我们可以打印一下这个变量的值:
(gdb) print disable_thread_flip_count_ $1 = 1 |
果然它是1,正常情况下,当前线程在thread_flip_cond_这个条件上wait一段时间后,肯定会有一个线程会去notify它,来唤醒处于wait状态的线程;
- 那么,接下来自然是快速过一下源码,看看都有哪些函数会去操作thread_flip_cond_:
void Heap::ThreadFlipEnd(Thread* self) { // Supposed to be called by GC. Set thread_flip_running_ to false and potentially wake up mutators // waiting before doing a JNI critical. CHECK(kUseReadBarrier); MutexLock mu(self, *thread_flip_lock_); CHECK(thread_flip_running_); thread_flip_running_ = false; // Potentially notify mutator threads blocking to enter a JNI critical section. thread_flip_cond_->Broadcast(self); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< } void Heap::DecrementDisableThreadFlip(Thread* self) { // Supposed to be called by mutators. Decrement disable_thread_flip_count_ and potentially wake up // the GC waiting before doing a thread flip. CHECK(kUseReadBarrier); self->DecrementDisableThreadFlipCount(); bool is_outermost = self->GetDisableThreadFlipCount() == 0; if (!is_outermost) { // If this is not an outermost JNI critical exit, we don't need to decrement the global counter. // The global counter is decremented only once for a thread for the outermost exit. return; } MutexLock mu(self, *thread_flip_lock_); CHECK_GT(disable_thread_flip_count_, 0U); --disable_thread_flip_count_; if (disable_thread_flip_count_ == 0) { // Potentially notify the GC thread blocking to begin a thread flip. thread_flip_cond_->Broadcast(self); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< } } |
再看一下各个线程的backtrace,确认了,没有哪个在调用上面两个任意一个
- 那些处于WaitingForGcThreadFlip的线程block在哪?
-
(gdb) t 23 [Switching to thread 23 (Thread 1471.1692)] #0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 41 svc #0 (gdb) bt #0 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 #1 0x0000007238a80560 in art::futex (uaddr=0x7239723910, op=0, val=93152, timeout=0x721cbe2850, uaddr2=0x721cbe2850, val3=0) at art/runtime/base/mutex-inl.h:43 #2 art::ConditionVariable::WaitHoldingLocks (this=0x7239723900, self=0x721d26a000) at art/runtime/base/mutex.cc:953 #3 0x0000007238b98684 in art::gc::Heap::IncrementDisableThreadFlip (this=0x72396b7600, self=0x721d26a000) at art/runtime/gc/heap.cc:798 #4 0x0000007238d2cfe8 in art::JNI::GetStringCritical (env=<optimized out>, java_string=<optimized out>, is_copy=0x0 <_DYNAMIC>) at art/runtime/jni_internal.cc:1887 #5 0x00000072bae3177c in android::sp<android::IGraphicBufferProducer>::~sp (this=0x0 <_DYNAMIC>) at system/core/libutils/include/utils/StrongPointer.h:156 #6 android::nativeCreateScoped (env=<optimized out>, clazz=<optimized out>, surfaceObject=<optimized out>) at frameworks/base/core/jni/android_view_SurfaceSession.cpp:51 #7 0x00000000748c94ac in android.filterfw.format.ImageFormat.create () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/framework/arm64/boot-framework.oat #8 0x0000000000000000 in ?? () from /home/pip/Downloads/e8-V10.3.3.0.PECCNXM-gcflip-hang/out/target/product/equuleus/symbols/system/bin/app_process64
art/runtime/gc/heap.cc:798 对应的代码:
void Heap::IncrementDisableThreadFlip(Thread* self) { // Supposed to be called by mutators. If thread_flip_running_ is true, block. Otherwise, go ahead. CHECK(kUseReadBarrier); bool is_nested = self->GetDisableThreadFlipCount() > 0; self->IncrementDisableThreadFlipCount(); if (is_nested) { // If this is a nested JNI critical section enter, we don't need to wait or increment the global // counter. The global counter is incremented only once for a thread for the outermost enter. return; } ScopedThreadStateChange tsc(self, kWaitingForGcThreadFlip); // <<<<<<<<<<<<<<<<<<<<<<< 改变线程状态 MutexLock mu(self, *thread_flip_lock_); bool has_waited = false; uint64_t wait_start = NanoTime(); if (thread_flip_running_) { ScopedTrace trace("IncrementDisableThreadFlip"); while (thread_flip_running_) { has_waited = true; thread_flip_cond_->Wait(self); // <<<<<<<<<<<<<<<<<<<< } } ++disable_thread_flip_count_; if (has_waited) { uint64_t wait_time = NanoTime() - wait_start; total_wait_time_ += wait_time; if (wait_time > long_pause_log_threshold_) { LOG(INFO) << __FUNCTION__ << " blocked for " << PrettyDuration(wait_time); } } } |
所以到这儿,我们清楚的一点是,肯定是由于某个线程没有正常调用 DecrementDisableThreadFlip() 函数,导致 disable_thread_flip_count_ 一直不为0;
- disable_thread_flip_count_ 一直不为 0 这个锅应该是谁的?
仔细看一下上面这个函数的代码,不仅全局有个disable_thread_flip_count_, 同时每个线程还有thread local的 disable_thread_flip_count,每次调用IncrementDisableThreadFlip()开始的时候,都会先执行Thread::Current()->IncrementDisableThreadFlipCount(),看下它的代码:
-
// art/runtime/thread.h uint32_t GetDisableThreadFlipCount() const { CHECK(kUseReadBarrier); return tls32_.disable_thread_flip_count; }
YEAH!下面只需要逐个遍历art里面每个线程,把 tls32_.disable_thread_flip_count > 1的部分打印出来:
稍微修改一下gdb脚本下的 art.gdb脚本里的art_dump_thread_list_base函数:
define art_dump_thread_list art_dump_thread_list_base 0 end define art_dump_thread_list_base set $dump_mutex = $arg0 # 第一个节点 set $first = ('art::Runtime'::instance_)->thread_list_->list_->__end_->__next_ printf "\n Id Thread Id art::Thread* disable_flip_count STATE FLAGS NAME SUSPEND_COUNT MUTEX\n" set $list_node = 0 set $count = 1 while $list_node != $first if $list_node == 0 set $list_node = $first end set $thread = ('art::Thread'*)(*(void**)((uint64_t)$list_node + 2*sizeof(void*))) set $flip_count = (('art::Thread'*)$thread)->tls32_->disable_thread_flip_count if $flip_count != 0 # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 1 set $thread_id = (('art::Thread'*)$thread)->tls32_->tid #printf "monitor_enter_object: 0x%lx", $thread->tlsPtr_.monitor_enter_object printf "%4d %6d %p ", $count, $thread_id, $thread printf " %d ", $flip_count # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2 thread_print_state $thread printf " " thread_print_name $thread printf " " thread_print_suspend_count $thread if $dump_mutex == 1 thread_print_mutex $thread end printf "\n" end set $list_node = $list_node->__next_ set $count = $count + 1 if $list_node == $first loop_break end end end |
然后重新导入脚本然后执行这个函数即可:
(gdb) source /home/pip/Documents/gdb_native_tools/art.gdb (gdb) art_dump_thread_list Id Thread Id art::Thread* disable_flip_count STATE NAME SUSPEND_COUNT 4 1479 0x723973c800 1 kWaitingForGcThreadFlip Binder:filter-perf-event 0 23 1692 0x721d26a000 1 kWaitingForGcThreadFlip PowerManagerService 0 31 1825 0x722b7ab800 1 kWaitingForGcThreadFlip HwBinder:1471_1 0 52 1884 0x721c426c00 1 kWaitingForGcThreadFlip WifiStateMachine 0 88 1938 0x721bc96800 1 kWaitingForGcThreadFlip Binder:1471_3 0 98 2125 0x721c54d000 1 kWaitingForGcThreadFlip Binder:1471_5 0 100 2166 0x721a129800 1 kWaitingForGcThreadFlip Binder:1471_7 0 113 2894 0x7219cc1000 1 kWaitingForGcThreadFlip Binder:1471_A 0 116 2949 0x721273ec00 1 kWaitingForGcThreadFlip Binder:1471_D 0 117 2950 0x7219cc0400 1 kWaitingForGcThreadFlip Binder:1471_E 0 120 3089 0x720b025000 1 kWaitingForGcThreadFlip Binder:1471_11 0 126 3135 0x720c454800 1 kWaitingForGcThreadFlip Binder:1471_17 0 127 3140 0x721929cc00 1 kWaitingForGcThreadFlip Binder:1471_18 0 128 3142 0x720eb42000 1 kWaitingForGcThreadFlip Binder:1471_19 0 132 8981 0x7219d25800 1 kWaitingForGcThreadFlip HwBinder:1471_4 0 142 18045 0x72126b5800 1 kWaitingForGcThreadFlip Binder:1471_20 0 181 4738 0x7211c94400 1 kNative AsyncTask #36709 0 |
最后那个 AsyncTask #36709 十分可疑,别的线程的disable_flip_count 为1的原因是正在调用 IncrementDisableThreadFlip()函数,OK,接着看下它在执行什么函数:
- 嫌疑人 “AsyncTask #36709” 干了些什么?
-
"AsyncTask #36709" prio=5 tid=180 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x14d149c8 self=0x7211c94400 | sysTid=4738 nice=0 cgrp=default sched=0/0 handle=0x708411b4f0 | state=S schedstat=( 3931924 16620626 18 ) utm=0 stm=0 core=7 HZ=100 | stack=0x7084018000-0x708401a000 stackSize=1041KB | held mutexes= kernel: __switch_to+0xa4/0xfc kernel: binder_thread_read+0xa78/0x11a0 kernel: binder_ioctl+0x920/0xb08 kernel: do_vfs_ioctl+0xb8/0x8d8 kernel: SyS_ioctl+0x84/0x98 kernel: __sys_trace+0x4c/0x4c native: #00 pc 000000000006e5bc /system/lib64/libc.so (__ioctl+4) native: #01 pc 0000000000029544 /system/lib64/libc.so (ioctl+136) native: #02 pc 000000000001e22c /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240) native: #03 pc 000000000001e884 /system/lib64/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+324) native: #04 pc 0000000000012204 /system/lib64/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312) native: #05 pc 00000000000c9764 /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssXtra::_hidl_injectXtraData(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&)+244) native: #06 pc 000000000005d5c4 /system/lib64/libandroid_servers.so (android::android_location_GnssLocationProvider_inject_xtra_data(_JNIEnv*, _jobject*, _jbyteArray*, int)+268) at com.android.server.location.GnssLocationProvider.native_inject_xtra_data(Native method) at com.android.server.location.GnssLocationProvider.access$2300(GnssLocationProvider.java:112) at com.android.server.location.GnssLocationProvider$10.run(GnssLocationProvider.java:1140) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.lang.Thread.run(Thread.java:764)
它在一个jni里执行一个hw binder call,结合下代码:
// frameworks/base/services/core/jni/com_android_server_location_GnssLocationProvider.cpp#1613 static void android_location_GnssLocationProvider_inject_xtra_data(JNIEnv* env, jobject /* obj */, jbyteArray data, jint length) { if (gnssXtraIface == nullptr) { ALOGE("XTRA Interface not supported"); return; } jbyte* bytes = reinterpret_cast<jbyte *>(env->GetPrimitiveArrayCritical(data, 0)); gnssXtraIface->injectXtraData(std::string((const char*)bytes, length)); env->ReleasePrimitiveArrayCritical(data, bytes, JNI_ABORT); } |
(不熟悉art实现)一眼看过去,都是正常的jni调用,好像跟disable_flip_count没啥关系呀?全局搜一下DecrementDisableThreadFlip()和IncrementDisableThreadFlip被调用的地方:
ReleaseStringCritical() → DecrementDisableThreadFlip() ReleasePrimitiveArray() → DecrementDisableThreadFlip() GetPrimitiveArrayCritical() → IncrementDisableThreadFlip() GetStringCritical() → IncrementDisableThreadFlip() |
到这个地方,稍稍消化总结一下,不难推断出:
这个问题必然是因为 AsyncTask #36709 在执行jni调用过程中,它执行了:
GetPrimitiveArrayCritical() → IncrementDisableThreadFlip() → 使得全局变量 disable_thread_flip_count_ = 1
但是在继续执行 gnssXtraIface →injectXtraData() 这个 hal 接口的时候,一直被block,自然无法执行 ReleasePrimitiveArrayCritical() → DecrementDisableThreadFlip() → disable_thread_flip_count_ = 0
4. gnssXtraIface→injectXtraData() 为什么会一直被block?
查看binder transactin日志
$ adb shell cat /sys/kernel/debug/binder/transactions proc 1471 context hwbinder thread 1673: l 00 need_return 0 tr 0 incoming transaction 979513688: ffffffed66132800 from 1007:1007 to 1471:1673 code 2 flags 10 pri 0:120 r1 node 978962390 size 48:0 data ffffff8014f00258 outgoing transaction 979513687: ffffffed66133e00 from 1471:1673 to 1007:1007 code 3 flags 10 pri 0:120 r1 thread 8981: l 01 need_return 0 tr 0 incoming transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1 node 978962390 size 132:16 data ffffff8014f00508 thread 4738: l 00 need_return 0 tr 0 outgoing transaction 979513850: ffffffee6324a400 from 1471:4738 to 1007:0 code 2 flags 10 pri 0:120 r1 thread 8981: l 01 need_return 0 tr 0 incoming transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1 node 978962390 size 132:16 data ffffff8014f00508 proc 1007 context hwbinder8981 thread 1007: l 12 need_return 0 tr 0 outgoing transaction 979513688: ffffffed66132800 from 1007:1007 to 1471:1673 code 2 flags 10 pri 0:120 r1 incoming transaction 979513687: ffffffed66133e00 from 1471:1673 to 1007:1007 code 3 flags 10 pri 0:120 r1 node 956 size 36:0 data ffffff8012000008 thread 2040: l 10 need_return 0 tr 0 outgoing transaction 979513990: ffffffed2541c700 from 1007:2040 to 1471:8981 code 4 flags 10 pri 0:120 r1 pending transaction 979513850: ffffffee6324a400 from 1471:4738 to 1007:0 code 2 flags 10 pri 0:120 r1 node 5700 size 120:16 data ffffff8012000030 |
incoming transaction from 1471:4738 to 1007:0,得到对端进程pid为1007, 继续看下这个进程里各个线程的状态
$ adb shell ps -T 1007 USER PID TID PPID VSZ RSS WCHAN ADDR S CMD system 1007 1007 1 29016 156 binder_thread_read 0 S ignss_1_1 system 1007 2040 1 29016 156 binder_thread_read 0 S IPC Thread system 1007 17865 1 29016 156 poll_schedule_timeout 0 S ignss_1_1 |
各个线程看起来都是正常的,通过debuggerd查看backtrace
$ adb shell debuggerd -b 1007 ----- pid 1007 at 2019-05-05 13:25:52 ----- Cmd line: /vendor/bin/ignss_1_1 ABI: 'arm64' "ignss_1_1" sysTid=1007 #00 pc 000000000006e5bc /system/lib64/libc.so (__ioctl+4) #01 pc 0000000000029544 /system/lib64/libc.so (ioctl+136) #02 pc 000000000001e22c /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240) #03 pc 000000000001e77c /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+60) #04 pc 0000000000012204 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312) #05 pc 0000000000097ddc /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssCallback::_hidl_gnssStatusCb(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::gnss::V1_0::IGnssCallback::GnssStatusValue)+208) #06 pc 000000000001ccf0 /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::statusCb(GpsStatus*)+68) #07 pc 000000000001b5b4 /vendor/lib64/hw/gps.brcm.so (gps_stop()+156) #08 pc 0000000000013ae0 /vendor/lib64/hw/gps.brcm.so (proxy__gps_stop()+168) #09 pc 000000000001e1d4 /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::stop()+28) #10 pc 0000000000088328 /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BnHwGnss::_hidl_stop(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+152) #11 pc 000000000002e324 /system/lib64/android.hardware.gnss@1.1.so (android::hardware::gnss::V1_1::BnHwGnss::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+1624) #12 pc 000000000001df34 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+72) #13 pc 000000000001508c /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::executeCommand(int)+1508) #14 pc 0000000000014938 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::getAndExecuteCommand()+204) #15 pc 0000000000015708 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::joinThreadPool(bool)+268) #16 pc 0000000000000cb8 /vendor/bin/ignss_1_1 (main+468) #17 pc 00000000000acec0 /system/lib64/libc.so (__libc_init+88) "IPC Thread" sysTid=2040 #00 pc 000000000006e5bc /system/lib64/libc.so (__ioctl+4) #01 pc 0000000000029544 /system/lib64/libc.so (ioctl+136) #02 pc 000000000001e22c /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+240) #03 pc 000000000001e884 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+324) #04 pc 0000000000012204 /system/lib64/vndk-sp-28/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+312) #05 pc 0000000000098294 /system/lib64/android.hardware.gnss@1.0.so (android::hardware::gnss::V1_0::BpHwGnssCallback::_hidl_gnssNmeaCb(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, long, android::hardware::hidl_string const&)+272) #06 pc 000000000001d0b4 /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (android::hardware::gnss::V1_0::implementation::Gnss::nmeaCb(long, char const*, int)+112) #07 pc 00000000000136d8 /vendor/lib64/hw/gps.brcm.so (proxy__gps_nmea_cb(long, char const*, int)+180) #08 pc 0000000000018c88 /vendor/lib64/hw/gps.brcm.so (GpsiClient::marshal_gps_nmea_cb(IpcIncomingMessage&)+300) #09 pc 00000000000252a4 /vendor/lib64/hw/gps.brcm.so (IpcPipeTransportBase::OnSelect(int, bool, bool, bool, void*)+344) #10 pc 0000000000025a28 /vendor/lib64/hw/gps.brcm.so (SelectManager::PerformOneWaitAndProcess()+468) #11 pc 0000000000017860 /vendor/lib64/hw/gps.brcm.so (ipc_thread_proc(void*)+56) #12 pc 0000000000023160 /vendor/lib64/android.hardware.gnss@1.1-impl-xiaomi.so (threadFunc(void*)+12) #13 pc 0000000000081dac /system/lib64/libc.so (__pthread_start(void*)+36) #14 pc 0000000000023788 /system/lib64/libc.so (__start_thread+68) "ignss_1_1" sysTid=17865 #00 pc 000000000006e604 /system/lib64/libc.so (__pselect6+4) #01 pc 000000000002bbf8 /system/lib64/libc.so (select+144) #02 pc 000000000001974c /vendor/lib64/hw/flp.brcm.so (SelectManager::PerformOneWaitAndProcess()+268) #03 pc 000000000000e8e8 /vendor/lib64/hw/flp.brcm.so (ipc_thread_proc(void*)+116) #04 pc 0000000000081dac /system/lib64/libc.so (__pthread_start(void*)+36) #05 pc 0000000000023788 /system/lib64/libc.so (__start_thread+68) ----- end 1007 ----- |
结合上面的binder call日志,可以看到ignss_1_1进程内部的两个binder线程都在回调system_server进程的binder 接口,没有空闲binder线程可以处理来自system_server的gnssXtraIface→injectXtraData() 请求!
incoming 1471:1673
outgoing 1007:1007 to 1471:1673 // blocked !
outgoing 1007:2040 to 1471:8981 // WaitingForGcThreadFlip !
p.s. 1471:1673 并不是一个binder线程, 但确响应了来自1007:1007的binder call,这是binder api设计上面的优化,复用? todo:确认
"android.bg" prio=5 tid=13 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x13fc3428 self=0x7232239400 | sysTid=1673 nice=0 cgrp=default sched=0/0 handle=0x721da8e4f0 | state=S schedstat=( 19417818968810 14397831932031 20757829 ) utm=561393 stm=1380388 core=5 HZ=100 | stack=0x721d98b000-0x721d98d000 stackSize=1041KB | held mutexes= at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:22705) - waiting to lock <0x00c6c1f5> (a com.android.server.am.ActivityManagerService) held by thread 146 at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1195) at com.android.server.location.GnssLocationProvider.reportStatus(GnssLocationProvider.java:1766) at com.android.server.location.GnssLocationProvider.native_stop(Native method) at com.android.server.location.GnssLocationProvider.stopNavigating(GnssLocationProvider.java:1607) at com.android.server.location.GnssLocationProvider.updateRequirements(GnssLocationProvider.java:1405) at com.android.server.location.GnssLocationProvider.handleSetRequest(GnssLocationProvider.java:1345) at com.android.server.location.GnssLocationProvider.access$4100(GnssLocationProvider.java:112) at com.android.server.location.GnssLocationProvider$ProviderHandler.handleMessage(GnssLocationProvider.java:2400) at android.os.Handler.dispatchMessage(Handler.java:106) at android.os.Looper.loop(Looper.java:207) at android.os.HandlerThread.run(HandlerThread.java:65) |
5. 修复方案
所以这个问题也是system_server↔ ignss_1_1 这两个进程相互binder call + jni call + 虚拟机正好在执行垃圾回收;
参考 oracle的文档:
After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical. We must treat the code inside this pair of functions as running in a “critical region.” Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)
These restrictions make it more likely that the native code will obtain an uncopied version of the array, even if the VM does not support pinning. For example, a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via GetPrimitiveArrayCritical.
GetPrimitiveArrayCritical 和 ReleasePrimitiveArrayCritical 两个函数间的代码执行耗时不应太久! 结合此处的业务逻辑,而且也不会被太频繁的调用,拷贝一份原始数组给binder call作为参数即可:
static void android_location_GnssLocationProvider_inject_xtra_data(JNIEnv* env, jobject /* obj */, jbyteArray data, jint length) { if (gnssXtraIface == nullptr) { ALOGE("XTRA Interface not supported"); return; } -- jbyte* bytes = reinterpret_cast<jbyte *>(env->GetPrimitiveArrayCritical(data, 0)); ++ jbyte* bytes = reinterpret_cast<jbyte *>(env->GetByteArrayElements(data, 0)); gnssXtraIface->injectXtraData(std::string((const char*)bytes, length)); -- env->ReleasePrimitiveArrayCritical(data, bytes, JNI_ABORT); ++ env->ReleaseByteArrayElements(data, bytes, JNI_ABORT); }
更多推荐
所有评论(0)