linux篇-线上服务器异常自动重启的原因排查
背景:最近发现公司一台惠普服务器异常自动重启了,所以尝试排查下了原因。排查步骤:1、登录机器,执行last或uptime等命令,查看重启时间$ last | grep rebootrebootsystem boot3.10.0-1160.24.1 Mon Oct 11 19:19 - 10:49(15:30)rebootsystem boot3.10.0-1160.24.1 Wed Oct6 1
背景:最近发现公司一台惠普服务器异常自动重启了,所以尝试排查下了原因。
排查步骤:
1、登录机器,执行last或uptime等命令,查看重启时间
$ last | grep reboot
reboot system boot 3.10.0-1160.24.1 Mon Oct 11 19:19 - 10:49 (15:30)
reboot system boot 3.10.0-1160.24.1 Wed Oct 6 14:08 - 10:49 (5+20:41)
reboot system boot 3.10.0-1160.24.1 Mon Oct 4 13:03 - 10:49 (7+21:46)
reboot system boot 3.10.0-1160.24.1 Sun Oct 3 21:39 - 10:49 (8+13:10)
reboot system boot 3.10.0-1160.24.1 Sun Oct 3 09:12 - 10:49 (9+01:37)
reboot system boot 3.10.0-1160.24.1 Sat Sep 25 23:13 - 10:49 (16+11:36)
$ uptime
10:53:27 up 15:34, 1 user, load average: 2.43, 1.74, 1.43
2、查看系统相关日志(如dmesg、/var/log/messages、kdump等)
dmesg:开机日志
$ dmesg | grep -Ei 'error|Fail'
[ 0.000000] tsc: Fast TSC calibration failed
[ 3.120763] pci 0000:12:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
[ 3.178571] pci 0000:5c:00.0: BAR 6: failed to assign [mem size 0x00200000 pref]
[ 3.223819] pci 0000:5d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
[ 3.240238] pci 0000:5d:00.2: BAR 6: failed to assign [mem size 0x00080000 pref]
[ 3.256824] pci 0000:5d:00.3: BAR 6: failed to assign [mem size 0x00080000 pref]
[ 3.366034] pci 0000:00:14.0: xHCI BIOS handoff failed (BIOS bug ?) 00012201
[ 4.635351] ioapic: probe of 0000:00:05.4 failed with error -22
[ 4.642051] ioapic: probe of 0000:11:05.4 failed with error -22
[ 4.648757] ioapic: probe of 0000:36:05.4 failed with error -22
[ 4.655459] ioapic: probe of 0000:5b:05.4 failed with error -22
[ 4.662176] ioapic: probe of 0000:80:05.4 failed with error -22
[ 4.668874] ioapic: probe of 0000:85:05.4 failed with error -22
[ 4.675576] ioapic: probe of 0000:ae:05.4 failed with error -22
[ 4.682278] ioapic: probe of 0000:d7:05.4 failed with error -22
[ 4.716010] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 6.058884] smartpqi: module verification failed: signature and/or required key missing - tainting kernel
[24726.679793] tsar[94262]: segfault at fffffffffffffff0 ip 00007fd5cddf5dd6 sp 00007fff9aa2c608 error 5 in libc-2.17.so[7fd5cdca0000+1c3000]
[24737.612788] tsar[95267]: segfault at fffffffffffffff0 ip 00007f9205c50dd6 sp 00007ffe55047368 error 5 in libc-2.17.so[7f9205afb000+1c3000]
[24740.345420] tsar[95426]: segfault at fffffffffffffff0 ip 00007f99f20efdd6 sp 00007ffe70032fa8 error 5 in libc-2.17.so[7f99f1f9a000+1c3000]
/var/log/messages:系统日志
$ grep -Ei 'error|Fail' /var/log/messages
Oct 11 19:19:35 kuyun.a01.host kernel: tsc: Fast TSC calibration failed
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:12:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5c:00.0: BAR 6: failed to assign [mem size 0x00200000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.2: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:5d:00.3: BAR 6: failed to assign [mem size 0x00080000 pref]
Oct 11 19:19:35 kuyun.a01.host kernel: pci 0000:00:14.0: xHCI BIOS handoff failed (BIOS bug ?) 00012201
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:00:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:11:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:36:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:5b:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:80:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:85:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:ae:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ioapic: probe of 0000:d7:05.4 failed with error -22
Oct 11 19:19:35 kuyun.a01.host kernel: ERST: Error Record Serialization Table (ERST) support is initialized.
Oct 11 19:19:35 kuyun.a01.host kernel: smartpqi: module verification failed: signature and/or required key missing - tainting kernel
Oct 11 19:19:35 kuyun.a01.host systemd[1]: Failed to start Configure CPU turboboost.
Oct 11 19:19:35 kuyun.a01.host systemd[1]: Unit cpunoturbo.service entered failed state.
Oct 11 19:19:35 kuyun.a01.host systemd[1]: cpunoturbo.service failed.
Oct 11 19:19:35 kuyun.a01.host syslog-ng[1144]: [2021-10-11T19:19:35.010289] Error resolving hostname; host='syslog.tbsite.net'
Oct 11 19:19:35 kuyun.a01.host syslog-ng[1144]: [2021-10-11T19:19:35.010373] Initiating connection failed, reconnecting; time_reopen='10'
Oct 11 19:19:39 kuyun.a01.host systemd[1562]: Failed at step EXEC spawning /home/staragent/bin/agent.sh: No such file or directory
Oct 11 19:19:39 kuyun.a01.host systemd[1]: Failed to start StarAgent2.0.
Oct 11 19:19:39 kuyun.a01.host systemd[1]: Unit staragentctl.service entered failed state.
Oct 11 19:19:39 kuyun.a01.host systemd[1]: staragentctl.service failed.
Oct 11 19:21:22 kuyun.a01.host useradd[9397]: failed adding user 'terminal', exit code: 9
Oct 12 02:11:08 kuyun.a01.host kernel: tsar[94262]: segfault at fffffffffffffff0 ip 00007fd5cddf5dd6 sp 00007fff9aa2c608 error 5 in libc-2.17.so[7fd5cdca0000+1c3000]
Oct 12 02:11:19 kuyun.a01.host kernel: tsar[95267]: segfault at fffffffffffffff0 ip 00007f9205c50dd6 sp 00007ffe55047368 error 5 in libc-2.17.so[7f9205afb000+1c3000]
Oct 12 02:11:22 kuyun.a01.host kernel: tsar[95426]: segfault at fffffffffffffff0 ip 00007f99f20efdd6 sp 00007ffe70032fa8 error 5 in libc-2.17.so[7f99f1f9a000+1c3000]
kdump:宕机日志
kdump服务的log日志文件路径在/var/crash/目录下,但当时没看到有日志生成。
$ grep -Ei 'fail|error' /var/crash/<对应宕机日期>/vmcore-dmesg.txt
从系统日志中看到内核有个报错:ERST: Error Record Serialization Table (ERST) support is initialized.
ERST报错可参考说明:https://access.redhat.com/solutions/527433
3、登录服务器的带外管理后台查看下相关日志
因为公司的这台惠普服务器有带外管理页面,所以就直接登录进去看了,带外里面能看到具体的一些硬件报错信息,很方便。
于是进入到带外管理后台的 Integrated Management Log 页面,果然看到有一个CPU类型的硬件报错信息,如下:
Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000038, Bank 0x00000003, Status 0xBE000000'00800400, Address 0xFFFFFFFF'81637323, Misc 0xFFFFFFFF'81637323).
建议是:
Update the system firmware. If the issue persists, contact support.
Learn more:
https://techlibrary.hpe.com/docs/enterprise/servers/gen10/ilo5/en/class0x0005code0x0003-gen10.html
结论就是,这个要找到服务器厂家的售后工程师,协助排查并修复。
更多推荐
所有评论(0)