这个BUG是我去年11月撞见的,早该写出来了。因为这个BUG造成的灾难后果远远超出我的想像。
当时的现象是某些机器重启后起不来,/var/log/message中有这样的信息:
Nov 15 03:46:09 kernel: INFO: task sh:7684 blocked for more than 120 seconds. Nov 15 03:46:09 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 15 03:46:11 kernel: Call Trace: Nov 15 03:46:11 kernel: [] ? ext4_file_open+0x0/0x130 [ext4] Nov 15 03:46:11 kernel: [] schedule_timeout+0x215/0x2e0 Nov 15 03:46:12 kernel: [] ? nameidata_to_filp+0x54/0x70 Nov 15 03:46:12 kernel: [] ? cpumask_next_and+0x29/0x50 Nov 15 03:46:12 kernel: [] wait_for_common+0x123/0x180 Nov 15 03:46:12 kernel: [] ? default_wake_function+0x0/0x20 Nov 15 03:46:13 kernel: [] wait_for_completion+0x1d/0x20 Nov 15 03:46:13 kernel: [] sched_exec+0xdc/0xe0 Nov 15 03:46:13 kernel: [] do_execve+0xe0/0x2c0 Nov 15 03:46:13 kernel: [] sys_execve+0x4a/0x80 Nov 15 03:46:13 kernel: [] stub_execve+0x6a/0xc0
上网一查,发现这是一个已知的BUG, 请见 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf 里面的BT81,我摘抄如下:
BT81. TSC is Not Affected by Warm Reset
Problem : The TSC (Time Stamp Counter MSR 10H) should be cleared on reset. Due to this erratum the TSC is not affected by warm reset.
Implication : The TSC is not cleared by a warm reset. The TSC is cleared by power-on reset as expected. Intel has not observed any functional failures due to this erratum.
它说的理直气壮好像没事。
实际上只要满足以下三个条件:
- 操作系统为Red Hat Enterprise Linux 6.1 – 6.4。(6.5及以上没问题)
- CPU属于Intel® Xeon® E5, Intel® Xeon® E5 v2, 或 Intel® Xeon® E7 v2 系列。
- 大约200天以上没有断电重启过。(是指没有hard reset。远程在Linux里敲reboot不算是)
就会导致操作系统reboot失败。临时的解决办法就是:找人去机房,断电,然后再起来。
具体请参见Red Hat的声明:https://access.redhat.com/solutions/433883
如果你对比以上条件发现自己中招了,赶紧升级kernel吧。
转载请注明:爱开源 » Intel CPU的BUG导致reboot起不来