Intel CPU的BUG导致reboot起不来-爱开源

这个BUG是我去年11月撞见的，早该写出来了。因为这个BUG造成的灾难后果远远超出我的想像。

当时的现象是某些机器重启后起不来，/var/log/message中有这样的信息：

Nov 15 03:46:09 kernel: INFO: task sh:7684 blocked for more than 120 seconds. 
Nov 15 03:46:09 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Nov 15 03:46:11 kernel: Call Trace: 
Nov 15 03:46:11 kernel: [] ? ext4_file_open+0x0/0x130 [ext4] 
Nov 15 03:46:11 kernel: [] schedule_timeout+0x215/0x2e0 
Nov 15 03:46:12 kernel: [] ? nameidata_to_filp+0x54/0x70 
Nov 15 03:46:12 kernel: [] ? cpumask_next_and+0x29/0x50 
Nov 15 03:46:12 kernel: [] wait_for_common+0x123/0x180 
Nov 15 03:46:12 kernel: [] ? default_wake_function+0x0/0x20 
Nov 15 03:46:13 kernel: [] wait_for_completion+0x1d/0x20 
Nov 15 03:46:13 kernel: [] sched_exec+0xdc/0xe0 
Nov 15 03:46:13 kernel: [] do_execve+0xe0/0x2c0 
Nov 15 03:46:13 kernel: [] sys_execve+0x4a/0x80 
Nov 15 03:46:13 kernel: [] stub_execve+0x6a/0xc0

上网一查，发现这是一个已知的BUG, 请见 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf 里面的BT81，我摘抄如下：

BT81. TSC is Not Affected by Warm Reset

Problem : The TSC (Time Stamp Counter MSR 10H) should be cleared on reset. Due to this erratum the TSC is not affected by warm reset.

Implication : The TSC is not cleared by a warm reset. The TSC is cleared by power-on reset as expected. Intel has not observed any functional failures due to this erratum.

它说的理直气壮好像没事。

实际上只要满足以下三个条件：

操作系统为Red Hat Enterprise Linux 6.1 – 6.4。（6.5及以上没问题）
CPU属于Intel® Xeon® E5, Intel® Xeon® E5 v2, 或 Intel® Xeon® E7 v2 系列。
大约200天以上没有断电重启过。(是指没有hard reset。远程在Linux里敲reboot不算是）

就会导致操作系统reboot失败。临时的解决办法就是：找人去机房，断电，然后再起来。

具体请参见Red Hat的声明：https://access.redhat.com/solutions/433883

如果你对比以上条件发现自己中招了，赶紧升级kernel吧。

转载请注明：爱开源 » Intel CPU的BUG导致reboot起不来

Intel CPU的BUG导致reboot起不来

与本文相关的文章

您必须登录才能发表评论！

与本文相关的文章

您必须 登录 才能发表评论！

您必须登录才能发表评论！