故障描述:
报告某服务器出现故障,测试发现ssh和ping均不通,尝试ILO连接也失败(网页打不开)。
在准备使用ipmitool命令重启服务器时,服务器恢复正常,能够ssh登录。发现服务器前几分钟自动重启了。
为了尽快恢复故障,首先将应用服务启动起来。之后排查了系统日志、硬件信息,均未发现异常。
根据经验,看了下操作系统版本,是RHEL6.1 x86_64,怀疑是内核bug所致。
通过kernel dump方法,最后发现是swapper导致的系统crash,swapper这个进程是linux系统的首进程(pid=0)。
解决方法:
通过升级内核解决问题,升级至RHEL6.5的内核版本。
利用kernel dump分析内核故障方法:
1、安装相关包
安装对应内核版本的几个包:
crash-trace-command-1.0-3.el6.x86_64 crash-6.1.0-5.el6.x86_64 kernel-debuginfo-2.6.32-131.0.15.el6.x86_64 kernel-debuginfo-common-x86_64-2.6.32-131.0.15.el6.x86_64
这些包可以从http://debuginfo.centos.org/6/x86_64/下载。
2、分析
#找到kernel crash目录下
#执行指令:crash /usr/lib/debug/lib/modules/2.6.32-xxx(相应内核版本)/vmlinux ./vmcore
#输出内容的PANIC字段一般会告知是不是内核bug,COMMAND字段表示哪个进程引起的crash
[root@17173.com 127.0.0.1-2014-09-16-10:20:28]# cd /var/crash/127.0.0.1-2014-09-16-10:20:28 [root@17173.com 127.0.0.1-2014-09-16-10:20:28]# crash /usr/lib/debug/lib/modules/2.6.32-131.0.15.el6.x86_64/vmlinux ./vmcore crash 6.1.0-5.el6 Copyright (C) 2002-2012 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.3.1 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel version inconsistency between vmlinux and dumpfile KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.0.15.el6.x86_64/vmlinux DUMPFILE: ./vmcore [PARTIAL DUMP] CPUS: 16 DATE: Tue Sep 16 10:19:24 2014 UPTIME: 273 days, 22:14:50 LOAD AVERAGE: 0.11, 0.30, 0.43 TASKS: 2335 NODENAME: myhost.17173ops.com RELEASE: 2.6.32-131.0.15.el6.x86_64 VERSION: #1 SMP Tue May 10 15:42:40 EDT 2011 MACHINE: x86_64 (2400 Mhz) MEMORY: 24 GB PANIC: "" PID: 0 COMMAND: "swapper" TASK: ffff88037222ca80 (1 of 16) [THREAD_INFO: ffff88067273e000] CPU: 2 STATE: TASK_RUNNING (PANIC) crash> quit
转载请注明:爱开源 » 利用kernel dump分析内核故障