系统之家 - 系统光盘下载网站!

当前位置:系统之家 > 系统教程 > Linux系统内核崩溃排查

Linux系统内核崩溃如何排查?(2)

时间:2015-03-06 17:08:48 作者:qipeng 来源:系统之家 1. 扫描二维码随时看资讯 2. 请使用手机浏览器访问: https://m.xitongzhijia.net/xtjc/20150306/40328.html 手机查看 评论

  后面又继续分析内核中出现的另一个错误,“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:进程号]”,对上面的错误信息我做了一点点处理,CPU#后面的N是对应的一个具体的cpu编号,这个在每一台服务器是不一样的,还有就是最后中括号中的进程和进程号码不同,不过就是qmgr和master。如下统计:

  IP 107 108 109 110 111 112 113 114

  选项

  日志

  时间13:01:2014:03:3414:05:4414:22:4414:19:5814:17:1214:22:49

  14:19:58错误日志类型和进程1qmgr1master

  2qmgr1qmgr

  2master1 qmgr

  1qmgr

  2master1qmgr

  2master1qmgr

  2master1qmgr

  2master

  错误类型1就是上面提到的不会一起内核挂起的错误,2就是现在分析的这个错误,会导致linux内核panic。可以看出只有107和110当时是没有挂起的。

  接着上面的内核出错日志分析,发现一个很大的相同点,就是4278190091s这个值。首先解释一下这个值代表的意义,通常情况下如果一个cpu超过10s没有喂狗(执行watchdog程序)就会抛出soft lockup(软死锁)错误并且挂起内核。但是这个值尽然是4278190091s,并都是一样的。完全可以理解为是一个固定的错误,为了验证自己的想法,我就在RedHat官方网站搜索这个错误信息,让我非常激动的是,尽然找到了相同的bug(url:https://access.redhat.com/knowledge/solutions/68466),然后查看错误的redhat版本和内核版本,都和我们的一样(redhat6.2和CentOS6.2对应)。错如信息和解决方案如下:

  Does Red Hat Enterprise Linux 6 or 5 have a reboot problem which is caused by sched_clock() overflow around 208.5 days?

  (Updated 21 Feb 2013, 5:11 AM GMT RateSelect ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5Cancel ratingCancel ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5. Average: 5 (1 vote)。 Show Follow

  Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

  rhel5 Content in rhel5 by Marc Milgram Content in rhel6 Content in rhel6 by Marc Milgram Content in kernel Content in kernel by

  Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

  Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

  Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

  0 Issue(问题)

  •Linux Kernel panics when sched_clock() overflows after an uptime of around 208.5 days.

  •Red Hat Enterprise Linux 6.1 system reboots with sched_clock() overflow after an uptime of around 208.5 days

  •This symptom may happen on the systems using the CPU which has TSC.

  •A process showing BUG: soft lockup - CPU#N stuck for 4278190091s!

  Environment(环境)

  •Red Hat Enterprise Linux 6

  ◦Red Hat Enterprise Linux 6.0, 6.1 and 6.2 are affected

  ◦several kernels affected, see below

  ◦TSC clock source - **see root cause

  •Red Hat Enterprise Linux 5

  ◦Red Hat Enterprise Linux 5.3, 5.6, 5.8: please refer to the resolution section for affected kernels

  ◦Red Hat Enterprise Linux 5.0, 5,1, 5.2, 5.4, 5.5 ,5.7: all kernels affected

  ◦Red Hat Enterprise Linux 5.9 and later are not affected

  ◦TSC clock source - **see root cause

  •An approximate uptime of around 208.5 days.

  Resolution(解决方案)

  •Red Hat Enterprise Linux 6

  ◦Red Hat Enterprise Linux 6.x: update to kernel-2.6.32-279.el6 (from RHSA-2012-0862) or later. This kernel is already part of RHEL6.3GA. This fix was implemented with (private) bz765720.

  ◦Red Hat Enterprise Linux 6.2: update to kernel-2.6.32-220.4.2.el6 (from RHBA-2012-0124) or later. This fix was implemented with (private) bz781974.

  ◦Red Hat Enterprise Linux 6.1 Extended Update Support: update to kernel-2.6.32-131.26.1.el6 (from RHBA-2012-0424) or later. This fix was implemented with (private) bz795817.

  •Red Hat Enterprise Linux 5

  ◦architecture x86_64/64bit

  ■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.

  ■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.11.1.el5 (from RHSA-2012-1061) or later.

  ■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.

  ■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.

  ◦architecture x86/32bit

  ■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.

  ■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.13.1.el5 (from RHSA-2012-1174) or later.

  ■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.

  ■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.

  Root Cause(根本原因)

  •An insufficiently designed calculation in the CPU accelerator in the previous kernel caused an arithmetic overflow in the sched_clock() function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC) clock source.

  •This problem will occur only when system uptime becomes 208.5 days or exceeds 208.5 days.

  •This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under these circumstances.

  •On Red Hat Enterprise 5, this problem is a timing issue and very very rare to happen.

  •**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast access clock whereas the HPET and PMTimer are both slow access clocks. Using notsc would be a significant performance hit.

  Diagnostic Steps

  Note:

  This issue could likely happen in numerous locals that deal with time

  in the kernel. For example, a user running a non-Red Hat kernel had the

  kernel panic with a soft lockup in __ticket_spin_lock.

  通过上面的信心我们完全可以确认这个是linux内核的一个bug,这个bug的原因上面也相信描述了,就是对于x86_64体系结构的内核版本,如果启动时间超过208.5天就会导致溢出。

  虽然得到了上面的信息证实了内核panic的原因,不过自己想了解一下淘宝的内核工程师是否也应该遇到过同样的问题,所以就在qq上找以前聊过的淘宝内核工程师确认这个问题。结果证明:他们也遇到过同样的错误,并且也不能重现,解决方案还是升级内核版本。

  4.总结

  上面就是Linux内核崩溃的排查方法介绍了,通过本文的介绍能够了解到Linux内核的排查是比较困难的,需要一定的耐心和技术。

标签 内核

发表评论

0

没有更多评论了

评论就这些咯,让大家也知道你的独特见解

立即评论

以上留言仅代表用户个人观点,不代表系统之家立场

其他版本软件

热门教程

人气教程排行

Linux系统推荐

扫码关注
扫码关注

扫码关注 官方交流群 软件收录