Tuesday, June 18, 2013

Guide to unexpected Linux system restarts

Sometimes you really don't have clue of what the root cause of a system restarts. You, your colleagues, nobody then who?

Giving three common examples:

1) A deliberate action of a user (fence event, shutdown command)
2) Software error (kernel panic, NMI, etc)
3) Hardware fault/power failure in the server (power supply, disk, memory, system board, etc.)

Environment

Do we have real picture of what been configured, function of each box?

  • Is the server part of a cluster (cluster node) with fence device?
  • What software installed and does it perform any tasks which would change its typical resource use?
  • Is the server hardware capable of rebooting during system hang (configured with health monitoring software, such as HP ASR)?
  • Does it have Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.?

Gathering information

Potential software faults will most typically leave traces in /var/log/messages

Hardware faults are difficult to diagnose from an OS level, be alert of power failures, maintenance events, or other environmental occurrences around the time of the restart.

Investigation

Examine /var/log/message,

Many but not all restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, searching the message log for the phrase "Command line" is the first step when beginning an investigation.

Aug 22 03:18:15 node1 kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M

Now look for similar output from the log,

User initiated:

shutdown: shutting down for system reboot 
init: Switching to runlevel: 6 
exiting on signal 15 
Got SIGTERM, quitting

Veritas Cluster Fence:

GAB WARNING V-15-1-20138 Port h isolated due to client process failure

RHEL High-Availability Cluster Suite Fence Event

fenced[xxxx]: fencing node "node1.example.com" 


Hardware Fault

CPU 1: Machine Check Exception: 3 Bank 3: ba00000000070f0f

Thermal Event/Cooling Failure  Hardware Fault Power Button Pressed

kernel: CPUX: Temperature above threshold, cpu clock throttled
kernel: CPUX: Core power limit notification (total events = 1)


Hardware Fault Power Button Pressed

received event "button/power PWRF 00000000 00000000"
 

Non-Maskable Interrupt Received

Uhhuh. NMI received for unknown reason XX.


Kernel Soft Lockup Task Blocked for Too Long

Kernel: BUG: soft lockup - CPU#7 stuck for 10s!
 


Task Blocked for Too Long

kernel: INFO: task khugepaged:60 blocked for more than 120 seconds.
 


Above messages may not necessarily be the root cause of the reboot, but are important clues for further investigation.