Giving three common examples:
1) A deliberate action of a user (fence event, shutdown command)
2) Software error (kernel panic, NMI, etc)
3) Hardware fault/power failure in the server (power supply, disk, memory, system board, etc.)
Environment
Do we have real picture of what been configured, function of each box?
- Is the server part of a cluster (cluster node) with fence device?
- What software installed and does it perform any tasks which would change its typical resource use?
- Is the server hardware capable of rebooting during system hang (configured with health monitoring software, such as HP ASR)?
- Does it have Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.?
Gathering information
Potential software faults will most typically leave traces in /var/log/messages
Hardware faults are difficult to diagnose from an OS level, be alert of power failures, maintenance events, or other environmental occurrences around the time of the restart.
Investigation
Examine /var/log/message,
Many but not all restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, searching the message log for the phrase "Command line" is the first step when beginning an investigation.
Aug 22 03:18:15 node1 kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M
Now look for similar output from the log,
User initiated:
shutdown: shutting down for system reboot
init: Switching to runlevel: 6
exiting on signal 15
Got SIGTERM, quitting
Veritas Cluster Fence:
GAB WARNING V-15-1-20138 Port h isolated due to client process failure
RHEL High-Availability Cluster Suite Fence Event
fenced[xxxx]: fencing node "node1.example.com"
Hardware Fault
CPU 1: Machine Check Exception: 3 Bank 3: ba00000000070f0f
Thermal Event/Cooling Failure Hardware Fault Power Button Pressed
kernel: CPUX: Temperature above threshold, cpu clock throttled
kernel: CPUX: Core power limit notification (total events = 1)
Hardware Fault Power Button Pressed
received event "button/power PWRF 00000000 00000000"
Non-Maskable Interrupt Received
Uhhuh. NMI received for unknown reason XX.
Kernel Soft Lockup Task Blocked for Too Long
Kernel: BUG: soft lockup - CPU#7 stuck for 10s!
Task Blocked for Too Long
kernel: INFO: task khugepaged:60 blocked for more than 120 seconds.
Above messages may not necessarily be the root cause of the reboot, but are important clues for further investigation.
Good attempt!!
ReplyDelete