Watchdog timeout messages are displayed in the advanced management module event
log. Use this procedure if there are multiple blade servers in a BladeCenter S chassis and
you are seeing these messages for only one of the blade servers.
Problem
The advanced management module event
log displays watchdog timeout messages for only one of the blade servers in
a BladeCenter S chassis.
Investigation
Perform these steps to resolve the
problem:
- Find firmware updates for the blade server and service
processor. Look in the firmware change history for information related to
watchdog timeout errors and update the firmware if necessary.
You can find
the firmware by going to Sofware and device drivers - IBM BladeCenter and selecting
the blade server that you have installed. It is typically listed under Advanced
Systems Management.
- Find firmware updates for the advanced management module.
Look in the firmware change history for information related to watchdog timeout
errors and update the firmware if necessary.
- Verify the operation of the blade server. If it is responsive, the problem
may be a false error condition.
- Verify that the IBM Automatic Server Restart (ASR) driver is installed
on the blade server.
- Update the firmware for the service processor on the blade server
- If the blade server is nonresponsive, determine the cause of the problem:
- If there are POST watchdog timeout messages for this blade server in the
event log, the BIOS flash image on the blade server may be corrupt.
- If an I/O expansion card is installed in the blade server, remove it and
reboot the blade server.
- If the blade server boots properly, replace the I/O expansion card.
- If the blade server is still nonresponsive, force the blade server to
boot from the backup flash image. You will need to remove the blade server
from the BladeCenter S chassis,
open the cover, and move one of the jumpers. See the documentation that came
with the blade server for information about this procedure.
- If the blade server boots from the backup flash image, update the firmware
for the blade server.
- If the blade server continues to be nonresponsive, replace the blade server.
- If there are OS watchdog timeout messages for this blade server in the
event log, access the operating system logs to determine why the blade server
is nonresponsive.
- Determine if the nonresponsiveness is due to a software driver or module
problem.
- Look for machine checks or memory errors in the event log.
- Verify that the disk and communications drivers are up to date.
- Check the event log for other hardware related errors such as CPU or DIMM
errors. If you see hardware faults occurring before the watchdog timeout occurs,
the problem may be in one of the blade server hardware components. Follow
normal debug procedures to isolate the faulty hardware component and replace
it.
Note:
Hard disk drives,
I/O cards, and I/O expansion modules can cause CPU faults because of bus errors.