Watchdog timeout messages displayed for a blade server

Watchdog timeout messages are displayed in the advanced management module event log. Use this procedure if there are multiple blade servers in a BladeCenter S chassis and you are seeing these messages for only one of the blade servers.

Problem

The advanced management module event log displays watchdog timeout messages for only one of the blade servers in a BladeCenter S chassis.

Investigation

Perform these steps to resolve the problem:
  1. Find firmware updates for the blade server and service processor. Look in the firmware change history for information related to watchdog timeout errors and update the firmware if necessary.

    You can find the firmware by going to Sofware and device drivers - IBM BladeCenter and selecting the blade server that you have installed. It is typically listed under Advanced Systems Management.

  2. Find firmware updates for the advanced management module. Look in the firmware change history for information related to watchdog timeout errors and update the firmware if necessary.
  3. Verify the operation of the blade server. If it is responsive, the problem may be a false error condition.
    1. Verify that the IBM Automatic Server Restart (ASR) driver is installed on the blade server.
    2. Update the firmware for the service processor on the blade server
  4. If the blade server is nonresponsive, determine the cause of the problem:
    • If there are POST watchdog timeout messages for this blade server in the event log, the BIOS flash image on the blade server may be corrupt.
      1. If an I/O expansion card is installed in the blade server, remove it and reboot the blade server.
        • If the blade server boots properly, replace the I/O expansion card.
        • If the blade server is still nonresponsive, force the blade server to boot from the backup flash image. You will need to remove the blade server from the BladeCenter S chassis, open the cover, and move one of the jumpers. See the documentation that came with the blade server for information about this procedure.
          • If the blade server boots from the backup flash image, update the firmware for the blade server.
          • If the blade server continues to be nonresponsive, replace the blade server.
    • If there are OS watchdog timeout messages for this blade server in the event log, access the operating system logs to determine why the blade server is nonresponsive.
      • Determine if the nonresponsiveness is due to a software driver or module problem.
      • Look for machine checks or memory errors in the event log.
      • Verify that the disk and communications drivers are up to date.
  5. Check the event log for other hardware related errors such as CPU or DIMM errors. If you see hardware faults occurring before the watchdog timeout occurs, the problem may be in one of the blade server hardware components. Follow normal debug procedures to isolate the faulty hardware component and replace it.
    Note: Hard disk drives, I/O cards, and I/O expansion modules can cause CPU faults because of bus errors.