Troubleshooting thermal issues

The advanced management module provides status and alerts for temperature sensors throughout the chassis. Some sensors provide an actual temperature and others provide only a notification of whether a threshold has been exceeded. The advanced management module adjusts the speed of cooling devices (blowers and fans) based on the environmental conditions and threshold indications.

The main ambient temperature sensor is located in the media tray. Therefore, removing the media tray will cause the blowers/fans to run at a maximum speed. If the advanced management module cannot read the temperature sensor, it will operate as if the temperature is at the peak operational value and all components in the chassis require maximum cooling.

Other temperature sensors are located near the CPU modules in the blade servers. The CPU temperature sensors are compared against a warning and a critical temperature limit. If the CPU temperature exceeds the warning limit specified for that blade, a temperature warning event will be posted and the chassis blowers will increase in speed to correct the temperature condition. If the CPU temperature exceeds the critical limit, then the blade will be shut off. The CPU temperature sensors can be checked by clicking on the status indicator next to each blade server listed on the System Status page through the advanced management module Web interface.

Temperature warning events are recorded in the event log. They are available through e-mail notifications, SNMP traps, and IBM® Director alerts, if enabled.

The following conditions can cause thermal errors:
  • The ambient temperature in the environment is hot, which could be due to problems with air conditioning.
  • The intake vents in the front of the chassis are obstructed.
  • The air filter on the chassis needs cleaning.
  • A heat sink on a blade server CPU is loose or needs thermal grease correctly applied.
  • A fan module or blower module has failed or has been removed
  • Bays on the chassis are empty, which prevents normal front-to-back air flow. A filler or component should always be installed in each bay in a chassis.
  • A thermal sensor is faulty.

Thermal conditions tend to develop gradually (with the exception of heat sink problems). You can view and compare the temperatures and temperature sensors for various components in the BladeCenter® chassis. The advanced management module thermal sensor, located at the rear of a chassis, is expected to report slightly higher temperatures than the ambient sensor located in front of the chassis in the media tray.

A thermal sensor could be faulty if the advanced management module posts a thermal warning or maintains the blower/fan speeds at maximum RPMs immediately after power is applied to the chassis or after a specific blade server is powered up. It is normal for the chassis blowers to increase their speed immediately after power is applied to the chassis or the advanced management module is reset. However the blowers speed should be reduced within two minutes if ambient temperature conditions are good and if all temperature sensors are working as expected.

A blade with a faulty heat sink connection to a CPU might show a temperature increase within seconds after powering up the blade server. One way to determine if the temperature increase is due to a faulty heat sink, a faulty temperature sensor, chassis air flow, or an ambient air temperature problem is to monitor the blade server CPU temperature through the advanced management module Web interface while powering up the blade server.

If the blade server has two CPU modules installed, the temperature for both modules should rise at about the same rate under the same stress. If the temperature reading for one CPU module rises much faster than the other module, the heat sink for that CPU module might not be installed correctly. If both CPU module temperatures rise at about the same rate to warning or critical limits, there might be a problem with chassis air flow or ambient temperature.