MAP4510 CEC enclosure to CEC enclosure communication failure

About this task

This MAP is used for communication failures between the partner logical partitions (LPARs) in the CEC enclosures. The communication begins after AIX is loaded and the functional code load occurs. For models 961 and 98x, this communication between CEC enclosures is across the PCIe interface. For models 941 and 951, this communication between CEC enclosures is across the RIO interface.

Because this communication uses redundant paths, no single cable fault should cause a communication failure. Each LPAR periodically sends a communication message to the partner LPAR (heartbeat) and sets a timer waiting for the response. If the timer expires with no response, the error recovery process will cause the non-responding LPAR to failover its resources to the originating LPAR and then reboot itself to try to restore communication.
  • If communication is restored, a fail-back occurs to return resources, and no serviceable event is created.
  • If communication still fails, the LPAR is fenced and a serviceable event is created to send you to this MAP.

Procedure

  1. Display the status of the Servers:
    1. From the navigation area, click Storage Facility Management > storage facility > Server View.
    2. Note the status of each server, and the reference code, if one is displayed.
      Record the serial number of any server that does not show a status of "Operating," or displays a reference code.
  2. Display the status of the LPARs:
    1. From the navigation area, click Storage Facility Management > storage facility > Server View > server.
    2. Note the status of each LPAR, and the reference code, if one is displayed.
      Record the name of any LPAR that does not show a status of "Running," or displays a reference code.
    3. Repeat steps a and b for the other server.
  3. If a server does not show a status of "Operating" or is missing from the display, check the input power LEDs of all CEC enclosure power supplies for that server.

    Is at least one input power LED lit?

  4. Check the DC GOOD (output power) LEDs on all the CEC enclosure power supplies. Are all the DC GOOD LEDs flashing?
  5. Were both I2C cables disconnected from the CEC and reconnected?
    • Yes, continue to the next step.
    • No, continue with step 9.
  6. Attempt to power on the CEC enclosure using the Service Utilities. Do the following steps to select the CEC enclosure:
    Note: If there is a communication problem between the HMC and the service processor, these steps might not work.
    1. From the navigation area, click Storage Facility Management > storage facility > Server View > server.
    2. From the bottom Task area, click Service Utilities > Storage System Power Control.
    3. Click Power On System to Ready and confirm the action.

    Did the managed system (CEC) power-on and LPAR IML successfully?

    • Yes, the repair is complete; close the serviceable event that sent you here.
    • No, continue to the next step.
  7. Is the CEC/LPAR fenced? See MAP1100 View storage facility state (end of call).
    • Yes, continue to the next step.
    • No, continue with step 9.
  8. Use the following special pseudo repair procedure to reset the fenced CEC enclosure, which quiesces, powers off, powers on, and resumes the CEC enclosure:
    1. Use the Display Storage Facility State (End of Call) to determine which CEC enclosure is fenced.
      1. From the navigation area, click Storage Facility Management > storage facility.
      2. From the bottom Task area, click Service Utilities > View Storage Facility State. The View Storage Facility State (end of call) window opens.
      3. Click the Fenced Resources option at the bottom of the list. Then, click Details and the fenced LPAR information is shown.
        For example:
        Server     Not Good                      
        lparName   SF75FW820ESS11                
        state=4(Fenced),PartitionState=Running
               
        Note: SFsssssssssESS0x (x = 1 or 2) is in CEC0 (upper) 
              SFsssssssssESS1x (x = 1 or 2) is in CEC1 (lower)
      4. Return to the Task area, and click Service Utilities > View Hardware Topology. You can identify the fenced CEC enclosure location code.
        For example:
        Current Hardware Topology
        CEC 0 MTMS = 9117-MMA*10D5242
        CEC 0 Unit ID = U787D.001.DQD53K3
        CEC 1 MTMS = 9117-MMA*10D5272
        CEC 1 Unit ID = U787D.001.DQD17BM
    2. Use the Exchange Parts procedure to select the CEC enclosure that was displayed as fenced:
      1. From the navigation area, click Storage Facility Management > storage facility.
      2. From the bottom Task area, click Exchange Parts > Exchange CEC Components.... The Exchange CEC Components window opens.
      3. Select a CEC enclosure and click Show FRUs. The Show CEC FRUs window opens.
      4. Select the FRU Location Code, and then click Exchange FRU.
    3. Select the System Processor Card FRU and continue the guided repair. From the Show CEC FRUs window, select System Processor Card for any Processor Card Slot (LEDs not used) and click Exchange FRU.
      Notes:
      1. If the System Processor Card is not displayed, you might need to maximize the window and manually scroll down the list.
      2. Do not disconnect the black power cables to the CEC enclosure power supplies when directed.
      3. Do not replace the system processor card when directed; leave it installed and continue the repair.
    4. After the pseudo repair of the CEC is complete, did the LPAR successfully IML?
      • Yes, the repair is complete; close the serviceable event that sent you here.
      • No, continue with step 9.
  9. Do one of the following:
    • If any server or partition that displays an Operator Panel Value is not responding, exit this MAP and go to MAP4360 Codes displayed by the CEC enclosure control panel. After the repair is complete, close the serviceable event that sent you here.
    • If any Server State is not "Operating" or any Partition State is not "Running," display and repair any related serviceable events for that CEC enclosure. After the repair is complete, close the serviceable event that sent you here. If no related serviceable events are found, contact your next level of support.