MAP4040 Isolating LPAR operating system boot problems

A checkpoint might be displayed on the operator panel for a period of time while the boot image is retrieved from the device. If the checkpoint is displayed for an extended period of time and the hard drive LED is not indicating any activity, there might be a problem loading the boot image from the device.

MAP4040 Section-1

Procedure

Look at the service action event error log in the HMC Service Focal Point. Perform the actions necessary to resolve any open entries that affect devices in the boot path of the partition. Then, try to reboot the partition. Does the partition reboot successfully?

MAP4040 Section-2

About this task

This section shows you how to use the SMS menus to display the status of the partition boot devices (hard drives).

Procedure

  1. Log on to the Management Console (HMC) with the CE user ID and password (default serv1cece).
  2. Use the Service Utilities to quiesce and shut down the failing partition:
    1. From the navigation area, click Storage Facility Management > storage facility > SF image.
    2. From the right work area, select the affected LPAR.
    3. From the bottom Task area, click Service Utilities > Change/Show LPAR State. The LPAR Server Control window opens.
    4. Click Quiesce LPAR. Click Yes to confirm.
    5. Wait 10 minutes until the Quiesce started! window opens. Click OK.
    6. On the Server Control window, click Refresh to see the current status.
    7. When the quiesce is complete, click Shutdown LPAR. Click Yes to confirm.
    8. Wait 10 minutes until the Shutdown started window opens. Click OK.
    9. On the Server Control window, click Refresh to see the current status.
    10. Wait until the Operational State is Deactivated.
    11. Leave the Server Control window open.
    12. From the bottom Task area, click Service Utilities > Set No-rsStart.
    13. When the Set No-rsStart Successful window opens, click OK.
  3. Open a Terminal window:
    1. From the navigation area, click Storage Facility Management > storage facility > Server View > server.
    2. From the right work area, select the failing partition (the state of the failing partition is Not Activated).
    3. From the bottom Task area, click Console Window > Open Terminal Window.
  4. Activate the partition and interrupt access through the SMS menus.
    1. On the Server Control window, click Activate LPAR. Click Yes to confirm.
    2. Quickly return to the Terminal Window. When the following screen is displayed, type 1 and press Enter.
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM
      
                1 = SMS Menu                          5 = Default Boot List
                8 = Open Firmware Prompt              6 = Stored Boot List
      
      
           Memory      Keyboard     Network     SCSI     Speaker
    3. The SMS menu is displayed. Type 1 and press Enter to Select Language.
    4. The SMS language menu is displayed. Type 1 and press Enter to choose English. The SMS Main menu is displayed.
  5. Use the SMS menus to display the list of available boot devices:
    1. On the SMS Main menu, type 5 and press Enter to choose Select Boot Options.
    2. On the next menu, type 1 and press Enter to choose Select Install/Boot Device
    3. On the next menu, type 7 and press Enter to choose List all Devices.
  6. Review the list of devices that are found. See the example screen shown below. In some cases, you might need to enter N (Next page of list) to see the entire list of available boot devices. Also see Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Table 1, and Table 2 which show the hard drive location codes for the different DS8000® models.
    PowerPC Firmware
     Version AL740_075
     SMS 1.7 (c) Copyright IBM Corp. 2000,2008 All rights reserved.
    -------------------------------------------------------------------------------
     Select Device
     Device  Current  Device
     Number  Position  Name
     1.        -      Port 1 - IBM 2 PORT PCIe 10/100/1000 Base-TX Adapter
            ( loc=U78AA.001.WZSGRLF-P1-C2-T1 )
     2.        -      Port 2 - IBM 2 PORT PCIe 10/100/1000 Base-TX Adapter
            ( loc=U78AA.001.WZSGRLF-P1-C2-T2 )
     3.        2      SAS 136 GB Harddisk, part=2 (AIX 7.1.0)
            ( loc=U78AA.001.WZSGRLF-P2-D2 )
     4.        1      SAS 136 GB Harddisk, part=2 (AIX 7.1.0)
            ( loc=U78AA.001.WZSGRLF-P2-D1 )
     5.        -      SAS 136 GB Harddisk, part=4 (AIX 7.1.0)
            ( loc=U78AA.001.WZSGRLF-P2-D2 )
    
    
     -------------------------------------------------------------------------------
     Navigation keys:
     M = return to Main Menu
     ESC key = return to previous screen         X = eXit System Management Services
     -------------------------------------------------------------------------------
     Type menu item number and press Enter or select Navigation key:  
    Figure 1. CEC enclosure location codes (front) (model 961)
    CEC enclosure location codes (front)
    Figure 2. CEC enclosure location codes (front view) (Models 980, 983, 984)
    CEC enclosure location codes (front view)
    Figure 3. CEC enclosure location codes (front view) (model 981, 985, 986)
    Note: Model 981 CEC shown, models 980, 984 drive locations are similar.
    POWER8 front view of CEC enclosure location codes
    Figure 4. CEC enclosure location codes (front view) (Models 982, 988)
    CEC enclosure location codes (front view) (Model 982)
    Table 1. Hard drive locations (model 961, 98x)
    Model Storage Facility Image (SFI) Partition name Location code
    Primary boot drive Secondary boot drive
    961 First SFxxxxxxx01 (in upper CEC) U78AA.001.xxxxxxx-P2-D1 U78AA.001.xxxxxxx-P2-D2
    SFxxxxxxx11 (in lower CEC) U78AA.001.xxxxxxx-P2-D1 U78AA.001.xxxxxxx-P2-D2
    980, 981, 983, 984, 985, 986 First SFxxxxxxx01 (in upper CEC) U78Cx.001.xxxxxxx-P2-D1 U78Cx.001.xxxxxxx-P2-D2
    SFxxxxxxx11 (in lower CEC) U78Cx.001.xxxxxxx-P2-D1 U78Cx.001.xxxxxxx-P2-D2
    982, 988 First SFxxxxxxx01 (in upper CEC) U78Cx.001.xxxxxxx-P4-D2 U78Cx.001.xxxxxxx-P4-D6
    SFxxxxxxx11 (in lower CEC) U78Cx.001.xxxxxxx-P4-D2 U78Cx.001.xxxxxxx-P4-D6
    Figure 5. CEC enclosure locations (front) (models 941, 951)
    CEC enclosure locations (front) (models 941, 951)
    Table 2. Hard drive locations (models 941, 951)
    Model Storage Facility Image (SFI) Partition name Location code
    Primary boot drive Secondary boot drive
    941, 951 First SFxxxxxxx01 (in upper CEC) U789D.001.xxxxxxx-P3-D1 U789D.001.xxxxxxx-P3-D2
    SFxxxxxxx11 (in lower CEC) U789D.001.xxxxxxx-P3-D1 U789D.001.xxxxxxx-P3-D2
    Note: The same device might appear multiple times, indicating a multibos installation on that device. For example, the following display shows two partitions on the same hard drive:

    3.        2      SAS 136 GB Harddisk, part=2 (AIX 7.1.0)
            ( loc=U78AA.001.WZSGRLF-P2-D2 )
    5.        -      SAS 136 GB Harddisk, part=4 (AIX 7.1.0)
            ( loc=U78AA.001.WZSGRLF-P2-D2 )
    This is not the same as two separate hard drives.
  7. Are two hard drives listed for the failing partition?

MAP4040 Section-3

About this task

This section isolates a boot problem where both hard drives are listed on the boot list.

Procedure

  1. Unplug the hard drive that is in position 1 on the boot list and retry the operation.
    See Exchange the CEC enclosure disk drive to unplug the hard drive in position 1 on the boot list. Do not install a new drive at this time.
    1. Type x and press enter to exit the SMS menus and begin loading the operating system. Keep the terminal window open.
  2. Did the boot problem occur again?
    • Yes, go to the next step.
    • No, it appears that the unplugged hard drive was faulty. Record the code level and installation date from the terminal window login herald. (See an example in MAP4040 Section-5, step 2.)

      Go to MAP4040 Section-5.

  3. Reinstall the removed hard drive. See Exchange the CEC enclosure disk drive.
  4. Remove the hard drive in position 2 on the boot list.
  5. Shutdown and activate the partition to retest it.
    1. Return to the Server Control window. Click Shutdown. Click Yes to confirm.
    2. Wait 10 minutes until the Shutdown started window opens. Click OK.
    3. On the Server Control window, click Refresh to view the current status.
    4. Wait until the Operational State is Deactivated.
    5. Click Activate LPAR. Click Yes to confirm.
  6. Did the boot problem occur again?
    • Yes, go to the next step.
    • No, it appears that the unplugged hard drive was faulty. Record the code level and installation date from the terminal window login herald. (See an example in MAP4040 Section-5, step 2.)

      Go to MAP4040 Section-5.

  7. There are two likely causes of this problem. Contact your next level of support for guidance to either:

MAP4040 Section-4

About this task

This section isolates a boot problem where only one hard drive is listed on the boot list.

Procedure

  1. Use Table 1 to identify the location for the hard drive that is not listed.
  2. Unplug the hard drive that was identified in step 1 and retry the operation.
    1. See Exchange the CEC enclosure disk drive to unplug the hard drive identified in step 1. Do not install a new drive at this time
    2. Type x and press Enter to exit the SMS menus and begin loading the operating system. Keep the terminal window open.
  3. Did the boot problem occur again?
    • Yes, go to the next step.
    • No, it appears that the unplugged hard drive was faulty. Record the code level and installation date from the terminal window login herald. (See an example in MAP4040 Section-5 , step 2.)

      Go to MAP4040 Section-5.

  4. Reinstall the removed hard drive. See Exchange the CEC enclosure disk drive.
  5. Remove the hard drive that was listed on the boot list.
  6. Shutdown and activate the partition to retest it.
    1. Return to the Server Control window. Click Shutdown. Click Yes to confirm.
    2. Wait 10 minutes until the Shutdown started window opens. Click OK.
    3. On the Server Control window, click Refresh to view the current status.
    4. Wait until the Operational State is Deactivated.
    5. Click Activate LPAR. Click Yes to confirm.
  7. Did the boot problem occur again?
    • Yes, go to the next step.
    • No, it appears that the unplugged hard drive was faulty. Record the code level and install date from the terminal window login herald. (See an example in MAP4040 Section-5 , step 2.)

      Go to MAP4040 Section-5.

  8. Replace the disk drive backplane assembly (Storage Facility Management > storage facility > Exchange Parts).
    1. After the repair is complete, use step 6 to retest.

    Did the boot problem occur again?

    • Yes, contact your next level of support.
    • No, it appears that the disk drive backplane assembly was faulty. Record the code level and install date from the terminal window login herald. (See an example in MAP4040 Section-5 , step 2.)

      Go to MAP4040 Section-5.

MAP4040 Section-5

About this task

This section cleans up after recovering the LPAR operating system boot problem.

Procedure

  1. Determine the installed SF LIC level:
    1. From the navigation area, click Updates.
    2. From the right work area, select the storage facility.
    3. From the bottom Task area, click Display Storage Facility Code Levels.

    Examples:

    SFI Code Levels:
       VRMF: 7.7.0.379 locationCode: 8205-E6C*100B7FR-V1
       VRMF: 7.7.0.379 locationCode: 8205-E6C*100B7ER-V1

    CDA Install History: (Most recent successful update)
       Package: SEA.sfi , MTMS: 8205-E6C*100B7ER-V1
       Date: 2012/04/23-03:23, Bundle VRMF: 87.0.97.0 , Package Level: 7.7.0.379, Mode: CCL

       Package: SEA.sfi , MTMS: 8205-E6C*100B7FR-V1
       Date: 2012/04/23-04:00, Bundle VRMF: 87.0.97.0 , Package Level: 7.7.0.379, Mode: CCL


  2. Compare the code level (code EC) recorded from the login herald with the installed level (SFI level VRMF, SEA.sfi package level) obtained in step 1.
    Login herald example:
    IBM System Storage Enterprise Storage Server (TM)
        2107 Model 961            SN 75-YZ581          Server 1  SF75YZ580ESS01
    OS Level 7.1.0.403         Code EC 7.7.0.379      Installed on: Apr 23 2012

    Do the VRMF/EC levels match?

    1. Yes, go to the next step.
    2. No, it appears that the LPAR was booted from the wrong multibos image.

      Contact your next level of support to run the "recovery" section of the DS8000 Field Tip entitled "AIX boots old code level (SFG, MES, Rack disc, FSP repair, Model Conversion)." Inform the next level of support that the affected LPAR is already quiesced and that the No-rsStart function has been set.

  3. Use the Service Utilities to reset No-rsStart and resume the affected LPAR. (This step should not be necessary if the "recovery" section was done in step 2.)
    1. From the navigation area, click Storage Facility Management > storage facility > SF image.
    2. From the right work area, select the affected LPAR.
    3. From the bottom Task area, click Service Utilities > Reset No-rsStart.
    4. When the Reset No-rsStart Successful window opens, click OK.
    5. From the bottom Task area, click Service Utilities > Change/Show LPAR State. The LPAR Server Control window opens.
    6. Click Resume LPAR. Click Yes to confirm. Allow the resume to complete.
  4. Was the disk drive backplane assembly replaced?
    • Yes, this completes this procedure.
    • No, it appears that the unplugged hard drive was faulty. A serviceable event should be created that lists the hard drive to remove as a FRU.

      Repair the serviceable event to replace the faulty hard drive. If a serviceable event is not found, the hard drive can be replaced using the Exchange Parts menu (Storage Facility Management > storage facility > Exchange Parts).