digiturf.net

Replace a failing / failed HBA on AIX Server.


* * On this server dualpath is there and there are 2 HBA’s in this server which are FCS0 and FCS1 and here failed HBA is FCS1.

^^ Making sure that failed HBA is FCS1, even # errpt will give you clue about it.

# lspath
Enabled hdisk0  scsi0
Missing hdisk1  scsi0
Enabled hdisk2  scsi0
Enabled hdisk3  fscsi0
Enabled hdisk4  fscsi0
Enabled hdisk5  fscsi0
Enabled hdisk6  fscsi0
Missing hdisk3  fscsi1
Missing hdisk4  fscsi1
Missing hdisk5  fscsi1
Missing hdisk6  fscsi1
Enabled hdisk7  fscsi0
Enabled hdisk8  fscsi0
Enabled hdisk9  fscsi0
Enabled hdisk10 fscsi0
Enabled hdisk11 fscsi0
Missing hdisk7  fscsi1
Missing hdisk8  fscsi1
Missing hdisk9  fscsi1
Missing hdisk10 fscsi1
Missing hdisk11 fscsi1
Enabled hdisk12 fscsi0
Missing hdisk12 fscsi1

^^ Take a downtime from application team and arrange for replacement of failed HBA if it is not hot-swappable, here the server model is “IBM eServer pSeries 615″, hence it is not hot-pluggable.
.
^^ Finding PCI Device name of FCS1
 # lsdev -C -l fcs1 -F parent
pci11

^^ Changing parameters of “FCS1 / FSCSI1”
# chdev -a dyntrk=yes -l fscsi1
fscsi1 changed

# chdev -a fc_err_recov -l fscsi1
fscsi1 changed

# chdev -a fast_fail -l fscsi1
fscsi1 changed

^^ Once the failed HBA is replaced and server is up, find new HBA’s “WWPN Number” and give it to Storage team for Zoning / Mapping of LUNS to it. You can find the “WWPN” from below command. Generally, WWPN number is used by storage for mapping / masking / zoning the LUNS to HBA / FC Card.

# lscfg -vl fcs1
 fcs1             U0.1-P1-I4/Q1  FC Adapter

        Part Number……………..00P4295
        EC Level………………..A
        Serial Number……………1D3310C353
        Manufacturer…………….001D
        Feature Code/Marketing ID…5704
        FRU Number………………     00P4297
        Device Specific.(ZM)……..3
        Network Address………….10000000C935A988  [ WWN Number ]
        ROS Level and ID…………02E01991
        Device Specific.(Z0)……..2003806D
        Device Specific.(Z1)……..00000000
        Device Specific.(Z2)……..00000000
        Device Specific.(Z3)……..03000909
        Device Specific.(Z4)……..FF601416
        Device Specific.(Z5)……..02E01991
        Device Specific.(Z6)……..06631991
        Device Specific.(Z7)……..07631991
        Device Specific.(Z8)……..20000000C935A988  [ WWPN Number ]
        Device Specific.(Z9)……..HS1.92A1
        Device Specific.(ZA)……..H1D1.92A1
        Device Specific.(ZB)……..H2D1.92A1
        Device Specific.(YL)……..U0.1-P1-I4/Q1

^^ Now remove the protocol device and its child devices from the “Device tree / ODM definition” and also from the server with any of below 2 commands. This step can be completed before the installation of new HBA (Later comes down time for replacement of new HBA), In fact doing this before HBA replacement is a better way of doing it.

# rmdev -dl fscsi1 -R
fscsi1 deleted

# rmdev -dl fcs1 -R
fcnet1 deleted
fcs1 deleted

^^ This single commands establishes same task the above two commands do.

# rmdev -Rdl fcs1
fscsi1 deleted
fcnet1 deleted
fcs1 deleted

### Now if you observe in ” #lspath / #lsdev -Cc disk “ only disks from single path are shown and all the disks that are showing as “Missed” are removed / gone, You can also see the same from 3rd party multipath software HITACHI also shows disks from single path.

# lsdev -Cc disk
hdisk0  Available 1S-08-00-5,0 16 Bit LVD SCSI Disk Drive
hdisk1  Defined   1S-08-00-8,0 16 Bit LVD SCSI Disk Drive
hdisk2  Available 1S-08-00-8,0 16 Bit LVD SCSI Disk Drive
hdisk3  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk4  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk5  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk6  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk7  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk8  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk9  Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk10 Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk11 Available 1Z-08-01     Hitachi Disk Array (Fibre)
hdisk12 Available 1Z-08-01     Hitachi Disk Array (Fibre)

# lspath
Enabled hdisk0  scsi0
Missing hdisk1  scsi0    – This is onboard / local disk inside server itself.
Enabled hdisk2  scsi0
Enabled hdisk3  fscsi0
Enabled hdisk4  fscsi0
Enabled hdisk5  fscsi0
Enabled hdisk6  fscsi0
Enabled hdisk7  fscsi0
Enabled hdisk8  fscsi0
Enabled hdisk9  fscsi0
Enabled hdisk10 fscsi0
Enabled hdisk11 fscsi0
Enabled hdisk12 fscsi0

/usr/DynamicLinkManager/bin:#  ./dlnkmgr view -lu
Product       : USP_V
SerialNumber  : 0029222
LUs           : 10

iLU    HDevName OSPathID PathID Status
000296 hdisk12  00000    000008 Online
00034D hdisk3   00000    000002 Online
00034F hdisk4   00000    000005 Online
000350 hdisk5   00000    000007 Online
00035B hdisk6   00000    000001 Online
0005F1 hdisk7   00000    000004 Online
0005F2 hdisk8   00000    000003 Online
0005F3 hdisk9   00000    000009 Online
0005F4 hdisk10  00000    000006 Online
0005F5 hdisk11  00000    000000 Online

^^ Now, once storage confirms you that they’ve completed the “mapping of luns” for newly provided WWPN from replaced HBA, do the following.

# cfgmgr             # Scans device tree.

^^ Once scanning is complete, please check ” #lspath and #dlnkmgr view -lu “ outputs to find that 2nd set of disks showing up and you have a dual path again.
[root@dsearch2:/usr/DynamicLinkManager/bin] lspath
Enabled hdisk0  scsi0
Missing hdisk1  scsi0
Enabled hdisk2  scsi0
Enabled hdisk3  fscsi0
Enabled hdisk4  fscsi0
Enabled hdisk5  fscsi0
Enabled hdisk6  fscsi0
Enabled hdisk3  fscsi1
Enabled hdisk4  fscsi1
Enabled hdisk5  fscsi1
Enabled hdisk6  fscsi1
Enabled hdisk7  fscsi0
Enabled hdisk8  fscsi0
Enabled hdisk9  fscsi0
Enabled hdisk10 fscsi0
Enabled hdisk11 fscsi0
Enabled hdisk7  fscsi1
Enabled hdisk8  fscsi1
Enabled hdisk9  fscsi1
Enabled hdisk10 fscsi1
Enabled hdisk11 fscsi1
Enabled hdisk12 fscsi0
Enabled hdisk12 fscsi1

/usr/DynamicLinkManager/bin:# ./dlnkmgr view -lu
Product       : USP_V
SerialNumber  : 0029222
LUs           : 10

iLU    HDevName OSPathID PathID Status
000296 hdisk12  00000    000008 Online
                00001    000015 Online
00034D hdisk3   00000    000002 Online
                00001    000014 Online
00034F hdisk4   00000    000005 Online
                00001    000012 Online
000350 hdisk5   00000    000007 Online
                00001    000016 Online
00035B hdisk6   00000    000001 Online
                00001    000019 Online
0005F1 hdisk7   00000    000004 Online
                00001    000011 Online
0005F2 hdisk8   00000    000003 Online
                00001    000013 Online
0005F3 hdisk9   00000    000009 Online
                00001    000018 Online
0005F4 hdisk10  00000    000006 Online
                00001    000017 Online
0005F5 hdisk11  00000    000000 Online
                00001    000010 Online
KAPL01001-I The HDLM command completed normally. Operation name = view, completion time = 2014/04/21 03:04:53

NOTES on few things :

> > Dynamic Tracking of the FC adapter driver
detects when the Fibre Channel N_Port ID of a device changes. The
FC adapter driver then reroutes traffic that is destined for that
device to the new address while the devices are still online. Events
that can cause an N_Port ID to change include one of the following
scenarios:

  • Moving a cable between a switch and storage device from one switch
    port to another.
  • Connecting two separate switches by using an inter-switch link
    (ISL).
  • Rebooting a switch.
> > When dynamic tracking is disabled, there is a marked difference between
the delayed_fail and fast_fail settings
of the fc_err_recov attribute. However, with dynamic tracking enabled,
the setting of the fc_err_recov attribute is less significant. This
is because there is some overlap in the dynamic tracking and fast fail error-recovery
policies. Therefore, enabling dynamic tracking inherently enables some of
the fast fail logic.
The general error recovery procedure when a device is no longer reachable
on the fabric is the same for both fc_err_recov settings
with dynamic tracking enabled. The minor difference is that the storage drivers
can choose to inject delays between I/O retries if fc_err_recov is
set to delayed_fail. This increases the I/O failure time
by an additional amount, depending on the delay value and number of retries,
before permanently failing the I/O. With high I/O traffic, however, the difference
between delayed_fail and fast_fail might
be more noticeable.
^^ This procedure is tested and worked well on AIX 5.2 server.
# uname -a
AIX   unixmemis1   2  5   00574ACE4D01

Leave a Reply