Friday, January 24, 2014

Perform Storage Maintenance on NetApp Clustered Data ONTAP with ZERO downtime

Welcome: To stay updated with all my Blog posts follow me on Twitter @arunpande !


I am writing this blog to share my experience about scheduling a maintenance activity on NetApp FAS3270 with Clustered DATA ONTAP. I had to reboot one node which was hosting 500 virtual machines across eight ESXi hosts.                                                                                                                                                                                                                                                          


When a storage administrator has to schedule a maintenance activity like firmware/hardware upgrade which requires a reboot he has the following options:



Work hard with Traditional Storage
  • Spend several minutes trying to shutdown the VMs on all eight ESXi hosts.
  • Make sure all VMs are powered off and there is no active I/O to avoid any application specific issues.
  • Reboot the Controller.
  • Again spend several hours trying to power on all the 500 virtual machines.
  • Spend hours working on your weekend trying to complete this maintenance


Work Smart with Clustered Data ONTAP
  • Use Clustered Data ONTAP with LIF migration and SFO (Storage Failover).
  • Perform takeover/give back of the controller.
  • No changes required in the vSphere Infrastrucutre
  • Migrate the LIFs back to the source node
  • Complete the maintenance within 10-15 minutes during production hours.


This is the procedure that I followed to perform this activity

 


I have the following cluster configured with 515 VMs


IMPORTANT: You don’t have to make any changes in your vSphere Infrastructure. You do NOT need any downtime for VMs.


The following activity has to be performed on your NetApp Storage


Make sure that the cluster is healthy.
f3270::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
lab-filer1            true    true
lab-filer2            true    true
lab-filer3            true    true
lab-filer4            true    true
4 entries were displayed.


Check the Storage Failover settings
lab-f3270::> storage failover show
                             Takeover
Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
lab-filer1     lab-filer2     true     Connected to lab-filer2
lab-filer2     lab-filer1     true     Connected to lab-filer1
lab-filer3     lab-filer4     true     Connected to lab-filer4
lab-filer4     lab-filer3     true     Connected to lab-filer3
4 entries were displayed.


Enable Advanced mode
lab-f3270::> set adv


Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y


Check how many lifs are currently on this node
lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
           Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Lab_Vserver
           nfs_lif04    up/up    192.168.40.244/24  lab-filer4    i0a-400 true


Make sure that the LIF is migrated to another node in the cluster
lab-f3270::*> network interface migrate-all -node lab-filer4


lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
There are no entries matching your query.


IMPORTANT: Create LIF Failover groups to perform seamless migration of the LIFs during link failure and takeover. In this blog post I have shared the steps to perform link migration in case you have not configured Failover groups. I encourage that you configure failover groups, refer to the Clustered Data ONTAP ® 8.2 High-Availability Configuration Guide for detailed information.


Initiate the takeover of the controller to reboot it.
lab-f3270::*> storage failover takeover -ofnode lab-filer4


The controller now reboots
lab-filer4% Waiting for PIDS: /usr/sbin/ypbind 722.
Waiting for PIDS: /usr/sbin/rpcbind 688.
Terminated
.
Uptime: 112d2h54m45s
Top Shutdown Times (ms): {if_reset=1161, shutdown_wafl=223(multivol=0, sfsr=0, abort_scan=0, snapshot=0, start=62, sync1=77, sync2=4, mark_fs=80), wafl_sync_tagged=148, shutdown_raid=28, iscsimgt_notify_shutdown_appliance=22, shutdown_fm=15}
Shutdown duration (ms): {CIFS=2607, NFS=2607, ISCSI=2585, FCP=2585}
HALT:  HA partner has taken over (ic) on Fri Jan 24 04:08:38 EST 2014


System rebooting...


Once the reboot is complete and the storage is ready for give back, initiate the give back for this controller
lab-f3270::*> storage failover giveback -ofnode lab-filer4


Info: Run the storage failover show-giveback command to check giveback status.


Revert the lif back to its home node
lab-f3270::*> network interface revert -vserver Lab_Vserver -lif nfs_lif04


lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4
           Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
Lab_Vserver
           nfs_lif04    up/up    192.168.40.244/24  lab-filer4    i0a-400 true


Make sure that the cluster is healthy again. 

Within 10-15 minutes and the entire maintenance activity of rebooting the controller and making sure that its online was complete.


IMPORTANT: It’s important that you setup the cluster as per best practices, refer to Clustered Data ONTAP 8.2 Documentation for more information.