Data Center on Cloud: Perform Storage Maintenance on NetApp Clustered Data ONTAP with ZERO downtime

Welcome: To stay updated with all my Blog posts follow me on Twitter @arunpande !

I am writing this blog to share my experience about scheduling a maintenance activity on NetApp FAS3270 with Clustered DATA ONTAP. I had to reboot one node which was hosting 500 virtual machines across eight ESXi hosts.

When a storage administrator has to schedule a maintenance activity like firmware/hardware upgrade which requires a reboot he has the following options:

Work hard with Traditional Storage

Spend several minutes trying to shutdown the VMs on all eight ESXi hosts.
Make sure all VMs are powered off and there is no active I/O to avoid any application specific issues.
Reboot the Controller.
Again spend several hours trying to power on all the 500 virtual machines.
Spend hours working on your weekend trying to complete this maintenance

Work Smart with Clustered Data ONTAP

Use Clustered Data ONTAP with LIF migration and SFO (Storage Failover).
Perform takeover/give back of the controller.
No changes required in the vSphere Infrastrucutre
Migrate the LIFs back to the source node
Complete the maintenance within 10-15 minutes during production hours.

This is the procedure that I followed to perform this activity

I have the following cluster configured with 515 VMs

IMPORTANT: You don’t have to make any changes in your vSphere Infrastructure. You do NOT need any downtime for VMs.

The following activity has to be performed on your NetApp Storage

Make sure that the cluster is healthy.

f3270::> cluster show

Node Health Eligibility

--------------------- ------- ------------

lab-filer1 true true

lab-filer2 true true

lab-filer3 true true

lab-filer4 true true

4 entries were displayed.

Check the Storage Failover settings

lab-f3270::> storage failover show

Takeover

Node Partner Possible State Description

-------------- -------------- -------- -------------------------------------

lab-filer1 lab-filer2 true Connected to lab-filer2

lab-filer2 lab-filer1 true Connected to lab-filer1

lab-filer3 lab-filer4 true Connected to lab-filer4

lab-filer4 lab-filer3 true Connected to lab-filer3

4 entries were displayed.

Enable Advanced mode

lab-f3270::> set adv

Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.

Do you want to continue? {y|n}: y

Check how many lifs are currently on this node

lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4

Logical Status Network Current Current Is

Vserver Interface Admin/Oper Address/Mask Node Port Home

----------- ---------- ---------- ------------------ ------------- ------- ----

Lab_Vserver

nfs_lif04 up/up 192.168.40.244/24 lab-filer4 i0a-400 true

Make sure that the LIF is migrated to another node in the cluster

lab-f3270::*> network interface migrate-all -node lab-filer4

lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4

There are no entries matching your query.

IMPORTANT: Create LIF Failover groups to perform seamless migration of the LIFs during link failure and takeover. In this blog post I have shared the steps to perform link migration in case you have not configured Failover groups. I encourage that you configure failover groups, refer to the Clustered Data ONTAP ® 8.2 High-Availability Configuration Guide for detailed information.

Initiate the takeover of the controller to reboot it.

lab-f3270::*> storage failover takeover -ofnode lab-filer4

The controller now reboots

lab-filer4% Waiting for PIDS: /usr/sbin/ypbind 722.

Waiting for PIDS: /usr/sbin/rpcbind 688.

Terminated

Uptime: 112d2h54m45s

Top Shutdown Times (ms): {if_reset=1161, shutdown_wafl=223(multivol=0, sfsr=0, abort_scan=0, snapshot=0, start=62, sync1=77, sync2=4, mark_fs=80), wafl_sync_tagged=148, shutdown_raid=28, iscsimgt_notify_shutdown_appliance=22, shutdown_fm=15}

Shutdown duration (ms): {CIFS=2607, NFS=2607, ISCSI=2585, FCP=2585}

HALT: HA partner has taken over (ic) on Fri Jan 24 04:08:38 EST 2014

System rebooting...

Once the reboot is complete and the storage is ready for give back, initiate the give back for this controller

lab-f3270::*> storage failover giveback -ofnode lab-filer4

Info: Run the storage failover show-giveback command to check giveback status.

Revert the lif back to its home node

lab-f3270::*> network interface revert -vserver Lab_Vserver -lif nfs_lif04

lab-f3270::*> network interface show -data-protocol nfs|iscsi|fcp -curr-node lab-filer4

Logical Status Network Current Current Is

Vserver Interface Admin/Oper Address/Mask Node Port Home

----------- ---------- ---------- ------------------ ------------- ------- ----

Lab_Vserver

nfs_lif04 up/up 192.168.40.244/24 lab-filer4 i0a-400 true

Make sure that the cluster is healthy again.

Within 10-15 minutes and the entire maintenance activity of rebooting the controller and making sure that its online was complete.

IMPORTANT: It’s important that you setup the cluster as per best practices, refer to Clustered Data ONTAP 8.2 Documentation for more information.

Data Center on Cloud

Pages

Friday, January 24, 2014

Perform Storage Maintenance on NetApp Clustered Data ONTAP with ZERO downtime

Questions or Comments ??

AdSense