** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 06/01/2013 10:00 PM

End: 06/01/2013 11:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

This last Monday we experienced two more pages for downed events for one of our switches one at 8am and one at 5pm. This did not impact service but is troubling as we want everything to be healthy all the time in our environment. For a description of the problem we are seeing take a look at my earlier blog post and its follow up.

I put in a support call to HP and referenced the older ticket and the repeat of the problem. Support requested a copy of the output from each switch from the command show tech all. I dumped the output and sent it off to the helpful support person. Later that day the support person called back and asked about why were were on such a new version of the firmware! So I pointed out that it was their support whom gave us the copy of the firmware and told us to run it. At the end of this support call HP has come back with two changes. They would like us to add loop protect on the ports that feed our blade centers. They would also like us to reconfigure both switch2s in each site so that their trunk ports are statically defined instead of auto-detected/dynamic.

So our next maintenance window is this Saturday and we will perform the following changes:

All switches will get:

config
loop-protect mode port
loop-protect a2
write mem
exit

Each switch2 will get (where ? is 1 for site1 and 2 for site2):

config
no interface <port list> lacp
trunk <port list> trk? lacp
write mem
exit

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 05/18/2013 10:00 PM

End: 05/18/2013 11:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

Part1.A Post Config Follow up:

We applied the sNTP change last night and none of the switches will pull time. After a quick searching on the internet we found this: 2910al-48G-can-not-get-time-from-W2K3-NTP-server

From this thread we learn that the firmware (W.14.38) we are running has a bug talking to the management host which also serves as the ntp server. So while the sNTP config we have now is good, the firmware is not. On this coming Saturday maintenance when we install the new firmware we will hopefully fix sNTP and get good timestamps on logs!

Spanning tree configuration went as expected and we saw the changes take place as we made them. We have not had a recurrence of spanning tree flapping, but we have had several instances where we would go a day or two with out an event. So we are still in a wait and see game.

Part1.B Pre-firmware questions to HP Support:

I sent the following response in on our open ticket with HP:

HP Support recommended firmware version W.15.10.0010 (and gave us a copy)
I see the current HP stable version is W.15.08.0012
And there is a early release version of W.15.12.0006
(see 2910al firmware download) Is there a particular reason support sent us a version in the middle of these two? Which would be the best version to load on the switches?

HP support then replied back with the response:

Hello

This email is regarding the case 4************, for the 2910al also the version W.15.10.0010 is stable but havent been posted on the website, and earliest availability version W.15.12.0006 you are right seems to be a new one but I dont see it under the list of 2910al software release versions that’s why suggest to use the W.15.10.0010

If HP support is going to recommend this, considers it stable, and will provide support to us running it then that is what we will do. We just want to be running in a current supported configuration. So we will apply W.15.10.0010 during this Saturday’s maintenance window as planned.

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

Our Problem:

We have an HP StoreVirtual / LeftHand OS multi-site SAN. Part of this SAN is it’s switching infrastructure, which is built out of 4 2910al-24g switches. Each switch has a 10-gig add on module providing 2 10gig ports.

Each site has 2 switches joined together via a lacp trunk. Between each site we have 2 10gig fiber pairs linking the sites. Hanging off this switching core we have a blade center per site, as well as 6 x p4500g2 StoreVirtual nodes per site. The blade centers have a pair of  10gig uplinks (1 per switch per side) in a active/passive configuration. Each p4500g2 node has a pair of 1gig uplinks (1 per switch per side) in a alb configuration. So our SAN network looks like this:

The configuration on these switches was setup by a vendor for us. At the time we were very new to the StoreVirtual world and needed the help. All was well for over an year! Then about a week ago we started to get a pages and notifications that a switch was down. This was disconcerting but you jump on the switches and all seemed well. We have never seen any traffic problems or anything wrong at all. This week we started to get multiple pages per night. We were not too happy. My colleague Josh put a call into HP support and they noticed we have a Spanning tree problem. Which switch thinks its the root node is flapping around.

Looking into this problem and others has illustrated what seems to be a miss configuration on our switches. Switch1 in each location is configured. Switch2 in each location is auto detecting its world and has no configuration set other then its local ip address.

So here we are! Part 1 of switch RE-configuration. Lets see if we can try and get these switches configured optimally. Our part 1 strategy is to just mitigate the spanning tree flapping. We do not know if this is indicating an hardware error or if each switch having the same priority is causing the flapping. We also discovered that no ntp was set so the logs out of the switches are less then useful.

Part1.A Configuration changes on our next Tuesday maintenance window:

All switches will get:

config
timesync sntp
sntp unicast
sntp server priority 1 ***.***.***.1
show sntp
write mem
exit

Then each switch will get a Spanning Tree priority set. We are going to try and be minimally disruptive as possible so we will be keeping the present/most often winning switch as the spanning tree root. Where X is 1, 2, 3, or 4 depending on which switch it is.

config
spanning-tree clear-debug-counters
spanning-tree priority X
show spanning-tree
write mem
exit

Part1.B Firmware upgrade to latest on our next Saturday maintenance window:

We will be applying the latest firmware to each switch. This is also a bit interesting so we will follow up with hp support to answer the question… Which firmware should we apply?

Current stable: W.15.08.0012
Support provided: W.15.10.0010
Early Available: W.15.12.0006

Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4

** Maintenance Announcement – DEV VM service interruption anticipated **

We are upgrading our 2910al-24g switches to the latest firmware. During the upgrade of the switch that provides the storage for the DEV VMware cluster will be unavailable, as such the Dev VMware cluster will also be shut down.

Production SANs are redundantly connected and maintenance will have no noticeable effect on these SANs, meaning that the OSU systems used by students, staff, and faculty will not experience a service interruption.

Start: 04/20/2013 at 9:00 PM

End:  04/20/2013 at 11:59 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Maintenance Announcement – No service interruption anticipated **

We will be applying a configuration change to our iSCSI switches that support our StoreVirtual SAN. This is the storage network that back’s our VMware infrastructure.

We do not anticipate any service interruption. Our switching is redundant, we will only change the switches one at a time, and the changes should not be service interrupting.

Start: 04/16/2013 10:00 PM

End: 04/16/2013 11:00 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

Last night we upgraded from HP StoreVirtual LeftHand OS 10.0.00.1896 to 10.5.00.0149

When it came time for the Central Management Console (CMC) to reboot each node our Linux and Windows hosts noticed their respective gateway connection disappear. Each host retried once and got a new gateway connection from one of the remaining nodes in the cluster and all was well. This manifested in the logs of the affected hosts as follows:

Windows host:

3/26/2013 11:59:54 PM – Error Event ID 20 – iScsiPrt – Connection to the target was lost. The initiator will attempt to retry the connection.
3/26/2013 11:59:55 PM – Error Event ID 1 – iScsiPrt – Initiator failed to connect to the target. Target IP address and TCP Port number are given in dump data.
3/26/2013 11:59:59 PM – Informational Event ID 34 – iScsiPrt – A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name.

Linux Host:

03/27 00:08:15 iscsid: connection3:0 is operational after recovery (1 attempts)
03/27 00:08:14 kernel: [22973221.827841] connection3:0: detected conn error (1020)
03/27 00:08:12 iscsid: Kernel reported iSCSI connection 3:0 error (1020) state (3)
03/27 00:08:12 kernel: [22973219.322744] connection3:0: detected conn error (1020)

Several of our hosts were unlucky and randomly received a new gateway connection on a node that had yet to reboot as part of the LeftHand OS update. They then had a second event where the same thing happened again when it was time for the new node to reboot, leading them to receive yet another gateway connection.

What is interesting is our VMware ESXi 5.1 hosts did not notice their respective gateway connections drop or disappear through out the reboots of each StoreVirtual cluster.

Throughout the entire LeftHand OS upgrade no customer affecting service was impacted and all hosts kept on serving.

** Maintenance Announcement – DEV VM service interruption anticipated **

We are upgrading our StoreVirtual firmware from  LeftHand OS 10.0 -> LeftHand OS 10.5. During the upgrade of the storage that provides the DEV VMware cluster will be unavailable, as such the Dev VMware cluster will also be shut down.

Production SANs are redundant and maintenance will have no noticeable effect on these SANs, meaning that the OSU systems used by students, staff, and faculty will not experience a service interruption.

Start Time: 03/26/2013 at 10:30 PM

End Time:  03/27/2013 at 4:00 AM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.

** Maintenance Announcement – No service interruption anticipated **

We will be moving node1-site2 and node2-site2 from rack mcc-b5 to rack mcc-b6. We are working at a standard rack layout and this will aid in bringing mcc-b6 closer to our anticipated standard for racks with blade centers.

Start: 03/26/2013 9:00 PM

End: 03/26/2013 10:30 PM

If you have questions or concerns about this maintenance, please contact the Shared Infrastructure Group at osu-sig (at) oregonstate.edu or call 737-7SIG.