Our Problem:
We have an HP StoreVirtual / LeftHand OS multi-site SAN. Part of this SAN is it’s switching infrastructure, which is built out of 4 2910al-24g switches. Each switch has a 10-gig add on module providing 2 10gig ports.
Each site has 2 switches joined together via a lacp trunk. Between each site we have 2 10gig fiber pairs linking the sites. Hanging off this switching core we have a blade center per site, as well as 6 x p4500g2 StoreVirtual nodes per site. The blade centers have a pair of 10gig uplinks (1 per switch per side) in a active/passive configuration. Each p4500g2 node has a pair of 1gig uplinks (1 per switch per side) in a alb configuration. So our SAN network looks like this:
The configuration on these switches was setup by a vendor for us. At the time we were very new to the StoreVirtual world and needed the help. All was well for over an year! Then about a week ago we started to get a pages and notifications that a switch was down. This was disconcerting but you jump on the switches and all seemed well. We have never seen any traffic problems or anything wrong at all. This week we started to get multiple pages per night. We were not too happy. My colleague Josh put a call into HP support and they noticed we have a Spanning tree problem. Which switch thinks its the root node is flapping around.
Looking into this problem and others has illustrated what seems to be a miss configuration on our switches. Switch1 in each location is configured. Switch2 in each location is auto detecting its world and has no configuration set other then its local ip address.
So here we are! Part 1 of switch RE-configuration. Lets see if we can try and get these switches configured optimally. Our part 1 strategy is to just mitigate the spanning tree flapping. We do not know if this is indicating an hardware error or if each switch having the same priority is causing the flapping. We also discovered that no ntp was set so the logs out of the switches are less then useful.
Part1.A Configuration changes on our next Tuesday maintenance window:
All switches will get:
config
timesync sntp
sntp unicast
sntp server priority 1 ***.***.***.1
show sntp
write mem
exit
Then each switch will get a Spanning Tree priority set. We are going to try and be minimally disruptive as possible so we will be keeping the present/most often winning switch as the spanning tree root. Where X is 1, 2, 3, or 4 depending on which switch it is.
config
spanning-tree clear-debug-counters
spanning-tree priority X
show spanning-tree
write mem
exit
Part1.B Firmware upgrade to latest on our next Saturday maintenance window:
We will be applying the latest firmware to each switch. This is also a bit interesting so we will follow up with hp support to answer the question… Which firmware should we apply?
Current stable: W.15.08.0012
Support provided: W.15.10.0010
Early Available: W.15.12.0006
Part 1 | Part 1 follow up | Part 2 | Part 3 | Part 4