Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades?
Closed, ResolvedPublic

Description

I only today became fully aware of a very annoying issue we have in former 1G racks in codfw. Specifically this came up while discussing T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G today.

That host is in codfw rack A3, which has a shiny new QFX5120 in it. The problem is, however, every single port on the box has a 1G copper SFP in it with all hosts connected at 1G:

cmooney@lsw1-a3-codfw> show interfaces descriptions 
Interface       Admin Link Description
ge-0/0/0        up    up   mw2377
ge-0/0/1        up    up   mw2378
ge-0/0/2        up    up   mw2379
ge-0/0/3        up    up   mw2380
ge-0/0/4        up    up   mw2381
ge-0/0/5        up    up   mw2382
ge-0/0/6        up    up   mw2383
ge-0/0/7        up    up   mw2384
ge-0/0/8        up    up   mw2385
ge-0/0/9        up    up   mw2386
ge-0/0/10       up    up   mw2387
ge-0/0/11       up    up   mw2388
ge-0/0/12       up    up   mw2389
ge-0/0/13       up    up   mw2390
ge-0/0/14       down  down DISABLED
ge-0/0/15       up    up   mw2392
ge-0/0/16       up    up   db2142
ge-0/0/17       up    up   mw2393
ge-0/0/18       up    up   mw2394
ge-0/0/19       up    up   mw2395
ge-0/0/20       up    up   mw2396
ge-0/0/21       up    up   db2158
ge-0/0/26       up    up   mw2397
ge-0/0/27       up    up   mw2398
ge-0/0/28       up    up   mw2399
ge-0/0/29       up    up   mw2400
ge-0/0/30       up    up   netmon2002
ge-0/0/31       up    up   mw2291
ge-0/0/32       up    up   mw2292
ge-0/0/33       up    up   kafka-main2006
ge-0/0/34       up    up   es2020
ge-0/0/36       up    up   mw2293
ge-0/0/37       up    up   mw2294
ge-0/0/38       up    up   mw2295
ge-0/0/39       up    up   mw2296
ge-0/0/40       up    up   mw2297
ge-0/0/41       up    up   mw2298
ge-0/0/42       up    up   mw2299
ge-0/0/43       up    up   mw2300
et-0/0/54       up    up   Core: ssw1-a8-codfw:et-0/0/2 {#230403800021}
et-0/0/55       up    up   Core: ssw1-a1-codfw:et-0/0/2 {#230403800027}

As we know the QFX5120 is Trident 3 based, and due to how the ASIC works every 4 SFP ports are actually a single 40/100G connection. The result being adjacent ports in blocks of 4 all have to have the same speed.

cmooney@lsw1-a3-codfw> show configuration chassis | display set 
set chassis fpc 0 pic 0 port 0 speed 1G
set chassis fpc 0 pic 0 port 4 speed 1G
set chassis fpc 0 pic 0 port 8 speed 1G
set chassis fpc 0 pic 0 port 12 speed 1G
set chassis fpc 0 pic 0 port 16 speed 1G
set chassis fpc 0 pic 0 port 20 speed 1G
set chassis fpc 0 pic 0 port 24 speed 1G
set chassis fpc 0 pic 0 port 28 speed 1G
set chassis fpc 0 pic 0 port 32 speed 1G
set chassis fpc 0 pic 0 port 36 speed 1G
set chassis fpc 0 pic 0 port 40 speed 1G

So to complete the upgrade in the above task @Papaul is planning to move the server to another rack. Which is fine, but nobody can deny it's not a lot of effort.

This is quite an annoying problem to be honest, which I hadn't fully thought through. It strikes me that over time we are going to have to do an awful lot of server moves/shuffling as we move/upgrade hosts to 10G. In rows A and B of codfw alone we have 7 switches exactly like this one, with over 200 servers connected at 1G.

One option that occurs to me is that we could use some of the QSFP28 ports on these switches, in channelized mode, to create 10/25G ports on them which server uplinks could be moved to if they are going from 1 to 10G? Avoiding having to move / re-rack the whole server? As we upgrade hosts and have more empty SFP ports we can then reconfigure those for the higher speeds and move hosts back. There would still be more juggling of links than we'd like, but at least we aren't moving servers the whole time? Anyway just an idea.

Event Timeline

cmooney created this task.

We could use these cables but we might not have enough slack to connect to servers at different heights in the rack:

https://www.fs.com/products/48676.html

So probably we should go with these QSFPs:

https://www.fs.com/products/36178.html

With one of these MPO cables connecting the QSFP side:

https://www.fs.com/products/68047.html

And either use couplers or a patch panel to conect the 4 ends to 4 x duplex MMF.

Can we move the cables instead of moving the servers ?

For example Port 44 to 47 can be used right away at 10/25G
If 1G servers on ports 20/21 are moved to 24/25 (or moved to 44-47 when upgraded to 10G), then 20-23 can be used as 10-25G, then we can repeat the same process.

Using the QSFP28 ports could make sens if all the 0-47 ports were already fully in use, but as we have less servers than switch ports it shouldn't be the case.

Downside of using QSFP28 I can see are : shared fate (if we need to change one B/O cable, 4 servers need to go down), more complex cabling and configuration, more moves needed if we need to free up those QSFP28 ports in the future.

@ayounsi perhaps I was a little quick to conclude all the blocks were assigned, you are correct.

The advantage to using them still would be not having to co-ordinate all the downtime when shuffling the links with other teams.

I also realise in codfw dc-ops match the switch port to server rack height, so perhaps regardless they’d have to re-rack the servers to keep that convention (even if within the same rack). Perhaps there is no easy way out of the delimma.

I’ll leave it open for them to comment and close the task if we’re agreed.

Yes i don't think this approach will work for codfw, like @cmooney said: "codfw dc-ops match the switch port to server rack height" this help in when it comes to trace a cable during troubleshooting, keep every thing organized and when racking servers this helps as well to identify which cable is gong where. In one word it just save a lot time for we onsite when we need to do any work in a rack.

cmooney claimed this task.

Cool thanks @Papaul. I guess we can see how we get on over the next while. If the number of 1G -> 10G upgrades become excessive we might have to review as moving rack also means the SREs who manage the system have to reimage/renumber it (moving rack vlan), so it adds an extra layer of complexity for them. If they are only occasional hopefully that won't become a major pain point.

I don't think we have a lot of servers right that have 10G NIC put using the 1G NIC. Most of the servers that have 10G NIC but using 1G NIC are DB's. This case right now i am working on wikikube-ctrl200[1-3] they requested to add a 10G nic to those servers. hope that everybody doesn't request to add 10G NIC into their servers :). Also for now all the servers that we will be receiving will have 10G NIC so if it needs to be racked in row A and B we just use the 10G NIC. Same will be true when we have Row C and D up. so maybe in 5 years from now when we do servers refresh we will eliminate a lot of those servers with 1G NIC and have room.