All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, we’ve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed!
However, the way we troubleshoot the network in relation to the applications we support hasn’t adapted. In this session, we’ll review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.
3. Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 3 years
American
• Former Network Engineer at the
University of Queensland
9. What are we trying to solve
Problem Statement
• Network Design – Has evolved
• Network software/ hardware –
Has advanced
• Learning – The average SRE may
not necessarily understand the
ramifications
• Tooling – Has been left behind
10. What this talk is
• Tale into potential pitfalls of modern
day networks
11. What this talk isn’t
• How to make the network do all the
things…quickly & reliably…
12. What this talk isn’t
• How to make the network do all the
things…quickly & reliably…
• Sorry
24. Advancement of Network Speeds
Speed Name Standard Year
10Mb 10BASE-T 802.3i 1990
100Mb 100BASE-TX 802.3u 1995
1000Mb = 1Gb 1000BASE-T 802.3ab 1999
10Gb 10GBASE 802.3ae 2002
40/100Gb 40GbE/ 100GbE 802.3ba 2010
25. Advancement of Network Speeds
• What this gives us
• Better transfer bulk speeds
• The ability to have higher concurrency
services (1M connection problem)
• Run multiple high-concurrency
applications (LPS)
31. Advancement of Network Speeds
• Network Interface Cards
• Various RX/ TX queue size limits/
defaults
• Various interrupt schemes
• Plethora of tunables that vary wildly
• LITTLE TO NO DOCUMENTATION!
• How do you monitor/ tune it???
32. Advancement of Network Speeds
• Linux Kernel
• Lots of network tunables
• Some defaults assume year ~2000
era hardware
• E.g. net.ipv4.tcp_max_syn_backlog
• Important to understand the type of
application you run and cater your
tunables to that.
33. Advancement of Network Speeds
• Network switches
• Similarly to interfaces and Linux
software, there’s a lot of options
• Deep Buffers
• DSCP marking
• Switching latency
• DCTCP
36. IPv6: Address Space
• Moving from a 32-bit address space to
128-bit.
• 4B 340TTT
• Read up on IPv6 addressing
representation
• RFC-5952
37. IPv6: Address Space
A SINGLE ADDRESS CAN BE REPRESENTED MANY WAYS
2001:db8:0:0:1:0:0:1
2001:0db8:0:0:1:0:0:1
2001:db8::1:0:0:1
2001:db8::0:1:0:0:1
2001:0db8::1:0:0:1
2001:db8:0:0:1::1
2001:db8:0000:0:1::1
2001:DB8:0:0:1::1
38. IPv6: Address Space
YOU CAN MAKE FUN PHRASES
• :cafe:beef
• :feed:f00d:
• :bad:f00d:
• :bad:beef:
• :bad:d00d:
• :f00d:cafe:
• :bad:fa11:
39. IPv6: Address Space
OR CLEVER ADVERTISING
[mkehoe@mkehoe ~]$ host -6 www.facebook.com
www.facebook.com is an alias for star-mini.c10r.facebook.com.
star-mini.c10r.facebook.com has IPv6 address
2a03:2880:f113:8083:face:b00c:0:25de
40. IPv6: Address Space
SPECIAL ADDRESSES: IPV4
RFC IP Block Use
1918 10.0.0.0/8
172.16.0.0/16
192.168.0.0/16
Private IP Addressing
6890/ 3927 169.254.0.0/16 Link-Local
5771
2365
224.0.0.0/4 Multicast
41. IPv6: Address Space
SPECIAL ADDRESSES: IPV6
IP Block Use
::/128 Unspecified Address
::1/128 Loopback address
::ffff:0:0/96 IPv4 mapped addresses
64:ff9b::/96 IPv4/ V6 translation
fc00:::/7 Unique Local Address
fe80::/10 Link-Local address
ff00::/8 Multicast addresses
42. IPv6: Address Space
OR CLEVER ADVERTISING
[mkehoe@mkehoe ~]$ host -6 www.facebook.com
www.facebook.com is an alias for star-mini.c10r.facebook.com.
star-mini.c10r.facebook.com has IPv6 address
2a03:2880:f113:8083:face:b00c:0:25de
44. IPv6: No NAT
• No need for NAT anymore
• Simplified Configuration
• Less points-of-failure
• Potential for better performance
• NAT is slow
• Harder for abusers to hide behind NAT
46. IPv6: Better Performance
• The elimination of NAT is a significant
factor
• Generally less hops across the internet
for IPv6 vs IPv4
• Simplified Header gives small amount of
optimization
48. Summary
• Don’t implicitly trust the network!
• Understand where your packets flow
• End-to-End monitoring of your network. It
is the lifeblood of your infrastructure
• For any network infrastructure changes,
ensure you understand how to
benchmark and monitor it!
So today I want to briefly talk about what this talk is about and what I hope to achieve by the end of this session.
I then want to do a quick review some of the basics of the internet and networks.
Then talk about 3 specific advances of networks.
NOTE:
So what’s the problem we’re trying to solve in this space:
If there’s one thing I would like you all to get out of this talk, it is:
Don’t trust any part of the network
Tier 1:
AT&T
Level3
Tata
Telecom Italia
Telefonica
We have our Layer 7 application layer which the application protocols that we use daily HTTP, DNS, SSH, SMTP and somewhat importantly the BGP protocol
We have the Layer 4 Transport layer, which is our where our TCP & UDP protocols live
We have the Layer 3 IP or Internet Layer, this is there the IP protocol lives, the ICMP protocol, but also a number of other important routing protocols including IPSEC, OSPF & RIP
We have our Layer 2 data-link layer, this layer provides the functional means to transfer data between entities. This is where the Ethernet protocol (802.3) protocol is
And finally we have the physical layer which we’ll talk about in a few minutes
So in the last 10 years or so we’ve finally started to see an advancement in the implementation of networks, particularly in the following areas
Clos Networks
Advancement of network speeds
Eventual implementation of IPv6 within networks and on the internet
Multi-homed internet connections
Using BGP as an Interior Routing Protocol
All of these things have brought their own set of unique challenges to the way we operate the network, but also the applications we as SRE’s run underneath them.
So let’s talk about these
Clos Networks, named after Charles Clos who formalized this design in 1952.
The Clos Network design actually started out as a multi-stage switching system for telephone systems. Funnily enough, the original “key advantage” of this design was to increase capacity and reduce bottlenecks in switching devices.
Fast-forward approximately 60 years, Network Engineers started to use Clos topology in datacenter networks. In a fashion similar to what you see on the screen.
The interesting thing about the typical implementation of the Clos (Spine/Leaf) topology is that instead of it being a switching network (A Layer 2 network), It’s a Routed Network (A Layer 3 network).
As an aside, Clos networks can be represented a number of different ways.
In the three representations shown here, The spine planes are all connected in the same way, just arranged differently.
So now we have traffic being routed across multiple links (no L2 protocols here: spanning tree or LACP here). We are using what’s known as Equal Cost Multipath Routing or ECMP. So what does it mean to us as SRE’s? Simply put, how you go from server A to B (within a datacenter or fabric) could be 16, 64 or 256 various paths. Making ”why are my packets not making it to Server B”, a difficult problem to troubleshoot.
According to the Paris-Traceroute research paper, ECMP flows are load-balanced using a set of five fields (Source/ Destination IP’s, Ports and Type of Service).
Unfortunately, unless you have a SDN controller that’s aware of these flows, it’s not possible to identify the path of application traffic in real time.
So where does that leave us for troubleshooting poor connectivity between servers.
Unfortunately, for the most part, traditional tools like ping, traceroute and even MTR aren’t useful. Using the default options on these options will only let you discover 1 path out of potentially 100’s.
There are two utilities that have made ground in this area:
Dublin-traceroute which draws paths
Fbtracert, by Facebook which is built ontop of Go
Hopefully in the near future, we can bring a similar utility to LinkedIn
As you can see, since the 1990’s, we’ve been growing our LAN network speeds every few years. In the space of 20 years, we’ve gone from 10Mb Ethernet over copper wires to 100Gb over optical fibers.
As internet backbone speeds have grown, so have the speeds on our desktops and of course on our servers.
So to think you’re going to get 10Gb out of the box is somewhat of a pipedream unfortunately, there are some optimizations and forethought required.
So for this to work harmoniously together, there’s three components that need to work together
NIC
Linux Kernel
Network Switches
So when you look at the NIC side of the equation, there’s so many variables
Suggest you check out Joe Damato’s talk from Monitorama 2016 where he talks about why statistic collection for network devices in Linux is probably wrong.
Standard TCP congestion control relies on packet loss to detect congestion
DCTCP
Standard TCP congestion control relies on packet loss to detect congestion
DCTCP
Standard TCP congestion control relies on packet loss to detect congestion
DCTCP