Network Virtualization in Cloud Data Centers
- 1. Multi-Tenant Isolation and
Network Virtualization in
Cloud Data Centers
.
Raj Jain
Washington University in Saint Louis
Saint Louis, MO 63130
Jain@cse.wustl.edu
These slides and audio/video recordings of this class lecture are at:
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-1
©2013 Raj Jain
- 2. Overview
NVO3
2. VXLAN
3. NVGRE
4. STT
Note: Data center interconnection and LAN extension techniques
are covered in another module which includes OTV, TRILL,
and LISP.
1.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-2
©2013 Raj Jain
- 3. Network Virtualization
1.
2.
3.
4.
Network virtualization allows tenants to form an overlay network in
a multi-tenant network such that tenant can control:
1. Connectivity layer: Tenant network can be L2 while the provider
is L3 and vice versa
2. Addresses: MAC addresses and IP addresses
3. Network Partitions: VLANs and Subnets
4. Node Location: Move nodes freely
Network virtualization allows providers to serve a large number of
tenants without worrying about:
1. Internal addresses used in client networks
2. Number of client nodes
3. Location of individual client nodes
4. Number and values of client partitions (VLANs and Subnets)
Network could be a single physical interface, a single physical
machine, a data center, a metro, … or the global Internet.
Provider could be a system owner, an enterprise, a cloud provider, or
a carrier.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-3
©2013 Raj Jain
- 4. Network Virtualization Techniques
Entity
NIC
Switch
L2 Link
L2 Network using L2
Partitioning
SR-IOV
VEB, VEPA
VLANs
VLAN
L2 Network using L3 NVO3,
VXLAN,
NVGRE, STT
Router
VDCs, VRF
L3 Network using L1
L3 Network using
MPLS, GRE,
L3*
PW, IPSec
Application
ADCs
Aggregation/Extension/Interconnection**
MR-IOV
VSS, VBE, DVS, FEX
LACP, Virtual PortChannels
PB (Q-in-Q), PBB (MAC-in-MAC), PBB-TE,
Access-EPL, EVPL, EVP-Tree, EVPLAN
MPLS, VPLS, A-VPLS, H-VPLS, PWoMPLS,
PWoGRE, OTV, TRILL, LISP, L2TPv3,
EVPN, PBB-EVPN
VRRP, HSRP
GMPLS, SONET
MPLS, T-MPLS, MPLS-TP, GRE, PW, IPSec
Load Balancers
*All L2/L3 technologies for L2 Network partitioning and aggregation can also be used for L3 network
partitioning and aggregation, respectively, by simply putting L3 packets in L2 payloads.
**The aggregation technologies can also be seen as partitioning technologies from the provider point of view.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-4
©2013 Raj Jain
- 5. NVO3
Network Virtualization Overlays using L3 techniques
Problem: Data Center Virtual Private Network (DCVPN) in a
multi-tenant datacenter
Issues:
Scale in Number of Networks: Hundreds of thousands of
DCVPNs in a single administrative domain
Scale in Number of Nodes: Millions of VMs on hundred
thousands of physical servers
VM (or pM) Migration
Support both L2 and L3 VPNs
Dynamic provisioning
Addressing independence
Virtual Private Other tenants do not see your frames
Optimal Forwarding (VRRP inefficient in a large network)
Ref: Network Virtualization Overlays (nvo3) charter, http://datatracker.ietf.org/wg/nvo3/charter/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
13-5
©2013 Raj Jain
- 6. NVO3 Goals
Develop a general architectural framework
Identify key functional blocks.
Indentify alternatives for each functional block
Deployments can mix and match these alternatives
Analyze which requirements are satisfied by
different alternatives
Operation, Administration and Management
(OAM)
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-6
©2013 Raj Jain
- 7. NVO3 Terminology
Tenant System (TS): VM or pM
Virtual Network (VN): L2 or L3 Tenant networks
TS
Network Virtualization Edges (NVEs): Entities
TS
connecting TSs (virtual/physical switches/routers)
Network Virtualization Authority (NVA):
NVE
NVE
Manages forwarding info for a set of NVEs
NV Domain: Set of NVEs under one authority
TS
NV Region: Set of domains that share some
VN
NVE
information (to support VNs that span multiple
domains)
Region
NVA
VN
NVE
NVE
Domain
Washington University in St. Louis
NVA
VN
NVE
VN
NVE
NVE
VN
http://www.cse.wustl.edu/~jain/cse570-13/
13-7
VN
Domain
©2013 Raj Jain
- 8. NVO3 Components
Underlay Network: Provides overlay network service
Orchestration Systems: Create new VMs and
associated vSwitches and other networking entities and
properties. May share this information with NVAs.
NVEs could be in vSwitches, external pSwitches or span both.
NVA could be distributed or centralized and replicated.
NVEs get information from hypervisors and/or NVA.
Hypervisor-to-NVE Protocol (data plane learning)
NVE-NVA Protocol: Push or Pull (on-demand) model.
Control plane learning.
Map and Encap: Find destination NVE (map) and send (encap)
Ref: T. Narten, et al., “An Architecture for Overlay Networks (NVO3),” http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain
13-8
- 9. Current NVO Technologies
BGP/MPLS IP VPNs: Widely deployed in enterprise networks.
Difficult in data centers because hosts/hypervisors do not
implement BGP.
BGP/MPLS Ethernet VPNs: Deployed in carrier networks.
Difficult in data centers.
802.1Q, PB, PBB VLANs
Shortest Path Bridging: IEEE 802.1aq
Virtual Station Interface (VSI) Discovery and Configuration
Protocol (VDP): IEEE 802.1Qbg
Address Resolution for Massive numbers of hosts in the Data
Center (ARMD): RFC6820
TRILL
L2VPN: Provider provisioned L2 VPN
Proxy Mobile IP: Does not support multi-tenancy
LISP: RFC 6830
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain
13-9
- 10. GRE
Generic Routing Encaptulation (RFC 1701/1702)
Generic X over Y for any X or Y
Over IPv4, GRE packets use a protocol type of 47
Optional Checksum, Loose/strict Source Routing, Key
Key is used to authenticate the source
Recursion Control: # of additional encapsulations allowed.
0 Restricted to a single provider network end-to-end
Offset: Points to the next source route field to be used
IP or IPSec are commonly used as delivery headers
Delivery Header GRE Header
Payload
Check- Routing Key
Seq. Strict Recursion Flags Ver. Prot. Offset Check Key Seq. Source
sum
Present Present
#
Source Control
# Type
sum
#
Routing
Present
Present Route
List
1b
1b
1b
Washington University in St. Louis
1b
1b
3b
5b
3b
16b
16b
http://www.cse.wustl.edu/~jain/cse570-13/
13-10
16b 32b
32b Variable
©2013 Raj Jain
- 11. EoMPLSoGRE
Ethernet over MPLS over GRE (point-to-point)
VPLS over MPLS over GRE (Multipoint-to-multipoint)
Used when provider offers only L3 connectivity
Subscribers use their own MPLS over GRE tunnels
VPLSoGRE or Advanced-VPLSoGRE can also be used
GRE offers IPSec encryption option
Provider
Edge
Router
Customer
MPLS
Router
Ethernet
Provider
Core
Router(s)
Provider
Edge
Router
Tunnel
Provider Network
IP
GRE
Washington University in St. Louis
MPLS
Ethernet
=
IP
GRE
MPLS
http://www.cse.wustl.edu/~jain/cse570-13/
13-11
Customer
MPLS
Router
Ethernet
Ethernet
©2013 Raj Jain
- 12. NVGRE
Ethernet over GRE over IP (point-to-point)
A unique 24-bit Virtual Subnet Identifier (VSID) is used as the
lower 24-bits of GRE key field 224 tenants can share
Unique IP multicast address is used for BUM (Broadcast,
Unknown, Multicast) traffic on each VSID
Equal Cost Multipath (ECMP) allowed on point-to-point tunnels
Provider
Edge
Router
Customer
Edge
Switch
Ethernet
Provider
Core
Router(s)
Provider
Edge
Router
Tunnel
Provider Network
IP
GRE
Ethernet
=
IP
GRE
Ethernet
Ref: M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013,
http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
13-12
Customer
Edge
Switch
Ethernet
©2013 Raj Jain
- 13. NVGRE (Cont)
In a cloud, a pSwitch or a vSwitch can serve as tunnel endpoint
VMs need to be in the same VSID to communicate
VMs in different VSIDs can have the same MAC address
Inner IEEE 802.1Q tag, if present, is removed.
VM
VM
VM
VM
VM
VM
10.20.2.1
10.20.2.2
10.20.2.3
10.20.2.4
10.20.2.5
10.20.2.6
VM
VM
VM
VM
VM
VM
10.20.1.1
10.20.1.2
10.20.1.3
10.20.1.4
10.20.1.5
10.20.1.6
Subnet 192.168.1.X
Subnet 192.168.2.X
Virtual
Subnet
10.20.2.X
Virtual
Subnet
10.20.1.X
Subnet 192.168.3.X
Internet
Ref: Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug 2012, 11pp.,
http://www.emulex.com/artifacts/074d492d-9dfa-42bd-9583-69ca9e264bd3/elx_wp_all_nvgre.pdf
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
13-13
©2013 Raj Jain
- 14. VXLAN
Virtual eXtensible Local Area Networks (VXLAN)
L3 solution to isolate multiple tenants in a data center
(L2 solution is Q-in-Q and MAC-in-MAC)
Developed by VMware. Supported by many companies in IETF
NVO3 working group
Problem:
4096 VLANs are not sufficient in a multi-tenant data center
Tenants need to control their MAC, VLAN, and IP address
assignments Overlapping MAC, VLAN, and IP addresses
Spanning tree is inefficient with large number of switches
Too many links are disabled
Better throughput with IP equal cost multipath (ECMP)
Ref: M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,”
draft-mahalingam-dutt-dcops-vxlan-04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-04
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-14
©2013 Raj Jain
- 15. VXLAN Architecture
Create a virtual L2 overlay (called VXLAN) over L3 networks
224 VXLAN Network Identifiers (VNIs)
Only VMs in the same VXLAN can communicate
vSwitches serve as VTEP (VXLAN Tunnel End Point).
Encapsulate L2 frames in UDP over IP and send to the
destination VTEP(s).
Segments may have overlapping MAC addresses and VLANs
but L2 traffic never crosses a VNI
Tenant 3 Virtual L2 Network
Tenant 1 Virtual L2 Network
Tenant 2 Virtual L2 Network
L3 Network
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-15
©2013 Raj Jain
- 16. VXLAN Deployment Example
Example: Three tenants. 3 VNIs. 4 Tunnels for unicast.
+ 3 tunnels for multicast (not shown)
VM1-1
VNI 34
VM2-1
VNI 22
Hypervisor VTEP IP1
VM2-2
VNI 22
VM1-2
VNI 34
VM3-1
VNI 74
Hypervisor VTEP IP2
VM3-2
VNI 74
VM1-3
VNI 34
Hypervisor VTEP IP3
L3 Network
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-16
©2013 Raj Jain
- 17. VXLAN Encapsulation Format
Outer VLAN tag is optional.
Used to isolate VXLAN traffic on the LAN
Source VM ARPs to find Destination VM’s MAC address.
All L2 multicasts/unknown are sent via IP multicast.
Destination VM sends a standard IP unicast ARP response.
Destination VTEP learns inner-Src-MAC-to-outer-src-IP mapping
Avoids unknown destination flooding for returning responses
Dest. Source
VTEP VTEP
MAC MAC
Outer
VLAN
Only key fields are shown
Washington University in St. Louis
Dest. Source
VTEP VTEP
IP
IP
UDP VXLAN
Header Header
Flags
8b
Dest Source Tenant Ethernet
VM
VM
VLAN Payload
MAC MAC
Res VNI Res
24b 24b 8b
http://www.cse.wustl.edu/~jain/cse570-13/
13-17
©2013 Raj Jain
- 18. VXLAN Encapsulation Format (Cont)
IGMP is used to prune multicast trees
7 of 8 bits in the flag field are reserved.
I flag bit is set if VNI field is valid
UDP source port is a hash of the inner MAC header
Allows load balancing using Equal Cost Multi Path using
L3-L4 header hashing
VMs are unaware that they are operating on VLAN or VXLAN
VTEPs need to learn MAC address of other VTEPs and of
client VMs of VNIs they are handling.
A VXLAN gateway switch can forward traffic to/from nonVXLAN networks. Encapsulates or decapsulates the packets.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-18
©2013 Raj Jain
- 19. VXLAN: Summary
VXLAN solves the problem of multiple tenants with
overlapping MAC addresses, VLANs, and IP addresses in a
cloud environment.
A server may have VMs belonging to different tenants
No changes to VMs. Hypervisors responsible for all details.
Uses UDP over IP encapsulation to isolate tenants
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-19
©2013 Raj Jain
- 20. Stateless Transport Tunneling Protocol (STT)
Ethernet over TCP-Like over IP tunnels.
GRE, IPSec tunnels can also be used if required.
Tunnel endpoints may be inside the end-systems (vSwitches)
Designed for large storage blocks 64kB. Fragmentation
allowed.
Most other overlay protocols use UDP and disallow
fragmentation Maximum Transmission Unit (MTU) issues.
TCP-Like: Stateless TCP Header identical to TCP (same
protocol number 6) but no 3-way handshake, no connections,
no windows, no retransmissions, no congestion state
Stateless Transport (recognized by standard port number).
Broadcast, Unknown, Multicast (BUM) handled by IP
multicast tunnels
Ref: B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)," Sep 2013,
http://tools.ietf.org/html/draft-davie-stt-04
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain
13-20
- 21. LSO and LRO
Large Send Offload (LSO): Host hands a large chunk of data to
NIC and meta data. NIC makes MSS size segments, adds
checksum, TCP, IP, and MAC headers to each segment.
Large Receive Offload (LRO): NICs attempt to reassemble
multiple TCP segments and pass larger chunks to the host.
Host does the final reassembly with fewer per packet operations.
STT takes advantage of LSO and LRO features, if available.
Using a protocol number other than 6 will not allow LSO/LRO to
handle STT
Host
Meta Data
Payload
LSO
L2 Header
Washington University in St. Louis
LRO
IP Header TCP Header
Segment
http://www.cse.wustl.edu/~jain/cse570-13/
13-21
©2013 Raj Jain
- 22. STT Optimizations
VM
VM
vNIC
vNIC
Washington University in St. Louis
NIC
NIC
Underlay
Network
vSwitch
Large data size: Less overhead per payload byte
Context ID: 64-bit tunnel end-point identifier
Optimizations:
2-byte padding is added to Ethernet frames to make its size
a multiple of 32-bits.
Source port is a hash of the inner header ECMP with
each flow taking different path and all packets of a flow
taking one path
No protocol type field Payload assumed to be Ethernet,
which can carry any payload identified by protocol type.
vSwitch
http://www.cse.wustl.edu/~jain/cse570-13/
13-22
vNIC
VM
vNIC
VM
©2013 Raj Jain
- 23. STT Frame Format
16-Bit MSS 216 B = 64K Byte maximum
L4 Offset: From the of STT header to the start of encapsulated L4
(TCP/UDP) header Helps locate payload quickly
Checksum Verified: Checksum covers entire payload and valid
Checksum Partial: Checksum only includes TCP/IP headers
IP
TCP-Like STT
STT
Header Header
Header Payload
Version
8b
Flags
8b
Checksum
Verified
1b
L4 Offset
Reserved
8b
Checksum
Partial
Maximum
Segment
Size
8b
IP Version
IPv4
1b
Washington University in St. Louis
1b
Priority VLAN ID Context
Code
Valid
ID
Point
16b
TCP
Payload
1b
3b
1b
64b
VLAN
ID
12b
Padding
16b
Reserved
4b
http://www.cse.wustl.edu/~jain/cse570-13/
13-23
©2013 Raj Jain
- 24. TCP-Like Header in STT
Destination Port: Standard to be requested from IANA
Source Port: Selected for efficient ECMP
Ack Number: STT payload sequence identifier. Same in all
segments of a payload
Sequence Number (32b): Length of STT Payload (16b) + offset
of the current segment (16b) Correctly handled by NICs
with Large Receive Offload (LRO) feature
No acks. STT delivers partial payload to higher layers.
Higher layer TCP can handle retransmissions if required.
Middle boxes will need to be programmed to allow STT pass
through
Source
Port
(Random)
Dest.
Port
(Standard)
STT Payload
Segment
Length
Offset
Sequence Number*
16b
16b
16b+16b
Washington University in St. Louis
Payload
Sequence #
Ack Number*
32b
16b
*Different meaning than TCP
http://www.cse.wustl.edu/~jain/cse570-13/
13-24
Data
Offset
©2013 Raj Jain
- 25. STT Summary
STT solves the problem of efficient transport of large 64 KB
storage blocks
Uses Ethernet over TCP-Like over IP tunnels
Designed for software implementation in hypervisors
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-25
©2013 Raj Jain
- 26. Summary
1.
2.
3.
4.
NVO3 is a generalized framework for network virtualization
and partitioning for multiple tenants over L3. It covers both L2
and L3 connectivity.
NVGRE uses Ethernet over GRE for L2 connectivity.
VXLAN uses Ethernet over UDP over IP
STT uses Ethernet over TCP-like stateless protocol over IP.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-26
©2013 Raj Jain
- 27. Reading List
B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for
Network Virtualization (STT)," Sep 2013, http://tools.ietf.org/html/draftdavie-stt-04
Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug
2012, 11pp., http://www.emulex.com/artifacts/074d492d-9dfa-42bd-958369ca9e264bd3/elx_wp_all_nvgre.pdf
M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer
2 Networks over Layer 3 Networks,” draft-mahalingam-dutt-dcops-vxlan04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcopsvxlan-04
M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013,
http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03
Network Virtualization Overlays (nvo3) charter,
http://datatracker.ietf.org/wg/nvo3/charter/
T. Narten, et al., “An Architecture for Overlay Networks (NVO3),”
http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/
V. Josyula, M. Orr, and G. Page, “Cloud Computing: Automating the
Virtualized Data Center,” Cisco Press, 2012, 392 pp., ISBN: 1587204347.
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-27
©2013 Raj Jain
- 29. Acronyms
ARMD
Address Resolution for Massive numbers of
hosts in the Data center
ARP
Address Resolution Protocol
BGP
Border Gateway Protocol
BUM
Broadcast, Unknown, Multicast
DCN
Data Center Networks
DCVPN
Data Center Virtual Private Network
ECMP
Equal Cost Multi Path
EoMPLSoGRE
Ethernet over MPLS over GRE
EVPN
Ethernet Virtual Private Network
GRE
Generic Routing Encapsulation
IANA
Internet Address and Naming Authority
ID
Identifier
IEEE
Institution of Electrical and Electronic Engineers
IETF
Internet Engineering Task Force
Washington University in St. Louis
http://www.cse.wustl.edu/~jain/cse570-13/
13-29
©2013 Raj Jain