SlideShare a Scribd company logo
Multi-Tenant Isolation and
Network Virtualization in
Cloud Data Centers

.

Raj Jain
Washington University in Saint Louis
Saint Louis, MO 63130
Jain@cse.wustl.edu
These slides and audio/video recordings of this class lecture are at:
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-1

©2013 Raj Jain
Overview
NVO3
2. VXLAN
3. NVGRE
4. STT
Note: Data center interconnection and LAN extension techniques
are covered in another module which includes OTV, TRILL,
and LISP.
1.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-2

©2013 Raj Jain
Network Virtualization
1.

2.

3.
4.

Network virtualization allows tenants to form an overlay network in
a multi-tenant network such that tenant can control:
1. Connectivity layer: Tenant network can be L2 while the provider
is L3 and vice versa
2. Addresses: MAC addresses and IP addresses
3. Network Partitions: VLANs and Subnets
4. Node Location: Move nodes freely
Network virtualization allows providers to serve a large number of
tenants without worrying about:
1. Internal addresses used in client networks
2. Number of client nodes
3. Location of individual client nodes
4. Number and values of client partitions (VLANs and Subnets)
Network could be a single physical interface, a single physical
machine, a data center, a metro, … or the global Internet.
Provider could be a system owner, an enterprise, a cloud provider, or
a carrier.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-3

©2013 Raj Jain
Network Virtualization Techniques
Entity
NIC
Switch
L2 Link
L2 Network using L2

Partitioning
SR-IOV
VEB, VEPA
VLANs
VLAN

L2 Network using L3 NVO3,
VXLAN,
NVGRE, STT
Router
VDCs, VRF
L3 Network using L1
L3 Network using
MPLS, GRE,
L3*
PW, IPSec
Application
ADCs

Aggregation/Extension/Interconnection**
MR-IOV
VSS, VBE, DVS, FEX
LACP, Virtual PortChannels
PB (Q-in-Q), PBB (MAC-in-MAC), PBB-TE,
Access-EPL, EVPL, EVP-Tree, EVPLAN
MPLS, VPLS, A-VPLS, H-VPLS, PWoMPLS,
PWoGRE, OTV, TRILL, LISP, L2TPv3,
EVPN, PBB-EVPN
VRRP, HSRP
GMPLS, SONET
MPLS, T-MPLS, MPLS-TP, GRE, PW, IPSec
Load Balancers

*All L2/L3 technologies for L2 Network partitioning and aggregation can also be used for L3 network
partitioning and aggregation, respectively, by simply putting L3 packets in L2 payloads.
**The aggregation technologies can also be seen as partitioning technologies from the provider point of view.
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-4

©2013 Raj Jain
NVO3




Network Virtualization Overlays using L3 techniques
Problem: Data Center Virtual Private Network (DCVPN) in a
multi-tenant datacenter
Issues:
 Scale in Number of Networks: Hundreds of thousands of
DCVPNs in a single administrative domain
 Scale in Number of Nodes: Millions of VMs on hundred
thousands of physical servers
 VM (or pM) Migration
 Support both L2 and L3 VPNs
 Dynamic provisioning
 Addressing independence
 Virtual Private  Other tenants do not see your frames
 Optimal Forwarding (VRRP inefficient in a large network)

Ref: Network Virtualization Overlays (nvo3) charter, http://datatracker.ietf.org/wg/nvo3/charter/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

13-5

©2013 Raj Jain
NVO3 Goals


Develop a general architectural framework
 Identify key functional blocks.
 Indentify alternatives for each functional block
 Deployments can mix and match these alternatives
 Analyze which requirements are satisfied by
different alternatives
 Operation, Administration and Management
(OAM)

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-6

©2013 Raj Jain
NVO3 Terminology







Tenant System (TS): VM or pM
Virtual Network (VN): L2 or L3 Tenant networks
TS
Network Virtualization Edges (NVEs): Entities
TS
connecting TSs (virtual/physical switches/routers)
Network Virtualization Authority (NVA):
NVE
NVE
Manages forwarding info for a set of NVEs
NV Domain: Set of NVEs under one authority
TS
NV Region: Set of domains that share some
VN
NVE
information (to support VNs that span multiple
domains)
Region
NVA
VN

NVE

NVE

Domain
Washington University in St. Louis

NVA
VN

NVE

VN

NVE
NVE

VN

http://www.cse.wustl.edu/~jain/cse570-13/

13-7

VN

Domain
©2013 Raj Jain
NVO3 Components









Underlay Network: Provides overlay network service
Orchestration Systems: Create new VMs and
associated vSwitches and other networking entities and
properties. May share this information with NVAs.
NVEs could be in vSwitches, external pSwitches or span both.
NVA could be distributed or centralized and replicated.
NVEs get information from hypervisors and/or NVA.
 Hypervisor-to-NVE Protocol (data plane learning)
 NVE-NVA Protocol: Push or Pull (on-demand) model.
Control plane learning.
Map and Encap: Find destination NVE (map) and send (encap)

Ref: T. Narten, et al., “An Architecture for Overlay Networks (NVO3),” http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

13-8
Current NVO Technologies












BGP/MPLS IP VPNs: Widely deployed in enterprise networks.
Difficult in data centers because hosts/hypervisors do not
implement BGP.
BGP/MPLS Ethernet VPNs: Deployed in carrier networks.
Difficult in data centers.
802.1Q, PB, PBB VLANs
Shortest Path Bridging: IEEE 802.1aq
Virtual Station Interface (VSI) Discovery and Configuration
Protocol (VDP): IEEE 802.1Qbg
Address Resolution for Massive numbers of hosts in the Data
Center (ARMD): RFC6820
TRILL
L2VPN: Provider provisioned L2 VPN
Proxy Mobile IP: Does not support multi-tenancy
LISP: RFC 6830
http://www.cse.wustl.edu/~jain/cse570-13/

Washington University in St. Louis

©2013 Raj Jain

13-9
GRE









Generic Routing Encaptulation (RFC 1701/1702)
Generic  X over Y for any X or Y
Over IPv4, GRE packets use a protocol type of 47
Optional Checksum, Loose/strict Source Routing, Key
Key is used to authenticate the source
Recursion Control: # of additional encapsulations allowed.
0  Restricted to a single provider network  end-to-end
Offset: Points to the next source route field to be used
IP or IPSec are commonly used as delivery headers
Delivery Header GRE Header

Payload

Check- Routing Key
Seq. Strict Recursion Flags Ver. Prot. Offset Check Key Seq. Source
sum
Present Present
#
Source Control
# Type
sum
#
Routing
Present
Present Route
List
1b

1b

1b

Washington University in St. Louis

1b

1b

3b

5b

3b

16b

16b

http://www.cse.wustl.edu/~jain/cse570-13/

13-10

16b 32b

32b Variable
©2013 Raj Jain
EoMPLSoGRE





Ethernet over MPLS over GRE (point-to-point)
VPLS over MPLS over GRE (Multipoint-to-multipoint)
Used when provider offers only L3 connectivity
Subscribers use their own MPLS over GRE tunnels
VPLSoGRE or Advanced-VPLSoGRE can also be used
GRE offers IPSec encryption option
Provider
Edge
Router

Customer
MPLS
Router
Ethernet

Provider
Core
Router(s)

Provider
Edge
Router

Tunnel
Provider Network
IP

GRE

Washington University in St. Louis

MPLS

Ethernet

=

IP

GRE

MPLS

http://www.cse.wustl.edu/~jain/cse570-13/

13-11

Customer
MPLS
Router
Ethernet

Ethernet
©2013 Raj Jain
NVGRE





Ethernet over GRE over IP (point-to-point)
A unique 24-bit Virtual Subnet Identifier (VSID) is used as the
lower 24-bits of GRE key field  224 tenants can share
Unique IP multicast address is used for BUM (Broadcast,
Unknown, Multicast) traffic on each VSID
Equal Cost Multipath (ECMP) allowed on point-to-point tunnels
Provider
Edge
Router

Customer
Edge
Switch
Ethernet

Provider
Core
Router(s)

Provider
Edge
Router

Tunnel
Provider Network
IP

GRE

Ethernet

=

IP

GRE

Ethernet

Ref: M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013,
http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

13-12

Customer
Edge
Switch
Ethernet

©2013 Raj Jain
NVGRE (Cont)





In a cloud, a pSwitch or a vSwitch can serve as tunnel endpoint
VMs need to be in the same VSID to communicate
VMs in different VSIDs can have the same MAC address
Inner IEEE 802.1Q tag, if present, is removed.
VM

VM

VM

VM

VM

VM

10.20.2.1

10.20.2.2

10.20.2.3

10.20.2.4

10.20.2.5

10.20.2.6

VM

VM

VM

VM

VM

VM

10.20.1.1

10.20.1.2

10.20.1.3

10.20.1.4

10.20.1.5

10.20.1.6

Subnet 192.168.1.X

Subnet 192.168.2.X

Virtual
Subnet
10.20.2.X
Virtual
Subnet
10.20.1.X

Subnet 192.168.3.X

Internet
Ref: Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug 2012, 11pp.,
http://www.emulex.com/artifacts/074d492d-9dfa-42bd-9583-69ca9e264bd3/elx_wp_all_nvgre.pdf
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

13-13

©2013 Raj Jain
VXLAN





Virtual eXtensible Local Area Networks (VXLAN)
L3 solution to isolate multiple tenants in a data center
(L2 solution is Q-in-Q and MAC-in-MAC)
Developed by VMware. Supported by many companies in IETF
NVO3 working group
Problem:
 4096 VLANs are not sufficient in a multi-tenant data center
 Tenants need to control their MAC, VLAN, and IP address
assignments  Overlapping MAC, VLAN, and IP addresses
 Spanning tree is inefficient with large number of switches
 Too many links are disabled
 Better throughput with IP equal cost multipath (ECMP)

Ref: M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,”
draft-mahalingam-dutt-dcops-vxlan-04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-04
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-14

©2013 Raj Jain
VXLAN Architecture







Create a virtual L2 overlay (called VXLAN) over L3 networks
224 VXLAN Network Identifiers (VNIs)
Only VMs in the same VXLAN can communicate
vSwitches serve as VTEP (VXLAN Tunnel End Point).
 Encapsulate L2 frames in UDP over IP and send to the
destination VTEP(s).
Segments may have overlapping MAC addresses and VLANs
but L2 traffic never crosses a VNI
Tenant 3 Virtual L2 Network
Tenant 1 Virtual L2 Network

Tenant 2 Virtual L2 Network

L3 Network
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-15

©2013 Raj Jain
VXLAN Deployment Example
Example: Three tenants. 3 VNIs. 4 Tunnels for unicast.
+ 3 tunnels for multicast (not shown)

VM1-1
VNI 34

VM2-1
VNI 22

Hypervisor VTEP IP1

VM2-2
VNI 22

VM1-2
VNI 34

VM3-1
VNI 74

Hypervisor VTEP IP2

VM3-2
VNI 74

VM1-3
VNI 34

Hypervisor VTEP IP3

L3 Network
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-16

©2013 Raj Jain
VXLAN Encapsulation Format





Outer VLAN tag is optional.
Used to isolate VXLAN traffic on the LAN
Source VM ARPs to find Destination VM’s MAC address.
All L2 multicasts/unknown are sent via IP multicast.
Destination VM sends a standard IP unicast ARP response.
Destination VTEP learns inner-Src-MAC-to-outer-src-IP mapping
 Avoids unknown destination flooding for returning responses
Dest. Source
VTEP VTEP
MAC MAC

Outer
VLAN

Only key fields are shown
Washington University in St. Louis

Dest. Source
VTEP VTEP
IP
IP

UDP VXLAN
Header Header

Flags
8b

Dest Source Tenant Ethernet
VM
VM
VLAN Payload
MAC MAC

Res VNI Res
24b 24b 8b

http://www.cse.wustl.edu/~jain/cse570-13/

13-17

©2013 Raj Jain
VXLAN Encapsulation Format (Cont)








IGMP is used to prune multicast trees
7 of 8 bits in the flag field are reserved.
I flag bit is set if VNI field is valid
UDP source port is a hash of the inner MAC header
 Allows load balancing using Equal Cost Multi Path using
L3-L4 header hashing
VMs are unaware that they are operating on VLAN or VXLAN
VTEPs need to learn MAC address of other VTEPs and of
client VMs of VNIs they are handling.
A VXLAN gateway switch can forward traffic to/from nonVXLAN networks. Encapsulates or decapsulates the packets.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-18

©2013 Raj Jain
VXLAN: Summary






VXLAN solves the problem of multiple tenants with
overlapping MAC addresses, VLANs, and IP addresses in a
cloud environment.
A server may have VMs belonging to different tenants
No changes to VMs. Hypervisors responsible for all details.
Uses UDP over IP encapsulation to isolate tenants

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-19

©2013 Raj Jain
Stateless Transport Tunneling Protocol (STT)








Ethernet over TCP-Like over IP tunnels.
GRE, IPSec tunnels can also be used if required.
Tunnel endpoints may be inside the end-systems (vSwitches)
Designed for large storage blocks 64kB. Fragmentation
allowed.
Most other overlay protocols use UDP and disallow
fragmentation  Maximum Transmission Unit (MTU) issues.
TCP-Like: Stateless TCP  Header identical to TCP (same
protocol number 6) but no 3-way handshake, no connections,
no windows, no retransmissions, no congestion state
 Stateless Transport (recognized by standard port number).
Broadcast, Unknown, Multicast (BUM) handled by IP
multicast tunnels

Ref: B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)," Sep 2013,
http://tools.ietf.org/html/draft-davie-stt-04
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

13-20
LSO and LRO







Large Send Offload (LSO): Host hands a large chunk of data to
NIC and meta data. NIC makes MSS size segments, adds
checksum, TCP, IP, and MAC headers to each segment.
Large Receive Offload (LRO): NICs attempt to reassemble
multiple TCP segments and pass larger chunks to the host.
Host does the final reassembly with fewer per packet operations.
STT takes advantage of LSO and LRO features, if available.
Using a protocol number other than 6 will not allow LSO/LRO to
handle STT
Host

Meta Data

Payload
LSO

L2 Header
Washington University in St. Louis

LRO

IP Header TCP Header

Segment

http://www.cse.wustl.edu/~jain/cse570-13/

13-21

©2013 Raj Jain
STT Optimizations




VM
VM

vNIC
vNIC

Washington University in St. Louis

NIC

NIC
Underlay
Network

vSwitch



Large data size: Less overhead per payload byte
Context ID: 64-bit tunnel end-point identifier
Optimizations:
 2-byte padding is added to Ethernet frames to make its size
a multiple of 32-bits.
 Source port is a hash of the inner header  ECMP with
each flow taking different path and all packets of a flow
taking one path
No protocol type field  Payload assumed to be Ethernet,
which can carry any payload identified by protocol type.
vSwitch



http://www.cse.wustl.edu/~jain/cse570-13/

13-22

vNIC

VM

vNIC

VM
©2013 Raj Jain
STT Frame Format





16-Bit MSS  216 B = 64K Byte maximum
L4 Offset: From the of STT header to the start of encapsulated L4
(TCP/UDP) header  Helps locate payload quickly
Checksum Verified: Checksum covers entire payload and valid
Checksum Partial: Checksum only includes TCP/IP headers
IP
TCP-Like STT
STT
Header Header
Header Payload

Version

8b

Flags

8b

Checksum
Verified
1b

L4 Offset

Reserved

8b
Checksum
Partial

Maximum
Segment
Size

8b
IP Version
IPv4

1b

Washington University in St. Louis

1b

Priority VLAN ID Context
Code
Valid
ID
Point

16b
TCP
Payload
1b

3b

1b

64b

VLAN
ID
12b

Padding

16b

Reserved
4b

http://www.cse.wustl.edu/~jain/cse570-13/

13-23

©2013 Raj Jain
TCP-Like Header in STT









Destination Port: Standard to be requested from IANA
Source Port: Selected for efficient ECMP
Ack Number: STT payload sequence identifier. Same in all
segments of a payload
Sequence Number (32b): Length of STT Payload (16b) + offset
of the current segment (16b)  Correctly handled by NICs
with Large Receive Offload (LRO) feature
No acks. STT delivers partial payload to higher layers.
Higher layer TCP can handle retransmissions if required.
Middle boxes will need to be programmed to allow STT pass
through
Source
Port
(Random)

Dest.
Port
(Standard)

STT Payload
Segment
Length
Offset
Sequence Number*

16b

16b

16b+16b

Washington University in St. Louis

Payload
Sequence #
Ack Number*

32b
16b
*Different meaning than TCP

http://www.cse.wustl.edu/~jain/cse570-13/

13-24

Data
Offset

©2013 Raj Jain
STT Summary




STT solves the problem of efficient transport of large 64 KB
storage blocks
Uses Ethernet over TCP-Like over IP tunnels
Designed for software implementation in hypervisors

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-25

©2013 Raj Jain
Summary

1.

2.
3.
4.

NVO3 is a generalized framework for network virtualization
and partitioning for multiple tenants over L3. It covers both L2
and L3 connectivity.
NVGRE uses Ethernet over GRE for L2 connectivity.
VXLAN uses Ethernet over UDP over IP
STT uses Ethernet over TCP-like stateless protocol over IP.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-26

©2013 Raj Jain
Reading List











B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for
Network Virtualization (STT)," Sep 2013, http://tools.ietf.org/html/draftdavie-stt-04
Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug
2012, 11pp., http://www.emulex.com/artifacts/074d492d-9dfa-42bd-958369ca9e264bd3/elx_wp_all_nvgre.pdf
M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer
2 Networks over Layer 3 Networks,” draft-mahalingam-dutt-dcops-vxlan04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcopsvxlan-04
M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013,
http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03
Network Virtualization Overlays (nvo3) charter,
http://datatracker.ietf.org/wg/nvo3/charter/
T. Narten, et al., “An Architecture for Overlay Networks (NVO3),”
http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/
V. Josyula, M. Orr, and G. Page, “Cloud Computing: Automating the
Virtualized Data Center,” Cisco Press, 2012, 392 pp., ISBN: 1587204347.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-27

©2013 Raj Jain
Wikipedia Links





http://en.wikipedia.org/wiki/Generic_Routing_Encapsulation
http://en.wikipedia.org/wiki/Locator/Identifier_Separation_Prot
ocol
http://en.wikipedia.org/wiki/Large_segment_offload
http://en.wikipedia.org/wiki/Large_receive_offload

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-28

©2013 Raj Jain
Acronyms















ARMD

Address Resolution for Massive numbers of
hosts in the Data center
ARP
Address Resolution Protocol
BGP
Border Gateway Protocol
BUM
Broadcast, Unknown, Multicast
DCN
Data Center Networks
DCVPN
Data Center Virtual Private Network
ECMP
Equal Cost Multi Path
EoMPLSoGRE
Ethernet over MPLS over GRE
EVPN
Ethernet Virtual Private Network
GRE
Generic Routing Encapsulation
IANA
Internet Address and Naming Authority
ID
Identifier
IEEE
Institution of Electrical and Electronic Engineers
IETF
Internet Engineering Task Force

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-29

©2013 Raj Jain
Acronyms (Cont)

















IGMP
IP
IPSec
IPv4
LAN
LISP
LRO
LSO
MAC
MPLS
MSS
MTU
NIC
NV
NVA
NVEs

Internet Group Multicast Protocol
Internet Protocol
IP Security
Internet Protocol V4
Local Area Network
Locator ID Separation Protocol
Large Receive Offload
Large Send Offload
Media Access Control
Multi Protocol Label Switching
Maximum Segment Size
Maximum Transmission Unit
Network Interface Card
Network Virtualization
Network Virtualization Authority
Network Virtualization Edge

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-30

©2013 Raj Jain
Acronyms (Cont)

















NVGRE
NVO3
OAM
OTV
PB
PBB
pM
pSwitch
QoS
RFC
RS
STT
TCP
TRILL
TS
UDP

Network Virtualization Using GRE
Network Virtualization over L3
Operation, Administration and Management
Overlay Transport Virtualization
Provider Bridges
Provider Backbone Bridge
Physical Machine
Physical Switch
Quality of Service
Request for Comment
Routing System
Stateless Transport Tunneling Protocol
Transmission Control Protocol
Transparent Routing over Lots of Links
Tenant System
User Datagram Protocol

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-31

©2013 Raj Jain
Acronyms (Cont)
















VDP
VLAN
VM
VN
VNI
VPLS
VPLSoGRE
VPLSoGRE
VPN
VRRP
VSI
VSID
vSwitch
VTEP
VXLAN

VSI Discovery and Configuration Protocol
Virtual Local Area Network
Virtual Machine
Virtual Network
Virtual Network Identifier
Virtual Private LAN Service
Virtual Private LAN Service over GRE
VPLS over GRE
Virtual Private Network
Virtual Router Redundancy Protocol
Virtual Station Interface
Virtual Subnet Identifier
Virtual Switch
VXLAN Tunnel End Point
Virtual Extensible Local Area Network

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

13-32

©2013 Raj Jain

More Related Content

Network Virtualization in Cloud Data Centers

  • 1. Multi-Tenant Isolation and Network Virtualization in Cloud Data Centers . Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides and audio/video recordings of this class lecture are at: http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-1 ©2013 Raj Jain
  • 2. Overview NVO3 2. VXLAN 3. NVGRE 4. STT Note: Data center interconnection and LAN extension techniques are covered in another module which includes OTV, TRILL, and LISP. 1. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-2 ©2013 Raj Jain
  • 3. Network Virtualization 1. 2. 3. 4. Network virtualization allows tenants to form an overlay network in a multi-tenant network such that tenant can control: 1. Connectivity layer: Tenant network can be L2 while the provider is L3 and vice versa 2. Addresses: MAC addresses and IP addresses 3. Network Partitions: VLANs and Subnets 4. Node Location: Move nodes freely Network virtualization allows providers to serve a large number of tenants without worrying about: 1. Internal addresses used in client networks 2. Number of client nodes 3. Location of individual client nodes 4. Number and values of client partitions (VLANs and Subnets) Network could be a single physical interface, a single physical machine, a data center, a metro, … or the global Internet. Provider could be a system owner, an enterprise, a cloud provider, or a carrier. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-3 ©2013 Raj Jain
  • 4. Network Virtualization Techniques Entity NIC Switch L2 Link L2 Network using L2 Partitioning SR-IOV VEB, VEPA VLANs VLAN L2 Network using L3 NVO3, VXLAN, NVGRE, STT Router VDCs, VRF L3 Network using L1 L3 Network using MPLS, GRE, L3* PW, IPSec Application ADCs Aggregation/Extension/Interconnection** MR-IOV VSS, VBE, DVS, FEX LACP, Virtual PortChannels PB (Q-in-Q), PBB (MAC-in-MAC), PBB-TE, Access-EPL, EVPL, EVP-Tree, EVPLAN MPLS, VPLS, A-VPLS, H-VPLS, PWoMPLS, PWoGRE, OTV, TRILL, LISP, L2TPv3, EVPN, PBB-EVPN VRRP, HSRP GMPLS, SONET MPLS, T-MPLS, MPLS-TP, GRE, PW, IPSec Load Balancers *All L2/L3 technologies for L2 Network partitioning and aggregation can also be used for L3 network partitioning and aggregation, respectively, by simply putting L3 packets in L2 payloads. **The aggregation technologies can also be seen as partitioning technologies from the provider point of view. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-4 ©2013 Raj Jain
  • 5. NVO3    Network Virtualization Overlays using L3 techniques Problem: Data Center Virtual Private Network (DCVPN) in a multi-tenant datacenter Issues:  Scale in Number of Networks: Hundreds of thousands of DCVPNs in a single administrative domain  Scale in Number of Nodes: Millions of VMs on hundred thousands of physical servers  VM (or pM) Migration  Support both L2 and L3 VPNs  Dynamic provisioning  Addressing independence  Virtual Private  Other tenants do not see your frames  Optimal Forwarding (VRRP inefficient in a large network) Ref: Network Virtualization Overlays (nvo3) charter, http://datatracker.ietf.org/wg/nvo3/charter/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 13-5 ©2013 Raj Jain
  • 6. NVO3 Goals  Develop a general architectural framework  Identify key functional blocks.  Indentify alternatives for each functional block  Deployments can mix and match these alternatives  Analyze which requirements are satisfied by different alternatives  Operation, Administration and Management (OAM) Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-6 ©2013 Raj Jain
  • 7. NVO3 Terminology       Tenant System (TS): VM or pM Virtual Network (VN): L2 or L3 Tenant networks TS Network Virtualization Edges (NVEs): Entities TS connecting TSs (virtual/physical switches/routers) Network Virtualization Authority (NVA): NVE NVE Manages forwarding info for a set of NVEs NV Domain: Set of NVEs under one authority TS NV Region: Set of domains that share some VN NVE information (to support VNs that span multiple domains) Region NVA VN NVE NVE Domain Washington University in St. Louis NVA VN NVE VN NVE NVE VN http://www.cse.wustl.edu/~jain/cse570-13/ 13-7 VN Domain ©2013 Raj Jain
  • 8. NVO3 Components       Underlay Network: Provides overlay network service Orchestration Systems: Create new VMs and associated vSwitches and other networking entities and properties. May share this information with NVAs. NVEs could be in vSwitches, external pSwitches or span both. NVA could be distributed or centralized and replicated. NVEs get information from hypervisors and/or NVA.  Hypervisor-to-NVE Protocol (data plane learning)  NVE-NVA Protocol: Push or Pull (on-demand) model. Control plane learning. Map and Encap: Find destination NVE (map) and send (encap) Ref: T. Narten, et al., “An Architecture for Overlay Networks (NVO3),” http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 13-8
  • 9. Current NVO Technologies           BGP/MPLS IP VPNs: Widely deployed in enterprise networks. Difficult in data centers because hosts/hypervisors do not implement BGP. BGP/MPLS Ethernet VPNs: Deployed in carrier networks. Difficult in data centers. 802.1Q, PB, PBB VLANs Shortest Path Bridging: IEEE 802.1aq Virtual Station Interface (VSI) Discovery and Configuration Protocol (VDP): IEEE 802.1Qbg Address Resolution for Massive numbers of hosts in the Data Center (ARMD): RFC6820 TRILL L2VPN: Provider provisioned L2 VPN Proxy Mobile IP: Does not support multi-tenancy LISP: RFC 6830 http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 13-9
  • 10. GRE         Generic Routing Encaptulation (RFC 1701/1702) Generic  X over Y for any X or Y Over IPv4, GRE packets use a protocol type of 47 Optional Checksum, Loose/strict Source Routing, Key Key is used to authenticate the source Recursion Control: # of additional encapsulations allowed. 0  Restricted to a single provider network  end-to-end Offset: Points to the next source route field to be used IP or IPSec are commonly used as delivery headers Delivery Header GRE Header Payload Check- Routing Key Seq. Strict Recursion Flags Ver. Prot. Offset Check Key Seq. Source sum Present Present # Source Control # Type sum # Routing Present Present Route List 1b 1b 1b Washington University in St. Louis 1b 1b 3b 5b 3b 16b 16b http://www.cse.wustl.edu/~jain/cse570-13/ 13-10 16b 32b 32b Variable ©2013 Raj Jain
  • 11. EoMPLSoGRE     Ethernet over MPLS over GRE (point-to-point) VPLS over MPLS over GRE (Multipoint-to-multipoint) Used when provider offers only L3 connectivity Subscribers use their own MPLS over GRE tunnels VPLSoGRE or Advanced-VPLSoGRE can also be used GRE offers IPSec encryption option Provider Edge Router Customer MPLS Router Ethernet Provider Core Router(s) Provider Edge Router Tunnel Provider Network IP GRE Washington University in St. Louis MPLS Ethernet = IP GRE MPLS http://www.cse.wustl.edu/~jain/cse570-13/ 13-11 Customer MPLS Router Ethernet Ethernet ©2013 Raj Jain
  • 12. NVGRE     Ethernet over GRE over IP (point-to-point) A unique 24-bit Virtual Subnet Identifier (VSID) is used as the lower 24-bits of GRE key field  224 tenants can share Unique IP multicast address is used for BUM (Broadcast, Unknown, Multicast) traffic on each VSID Equal Cost Multipath (ECMP) allowed on point-to-point tunnels Provider Edge Router Customer Edge Switch Ethernet Provider Core Router(s) Provider Edge Router Tunnel Provider Network IP GRE Ethernet = IP GRE Ethernet Ref: M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013, http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03 http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 13-12 Customer Edge Switch Ethernet ©2013 Raj Jain
  • 13. NVGRE (Cont)     In a cloud, a pSwitch or a vSwitch can serve as tunnel endpoint VMs need to be in the same VSID to communicate VMs in different VSIDs can have the same MAC address Inner IEEE 802.1Q tag, if present, is removed. VM VM VM VM VM VM 10.20.2.1 10.20.2.2 10.20.2.3 10.20.2.4 10.20.2.5 10.20.2.6 VM VM VM VM VM VM 10.20.1.1 10.20.1.2 10.20.1.3 10.20.1.4 10.20.1.5 10.20.1.6 Subnet 192.168.1.X Subnet 192.168.2.X Virtual Subnet 10.20.2.X Virtual Subnet 10.20.1.X Subnet 192.168.3.X Internet Ref: Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug 2012, 11pp., http://www.emulex.com/artifacts/074d492d-9dfa-42bd-9583-69ca9e264bd3/elx_wp_all_nvgre.pdf http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 13-13 ©2013 Raj Jain
  • 14. VXLAN     Virtual eXtensible Local Area Networks (VXLAN) L3 solution to isolate multiple tenants in a data center (L2 solution is Q-in-Q and MAC-in-MAC) Developed by VMware. Supported by many companies in IETF NVO3 working group Problem:  4096 VLANs are not sufficient in a multi-tenant data center  Tenants need to control their MAC, VLAN, and IP address assignments  Overlapping MAC, VLAN, and IP addresses  Spanning tree is inefficient with large number of switches  Too many links are disabled  Better throughput with IP equal cost multipath (ECMP) Ref: M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” draft-mahalingam-dutt-dcops-vxlan-04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-04 Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-14 ©2013 Raj Jain
  • 15. VXLAN Architecture      Create a virtual L2 overlay (called VXLAN) over L3 networks 224 VXLAN Network Identifiers (VNIs) Only VMs in the same VXLAN can communicate vSwitches serve as VTEP (VXLAN Tunnel End Point).  Encapsulate L2 frames in UDP over IP and send to the destination VTEP(s). Segments may have overlapping MAC addresses and VLANs but L2 traffic never crosses a VNI Tenant 3 Virtual L2 Network Tenant 1 Virtual L2 Network Tenant 2 Virtual L2 Network L3 Network Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-15 ©2013 Raj Jain
  • 16. VXLAN Deployment Example Example: Three tenants. 3 VNIs. 4 Tunnels for unicast. + 3 tunnels for multicast (not shown) VM1-1 VNI 34 VM2-1 VNI 22 Hypervisor VTEP IP1 VM2-2 VNI 22 VM1-2 VNI 34 VM3-1 VNI 74 Hypervisor VTEP IP2 VM3-2 VNI 74 VM1-3 VNI 34 Hypervisor VTEP IP3 L3 Network Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-16 ©2013 Raj Jain
  • 17. VXLAN Encapsulation Format    Outer VLAN tag is optional. Used to isolate VXLAN traffic on the LAN Source VM ARPs to find Destination VM’s MAC address. All L2 multicasts/unknown are sent via IP multicast. Destination VM sends a standard IP unicast ARP response. Destination VTEP learns inner-Src-MAC-to-outer-src-IP mapping  Avoids unknown destination flooding for returning responses Dest. Source VTEP VTEP MAC MAC Outer VLAN Only key fields are shown Washington University in St. Louis Dest. Source VTEP VTEP IP IP UDP VXLAN Header Header Flags 8b Dest Source Tenant Ethernet VM VM VLAN Payload MAC MAC Res VNI Res 24b 24b 8b http://www.cse.wustl.edu/~jain/cse570-13/ 13-17 ©2013 Raj Jain
  • 18. VXLAN Encapsulation Format (Cont)       IGMP is used to prune multicast trees 7 of 8 bits in the flag field are reserved. I flag bit is set if VNI field is valid UDP source port is a hash of the inner MAC header  Allows load balancing using Equal Cost Multi Path using L3-L4 header hashing VMs are unaware that they are operating on VLAN or VXLAN VTEPs need to learn MAC address of other VTEPs and of client VMs of VNIs they are handling. A VXLAN gateway switch can forward traffic to/from nonVXLAN networks. Encapsulates or decapsulates the packets. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-18 ©2013 Raj Jain
  • 19. VXLAN: Summary     VXLAN solves the problem of multiple tenants with overlapping MAC addresses, VLANs, and IP addresses in a cloud environment. A server may have VMs belonging to different tenants No changes to VMs. Hypervisors responsible for all details. Uses UDP over IP encapsulation to isolate tenants Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-19 ©2013 Raj Jain
  • 20. Stateless Transport Tunneling Protocol (STT)       Ethernet over TCP-Like over IP tunnels. GRE, IPSec tunnels can also be used if required. Tunnel endpoints may be inside the end-systems (vSwitches) Designed for large storage blocks 64kB. Fragmentation allowed. Most other overlay protocols use UDP and disallow fragmentation  Maximum Transmission Unit (MTU) issues. TCP-Like: Stateless TCP  Header identical to TCP (same protocol number 6) but no 3-way handshake, no connections, no windows, no retransmissions, no congestion state  Stateless Transport (recognized by standard port number). Broadcast, Unknown, Multicast (BUM) handled by IP multicast tunnels Ref: B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)," Sep 2013, http://tools.ietf.org/html/draft-davie-stt-04 http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 13-20
  • 21. LSO and LRO     Large Send Offload (LSO): Host hands a large chunk of data to NIC and meta data. NIC makes MSS size segments, adds checksum, TCP, IP, and MAC headers to each segment. Large Receive Offload (LRO): NICs attempt to reassemble multiple TCP segments and pass larger chunks to the host. Host does the final reassembly with fewer per packet operations. STT takes advantage of LSO and LRO features, if available. Using a protocol number other than 6 will not allow LSO/LRO to handle STT Host Meta Data Payload LSO L2 Header Washington University in St. Louis LRO IP Header TCP Header Segment http://www.cse.wustl.edu/~jain/cse570-13/ 13-21 ©2013 Raj Jain
  • 22. STT Optimizations   VM VM vNIC vNIC Washington University in St. Louis NIC NIC Underlay Network vSwitch  Large data size: Less overhead per payload byte Context ID: 64-bit tunnel end-point identifier Optimizations:  2-byte padding is added to Ethernet frames to make its size a multiple of 32-bits.  Source port is a hash of the inner header  ECMP with each flow taking different path and all packets of a flow taking one path No protocol type field  Payload assumed to be Ethernet, which can carry any payload identified by protocol type. vSwitch  http://www.cse.wustl.edu/~jain/cse570-13/ 13-22 vNIC VM vNIC VM ©2013 Raj Jain
  • 23. STT Frame Format     16-Bit MSS  216 B = 64K Byte maximum L4 Offset: From the of STT header to the start of encapsulated L4 (TCP/UDP) header  Helps locate payload quickly Checksum Verified: Checksum covers entire payload and valid Checksum Partial: Checksum only includes TCP/IP headers IP TCP-Like STT STT Header Header Header Payload Version 8b Flags 8b Checksum Verified 1b L4 Offset Reserved 8b Checksum Partial Maximum Segment Size 8b IP Version IPv4 1b Washington University in St. Louis 1b Priority VLAN ID Context Code Valid ID Point 16b TCP Payload 1b 3b 1b 64b VLAN ID 12b Padding 16b Reserved 4b http://www.cse.wustl.edu/~jain/cse570-13/ 13-23 ©2013 Raj Jain
  • 24. TCP-Like Header in STT        Destination Port: Standard to be requested from IANA Source Port: Selected for efficient ECMP Ack Number: STT payload sequence identifier. Same in all segments of a payload Sequence Number (32b): Length of STT Payload (16b) + offset of the current segment (16b)  Correctly handled by NICs with Large Receive Offload (LRO) feature No acks. STT delivers partial payload to higher layers. Higher layer TCP can handle retransmissions if required. Middle boxes will need to be programmed to allow STT pass through Source Port (Random) Dest. Port (Standard) STT Payload Segment Length Offset Sequence Number* 16b 16b 16b+16b Washington University in St. Louis Payload Sequence # Ack Number* 32b 16b *Different meaning than TCP http://www.cse.wustl.edu/~jain/cse570-13/ 13-24 Data Offset ©2013 Raj Jain
  • 25. STT Summary    STT solves the problem of efficient transport of large 64 KB storage blocks Uses Ethernet over TCP-Like over IP tunnels Designed for software implementation in hypervisors Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-25 ©2013 Raj Jain
  • 26. Summary 1. 2. 3. 4. NVO3 is a generalized framework for network virtualization and partitioning for multiple tenants over L3. It covers both L2 and L3 connectivity. NVGRE uses Ethernet over GRE for L2 connectivity. VXLAN uses Ethernet over UDP over IP STT uses Ethernet over TCP-like stateless protocol over IP. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-26 ©2013 Raj Jain
  • 27. Reading List        B. Davie and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)," Sep 2013, http://tools.ietf.org/html/draftdavie-stt-04 Emulex, “NVGRE Overlay Networks: Enabling Network Scalability,” Aug 2012, 11pp., http://www.emulex.com/artifacts/074d492d-9dfa-42bd-958369ca9e264bd3/elx_wp_all_nvgre.pdf M. Mahalingam, “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” draft-mahalingam-dutt-dcops-vxlan04, May, 8, 2013, http://tools.ietf.org/html/draft-mahalingam-dutt-dcopsvxlan-04 M. Sridharan, “MVGRE: Network Virtualization using GRE,” Aug 2013, http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03 Network Virtualization Overlays (nvo3) charter, http://datatracker.ietf.org/wg/nvo3/charter/ T. Narten, et al., “An Architecture for Overlay Networks (NVO3),” http://datatracker.ietf.org/doc/draft-narten-nvo3-arch/ V. Josyula, M. Orr, and G. Page, “Cloud Computing: Automating the Virtualized Data Center,” Cisco Press, 2012, 392 pp., ISBN: 1587204347. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-27 ©2013 Raj Jain
  • 29. Acronyms               ARMD Address Resolution for Massive numbers of hosts in the Data center ARP Address Resolution Protocol BGP Border Gateway Protocol BUM Broadcast, Unknown, Multicast DCN Data Center Networks DCVPN Data Center Virtual Private Network ECMP Equal Cost Multi Path EoMPLSoGRE Ethernet over MPLS over GRE EVPN Ethernet Virtual Private Network GRE Generic Routing Encapsulation IANA Internet Address and Naming Authority ID Identifier IEEE Institution of Electrical and Electronic Engineers IETF Internet Engineering Task Force Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-29 ©2013 Raj Jain
  • 30. Acronyms (Cont)                 IGMP IP IPSec IPv4 LAN LISP LRO LSO MAC MPLS MSS MTU NIC NV NVA NVEs Internet Group Multicast Protocol Internet Protocol IP Security Internet Protocol V4 Local Area Network Locator ID Separation Protocol Large Receive Offload Large Send Offload Media Access Control Multi Protocol Label Switching Maximum Segment Size Maximum Transmission Unit Network Interface Card Network Virtualization Network Virtualization Authority Network Virtualization Edge Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-30 ©2013 Raj Jain
  • 31. Acronyms (Cont)                 NVGRE NVO3 OAM OTV PB PBB pM pSwitch QoS RFC RS STT TCP TRILL TS UDP Network Virtualization Using GRE Network Virtualization over L3 Operation, Administration and Management Overlay Transport Virtualization Provider Bridges Provider Backbone Bridge Physical Machine Physical Switch Quality of Service Request for Comment Routing System Stateless Transport Tunneling Protocol Transmission Control Protocol Transparent Routing over Lots of Links Tenant System User Datagram Protocol Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-31 ©2013 Raj Jain
  • 32. Acronyms (Cont)                VDP VLAN VM VN VNI VPLS VPLSoGRE VPLSoGRE VPN VRRP VSI VSID vSwitch VTEP VXLAN VSI Discovery and Configuration Protocol Virtual Local Area Network Virtual Machine Virtual Network Virtual Network Identifier Virtual Private LAN Service Virtual Private LAN Service over GRE VPLS over GRE Virtual Private Network Virtual Router Redundancy Protocol Virtual Station Interface Virtual Subnet Identifier Virtual Switch VXLAN Tunnel End Point Virtual Extensible Local Area Network Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 13-32 ©2013 Raj Jain