This document describes Network Stack in Userspace (NUSE), which implements a full network stack as a userspace library. NUSE aims to allow faster evolution of network stacks outside the kernel and enable network protocol personalization. It works by patching the Linux kernel to include a new architecture, implementing the network stack components as a userspace library, and hijacking POSIX socket calls to redirect them to the NUSE implementation. Performance tests show NUSE adding only small overhead compared to kernel implementations. NUSE can also integrate with the ns-3 network simulator to enable controllable and reproducible network simulations using real protocol implementations.
Report
Share
Report
Share
1 of 42
Download to read offline
More Related Content
NUSE (Network Stack in Userspace) at #osio
1. Network Stack in
Userspace (NUSE)
!
Hajime Tazaki
Ryo Nakamura
(University of Tokyo)
!
New Directions in Operating Systems
London, 2014
2. Motivation
Implementation of the Internet
is not finished yet
!
!
Faster evolution of OSes (network
stack)
OS personalization
2
3. I have a new Layer-3/4
protocol! Yay!
I have new, great Layer-3/4 protocol ! It
will change the WORLD !
Replace network stack ?
No: destroy my life ?!
(experimental ? not tested ?)
Yes: I wanna be your slave.
Slow evolution of network stack ?
VM on personal device ?
3
4. Virtual Machine ?
Poll: “When you download and run software, how often do you use a virtual machine (to reduce
security risks)?”
Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download
Practices, no. MSR-TR-2010-51, May 2010
4
5. costin.raiciu@cs.pub.ro, j.araujo@ucl.ac.uk, rizzo@iet.unipi.it
Internet paths
that it is still
despite the
the blame
extensions taking
placed on end
moving protocols
deployment
optimizations.
support for user-level
commodity
number of
host stack,
s.
our mux/de-mux
line rate (up
Slow evolution of network stack
Honda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM
SIGCOMM CCR, Vol.44, Num. 2, April 2014
cores, and
over a basic
same server
1.00
0.75
0.50
0.25
0.00
2007 2008 2009 2010 2011 2012
Date
Ratio of flows
Option
SACK
Timestamp
Windowscale
Direction
Inbound
Outbound
Figure 1: TCP options deployment over time.
pen infrequently not only because of slow release cycles, but
also due to their cost and potential disruption to existing
setups. If protocol stacks were embedded into applications,
they could be updated on a case-by-case basis, and deploy-ment
would be a lot more timely.
For example, Mac OS, Windows XP and FreeBSD still
use a traditional Additive Increase Multiplicative Decrease
(AIMD) algorithm for TCP congestion control, while Linux
6. Meanwhile in
Filesystem world..
There is,
Filesystem in Userspace
(FUSE)
Userspace code can host
new filesystem (sshfs,
GmailFS, etc)
Performance is bad,
but doesn’t matter
Flexibility and
functionality do matter
6
http://fuse.sourceforge.net/
7. Alternatives
Container (LXC, OpenVZ, vimage)
share kernel with host operating system (no
flexibility)
Library OS
full scratch: mtcp, Mirage, lwIP
Porting: OSv, Sandstorm, libuinet (FreeBSD),
Arrakis (lwIP), OpenOnload (lwIP?)
Glue-layer: LKL (Linux-2.6), rumpkernel (NetBSD)
7
9. What’s NUSE ?
Network stack in Userspace
A library operating system
Library version of network
stack (of monolithic kernel)
Linux (latest), FreeBSD (plan)
(UNIX) Process-based
virtualization
9
nuse example
kernel bypassed
TCP/IP
ARP/
ndisc
libnuse
glibc
NIC
userspace
kernel
raw sock
netmap
DPDK (etc)
10. Why NUSE ?
minimized porting effort
Linux (net-next) changes frequently
!
full functional network stack for
netmap
DPDK
(any kernel-bypass technology)
10
11. How it works
Application
POSIX glue
TCP UDP DCCP SCTP
ICMP ARP
IPv6 IPv4
Qdisc
Netfilter Bridging
Netlink
IPSec Tunneling
Kernel layer
NUSE core
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
petit-scheduler
1. (monolithic) kernel
source
2. scheduler
3. POSIX glue
redirect system calls
4. network I/O
raw socket, DPDK,
netmap, etc..
11
12. 1) kernel build
Application
POSIX glue
TCP UDP DCCP SCTP
ICMP ARP
IPv6 IPv4
Qdisc
Netfilter Bridging
Netlink
IPSec Tunneling
Kernel layer
NUSE core
bottom halves/
rcu/timer/
interrupt
struct
net_device
RAW DPDK netmap ...
NIC
petit-scheduler
patch to kernel tree
with new (hw independent)
arch (arch/sim)
robust to (frequent)
mainstream changes
12
18. (possible) use cases
New protocol deployment
Chrome + Linux mptcp (on NUSE)
Process-level virtual instance
% NUSE-linux-ovs | NUSE-freebsd-NAT |
NUSE-router | NUSE-nginx!
VM chaining via UNIX command line
18
19. Limitation (ongoings)
no fork(2)/exec(2) support
no multi-processes
no sysctl/proc
(inefficient) thread scheduling
19
20. Experiments
1. Can we benefit with OS personalization?
present a custom (NUSE) kernel with an
application (OS personalization)
2. How much overhead does NUSE add?
Simple performance measurements
20
23. Host Tx
(NUSE->Receiver)
NUSE Rx
23
avg max min
dpdk! 2.610 8.000 0.156
netmap 0.370 0.494 0.252
raw 0.396 0.501 0.290
tap 0.397 0.538 0.303
500
450
400
350
300
250
200
150
100
50
0
dpdk netmap raw tap
Throughput (Mbps)
ping (RTT) throughput
(1024byte,UDP)
8
7
6
5
4
3
2
1
0
dpdk netmap raw tap
RTT (ms)
24. L3 Routing
Sender->NUSE->Receiver
Tx NUSE Rx
24
avg max min
dpdk! 11.998 27.700 0.252
netmap 0.664 0.741 0.556
raw 0.663 0.761 0.575
tap 0.694 0.749 0.602
ping (RTT)
500
450
400
350
300
250
200
150
100
50
0
netmap raw tap
Throughput (Mbps)
throughput
(1024byte,UDP)
30
25
20
15
10
5
0
dpdk netmap raw tap
RTT (ms)
25. Discussions
not so bad performance
we don’t care much about performance
network stack is full functional
but supplemental tools are not sufficient
25
26. Network Simulator
Integration (ns-3)
network stack +ns-3 network simulator
!
Direct Code Execution (DCE)
Established by Mathieu Lacage (2006)
part of ns-3 project
!
Features
reproducible (deterministic clock)
controllable (simulator’s facility)
http://www.nsnam.org/overview/projects/direct-code-execution/
26
30. Bug reproducibility
Home Agent
AP1 AP2
30
Wi-Fi Wi-Fi
handoff
ping6
correspondent
node
mobile node
(gdb) b mip6_mh_filter if dce_debug_nodeid()==0
Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88.
<continue>
(gdb) bt 4
#0 mip6_mh_filter
(sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0)
at net/ipv6/mip6.c:109
#1 0x00007ffff2831418 in ipv6_raw_deliver
(skb=0x7ffff7cde8b0, nexthdr=135)
at net/ipv6/raw.c:199
#2 0x00007ffff2831697 in raw6_local_deliver
(skb=0x7ffff7cde8b0, nexthdr=135)
at net/ipv6/raw.c:232
#3 0x00007ffff27e6068 in ip6_input_finish
(skb=0x7ffff7cde8b0)
at net/ipv6/ip6_input.c:197
31. Debugging
==5864== Memcheck, a memory error detector
==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==5864== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info
==5864== Command: ../build/bin/ns3test-dce-vdl --verbose
==5864==
==5864== Conditional jump or move depends on uninitialised value(s)
==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782)
==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532)
==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496)
==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576)
==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696)
==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226)
==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318)
==5864== by 0x7D2313F: process_backlog (dev.c:3368)
==5864== by 0x7D23455: net_rx_action (dev.c:3526)
==5864== by 0x7CF2477: do_softirq (softirq.c:65)
==5864== by 0x7CF2544: softirq_task_function (softirq.c:21)
==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manager.==5864== Uninitialised value was created by a stack allocation
==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522)
==5864==
Memory error detection
among distributed nodes
in a single process
using Valgrind
!
!
31