Forcing Ping to Egress When Destination Interface is Local (Debian)

Question

I am running a Debian-based Linux container under Proxmox 4.4. This host has five network interfaces (though only two come into play in the problem I'm having).

While I am shelled into this host, I ping the IP address associated with eth1. What is happening and what I believe should happen are two very different things.

What I want to happen is for the ping packet to egress eth3, where it will be routed to eth1.

What is happening is that the IP stack sees I'm pinging a local interface and it then sends the reply right back up the stack. I know the packet is not going out and coming back for two reasons:

A packet capture shows nothing hitting either eth1 or eth3.
The ping latency averages 0.013 ms. If the packet were going out and back as intended, the latency would be about 60 ms.

Of course, I desire corresponding behavior when I ping the IP address associated with eth3. In that case, I want the packet to egress eth1 where it will be routed to eth3. Unfortunately, similar behavior as described above happens.

Below, I show the static routes I've set up to try and induce the desired behavior. Such routes work as intended on a Windows machine, but they do not work under the Linux setup I am using.

How may I configure this host to forward as intended?

root@my-host:~# uname -a
Linux my-host 4.4.35-1-pve #1 SMP Fri Dec 9 11:09:55 CET 2016 x86_64 GNU/Linux
root@my-host:~#
root@my-host:~# cat /etc/debian_version
8.9
root@my-host:~#
root@my-host:~# ifconfig
eth0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:192.0.2.65  Bcast:192.0.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:195028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12891 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:92353608 (88.0 MiB)  TX bytes:11164530 (10.6 MiB)

eth1      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:128.66.100.10  Bcast:128.66.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:816 errors:0 dropped:0 overruns:0 frame:0
          TX packets:486 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:149517 (146.0 KiB)  TX bytes:34107 (33.3 KiB)

eth2      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:203.0.113.1  Bcast:203.0.113.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:738 errors:0 dropped:0 overruns:0 frame:0
          TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:423603 (413.6 KiB)  TX bytes:94555 (92.3 KiB)

eth3      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:128.66.200.10  Bcast:128.66.200.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:611 errors:0 dropped:0 overruns:0 frame:0
          TX packets:182 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:43921 (42.8 KiB)  TX bytes:13614 (13.2 KiB)

eth4      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:198.51.100.206  Bcast:198.51.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:183427 errors:0 dropped:0 overruns:0 frame:0
          TX packets:83 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:85706791 (81.7 MiB)  TX bytes:3906 (3.8 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:252 errors:0 dropped:0 overruns:0 frame:0
          TX packets:252 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:22869 (22.3 KiB)  TX bytes:22869 (22.3 KiB)
root@my-host:~#
root@my-host:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.0.2.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
203.0.113.0     0.0.0.0         255.255.255.0   U     0      0        0 eth2
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
198.51.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth4
root@my-host:~#
root@my-host:~# route -v add 128.66.200.10/32 gw 128.66.100.1
root@my-host:~# route -v add 128.66.100.10/32 gw 128.66.200.1
root@my-host:~#
root@my-host:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.0.2.0       0.0.0.0         255.255.255.0   U     0      0        0 eth0
203.0.113.0     0.0.0.0         255.255.255.0   U     0      0        0 eth2
198.51.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth4
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
128.66.200.10   128.66.100.1    255.255.255.255 UGH   0      0        0 eth1
root@my-host:~#
root@my-host:~# ping -c 3 128.66.100.10
PING 128.66.100.10 (128.66.100.10) 56(84) bytes of data.
64 bytes from 128.66.100.10: icmp_seq=1 ttl=64 time=0.008 ms
64 bytes from 128.66.100.10: icmp_seq=2 ttl=64 time=0.014 ms
64 bytes from 128.66.100.10: icmp_seq=3 ttl=64 time=0.017 ms

--- 128.66.100.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.008/0.013/0.017/0.003 ms
root@my-host:~#

THURSDAY, 8/17/2017 8:12 AM PDT UPDATE

Per the request of dirkt, I am elaborating on our architecture and the reason for my question.

The virtual host that is the subject of this post (i.e. the host with network interfaces eth1, eth3, and three other network interfaces unrelated to my question), is being used to test a physical, wired TCP/IP networking infrastructure we have set up. Specifically, it is the routing functionality of this TCP/IP networking infrastructure that we are testing.

We used to have two virtual hosts, not one as I've described in my original post. A ping between these two hosts would be our smoke test to ensure that the TCP/IP networking infrastructure under test was still working.

For reasons too detailed to get into, having two hosts made it difficult to collect the logs we need to. So, we switched to one host, gave it two NICs, set up static routes so that anything destined for NIC 2 would egress NIC 1 and vice versa. The problem is, as I've stated, they're not egressing.

This one host / two NIC setup has worked under Windows for us for years. I don't know if that is because Windows is broken and we were inadvertently taking advantage of a bug, or if Windows is fine (i.e. RFC-compliant) and we just need to get the configuration right on our Linux VMs to get the same behavior.

To summarize and distill down the long block of shell text above:

Two Interfaces:

eth1: 128.66.100.10/24; the router on this interface's network has IP address 128.66.100.1
eth3: 128.66.200.10/24; the router on this interface's network has IP address 128.66.200.1

Relevant Routes:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3
128.66.200.0    0.0.0.0         255.255.255.0   U     0      0        0 eth3
128.66.200.10   128.66.100.1    255.255.255.255 UGH   0      0        0 eth1

Command I'm Executing:

ping -c 3 128.66.100.10

The destination of 128.66.100.10 matches two of the above routes:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3

The route with the longest prefix match is:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.66.100.10   128.66.200.1    255.255.255.255 UGH   0      0        0 eth3

What I am trying to understand is why, given the existence of this route, the packet won't egress eth3, travel through our TCP/IP networking infrastructure, come back and hit eth1 from the outside.

The TCP/IP stack is apparently not consulting the forwarding table. It's as if when it sees that I'm pinging a locally-connected interface, the TCP/IP stack just says, "Oh, this is local interface. So, I'm not going to consult the forwarding table. Instead, I'll just send an echo reply right back up the stack".

Is the behavior I desire RFC-compliant? If it is not, I must abandon the attempt. But if it is RFC-compliant, I would like to learn how to configure the Linux TCP/IP stack to allow this behavior.

MONDAY, 8/21/2017 UPDATE

I've discovered the sysctl rp_filter and accept_local kernel parameters. I have set them as follows:

root@my-host:~# cat /proc/sys/net/ipv4/conf/eth1/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth3/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/all/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/default/accept_local
1
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth1/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/eth3/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/all/rp_filter
0
root@my-host:~# cat /proc/sys/net/ipv4/conf/default/rp_filter
0

Setting this kernel parameters, rebooting, verifying they survived the reboot, and testing again showed no difference in behavior.

Please note that my-host is an lxc Linux container running under Proxmox 4.4. I have also set rp_filter and accept_local as shown above on the hypervisor interfaces that corresponds to the eth1 and eth3 interfaces on my-host.

To re-summarize my objective, I have a Linux host with two NICs, eth1 and eth3. I am trying to ping out eth1, have the ping packet get routed through a TCP/IP network infrastructure under test, and make its way back to eth3.

Nothing I've tried above has allowed me to do so. How may I do so?

8/27/2017 UPDATE

Per a note by dirkt that I had failed to mention if eth1 and eth3 are purely virtual or if they correspond to a physical interface... eth1 and eth3 both correspond to the same physical interface on the hypervisor. The intent is that a packet that egresses eth1 actually physically leave the hypervisor box, go out onto a real TCP/IP network, and get routed back.

8/27/2017 UPDATE #2

Per dirkt, I have investigated network namespaces as it seemed quite promising. However, it doesn't "just work".

I am using LXC containers, and it seems that some of the isolation mechanisms present in containers are preventing me from creating a network namespace. Were I not running in a container, I think I'd have no problem adding the network namespace.

I am finding some references to making this work in LXC containers, but they are quite obscure and arcane. Not there yet, and have to throw in the towel for today... Should anyone have any suggestions in this regard, please advise...

Wrong site. You're looking for Unix & Linux or Super User instead. This site is for programming related questions. — Ken White, Commented Aug 17, 2017 at 1:24
Yes, agreed and apologies. Posted here by accident. Intended SuperUser. — Dave, Commented Aug 17, 2017 at 1:34
Your description of what you want is a bit confusing and doesn't match the way routing works in the kernel. When a packet egresses on eth3, and will leave the physical interface, so it can't be routed to eth1 afterwards. Routing is always by destination. Can you explain the reason why you need your ping to behave this way? If you want to test forwarding from eth3 to eth1, you need to ping eth3 from the outside, i.e. from another computer connected to that LAN segment. Or there may be a way to simulate ingressing packets on eth3, but not with ping. — dirkt, Commented Aug 17, 2017 at 9:29

dirkt · Accepted Answer · 2017-08-29 17:00:57Z

(I'll leave the other answer because of the comments).

Descripton of the task: Given a single virtual host in a LXC container with two network interfaces eth1 and eth3, which are on different LAN segements and externally connected through routers, how can one implement a "boomerang" ping that leaves on eth3 and returns on eth1 (or vice versa)?

The problem here is that the Linux kernel will detect that the destination address is assigned to eth1, and will try to directly deliver the packets to eth1, even if the routing tables prescribe that the packets should be routed via eth3.

It's not possible to just remove the IP address from eth1, because the ping must be answered. So the only solution is to somehow use two different addresses (or to separate eth1 and eth3 from each other).

One way to do that is to use iptables, as in this answer linked by harrymc in the comments.

Another way I have tested on my machine with the following setup, using one network namespace to simulate the external network, and two network namespaces to separate the destination IP addresses:

Routing NS     Main NS      Two NS's

+----------+                   +----------+
|   veth0b |--- veth0a ....... | ipvl0    |
| 10.0.0.1 |    10.0.0.254     | 10.0.0.2 |
|          |                   +----------+
|          |                   +----------+
|   veth1b |--- veth1a ....... | ipvl1    |
| 10.0.1.1 |    10.0.1.254     | 10.0.1.2 |
+----------+                   +----------+

The Routing NS has forwarding enabled. The additional 10.0.*.2 addresses are assigned to an IPVLAN device, which one can think of as an extra IP address assigned to the master interface it is connected to. More details about IPVLAN e.g. here. Create like

ip link add ipvl0 link veth0a type ipvlan mode l2
ip link set ipvl0 netns nsx

where nsx is the new network namespace, then in that namespace,

ip netns exec nsx ip addr add 10.0.0.2/24 dev ipvl0
ip netns exec nsx ip link set ipvl0 up
ip netns exec nsx ip route add default via 10.0.0.1 dev ipvl0

The Main NS has the following routing rules in addtion to the default rules

ip route add 10.0.0.2/32 via 10.0.1.1 dev veth1a
ip route add 10.0.1.2/32 via 10.0.0.1 dev veth0a

and then ping 10.0.0.2 will do a "boomerang" round trip, as can be seen by tcpdump on both veth0a and veth1a. So with this setup, all logging can be done from the Main NS as far as pinging etc. is concerned, but more fancy tests with nc etc. might need the other namespaces at least to provide a receiver etc.

The LXC container uses network namespaces (and other namespaces). I am not too familiar with LXC containers, but if making new network namespaces inside the container is blocked, work from outside the container. First identify the name of the container with

ip netns list

and then do ip netns exec NAME_OF_LXC_NS ... as above. You can also delay moving eth1 and eth3 into the LXC container, and first create the two IPVLANs, and then move it into the container. Script as appropriate.

Edit

There's a third variant that works without network namespaces. The trick is to use policy routing, and give local lookup a higher ("worse") priority than normal, and treat packets from a socket bound to a specific interface differently. This prevents delivery to the local address, which was the main source of the problem.

With the same simulation setup as above minus the IPVLANs,

ip rule add pref 1000 lookup local
ip rule del pref 0
ip rule add pref 100 oif veth0a lookup 100
ip rule add pref 100 oif veth1a lookup 101
ip route add default dev veth0a via 10.0.0.1 table 100
ip route add default dev veth1a via 10.0.1.1 table 101

the commands

ping 10.0.1.254 -I veth0a
ping 10.0.0.254 -I veth1a

correctly egress ping requests. To also get a a ping reply, one must disable the tests against source spoofing:

echo "0" > /proc/sys/net/ipv4/conf/veth{0,1}a/rp_filter
echo "1" > /proc/sys/net/ipv4/conf/veth{0,1}a/accept_local

I also tried nc or socat, but I couldn't get them to work, because there are no options for nc to force the listener to answer on a specific device, and while there is such an option for socat, it doesn't seem to have an effect.

So network testing beyond pings is somewhat limited with this setup.

dirkt · Accepted Answer · 2017-08-27 09:08:45Z

So to sum up, you have the following configuration:

Host 1           Main Host            Host 2
  ethX -------- eth1   eth3 --------- ethY
       128.66.200.10   128.66.100.10

In the Main Host, /proc/sys/net/ipv4/ip_forward is enabled, and you want to test that the connection between Host 1 and Host 2 works.

Quick reminder how Linux processes IP packets, per interface:

So an ingressing packet from the physical layer traverses PREROUTING of the ingress interface, then gets routed by destination, then traverses POSTROUTING of the egress interface, and egresses to the physical layer. Conversely, applications like ping send packets to the OUTPUT chain, then they get routed (not shown in the picture), then traverse the POSTROUTING chain, and finally they egress.

Here I am using ingress in the sense of "enters the physical layer", and egress in the sense of "leave the physical layer".

What you are trying to do is to somehow tell the Linux kernel not to handle packets this way, but instead to simulate a packet ingressing on eth3 using the application ping, that then should get routed to eth1, where it egresses.

But that just doesn't work: Applications send packets via the OUTPUT chain. If you force ping to bind to eth3 with the -I option, Linux will simply decide that this is the wrong interface for the packet and drop the packet. It will never attempt to treat the packet like if it was ingressing into eth3.

So the normal way to handle this is to just send the ping from Host 1, and verify if it arrives on Host 2 (and the other direction). Nice, simple and easy, no contortions necessary.

As the "Main Host" is virtual, eth1 and eth3 are very likely not real interfaces (you didn't say). If they are just one end of a veth pair, it's easy to get hold of the other end, and just produce the ping on that end (whereever it happens to be).

If you insist on testing everything on "Main Host" for some reasons, you can also go through some contortions and bridge eth3 to some other interface veth-pair, and then ping on the other end of that veth-pair. As the packet is bridged from the veth, it will be treated as ingressing into eth3, so that does what you want. But it's really unnecessarily complicated.

I don't know any other ways to simulate ingressing packets.

You may try some iptable magic, but if you are trying to test your network connection, that's a bad idea: you'll never know that your iptables rules also work for real traffic, because that's not what you test.

dirkt, thank you for this response. I will look at it more closely in a few hours (before the bounty period ends, for sure). But a couple of quick comments... I do not have /proc/sys/net/ipv4/ip_forward enabled, I am not trying to get my dual-NIC VM to act as a router. Rather, I am trying to get it to act like two hosts. — Dave, Commented Aug 27, 2017 at 16:38
Also, I am not trying to simulate ingress and egress, If I am successful in getting this to work as I desire, the ping really will egress out of one of the hypervisor's physical interfaces, travel into a real, physical, wired TCP/IP intranet, get routed back where it will ingress back into the same hypervisor physical interface, and get passed back up the hypervisor's stack via the virtual interface that corresponds to eth3 in my VM. — Dave, Commented Aug 27, 2017 at 16:46
dirkt, you are correct, I failed to say if eth1 and eth3 correspond to physical interfaces. eth1 and eth3 do indeed correspond to a physical interface on the hypervisor. However, they do correspond to the same physical interface. They do not each have a separate physical interface allocated to them. — Dave, Commented Aug 27, 2017 at 16:49
Maybe I still don't understand your setup. So eth3 and eth1 are connected to each other (how? through several routers? They are not in the same LAN segment), and the problem is that when you do ping 128.66.200.10, you want the Linux kernel to ignore that this address is already present on eth1, and instead route it through eth3, where it will travel through the network, and come back to eth1? If that is correct, I totally misunderstood your question... — dirkt, Commented Aug 27, 2017 at 17:41
In that case, the simplest solution I can think of is to make an additional network namespace (think "mini virtual host"), so you have a similar situation as before, but now can do the logging in a single virtual host. I don't think a route conflicting with an interface address can be made to work (possibly with some iptables magic, but I'd have to try that). — dirkt, Commented Aug 27, 2017 at 17:44

Stack Exchange Network

Forcing Ping to Egress When Destination Interface is Local (Debian)

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
networking
linux-kernel
tcpip
lxc
.

Linked

Hot Network Questions

Forcing Ping to Egress When Destination Interface is Local (Debian)

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxnetworkinglinux-kerneltcpiplxc.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
networking
linux-kernel
tcpip
lxc
.