Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore error during network cleanup if the route doesn't exist #7003

Closed
wants to merge 3 commits into from

Conversation

maggie-lou
Copy link
Contributor

We are getting vm_network_cleanup_failed alerts because of this

Related issues: Example https://buildbuddy-corp.slack.com/archives/C0154UFBCJ2/p1720610696894509

Copy link
Member

@tylerwilliams tylerwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Any idea why we'd try to clean up a route that already was removed? Is this a symptom of some race condition?

@maggie-lou
Copy link
Contributor Author

Any idea why we'd try to clean up a route that already was removed? Is this a symptom of some race condition?

I'm not positive to be honest.
The function that calls the cleanup (firecracker::remove) is in a sync.Once, so should never get called again.
We aren't clearing the cleanup function after it's been called, but for firecracker we create a new Container object even when we're loading a snapshot, so that shouldn't matter.
Brandon added some network locking, so another VM shouldn't be trying to create / delete the same route. @bduffany do you have any ideas how this could happen?

if err != nil && !strings.Contains(err.Error(), "No such process") {
return err
}
return nil
})
} else {
log.Debugf("ip route %s via %s already exists", cloneIP, cloneEndpointAddr)
Copy link
Member

@bduffany bduffany Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "routeExists" logic seems a little sketchy since we don't expect these routes to be used by anything other than VMs, and the route IPs should be locked. Maybe we should promote this log line to Warning level or return an Unavailable error here?

Copy link
Contributor Author

@maggie-lou maggie-lou Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and we haven't seen this log for over 2 weeks in dev - upgraded to an error. This was probably more likely before we added network locking

@maggie-lou
Copy link
Contributor Author

maggie-lou commented Jul 12, 2024

Okay had a chance to investigate this a little more

In dev, every time we got this ip route delete error, there were a couple errors like ip route add default via 192.168.0.X" Error: Nexthop has invalid gateway. a couple milliseconds apart. So it seems like the networking has a temporary glitch, and a lot of the networking commands fail for a couple milliseconds.

The errors setting up the network should be okay, because the executors should retry the task

The aspect I'm not sure about is whether the route does actually still exist when we fail to delete it, and it's incorrectly reported as not existing because of the network glitch. If we ignore the error, then we'll unlock that ip and another VM can claim it. This PR makes it so that step will fail if the route already exists at that point. This alert is firing relatively frequently, so I'm inclining to trying this out and seeing how it does, but let me know if you have concerns with this approach @tylerwilliams @bduffany

@bduffany
Copy link
Member

@maggie-lou is it possible that Nexthop has invalid gateway is hinting at a legitimate error?

@vadimberezniker
Copy link
Member

Nexthop has invalid gateway is supposed to indicate that the gateway IP belongs to a subnet that is not attached to any interface. This would suggest something has already gone wrong prior to route setup.

@bduffany
Copy link
Member

I think I got to the bottom of this and fixed the issues in #7028

@maggie-lou maggie-lou closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants