-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore error during network cleanup if the route doesn't exist #7003
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Any idea why we'd try to clean up a route that already was removed? Is this a symptom of some race condition?
I'm not positive to be honest. |
server/util/networking/networking.go
Outdated
if err != nil && !strings.Contains(err.Error(), "No such process") { | ||
return err | ||
} | ||
return nil | ||
}) | ||
} else { | ||
log.Debugf("ip route %s via %s already exists", cloneIP, cloneEndpointAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "routeExists" logic seems a little sketchy since we don't expect these routes to be used by anything other than VMs, and the route IPs should be locked. Maybe we should promote this log line to Warning level or return an Unavailable error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and we haven't seen this log for over 2 weeks in dev - upgraded to an error. This was probably more likely before we added network locking
Okay had a chance to investigate this a little more In dev, every time we got this The errors setting up the network should be okay, because the executors should retry the task The aspect I'm not sure about is whether the route does actually still exist when we fail to delete it, and it's incorrectly reported as not existing because of the network glitch. If we ignore the error, then we'll unlock that ip and another VM can claim it. This PR makes it so that step will fail if the route already exists at that point. This alert is firing relatively frequently, so I'm inclining to trying this out and seeing how it does, but let me know if you have concerns with this approach @tylerwilliams @bduffany |
@maggie-lou is it possible that |
|
I think I got to the bottom of this and fixed the issues in #7028 |
We are getting vm_network_cleanup_failed alerts because of this
Related issues: Example https://buildbuddy-corp.slack.com/archives/C0154UFBCJ2/p1720610696894509