0

I have a GKE Kubernetes cluster and a deployment.

I set the resources stanza on the deployment to

resources.requests.memory: 4G
resources.requests.cpu: 2
resources.limits.memory: 4G

cpu limit is unset

I deploy the pod (a Django web app) and hammer it with a load test. When saturated, the pod CPU usage goes up to 1CPU -- but essentially refuses to go above 1CPU

Any ideas what I should be troubleshooting here?

3
  • What research have you done and what have you tried? Is the system actually not working? Commented Feb 10, 2023 at 4:28
  • 1
    @music2myear I answered my own question. Thanks for responding!
    – JDS
    Commented Apr 7, 2023 at 15:51
  • Answering your own is perfectly OK. You found the solution, you shared it, you get points! Glad you solved this. Commented Apr 8, 2023 at 3:51

1 Answer 1

1

I left this post open, so let's close it.

In my specific case, the bottleneck actually turned out to be upstream from the K8S Pod, at the Database. The postgres instance simply wasn't big enough, and was peaking at 100% CPU usage and causing downstream timeouts.

I suspect, but am not certain, that the CPU leveling out on the Pods was simply because the Pods were waiting for response from upstream, and couldn't go about 1 CPU usage because there wasn't anything else for them to do.

Additionally, the Django instances are using Django channels and the ASGI asynchronous model, which is single threaded, and doesn't have the same "child thread" model as UWSGI; another reason -- or maybe the actual reason -- that the CPU usage on the Pod maxes out at 1CPU.

So I'm pretty sure the correct way to scale this up is to

  • Vertically scale Postgres
  • Increase the baseline number of Pods
  • Lower the autoscaler (HPA) threshold to scale up and add new Pods

EDIT: Additional information

The issue also has to do with the way the app itself is designed. We are trying to use Django Channels asynchronous python and running htat in a Daphne ASGI container; however, not all of the app is async, and apparently, that's Bad. I did a lot of research into this async vs sync application proplem and resulting deadlocks and while I'm having the dev team redesign the app, I also redesigned the deploy

  • Add a uWSGI server to the deployment
  • Deploy codebase into two Deployments/Pods
    • One has the ASGI Pod
    • One has the uWSGI Pod
  • Route all ASGI endpoints (there's only one) to the ASGI pod in the Ingress Path rules
  • Route all uWSGI endpoints to the sync Pod in the Ingress Path rules

This works; however, I don't have full load testing done yet.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .