Terraform `aws_eks_node_group` Ready But Create Never Completes

Question

I have a terraform setup where I create a new launch template and a node group. Without the launch template everything works correctly. With the launch template, the nodes become ready but the node group never finishes creating.

main.tf

...

resource "aws_launch_template" "this" {
  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_type           = var.block_device_mappings.type
      volume_size           = var.block_device_mappings.size
      iops                  = var.block_device_mappings.iops
      kms_key_id            = var.block_device_mappings.kms_key_id
      encrypted             = var.block_device_mappings.encrypted
      delete_on_termination = var.block_device_mappings.delete_on_termination
    }
  }

  user_data = base64encode(templatefile("${path.module}/user_data.tpl", {
    cluster_endpoint = var.cluster_endpoint
    certificate_authority_data = var.certificate_authority_data
    bootstrap_extra_args = "--use-max-pods false"
    cluster_name = var.cluster_name
  }))
}

resource "aws_eks_node_group" "this" {
  cluster_name    = var.cluster_name
  node_group_name = var.node_group_name
  node_role_arn   = var.node_group_arn
  instance_types  = [var.instance_type]
  subnet_ids = [
    for subnet in var.subnets : subnet.id
  ]
  capacity_type = var.capacity_type

  scaling_config {
    desired_size = var.desired_capacity
    max_size     = var.max_capacity
    min_size     = var.min_capacity
  }

  update_config {
    max_unavailable = 1
  }

  labels = var.node_group_labels

  dynamic "taint" {
    for_each = toset(var.node_group_taints)

    content {
      key    = taint.value.key
      value  = taint.value.value
      effect = taint.value.effect
    }
  }

  launch_template {
    id      = aws_launch_template.this.id
    version = aws_launch_template.this.latest_version
  }
}

...

user_data.tpl

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="/:/+++"

--/:/+++
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash

/etc/eks/bootstrap.sh --apiserver-endpoint '${cluster_endpoint}' --b64-cluster-ca '${certificate_authority_data}' ${bootstrap_extra_args} '${cluster_name}'

--/:/+++--

kubectl get pods

NAME                                          STATUS   ROLES    AGE   VERSION
ip-192-168-1-128.us-west-1.compute.internal   Ready    <none>   13m   v1.29.0-eks-5e0fdde
ip-192-168-1-140.us-west-1.compute.internal   Ready    <none>   13m   v1.29.0-eks-5e0fdde
ip-192-168-1-157.us-west-1.compute.internal   Ready    <none>   13m   v1.29.0-eks-5e0fdde

kubectl describe node ip-192-168-1-128.us-west-1.compute.internal

Name:               ip-192-168-1-128.us-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.4xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-1
                    failure-domain.beta.kubernetes.io/zone=us-west-1a
                    k8s.io/cloud-provider-aws=cff041cdc91d38d182baa77beef8bf9f
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-1-128.us-west-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=m5.2xlarge
                    topology.kubernetes.io/region=us-west-1
                    topology.kubernetes.io/zone=us-west-1a
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.1.128
                    csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"ip-192-168-1-128.us-gov-west-1.compute.internal"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.1.128/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.42.7.192
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 14 Mar 2024 10:40:54 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-1-128.us-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Thu, 14 Mar 2024 10:54:21 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 14 Mar 2024 10:41:24 -0400   Thu, 14 Mar 2024 10:41:24 -0400   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Thu, 14 Mar 2024 10:52:09 -0400   Thu, 14 Mar 2024 10:40:54 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 14 Mar 2024 10:52:09 -0400   Thu, 14 Mar 2024 10:40:54 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 14 Mar 2024 10:52:09 -0400   Thu, 14 Mar 2024 10:40:54 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 14 Mar 2024 10:52:09 -0400   Thu, 14 Mar 2024 10:41:18 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.1.128
  InternalDNS:  ip-192-168-1-128.us-west-1.compute.internal
  Hostname:     ip-192-168-1-128.us-west-1.compute.internal
Capacity:
  cpu:                16
  ephemeral-storage:  20959212Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             64333324Ki
  pods:               110
Allocatable:
  cpu:                15890m
  ephemeral-storage:  18242267924
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             61334028Ki
  pods:               110
System Info:
  Machine ID:                 ec2821bfac66895c1abc29a47021fe76
  System UUID:                ec2821bf-ac66-895c-1abc-29a47021fe76
  Boot ID:                    356d15db-1436-4c45-af1e-6a668eddd8e0
  Kernel Version:             5.10.210-201.852.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11
  Kubelet Version:            v1.29.0-eks-5e0fdde
  Kube-Proxy Version:         v1.29.0-eks-5e0fdde
ProviderID:                   aws:///us-west-1a/i-0874068c9ab354407
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  calico-apiserver            calico-apiserver-5f98fdb745-cf4xg    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12m
  calico-system               calico-node-6c98k                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13m
  calico-system               calico-typha-695fb789b5-sfq4n        0 (0%)        0 (0%)      0 (0%)           0 (0%)         13m
  calico-system               csi-node-driver-qtczs                0 (0%)        0 (0%)      0 (0%)           0 (0%)         13m
  ionic-system                tigera-operator-967f9fc76-tghqf      0 (0%)        0 (0%)      0 (0%)           0 (0%)         15m
  kube-system                 kube-proxy-cnlnc                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         13m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (0%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type     Reason                   Age                From                   Message
  ----     ------                   ----               ----                   -------
  Normal   Starting                 13m                kube-proxy             
  Normal   Synced                   13m                cloud-node-controller  Node synced successfully
  Normal   Starting                 13m                kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      13m                kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  13m (x2 over 13m)  kubelet                Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    13m (x2 over 13m)  kubelet                Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     13m (x2 over 13m)  kubelet                Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  13m                kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode           13m                node-controller        Node ip-192-168-1-128.us-west-1.compute.internal event: Registered Node ip-192-168-1-128.us-west-1.compute.internal in Controller
  Normal   NodeReady                13m                kubelet                Node ip-192-168-1-128.us-west-1.compute.internal status is now: NodeReady

What makes this more confounding is if I remove --apiserver-endpoint '${cluster_endpoint}' --b64-cluster-ca '${certificate_authority_data}' from the .tpl file, everything works without issue except the max pod count is wrong (it drops to 58 instead due to the instance type).

Notes:

We are using Calico instead of the AWS Node CNI. This is a requirement for the project so I am stuck with that.
The only oddity that has stuck out thus far is when I run this without the above arguments, the taint populates, when I do it with the taint does not populate, but that might be a red herring.

Any suggestions are appreciated.

bulletshot60 · Accepted Answer · 2024-03-15 17:27:11Z

It took a couple of days of digging, but the answer is that a label was not being applied by the launch template which AWS needed: eks.amazonaws.com/nodegroup.

Below is the new template we are using which supplies these labels.

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="/:/+++"

--/:/+++
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
B64_CLUSTER_CA=${cluster_certificate_authority_data}
API_SERVER_URL=${cluster_endpoint}
K8S_CLUSTER_DNS_IP=172.20.0.10
/etc/eks/bootstrap.sh ${cluster_name} --kubelet-extra-args '--node-labels=eks.amazonaws.com/capacityType=${node_group_capacity_type},eks.amazonaws.com/nodegroup=${node_group_name} --max-pods=${max_pods_per_node}' --b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL --dns-cluster-ip $K8S_CLUSTER_DNS_IP --use-max-pods false

--/:/+++--

Stack Exchange Network

Terraform `aws_eks_node_group` Ready But Create Never Completes

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
kubernetes
terraform
.

Hot Network Questions

Terraform `aws_eks_node_group` Ready But Create Never Completes

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged amazon-web-serviceskubernetesterraform.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
kubernetes
terraform
.