Skip to content

Cost Optimization with Spot Instances in Karpenter

Azure spot instances are unused Azure capacity that can be allocated at a significant discount compared to on-demand pricing. The trade-off is that Azure may reclaim these instances with a 30s notice when it needs the capacity back.

Key characteristics of spot instances:

  • Much lower cost (up to 90% discount)
  • May be evicted with 30 seconds notice
  • Availability varies by region and VM size
  • Perfect for fault-tolerant, batch, or stateless workloads

In this module, we'll explore how to leverage Karpenter to use Spot efficiently.

Prerequisites

Before beginning, ensure you have:

  1. A running AKS cluster with Karpenter/NAP enabled
  2. The workshop namespace created

Exercise 1: Basic Spot Instance Configuration

Let's start by creating a new NodePool that can only deploy spot instances.

Step 1: Create a Spot-Based NodePool

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default-demo-spot
  annotations:
    kubernetes.io/description: "Basic NodePool for generic workloads"
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5s
    budgets:
      - nodes: 100%
  limits:
    cpu: "100"
  template:
    metadata:
      labels:
        # required for Karpenter to predict overhead from cilium DaemonSet
        kubernetes.azure.com/ebpf-dataplane: cilium
        aks-karpenter: demo
    spec:
      expireAfter: Never
      startupTaints:
        # https://karpenter.sh/docs/concepts/nodepools/#cilium-startup-taint
        - key: node.cilium.io/agent-not-ready
          effect: NoExecute
          value: "true"
      requirements:
        # Switch off to D family (B has no arm64)
        - key: karpenter.azure.com/sku-family
          operator: In
          values: [D]
        - key: kubernetes.io/arch
          operator: In
          # Note addition of ARM64
          values: ["amd64", "arm64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.azure.com/sku-cpu
          operator: Lt
          values: ["5"]
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default-demo 
---
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: default-demo
  annotations:
    kubernetes.io/description: "Basic AKSNodeClass for running Ubuntu2204 nodes"
spec:
  imageFamily: Ubuntu2204
  osDiskSizeGB: 100
EOF
$yamlContent = @"
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default-demo-spot
  annotations:
    kubernetes.io/description: "Basic NodePool for generic workloads"
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5s
    budgets:
      - nodes: 100%
  limits:
    cpu: "100"
  template:
    metadata:
      labels:
        # required for Karpenter to predict overhead from cilium DaemonSet
        kubernetes.azure.com/ebpf-dataplane: cilium
        aks-karpenter: demo
    spec:
      expireAfter: Never
      startupTaints:
        # https://karpenter.sh/docs/concepts/nodepools/#cilium-startup-taint
        - key: node.cilium.io/agent-not-ready
          effect: NoExecute
          value: "true"
      requirements:
        # Switch off to D family (B has no arm64)
        - key: karpenter.azure.com/sku-family
          operator: In
          values: [D]
        - key: kubernetes.io/arch
          operator: In
          # Note addition of ARM64
          values: ["amd64", "arm64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.azure.com/sku-cpu
          operator: Lt
          values: ["5"]
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default-demo 
---
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: default-demo
  annotations:
    kubernetes.io/description: "Basic AKSNodeClass for running Ubuntu2204 nodes"
spec:
  imageFamily: Ubuntu2204
  osDiskSizeGB: 100
"@

$yamlContent | kubectl apply -f -

The key configuration is setting karpenter.sh/capacity-type to spot, which tells Karpenter to provision spot instances. For the purpose of this module, we still limit the size of the created VMs to 4CPU maximum, this will allow comparisons across different configurations in the next exercises.

Step 2: Deploy a Spot-Compatible Workload

Next, let's update our deployment so it runs exclusively on spot instances, for now using amd64:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: workshop
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: karpenterossazuredemo.azurecr.io/pause:3.7
          resources:
            requests:
              cpu: 1
      nodeSelector:
        aks-karpenter: demo
        kubernetes.io/arch: amd64
EOF
$yamlContent = @"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: workshop
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: karpenterossazuredemo.azurecr.io/pause:3.7
          resources:
            requests:
              cpu: 1
      nodeSelector:
        aks-karpenter: demo
        kubernetes.io/arch: amd64
"@

$yamlContent | kubectl apply -f -

Step 3: Observe Spot Instance Provisioning

Examine the nodes to verify they're using spot capacity:

kubectl get nodes -l aks-karpenter=demo -o custom-columns=NAME:.metadata.name,CAPACITY_TYPE:.metadata.labels.karpenter\\.sh/capacity-type

Spot Capacity Nodes

You can also view nodes using the AKS Node Viewer tool:

aks-node-viewer --node-selector aks-karpenter=demo

AKS Node Viewer

If we compare to the AMD64 cost of the same workload from the previous exercise ($186.15/month), this represents a ~84% saving - keeping in mind you will not see the exact same values in your version as spot prices depend on region and availability.

Exercise 2: Combining Spot with ARM64 for Maximum Cost Savings

ARM64 instances are already cheaper than equivalent AMD64 VMs. By combining ARM64 architecture with spot pricing, we can achieve even greater cost savings. The nodepool we created in Exercise 1 already allows for both AMD64 and ARM64 architectures, so we can simply update our deployment to request ARM64 nodes.

Step 1: Deploy a Workload for ARM64 Spot Instances

Let's update our workload to specifically request ARM64 spot instances with a combination of nodeSelector configurations:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: workshop
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: karpenterossazuredemo.azurecr.io/pause:3.7
          resources:
            requests:
              cpu: 1
      nodeSelector:
        aks-karpenter: demo
        kubernetes.io/arch: arm64
EOF
$yamlContent = @"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: workshop
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: karpenterossazuredemo.azurecr.io/pause:3.7
          resources:
            requests:
              cpu: 1
      nodeSelector:
        aks-karpenter: demo
        kubernetes.io/arch: arm64
"@

$yamlContent | kubectl apply -f -

Step 2: Observe and Compare Cost Savings

Let's verify our ARM64 spot instances are provisioned correctly:

kubectl get nodes -l aks-karpenter=demo -o custom-columns=NAME:.metadata.name,ARCH:.metadata.labels.kubernetes\\.io/arch,CAPACITY_TYPE:.metadata.labels.karpenter\\.sh/capacity-type

You should see nodes with ARM64 architecture and spot capacity type.

Now let's use the AKS Node Viewer to compare the cost of our AMD64 and ARM64 spot instances:

aks-node-viewer --node-selector "aks-karpenter=spot" --resources cpu

AKS Node Viewer

This time we observe in our conditions a ~85% cost reduction from on-demand and the lowest cost profile overall.

Currently we have the following cost values for our 4 pods on 2 nodes - keep in mind that the exact amount will vary for you based on region, VM size, and current spot market conditions:

Instance Type AMD64 ARM64
On-demand $186.15 $135.78
Spot $30.1 $20.85

As you can see from this sample pricing table based on the AKS node viewers outputs so far, combining ARM64 with spot instances gives you the maximum cost savings - over 80% less than standard on-demand AMD64 VMs.

In module 6 we will explore additional concepts and discuss how to deploy workloads across spot and on demand configurations for additional safety.

Best Practices for Spot Instances with Karpenter

  1. Design for resilience: Build applications that can handle sudden terminations.
  2. Deploy adequate replicas: Ensure you have enough replicas to maintain service when nodes are evicted.
  3. Implement health checks and readiness probes: Ensure traffic only routes to healthy pods.
  4. Distribute pods across nodes: Use pod anti-affinity to reduce impact of single-node evictions.
  5. Use Horizontal Pod Autoscaler: Automatically maintain minimum required replicas.
  6. Avoid using spot for stateful workloads: Unless you have a robust replication strategy.
  7. Monitor spot evictions: Track metrics to understand impact and optimize strategies.
  8. Use a mix of instance types: In the exercises above we used explicitely the D family with small nodes only, aim to allow more instance families and sizes to spot availability and find the lowest price.
  9. Implement proper lifecycle hooks: Use preStop hooks to gracefully terminate connections when the eviction signal is received.

Cleanup

Before moving to the next module, clean up your resources:

kubectl delete deployment -n workshop inflate
kubectl delete nodepool default-demo-spot

Conclusion

In this module, you've learned how to leverage spot instances with Karpenter for significant cost savings. Key takeaways include:

  • How to configure NodePools and deployments to use spot instances
  • How to combine ARM64 architecture with spot pricing for maximum cost savings
  • Practical comparison of cost differences between on-demand and spot instances across architectures

By effectively using spot instances with Karpenter, you can dramatically reduce your AKS compute costs while maintaining application availability and performance. The combination of spot instances with Karpenter's intelligent provisioning across different architectures makes for a particularly powerful cost optimization strategy.

In the next module, we'll explore team isolation and working with multiple nodepools to provide separation and organization of workloads across different teams.