Optimizing Kubernetes Costs and Availability in OKE: A Strategy for Preemptible Nodes and Descheduler Integration

6–9 minutes

Written by Heitor Augusto Murari Cardozo, Site reliability engineering (SRE) at Pipefy.

In cloud-native environments, balancing cost efficiency with application availability and performance is a constant challenge. Preemptible (or Spot) instances offer significant cost savings. However, they come with the risk of being reclaimed, making them suitable for stateless and interruptible workloads that can be terminated and restarted without loss of state. 

In addition to the possibility of being interrupted, preemptible instances in Oracle Kubernetes Engine (OKE) have a considerable probability of encountering an “Out of Capacity” event. These events occur when the cloud provider, in this case, Oracle Cloud Infrastructure (OCI), temporarily lacks sufficient resources (compute instances) in a specific availability domain or region to fulfill a request for new preemptible nodes. This can happen due to high demand for preemptible instances or other operational constraints in the OCI infrastructure. 

When such an event occurs, OKE cannot provision new preemptible nodes, which can lead to pending pods and application performance degradation if there are no alternative on-demand nodes available to take over the workload.

The “Out of Capacity” Problem with Preemptible Nodes in OKE

Preemptible nodes are a double-edged sword. While cost-effective, they can be terminated by the cloud provider at any time. In OKE, this can lead to a specific issue when dealing with “Out of Capacity” scenarios for preemptible instances:

  • The ~20-Minute Delay: When the Cluster Autoscaler (CA) attempts to provision a new preemptible node in an Availability Domain (AD) that is experiencing “Out of Capacity,” it doesn’t immediately fail. Instead, the CA waits for a defined period (defaulting to –max-node-provision-time=15m) for the node to appear. Oracle Cloud Infrastructure (OCI) might not return an immediate “Out of Capacity” error, causing the CA to wait for this timeout.
  • Back-off and Re-evaluation: After the 15-minute timeout, the CA registers the provisioning attempt as a failure and enters a back-off period for that specific AD/Node Pool. This means a new attempt (or consideration of another AD) will only occur after an additional delay, often totaling around 20 minutes from the initial request.
  • Pod Starvation: During this prolonged delay, if pods are strictly bound to the problematic AD or node pool via scheduling rules (like nodeSelector or requiredDuringSchedulingIgnoredDuringExecution affinity), they will remain in a Pending state. This can lead to application starvation and impact availability, especially in highly elastic environments.
  • Limited AD Flexibility: Even if other Availability Domains have available capacity, the CA will prioritize attempts in the AD where pods are “stuck” due to rigid affinity rules. It will only consider other ADs/Node Pools after the initial timeout and re-evaluation.

Proposed Solution Architecture

To mitigate these risks while still leveraging the cost benefits of preemptible nodes, we propose a hybrid architecture:

Oracle already recommends using preemptible nodes with on-demand nodes as a backup solution to reduce costs (as highlighted in their blog post: Reduce your Kubernetes costs with preemptible nodes). However, a key limitation of this basic approach is that when preemptible nodes are interrupted and pods fall back to on-demand nodes, they typically only return to preemptible nodes during a manual rollout or redeployment. Our proposed solution addresses this by actively rebalancing pods.

  • Fixed Preemptible Node Pool: This pool will run a fixed number of preemptible nodes (without Cluster Autoscaler enabled for this specific pool). It serves as the primary, cost-optimized capacity for your baseline workload.
  • On-Demand Node Pool with Cluster Autoscaler: This pool consists of standard on-demand nodes and has the Cluster Autoscaler enabled. It acts as a resilient fallback, providing burst capacity and ensuring high availability when preemptible nodes are unavailable or reclaimed.
  • Kubernetes Descheduler: This component is crucial for optimizing pod placement. It actively monitors the cluster and moves pods from the more expensive on-demand nodes back to the cost-effective preemptible nodes once they become available and have capacity.

The overall goal of this setup is to achieve significant cost savings by prioritizing preemptible nodes, while simultaneously improving application response time and availability by ensuring a reliable fallback and intelligent rebalancing.

Implementation Steps

Here are the necessary steps to implement this configuration:

Step 1: Configure Fixed Preemptible Node Pool

Create your preemptible node pool in OKE with a fixed number of nodes. Do not enable Cluster Autoscaler for this pool.

  • Node Label: Ensure these nodes are automatically labeled by OKE (or manually if needed) with: oci.oraclecloud.com/oke-is-preemptible: “true”
  • Node Count: Set the desired fixed number of nodes (e.g., 3 nodes). This number should be carefully defined to meet the base load of your environment, avoiding unnecessary resource waste.

Step 2: Configure On-Demand Node Pool with Cluster Autoscaler

Create your standard on-demand node pool and enable the Cluster Autoscaler for it.

  • Standard Nodes: These nodes will not have the oci.oraclecloud.com/oke-is-preemptible label.
  • Cluster Autoscaler Configuration: Review and adjust the Cluster Autoscaler’s scale-down parameters to prevent aggressive node termination, which could lead to pods being stuck in Pending if preemptible nodes are not yet available.
    • –scale-down-unneeded-time: Increase this value (e.g., to 15m or 20m) to allow more time for pods to be rescheduled before an unneeded on-demand node is terminated. This is critical if your Descheduler runs less frequently.
    • Other parameters like –scale-down-delay-after-add and –scale-down-delay-after-delete might also need fine-tuning based on your cluster’s behavior.

Step 3: Pod Affinity Configuration

Configure your application’s pods (e.g., my-generic-app) to prefer preemptible nodes through affinity rules, but allowing them to be scheduled on on-demand nodes if preemptible ones are not available. This ensures that pods do not get stuck in a Pending state.

apiVersion: apps/v1

kind: Deployment

metadata:

  name: my-generic-app

  namespace: <YOUR_NAMESPACE> # Change to your namespace

spec:

  replicas: 3 # Adjust replica count as needed

  selector:

    matchLabels:

      app: my-generic-app

  template:

    metadata:

      labels:

        app: my-generic-app

    spec:

      containers:

      - name: my-generic-app-container

        image: your-repo/my-generic-app-image:latest # Replace with your image

        ports:

        - containerPort: 80

        resources:

          requests:

            cpu: 100m

            memory: 128Mi

          limits:

            cpu: 200m

            memory: 256Mi

      affinity:

        nodeAffinity:

          # This rule prefers preemptible nodes but allows fallback to others.

          preferredDuringSchedulingIgnoredDuringExecution:

          - preference:

              # Preference 1: Preemptible nodes

              matchExpressions:

              - key: oci.oraclecloud.com/oke-is-preemptible

                operator: In

                values:

                - "true"

            weight: 100 # High preference for preemptible nodes

          - preference:

              # Preference 2: Non-preemptible nodes (fallback)

              matchExpressions:

              - key: oci.oraclecloud.com/oke-is-preemptible

                operator: NotIn

                values:

                - "true"

            weight: 50 # Lower preference, acts as a fallback

Step 4: Kubernetes Descheduler Deployment (via Helm)

Deploy Kubernetes Descheduler (v0.33.0 is the latest version available at the time of writing this article) using Helm. The key is configuring its policy to avoid eviction loops and target specific nodes/pods.

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/

helm repo update

Create a descheduler-values.yaml file with the following content:

# descheduler-values.yaml

kind: CronJob

image:

  repository: registry.k8s.io/descheduler/descheduler

  tag: "v0.33.0" # Ensure this exact version

  pullPolicy: IfNotPresent

resources:

  requests:

    cpu: 100m

    memory: 128Mi

  limits:

    cpu: 200m

    memory: 256Mi

schedule: "*/10 * * * *" # Run every 10 minutes (adjust as needed, e.g., "*/15 * * * *")

cmdOptions:

  v: 3 # Verbose logging for debugging

deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

deschedulerPolicy:

  maxNoOfPodsToEvictPerNode: 1 # Limit evictions per node per run

  maxNoOfPodsToEvictPerTotal: 3 # Limit total evictions per run (adjust based on PDBs and cluster capacity)

  profiles:

    - name: default

      pluginConfig:

        - name: DefaultEvictor

          args:

            ignorePvcPods: true

            evictLocalStoragePods: true

            evictSystemCriticalPods: false

            nodeFit: true # Crucial: Descheduler will check if a pod can be scheduled elsewhere before evicting

        - name: LowNodeUtilization

          args:

            nodeResourceUtilizationThresholds:

              thresholds:

                cpu: 40

                memory: 40

              targetThresholds:

                cpu: 70

                memory: 70

        # The 'RemovePodsViolatingNodeAffinity' strategy is DISABLED to prevent eviction loops.

        # - name: RemovePodsViolatingNodeAffinity

        #   args:

        #     nodeAffinityType:

        #       - requiredDuringSchedulingIgnoredDuringExecution

        #       - preferredDuringSchedulingIgnchedulingIgnoredDuringExecution

      plugins:

        balance:

          enabled:

            - LowNodeUtilization # Enable LowNodeUtilization for balancing

        deschedule:

          enabled: [] # No other deschedule strategies enabled to avoid conflicts/loops

priorityClassName: system-cluster-critical

rbac:

  create: true

serviceAccount:

  create: true

  name:

Explanation of Descheduler Policies Used:

The Descheduler uses customizable policies to optimize pod placement for cost-efficiency and availability. Key strategies in our solution include:

  • DefaultEvictor: Manages the eviction process. Its nodeFit: true ensures a pod is only evicted if it can be scheduled elsewhere, preventing Pending states.
  • LowNodeUtilization Strategy: Our core rebalancing tool. It identifies under/overutilized nodes.
    • Purpose: Moves pods from overutilized on-demand nodes to available preemptible nodes.
  • RemovePodsViolatingNodeAffinity Strategy: Explicitly disabled.
    • Reason: Enabling it with preferredDuringSchedulingIgnoredDuringExecution affinity creates an undesirable eviction loop between on-demand nodes, as the Descheduler would repeatedly move pods between them without ensuring a move to a preemptible node.

By combining these, Descheduler intelligently moves pods to cheaper preemptible nodes when available, maintaining stability and avoiding unnecessary churn.

Install the Descheduler using Helm:

helm install descheduler descheduler/descheduler \

  --namespace kube-system \

  -f descheduler-values.yaml

Monitoring and Validation

After implementing this solution, continuous monitoring is essential:

  • Pod Status: Use kubectl get pods -o wide -n <YOUR_NAMESPACE> to observe pod placement.
  • Descheduler Logs: Check Descheduler pod logs (kubectl logs -f <descheduler-pod-name> -n kube-system) to verify its actions and ensure it’s not entering an eviction loop. Look for messages from LowNodeUtilization evicting pods from on-demand nodes.
  • Node Utilization: Monitor node resource usage to confirm that pods are being consolidated onto preemptible nodes as expected.
  • Cluster Autoscaler Logs: Review CA logs to understand its scaling decisions for on-demand nodes.
  • PodDisruptionBudgets (PDBs): Ensure your critical applications have PDBs configured to prevent excessive downtime during Descheduler operations.

By carefully configuring your node pools, pod affinities, and the Descheduler policies, you can effectively leverage preemptible instances for cost savings in OKE while maintaining high availability and responsiveness for your applications.

Similar Posts

Leave a ReplyCancel reply