Optimizing Kubernetes Costs and Availability in OKE: A Strategy for Preemptible Nodes and Descheduler Integration
Written by Heitor Augusto Murari Cardozo, Site reliability engineering (SRE) at Pipefy.
In cloud-native environments, balancing cost efficiency with application availability and performance is a constant challenge. Preemptible (or Spot) instances offer significant cost savings. However, they come with the risk of being reclaimed, making them suitable for stateless and interruptible workloads that can be terminated and restarted without loss of state.
In addition to the possibility of being interrupted, preemptible instances in Oracle Kubernetes Engine (OKE) have a considerable probability of encountering an “Out of Capacity” event. These events occur when the cloud provider, in this case, Oracle Cloud Infrastructure (OCI), temporarily lacks sufficient resources (compute instances) in a specific availability domain or region to fulfill a request for new preemptible nodes. This can happen due to high demand for preemptible instances or other operational constraints in the OCI infrastructure.
When such an event occurs, OKE cannot provision new preemptible nodes, which can lead to pending pods and application performance degradation if there are no alternative on-demand nodes available to take over the workload.
The “Out of Capacity” Problem with Preemptible Nodes in OKE
Preemptible nodes are a double-edged sword. While cost-effective, they can be terminated by the cloud provider at any time. In OKE, this can lead to a specific issue when dealing with “Out of Capacity” scenarios for preemptible instances:
- The ~20-Minute Delay: When the Cluster Autoscaler (CA) attempts to provision a new preemptible node in an Availability Domain (AD) that is experiencing “Out of Capacity,” it doesn’t immediately fail. Instead, the CA waits for a defined period (defaulting to –max-node-provision-time=15m) for the node to appear. Oracle Cloud Infrastructure (OCI) might not return an immediate “Out of Capacity” error, causing the CA to wait for this timeout.
- Back-off and Re-evaluation: After the 15-minute timeout, the CA registers the provisioning attempt as a failure and enters a back-off period for that specific AD/Node Pool. This means a new attempt (or consideration of another AD) will only occur after an additional delay, often totaling around 20 minutes from the initial request.
- Pod Starvation: During this prolonged delay, if pods are strictly bound to the problematic AD or node pool via scheduling rules (like nodeSelector or requiredDuringSchedulingIgnoredDuringExecution affinity), they will remain in a Pending state. This can lead to application starvation and impact availability, especially in highly elastic environments.
- Limited AD Flexibility: Even if other Availability Domains have available capacity, the CA will prioritize attempts in the AD where pods are “stuck” due to rigid affinity rules. It will only consider other ADs/Node Pools after the initial timeout and re-evaluation.
Proposed Solution Architecture
To mitigate these risks while still leveraging the cost benefits of preemptible nodes, we propose a hybrid architecture:
Oracle already recommends using preemptible nodes with on-demand nodes as a backup solution to reduce costs (as highlighted in their blog post: Reduce your Kubernetes costs with preemptible nodes). However, a key limitation of this basic approach is that when preemptible nodes are interrupted and pods fall back to on-demand nodes, they typically only return to preemptible nodes during a manual rollout or redeployment. Our proposed solution addresses this by actively rebalancing pods.
- Fixed Preemptible Node Pool: This pool will run a fixed number of preemptible nodes (without Cluster Autoscaler enabled for this specific pool). It serves as the primary, cost-optimized capacity for your baseline workload.
- On-Demand Node Pool with Cluster Autoscaler: This pool consists of standard on-demand nodes and has the Cluster Autoscaler enabled. It acts as a resilient fallback, providing burst capacity and ensuring high availability when preemptible nodes are unavailable or reclaimed.
- Kubernetes Descheduler: This component is crucial for optimizing pod placement. It actively monitors the cluster and moves pods from the more expensive on-demand nodes back to the cost-effective preemptible nodes once they become available and have capacity.
The overall goal of this setup is to achieve significant cost savings by prioritizing preemptible nodes, while simultaneously improving application response time and availability by ensuring a reliable fallback and intelligent rebalancing.
Implementation Steps
Here are the necessary steps to implement this configuration:
Step 1: Configure Fixed Preemptible Node Pool
Create your preemptible node pool in OKE with a fixed number of nodes. Do not enable Cluster Autoscaler for this pool.
- Node Label: Ensure these nodes are automatically labeled by OKE (or manually if needed) with: oci.oraclecloud.com/oke-is-preemptible: “true”
- Node Count: Set the desired fixed number of nodes (e.g., 3 nodes). This number should be carefully defined to meet the base load of your environment, avoiding unnecessary resource waste.
Step 2: Configure On-Demand Node Pool with Cluster Autoscaler
Create your standard on-demand node pool and enable the Cluster Autoscaler for it.
- Standard Nodes: These nodes will not have the oci.oraclecloud.com/oke-is-preemptible label.
- Cluster Autoscaler Configuration: Review and adjust the Cluster Autoscaler’s scale-down parameters to prevent aggressive node termination, which could lead to pods being stuck in Pending if preemptible nodes are not yet available.
- –scale-down-unneeded-time: Increase this value (e.g., to 15m or 20m) to allow more time for pods to be rescheduled before an unneeded on-demand node is terminated. This is critical if your Descheduler runs less frequently.
- Other parameters like –scale-down-delay-after-add and –scale-down-delay-after-delete might also need fine-tuning based on your cluster’s behavior.
Step 3: Pod Affinity Configuration
Configure your application’s pods (e.g., my-generic-app) to prefer preemptible nodes through affinity rules, but allowing them to be scheduled on on-demand nodes if preemptible ones are not available. This ensures that pods do not get stuck in a Pending state.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-generic-app
namespace: <YOUR_NAMESPACE> # Change to your namespace
spec:
replicas: 3 # Adjust replica count as needed
selector:
matchLabels:
app: my-generic-app
template:
metadata:
labels:
app: my-generic-app
spec:
containers:
- name: my-generic-app-container
image: your-repo/my-generic-app-image:latest # Replace with your image
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
affinity:
nodeAffinity:
# This rule prefers preemptible nodes but allows fallback to others.
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
# Preference 1: Preemptible nodes
matchExpressions:
- key: oci.oraclecloud.com/oke-is-preemptible
operator: In
values:
- "true"
weight: 100 # High preference for preemptible nodes
- preference:
# Preference 2: Non-preemptible nodes (fallback)
matchExpressions:
- key: oci.oraclecloud.com/oke-is-preemptible
operator: NotIn
values:
- "true"
weight: 50 # Lower preference, acts as a fallback
Step 4: Kubernetes Descheduler Deployment (via Helm)
Deploy Kubernetes Descheduler (v0.33.0 is the latest version available at the time of writing this article) using Helm. The key is configuring its policy to avoid eviction loops and target specific nodes/pods.
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm repo update
Create a descheduler-values.yaml file with the following content:
# descheduler-values.yaml
kind: CronJob
image:
repository: registry.k8s.io/descheduler/descheduler
tag: "v0.33.0" # Ensure this exact version
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
schedule: "*/10 * * * *" # Run every 10 minutes (adjust as needed, e.g., "*/15 * * * *")
cmdOptions:
v: 3 # Verbose logging for debugging
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"
deschedulerPolicy:
maxNoOfPodsToEvictPerNode: 1 # Limit evictions per node per run
maxNoOfPodsToEvictPerTotal: 3 # Limit total evictions per run (adjust based on PDBs and cluster capacity)
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
ignorePvcPods: true
evictLocalStoragePods: true
evictSystemCriticalPods: false
nodeFit: true # Crucial: Descheduler will check if a pod can be scheduled elsewhere before evicting
- name: LowNodeUtilization
args:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 40
memory: 40
targetThresholds:
cpu: 70
memory: 70
# The 'RemovePodsViolatingNodeAffinity' strategy is DISABLED to prevent eviction loops.
# - name: RemovePodsViolatingNodeAffinity
# args:
# nodeAffinityType:
# - requiredDuringSchedulingIgnoredDuringExecution
# - preferredDuringSchedulingIgnchedulingIgnoredDuringExecution
plugins:
balance:
enabled:
- LowNodeUtilization # Enable LowNodeUtilization for balancing
deschedule:
enabled: [] # No other deschedule strategies enabled to avoid conflicts/loops
priorityClassName: system-cluster-critical
rbac:
create: true
serviceAccount:
create: true
name:
Explanation of Descheduler Policies Used:
The Descheduler uses customizable policies to optimize pod placement for cost-efficiency and availability. Key strategies in our solution include:
- DefaultEvictor: Manages the eviction process. Its nodeFit: true ensures a pod is only evicted if it can be scheduled elsewhere, preventing Pending states.
- LowNodeUtilization Strategy: Our core rebalancing tool. It identifies under/overutilized nodes.
- Purpose: Moves pods from overutilized on-demand nodes to available preemptible nodes.
- RemovePodsViolatingNodeAffinity Strategy: Explicitly disabled.
- Reason: Enabling it with preferredDuringSchedulingIgnoredDuringExecution affinity creates an undesirable eviction loop between on-demand nodes, as the Descheduler would repeatedly move pods between them without ensuring a move to a preemptible node.
By combining these, Descheduler intelligently moves pods to cheaper preemptible nodes when available, maintaining stability and avoiding unnecessary churn.
Install the Descheduler using Helm:
helm install descheduler descheduler/descheduler \
--namespace kube-system \
-f descheduler-values.yaml
Monitoring and Validation
After implementing this solution, continuous monitoring is essential:
- Pod Status: Use kubectl get pods -o wide -n <YOUR_NAMESPACE> to observe pod placement.
- Descheduler Logs: Check Descheduler pod logs (kubectl logs -f <descheduler-pod-name> -n kube-system) to verify its actions and ensure it’s not entering an eviction loop. Look for messages from LowNodeUtilization evicting pods from on-demand nodes.
- Node Utilization: Monitor node resource usage to confirm that pods are being consolidated onto preemptible nodes as expected.
- Cluster Autoscaler Logs: Review CA logs to understand its scaling decisions for on-demand nodes.
- PodDisruptionBudgets (PDBs): Ensure your critical applications have PDBs configured to prevent excessive downtime during Descheduler operations.
By carefully configuring your node pools, pod affinities, and the Descheduler policies, you can effectively leverage preemptible instances for cost savings in OKE while maintaining high availability and responsiveness for your applications.
