Balancing cost and reliability for Spark on Kubernetes — news
News/2026-03-08-balancing-cost-and-reliability-for-spark-on-kubernetes-news-news
Breaking NewsMar 8, 20264 min read

Balancing cost and reliability for Spark on Kubernetes — news

Featured:NotionAWS
Balancing cost and reliability for Spark on Kubernetes — news

Notion, AWS Open-Source Spot Balancer to Cut Spark on Kubernetes Costs by Up to 90%

SAN FRANCISCO — Notion has open-sourced Spot Balancer, a tool developed in collaboration with AWS that dramatically reduces compute costs for Apache Spark workloads running on Kubernetes while preserving job reliability. The company says the solution helped it achieve up to 90% savings on Spark compute expenses.

In a blog post published today, Notion Software Engineer Justin Lee detailed the challenges of balancing cost and reliability when running large-scale Spark jobs on Kubernetes clusters, particularly when leveraging AWS EC2 Spot Instances. Spot Balancer addresses these issues by intelligently managing the trade-offs between cheaper, interruptible Spot capacity and more expensive, stable on-demand instances.

The tool was built to solve a common pain point for data engineering teams: Spark workloads often require significant compute resources for batch processing, machine learning pipelines, and analytics, but maintaining reliability on cost-optimized infrastructure has traditionally been complex. According to Notion's announcement, Spot Balancer uses application-aware logic to monitor cluster conditions and automatically adjust resource allocation to minimize costs without triggering costly job failures or excessive retries.

Technical Approach and Implementation

Notion's implementation focuses on Kubernetes-native Spark deployments, likely running on Amazon Elastic Kubernetes Service (EKS). The solution integrates with AWS EC2 Spot Instance features, including managed draining of Spot nodes through re-balance recommendations. Nodes are automatically labeled so that Spark pods can be scheduled using NodeAffinity rules, allowing the system to gracefully handle interruptions.

This approach builds on established best practices for running cost-optimized Spark workloads on Kubernetes, as previously outlined in AWS documentation and industry talks at Spark + AI Summit. By combining Spot capacity with strategic use of on-demand instances for critical job stages, Spot Balancer reportedly maintains high reliability while slashing infrastructure bills.

The open-source release makes the tool available to the broader data engineering community, allowing other organizations running Spark on Kubernetes to implement similar cost-optimization strategies without building the solution from scratch.

Impact on Developers and Data Teams

For developers and data platform teams, Spot Balancer represents a practical advancement in infrastructure optimization. Organizations that run substantial Spark workloads — whether for ETL processes, data warehousing, or AI/ML training — can potentially achieve massive reductions in cloud compute spending.

The collaboration between Notion and AWS highlights growing industry emphasis on application-aware, AI/ML-driven infrastructure optimization. Similar themes appear in tools and services from companies like Spot.io, which focus on automated provisioning and cost optimization for Spark on Kubernetes.

Early results from Notion suggest the tool delivers substantial value for companies with variable or bursty Spark workloads that can tolerate some level of interruption through intelligent scheduling and graceful degradation.

What's Next

Notion has not yet detailed a specific roadmap for future enhancements to Spot Balancer, but open-sourcing the project invites community contributions and potential integration with broader Kubernetes and Spark ecosystem tools.

As more organizations migrate big data workloads to Kubernetes, solutions that intelligently balance cost and reliability are expected to see increased adoption. The availability of Spot Balancer could accelerate experimentation with Spot Instances for Spark, potentially influencing how companies architect their data platforms on AWS and other cloud providers.

The tool's focus on maintaining reliability while optimizing costs addresses a key barrier that has prevented wider Spot Instance adoption for mission-critical data processing workloads.

Sources

Original Source

notion.com

Comments

No comments yet. Be the first to share your thoughts!