Using Machine Learning to Automate Kubernetes Optimization The New Stack – thenewstack.io
Brian Likosar
Brian is an open source geek with a passion for working at the intersection of people and technology. Throughout his career, he's been involved in open source, whether that was with Linux, Ansible and OpenShift/Kubernetes while at Red Hat, Apache Kafka while at Confluent, or Apache Flink while at AWS. Currently a senior solutions architect at StormForge, he is based in the Chicago area and enjoys horror, sports, live music and theme parks.
Note: This is the third of a five-part series covering Kubernetes resource management and optimization. In this article, we explain how machine learning can be used to manage Kubernetes resources efficiently. Previous articles explained Kubernetes resource types and requests and limits.
As Kubernetes has become the de-facto standard for application container orchestration, it has also raised vital questions about optimization strategies and best practices. One of the reasons organizations adopt Kubernetes is to improve efficiency, even while scaling up and down to accommodate changing workloads. But the same fine-grained control that makes Kubernetes so flexible also makes it challenging to effectively tune and optimize.
In this article, well explain how machine learning can be used to automate tuning of these resources and ensure efficient scaling for variable workloads.
Optimizing applications for Kubernetes is largely a matter of ensuring that the code uses its underlying resources namely CPU and memory as efficiently as possible. That means ensuring performance that meets or exceeds service-level objectives at the lowest possible cost and with minimal effort.
When creating a cluster, we can configure the use of two primary resources memory and CPU at the container level. Namely, we can set limits as to how much of these resources our application can use and request. We can think of those resource settings as our input variables, and the output in terms of performance, reliability and resource usage (or cost) of running our application. As the number of containers increases, the number of variables also increases, and with that, the overall complexity of cluster management and system optimization increases exponentially.
We can think of Kubernetes configuration as an equation with resource settings as our variables and cost, performance and reliability as our outcomes.
To further complicate matters, different resource parameters are interdependent. Changing one parameter may have unexpected effects on cluster performance and efficiency. This means that manually determining the precise configurations for optimal performance is an impossible task, unless you have unlimited time and Kubernetes experts.
If we do not set custom values for resources during the container deployment, Kubernetes automatically assigns these values. The challenge here is that Kubernetes is quite generous with its resources to prevent two situations: service failure due to an out-of-memory (OOM) error and unreasonably slow performance due to CPU throttling. However, using the default configurations to create a cloud-based cluster will result in unreasonably high cloud costs without guaranteeing sufficient performance.
This all becomes even more complex when we seek to manage multiple parameters for several clusters. For optimizing an environments worth of metrics, a machine learning system can be an integral addition.
There are two general approaches to machine learning-based optimization, each of which provides value in a different way. First, experimentation-based optimization can be done in a non-prod environment using a variety of scenarios to emulate possible production scenarios. Second, observation-based optimization can be performed either in prod or non-prod by observing actual system behavior. These two approaches are described next.
Optimizing through experimentation is a powerful, science-based approach because we can try any possible scenario, measure the outcomes, adjust our variables and try again. Since experimentation takes place in a non-prod environment, were only limited by the scenarios we can imagine and the time and effort needed to perform these experiments. If experimentation is done manually, the time and effort needed can be overwhelming. Thats where machine learning and automation come in.
Lets explore how experimentation-based optimization works in practice.
To set up an experiment, we must first identify which variables (also called parameters) can be tuned. These are typically CPU and memory requests and limits, replicas and application-specific parameters such as JVM heap size and garbage collection settings.
Some ML optimization solutions can scan your cluster to automatically identify configurable parameters. This scanning process also captures the clusters current, or baseline, values as a starting point for our experiment.
Next, you must specify your goals. In other words, which metrics are you trying to minimize or maximize? In general, the goal will consist of multiple metrics representing trade-offs, such as performance versus cost. For example, you may want to maximize throughput while minimizing resource costs.
Some optimization solutions will allow you to apply a weighting to each optimization goal, as performance may be more important than cost in some situations and vice versa. Additionally, you may want to specify boundaries for each goal. For instance, you might not want to even consider any scenarios that result in performance below a particular threshold. Providing these guardrails will help to improve the speed and efficiency of the experimentation process.
Here are some considerations for selecting the right metrics for your optimization goals:
Of course, these are just a few examples. Determining the proper metrics to prioritize requires communication between developers and those responsible for business operations. Determine the organizations primary goals. Then examine how the technology can achieve these goals and what it requires to do so. Finally, establish a plan that emphasizes the metrics that best accommodate the balance of cost and function.
With an experimentation-based approach, we need to establish the scenarios to optimize for and build those scenarios into a load test. This might be a range of expected user traffic or a specific scenario like a retail holiday-based spike in traffic. This performance test will be used during the experimentation process to simulate production load.
Once weve set up our experiment with optimization goals and tunable parameters, we can kick off the experiment. An experiment consists of multiple trials, with your optimization solution iterating through the following steps for each trial:
The machine learning engine uses the results of each trial to build a model representing the multidimensional parameter space. In this space, it can examine the parameters in relation to one another. With each iteration, the ML engine moves closer to identifying the configurations that optimize the goal metrics.
While machine learning automatically recommends the configuration that will result in the optimal outcomes, additional analysis can be done once the experiment is complete. For example, you can visualize the trade-offs between two different goals, see which parameters have a significant impact on outcomes and which matter less.
Results are often surprising and can lead to key architectural improvements, for example, determining that a larger number of smaller replicas is more efficient than a smaller number of heavier replicas.
Experiment results can be visualized and analyzed to fully understand system behavior.
Experiment results can be visualized and analyzed to fully understand system behavior.
While experimentation-based optimization is powerful for analyzing a wide range of scenarios, its impossible to anticipate every possible situation. Additionally, highly variable user traffic means that an optimal configuration at one point in time may not be optimal as things change. Kubernetes autoscalers can help, but they are based on historical usage and fail to take application performance into account.
This is where observation-based optimization can help. Lets see how it works.
Depending on what optimization solution youre using, configuring an application for observation-based optimization may consist of the following steps:
Once configured, the machine learning engine begins analyzing observability data collected from Prometheus, Datadog or other observability tools to understand actual resource usage and application performance trends. The system then begins making recommendations at the interval specified during configuration.
If you specified automatic implementation of recommendations during configuration, the optimization solution will automatically patch deployments with recommended configurations as they are recommended. If you selected manual deployment, you can view the recommendation, including container-level details, before deciding to approve or not.
As you may have noted, observation-based optimization is simpler than experimentation-based approaches. It provides value faster with less effort, but on the other hand, experimentation- based optimization is more powerful and can provide deep application insights that arent possible using an observation-based approach.
Which approach to use shouldnt be an either/or decision; both approaches have their place and can work together to close the gap between prod and non-prod. Here are some guidelines to consider:
Using both experimentation-based and observation-based approaches creates a virtuous cycle of systematic, continuous optimization.
Using both experimentation-based and observation-based approaches creates a virtuous cycle of systematic, continuous optimization.
Optimizing our Kubernetes environment to maximize efficiency (performance versus cost), scale intelligently and achieve our business goals requires:
For small environments, this task is arduous. For an organization running apps on Kubernetes at scale, it is likely already beyond the scope of manual labor.
Fortunately, machine learning can bridge the automation gap and provide powerful insights for optimizing a Kubernetes environment at every level.
StormForge provides a solution that uses machine learning to optimize based on both observation (using observability data) and experimentation (using performance-testing data).
To try StormForge in your environment, you can request a free trial here and experience how complete optimization does not need to be a complete headache.
Stay tuned for future articles in this series where well explain how to tackle specific challenges involved in optimizing Java apps and databases running in containers.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: StormForge.
Feature image via Pixabay.
Visit link:
Using Machine Learning to Automate Kubernetes Optimization The New Stack - thenewstack.io
- A 3X Leader for the Agentic Era: DataRobot Named a Leader Again in the Gartner Magic Quadrant for Data Science and Machine Learning Platforms -... - June 24th, 2026 [June 24th, 2026]
- A 3X Leader for the Agentic Era: DataRobot Named a Leader Again in the Gartner Magic Quadrant for Data Science and Machine Learning Platforms - Yahoo... - June 24th, 2026 [June 24th, 2026]
- Undergrads gain hands-on machine learning experience in summer program - The Pennsylvania State University - June 24th, 2026 [June 24th, 2026]
- Python and Machine Learning: Why the Two Skills Are Increasingly Inseparable - BNO News - June 24th, 2026 [June 24th, 2026]
- Domino Data Lab Named a Visionary for the Third Consecutive Year in the 2026 Gartner Magic Quadrant for AI Platforms for Data Science and Machine... - June 24th, 2026 [June 24th, 2026]
- Machine Learning Boosts Smart Thermochromic Window Efficiency - Bioengineer.org - June 24th, 2026 [June 24th, 2026]
- A.I. VS HUMAN ROAST BATTLE to Pit Machine Learning Against Live Rapper in SF - BroadwayWorld - June 16th, 2026 [June 16th, 2026]
- Machine learning gives the U.S. a 1% chance of winning the World Cup final in its own backyard - Fortune - June 16th, 2026 [June 16th, 2026]
- Machine Learning Reveals Genes That Help Yeasts Resist Stress - Department of Energy (.gov) - June 16th, 2026 [June 16th, 2026]
- Machine Learning Reveals AED Impact on LGG Prognosis - Bioengineer.org - June 16th, 2026 [June 16th, 2026]
- Introducing the Third Generation of Apples Foundation Models - Apple Machine Learning Research - June 12th, 2026 [June 12th, 2026]
- Machine learning model predicts T2D risk up to 10 years before onset - Managed Healthcare Executive - June 12th, 2026 [June 12th, 2026]
- GPU as a Service Market to Reach USD 14.4 Billion by 2033 at 16.0% CAGR, Fueled by Generative AI, Machine Learning, and Cloud Infrastructure Expansion... - June 12th, 2026 [June 12th, 2026]
- Machine learning-guided design of mechanoadaptive bioglues for multitissue trauma and first-aid applications - Nature - June 12th, 2026 [June 12th, 2026]
- OUCRU scientists are using machine learning to forecast the next dengue outbreak - tropicalmedicine.ox.ac.uk - June 12th, 2026 [June 12th, 2026]
- IIT Roorkee invites applications for 11th Batch of Data Science, Machine Learning & Generative AI Programme - Elets Technomedia - June 12th, 2026 [June 12th, 2026]
- RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem - Towards Data Science - June 3rd, 2026 [June 3rd, 2026]
- A reality check on the AI jobs hysteria - Machine Learning Week US - June 3rd, 2026 [June 3rd, 2026]
- STMicroelectronics Releases Vibration Sensor With Integrated Machine Learning for Industrial Monitoring - geneonline.com - June 3rd, 2026 [June 3rd, 2026]
- NAVER LABS Europe is offering a 2026 Research Internship in Large Language Models, focusing on AI Alignment, Controlled Generation, and Machine... - May 29th, 2026 [May 29th, 2026]
- Q&A: A Machine-Learning-Based Tool to Enhance Clinical Care of Patients With Multiple Sclerosis - Physician's Weekly - May 29th, 2026 [May 29th, 2026]
- Evaluating the Diagnostic Performance of AI and Machine Learning in Sickle Cell Disease Detection: A Systematic Review - Cureus - May 29th, 2026 [May 29th, 2026]
- HTC-19 Update: Artificial Intelligence and Machine Learning - Chromatography Online - May 29th, 2026 [May 29th, 2026]
- Multimodal phenotypic classification of generalized anxiety and panic using structural MRI data and psychosocial factors: machine learning results... - May 29th, 2026 [May 29th, 2026]
- Machine Learning Personalizes Depression Treatment with the Help of Wearable Technology - UC San Diego Today - May 27th, 2026 [May 27th, 2026]
- How Machine Learning Makes Complex Knowledge Useable in Real-World Conditions - Supply & Demand Chain Executive - May 25th, 2026 [May 25th, 2026]
- How Airbnbs machine-learning tools aim to prevent Memorial Day weekend parties in Las Vegas - FOX5 Vegas - May 25th, 2026 [May 25th, 2026]
- Artificial Intelligence and Machine Learning in Hospital Quality Management, Patient Safety, and Accreditation Readiness: A Systematic Review and... - May 25th, 2026 [May 25th, 2026]
- Machine learning accelerates analysis of fusion materials - Technology Org - May 25th, 2026 [May 25th, 2026]
- Dr. Kaveh Heidary Presents Innovations in AI, Machine Learning and Multispectral Imaging - aamu.edu - May 25th, 2026 [May 25th, 2026]
- Comparison of Prognostic Performance Between a Machine Learning Model and Manually Measured Grey-White-Matter Ratio on Early Brain Computed Tomography... - May 25th, 2026 [May 25th, 2026]
- Machine learning proves that graphene is hydrophobic - Phys.org - May 13th, 2026 [May 13th, 2026]
- Machine learning algorithm predicts AMD stock price on May 31, 2026 - Finbold - May 13th, 2026 [May 13th, 2026]
- Genetic association and machine learning improve the prediction of type 1 diabetes risk - Nature - May 1st, 2026 [May 1st, 2026]
- What Can We Expect From Machine Learning Predictions in Daily Clinical Neurology? - Neurology Live - May 1st, 2026 [May 1st, 2026]
- How Spam Filters Paved the Way for Adversarial Machine Learning - 150sec - May 1st, 2026 [May 1st, 2026]
- Real-Time Estimation of Numerical Rating Scale (NRS) Scores Using Machine Learning-Based Facial Expression Analysis: A Proof-of-Concept Study - Cureus - May 1st, 2026 [May 1st, 2026]
- Heriot-Watt researcher warns gen AI in machine learning carries serious and underestimated risks - EdTech Innovation Hub - May 1st, 2026 [May 1st, 2026]
- HS-SPME/GCMS and Machine Learning Enable Volatile Fingerprinting and Classification of Commercial Vinegars - Chromatography Online - April 12th, 2026 [April 12th, 2026]
- Role of Artificial Intelligence and Machine Learning in Diagnosing Knee Lesions: Where Are We Now? - Cureus - April 12th, 2026 [April 12th, 2026]
- CMML2AML: machine-learning discovery of co-mutations and specific single mutations predictive of blast transformation in chronic myelomonocytic... - April 12th, 2026 [April 12th, 2026]
- Machine-learning-based reconstruction of Ming-dynasty defensive corridors in Yuxian - Nature - April 12th, 2026 [April 12th, 2026]
- Have you published a disruptive paper? New machine-learning tool helps you check - Physics World - April 12th, 2026 [April 12th, 2026]
- Microsoft is automatically updating Windows 11 24H2 to 25H2 using machine learning - TweakTown - April 5th, 2026 [April 5th, 2026]
- Inside the Magic of Machine Learning That Powers Enemy AI in Arc Raiders - 80 Level - April 3rd, 2026 [April 3rd, 2026]
- We analyzed Philly street scenes and identified signs of gentrification using machine learning trained on longtime residents observations - The... - April 3rd, 2026 [April 3rd, 2026]
- Boston University To Apply Machine Learning To Alzheimers Biomarker And Cognitive Data - Quantum Zeitgeist - April 3rd, 2026 [April 3rd, 2026]
- Sony buys machine-learning company to help "enhance gameplay visuals, improve rendering techniques, and unlock new levels of visual... - April 3rd, 2026 [April 3rd, 2026]
- The Machine Learning Stack Is Being Rebuilt From Scratch Here's What Developers Need to Know in 2026 - HackerNoon - April 3rd, 2026 [April 3rd, 2026]
- Closing the Revenue Gap: Leveraging Machine Learning to Solve the $260 Billion Denial Crisis - vocal.media - April 3rd, 2026 [April 3rd, 2026]
- Machine Learning for Pharmaceuticals Set to Witness Rapid - openPR.com - April 3rd, 2026 [April 3rd, 2026]
- You Must Address These 4 Concerns To Deploy Predictive AI - Machine Learning Week US - March 30th, 2026 [March 30th, 2026]
- Google and the rise of space-based machine learning - Latitude Media - March 30th, 2026 [March 30th, 2026]
- Researchers use machine learning and social network theory to identify formation patterns in digital forums - techxplore.com - March 30th, 2026 [March 30th, 2026]
- Mayo Clinic Study Uses Wearables and Machine Learning to Predict COPD Rehab Participation - HIT Consultant - March 30th, 2026 [March 30th, 2026]
- Machine learning at the edge in retail: constraints and gains - IoT News - March 26th, 2026 [March 26th, 2026]
- AI agents are flashy, but machine learning still pays the bills - TechRadar - March 26th, 2026 [March 26th, 2026]
- Single-cell imaging and machine learning reveal hidden coordination in algae's response to light stress - Phys.org - March 26th, 2026 [March 26th, 2026]
- Machine learning analysis of CT scans - National Institutes of Health (.gov) - March 22nd, 2026 [March 22nd, 2026]
- TransUnion Machine Learning Fraud Tools Tested Against Weak Share Price Momentum - simplywall.st - March 22nd, 2026 [March 22nd, 2026]
- Machine learning could help predict how people with depression respond to treatment - Medical Xpress - March 22nd, 2026 [March 22nd, 2026]
- KR approves machine learning-based fuel reduction methodology - Smart Maritime Network - March 22nd, 2026 [March 22nd, 2026]
- Available solar energy in Andalusia will increase through the end of the century, machine learning model finds - Tech Xplore - March 22nd, 2026 [March 22nd, 2026]
- How Machine Learning Is Reshaping Environmental Policy and Water Governance - Devdiscourse - March 22nd, 2026 [March 22nd, 2026]
- Chemistry student uses machine learning to transform gene therapy production - The University of North Carolina at Chapel Hill - March 13th, 2026 [March 13th, 2026]
- AI and Machine Learning - City of Brownsville to build smart city safety solution - Smart Cities World - March 13th, 2026 [March 13th, 2026]
- AI and Machine Learning - London borough overhauls public safety infrastructure - Smart Cities World - March 13th, 2026 [March 13th, 2026]
- Titan Technology Corp. Responds to Alberta Innovates RFP AI, Machine Learning and Automation Services - TradingView - March 13th, 2026 [March 13th, 2026]
- Vietnam FPT's AI automation solution secures new machine learning patent on overseas market - VnExpress International - March 13th, 2026 [March 13th, 2026]
- AI Healthcare Technology: The Power of Machine Learning Diagnosis in Modern Medicine - Tech Times - March 13th, 2026 [March 13th, 2026]
- Future Perspectives: Key Trends Shaping the Machine Learning Market in Financial Services Until 2030 - openPR.com - March 13th, 2026 [March 13th, 2026]
- How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathys AutoResearch Framework for Hyperparameter Discovery... - March 13th, 2026 [March 13th, 2026]
- The Arc in Arc Raiders have multiple "brains," and they all love pursuing you because Embark gives them "rewards" in real-time via... - March 13th, 2026 [March 13th, 2026]
- OnPoint AI to Present its Augmented Reality and Machine Learning Surgical Platform at the 2026 Canaccord Genuity Musculoskeletal Conference - Yahoo... - February 27th, 2026 [February 27th, 2026]
- TD Bank continues to develop AI, machine learning tools - Auto Finance News - February 27th, 2026 [February 27th, 2026]
- AI and Machine Learning - Tech companies team to scale private 5G and physical AI - Smart Cities World - February 27th, 2026 [February 27th, 2026]
- AI and Machine Learning in Dating Apps: Smarter Matchmaking Algorithms - Programming Insider - February 27th, 2026 [February 27th, 2026]
- Machine-Learning App Helps Anesthesiologists Navigate Critical Surgical Equipment in Real Time - Carle Illinois College of Medicine - February 24th, 2026 [February 24th, 2026]
- Fractal Launches PiEvolve, an Evolutionary Agentic Engine for Autonomous Machine Learning and Scientific Discovery - Yahoo Finance - February 24th, 2026 [February 24th, 2026]
- How Brain Data and Machine Learning Could Transform the Aging Industry - gritdaily.com - February 24th, 2026 [February 24th, 2026]