AIOps Now: Scaling Kubernetes With AI and Machine Learning

Using AI and digital twins, optimize Kubernetes apps and address SRE challenges with continuous learning for improved outcomes.

If you are a site reliability engineer (SRE) for a large Kubernetes-powered application, optimizing resources and performance is a daunting job. Some spikes, like a busy shopping day, are things you can broadly schedule, but, if done right, would require painstakingly understanding the behavior of hundreds of microservices and their interdependence that has to be re-evaluated with each new release — not a very scalable approach, let alone the monotony and resulting stress to the SRE. Moreover, there will always be unexpected peaks to respond to. Continually keeping tabs on performance and putting the optimal amount of resources in the right place is essentially impossible.

The way this is being solved now is through gross overprovisioning, or a combination of guesswork and endless alerts — requiring support teams to review and intervene. It’s simply not sustainable or practical, and certainly not scalable. But it’s just the kind of problem that machine learning and AI thrives on. We have spent the last decade dealing with such problems, and the arrival of the latest generation of AI tools such as generative AI has opened the possibility of applying machine learning to the real problems of the SRE to realize the promise of AIOps.

Turning Up the Compute Knob…to Be Safe

No matter how great your observability dashboard, the amount of data and the need for agility is just too much. You have to provision adequate resources to achieve the desired response times and error rates. It is not unusual for people in this role to peg compute utilization at 30 percent “to be safe” and be prepared to monitor hundreds of microservices to ensure the desired service-level agreement (SLA) is achieved. The end result is costly — not just from compute resources, but also DevOps resources dedicated to maintaining the SLA.

It seems that, for all it has brought us, Kubernetes has gone beyond the comprehension of those charged with operating it. Horizontal pod autoscaling (HPA) and reactive scaling solutions still leave the SREs guessing at what level to set the CPU utilization threshold that would work for various traffic loads and service graph dependencies. Traffic does not have a linear relationship to microservice loading and thus to performance, and that is not the only reason to change the states of the application deployment. SREs are also monitoring issues like temperature, faults, and latency.

For a typical Kubernetes application, there are on average several hundreds of microservices. Furthermore, each microservice is dependent on other microservices in a web of interconnected relationships with other microservices. It is not easy for a person to view and understand it all and then make detailed changes and do this repeatedly for every release of each microservice every week. SREs figuratively “turn up the compute knob” and hope that it improves whatever has dropped below the service-level objective (SLO). But, the reality is that it is useless to increase resources at a microservice which is dependent on another microservice, which is actually the bottleneck.

An Ideal Use Case for AI

In 2024, when someone says AI, the next thought is almost inevitably ChatGPT. ChatGPT is generative AI that selects the best next word. While the architecture required for a strong AIOps platform is very different from ChatGPT (more on that later), the goal is similar — choose the best next state for the application.

The intricately interconnected ecosystems of modern microservice applications are too big and complex for the SRE team to comprehend in detail and make those decisions. Most efforts to autoscale these applications fail to take into account the nuanced requirements and performance needs of individual services. I’ve been hearing about this problem continuously for over 20 years (starting with the L5 network load balancer we invented at Arrowpoint Communications).

The Digital Twin Goes Through the Paces

Training data is the fuel for AI. To teach an application to operate a mission critical Kubernetes instance, we need to develop good information about how the performance can be optimized. Digital twins have been used for decades in multiple fields including manufacturing and racing to help people recreate a digital equivalent of the real subject to study its behavior. In our case, we use performance metrics to build a digital twin of each microservice.

In reinforcement learning (RL), digital twins are used to create a simulation environment to generate an observation space in which a model can be trained to discover and learn the best paths (also known as “trajectories”) to guide the system to states that have the desired target properties in terms of cost, performance, etc. In our case, we use proximal policy optimization (PPO) as the RL training algorithm. Our approach is service-graph aware to take into account the dependencies of microservices that impact scaling. Ultimately, we will have a model-free network that is continually learning based on operational experience.

Better Responsiveness and Ongoing Improvement

Kubernetes has come a long way. There is extensive tool-level automation, but not a lot of effective system-level automation. Perhaps that has a lot to do with the vast amount of activity within a Kubernetes instance. We boiled the problem down to deciding the best next state for the application.

People have been playing with generative AI that can produce words and images for a general audience. We are seeing how the same technology can transform our digital experience.

For SREs Now and Developers of the Future

SREs today could benefit from a transformation. Talking to SRE teams, we have learned that they are asked to contribute to their own SLOs and they simply don’t know where to begin. It seems that the complexity of Kubernetes has outpaced the ability of humans alone to operate it.

Looking ahead, applying AIOps models and moving toward autonomous infrastructure can allow for a new level of complexity and scale for microservices applications.

You may also like