The New Stack Podcast

ML Can Prevent Getting Burned For Kubernetes Provisioning

Episode Summary

In the rush to create, provision and manage Kubernetes, often left out is proper resource provisioning. According to StormForge, a company paying, for example, a million dollars a month on cloud computing resources is likely wasting $6 million a year of resources on the cloud on Kubernetes that are left unused. The reasons for this are manifold and can vary. They include how DevOps teams can tend to estimate too conservatively or aggressively or overspend on resource provisioning. In this podcast with StormForge’s Yasmin Rajabi, vice president of product management, and Patrick Bergstrom CTO, we look at how to properly provision Kubernetes resources and the associated challenges. The podcast was recorded live in Detroit during KubeCon + CloudNativeCon Europe 2022. Patrick Bergstrom - https://www.linkedin.com/in/bergstrompatrick/ Yasmin Rajabi - https://www.linkedin.com/in/yasminrajabi/ Bruce Gain - @bcamerongain The New Stack - @thenewstack

Episode Notes

Rethinking Web Application Firewalls

Almost ironically, the most commonly used Kubernetes resources can even complicate the ability to optimize resources for applications.The processes typically involve Kubernetes resource requests and limits, and predicting how the resources might impact quality of service for pods. Developers deploying an application on Kubernetes often need to set CPU-request, memory-request and other resource limits. “They are usually like ‘I don't know — whatever was there before or whatever the default is,’” Rajabi said. “They are in the dark.”

Sometimes, developers might use their favorite observability tool and say “‘we look where the max is, and then take a guess,’” Rajabi said. “The challenge is, if you start from there when you start to scale that out — especially for organizations that are using horizontal scaling with Kubernetes — is that then you're taking that problem and you're just amplifying it everywhere,” Rajabi said. “And so, when you've hit that complexity at scale, taking a second to look back and ‘say, how do we fix this?’ you don't want to just arbitrarily go reduce resources, because you have to look at the trade off of how that impacts your reliability.”

The process then becomes very hit or miss. “That's where it becomes really complex, when there are so many settings across all those environments, all those namespaces,” Rajabi said. “It's almost a problem that can only be solved by machine learning, which makes it very interesting.”

But before organizations learn the hard way about not automating optimizing deployments and management of Kubernetes, many resources — and costs — are bared to waste. “It's one of those things that becomes a bigger and bigger challenge, the more you grow as an organization,” Bergstrom said. Many StormForge customers are deploying into thousands of namespaces and thousands of workloads. “You are suddenly trying to manage each workload individually to make sure it has the resources and the memory that it needs,” Bergstrom said. “It becomes a bigger and bigger challenge.”

The process should actually be pain free, when ML is properly implemented. With StormForge’s partnership with Datadog, it is possible to apply ML to collect historical data, Bergstrom explained. “Then, within just hours of us deploying our algorithm into your environment, we have machine learning that's used two to three weeks worth of data to train that can then automatically set the correct resources for your application. This is because we know what the application is actually using,” Bergstrom said. “We can predict the patterns and we know what it needs in order to be successful.”

Episode Transcription

Colleen Coll 0:08

Welcome to this special edition as the new stack makers on the road. We're here in cube con North America, and discussions from the show floor with technologists giving you their expertise and insights to help you with your everyday work.

Bruce Gain 0:29

High speed camera gain here at cube con cloud native con in beautiful Detroit. And I'm here today with Jasmine Rajabi, who is vice president of product for storm Forge? Hi, Jasmine. Hey, happy to be here. Great. And I'm here with as well with Patrick Bergstrom. How are you doing? CTO of stormforce? Absolutely.

And we're here today to talk about the automation of optimizing deployments and management of Kubernetes. Yeah. And I learned recently and you know how, or what a challenge this is. And I was wondering, do you please elaborate on how people are learning the hard way that they're wasting a lot of resources? When they're deploying and managing Kubernetes? Yeah, just time.

Patrick Bergstrom 1:26

Absolutely. And it's, it's one of those things that becomes a bigger and bigger challenge, the more you grow as an organization. So a lot of the customers that Yasmin talking with right now and our regular customers, they're deploying into the 1000s of namespaces. And so the challenge that we see with organizations of this size, it boils down to day to Kubernetes, right? So you've got your Kubernetes cluster setup, you're running in production, you're deploying workloads. And then as you deploy 1000s, and 1000s of workloads, suddenly trying to manage each workload individually to make sure it has the resources and the memory that it needs. Yeah, becomes a bigger and bigger challenge.

Bruce Gain 2:01

And how does that work? Or how does it play out? For example, on a day to day level? I mean, whether they are struggling with this, yeah, but how are they realizing that they're wasting a lot of resources?

Yasmin Rajabi 2:09

Well, when you ask a developer, for example, who's deploying an application on Kubernetes? How did you set your CPU requests and limits? How do you set your memory with custom limits? Usually, like, I don't know, whatever was there before, or whatever the default for that application was in the dark? Or honestly? And the answer I got the other day was we look at Grafana, we look at where the max is, and then take a guess. And so the challenge is, if you start from there, when you start to scale that out, especially for organizations that are using like horizontal scaling with Kubernetes, then you're taking that problem and you're just amplifying it everywhere. And so when you've hit that complexity at scale, taking a second to look back and say, how do we fix this? You don't want to just arbitrarily go reduce resources, because you have to look at the trade off of how does that impact your reliability. And so that's where it becomes a really complex, like, there's so many settings across all those environments, all those namespaces. It's almost a problem that can only be solved by machine learning, which makes it very interesting.

Bruce Gain 3:06

That's another thing I definitely want to cover. I find that fascinating how that'll play out the mechanics of that. But, you know, how does that realization usually happen? Or you mentioned, is there a threshold of namespaces, where they suddenly realize, wow, we're wasting a lot of resources?

Patrick Bergstrom 3:23

From my experience, it's usually been the CFO comes knocking, and it's like, Hey, you spent $10 million on Cloud last year? Why the heck? Why is that up? 60%. Right over that. And why is it continuing to grow? What are we gonna do about it? I don't know. Yeah. And people don't know, they don't know how to solve for it.

Yasmin Rajabi 3:40

Yeah. And sometimes the CFO comes calling and says, oh, like our cloud bill is high. And they'll have teams that are dedicated to providing the reporting and to hear the namespaces that had the biggest waste. But then the challenge is, okay, you're giving it to a developer? How are they gonna address that

Patrick Bergstrom 3:55

one of the developers mind, especially too, as like, I like to talk about, you know, my past as an SRE, when you give me the controls to a namespace, if I'm deploying an application, I'm going to turn the memory and the CPU as high as I can go, because that eliminates my chances of throttling or getting them killed. Yeah. And I don't care how much I spend on it, right? Like they're the ones that gave me that knob to turn. So I want to make sure my app works. Well, the goal

Bruce Gain 4:16

is to work is to optimize it as much as you can. And your, as a developer, your goal is not necessarily are rarely actually right to optimize resources. Yeah. And so this helps keep everybody honest, in many respects.

Patrick Bergstrom 4:33

Yeah, it does. And it's it's pain free because it is automated to like as Ben said, like, it's not uncommon for organizations to spin up just massive teams to try to tackle this problem. When in reality, it's a perfect fit for machine learning. The data is very well structured, we can see the usage we know what your settings are currently using. And then we can do a look back through like our partnership with data dog for instance. Yeah, we can connect to collect historical data and then within just Hours of us deploying our algorithm into your environment, we have machine learning that's used two to three weeks worth of data to train that can then automatically set the correct resources for your application. Because we know what the application is actually using. Yeah. And we can predict the patterns and we know what it needs in order to be successful. How many

Bruce Gain 5:19

weeks does it have to go back? Or how does that work? Actually, I mean, you have, conceptually, you would think you utilize machine learning to take over and to learn what the patterns are and make forecasts, etc. And it must, it has to obviously has to be very, very smart. Am I? Yep. So I was curious how that works. I mean, they're realizing, for example, they're wasting a lot of resources, they're learning the hard way. And you're coming in with this machine learning algorithm with with the partnership with data dog. Yeah. And how does that work? Usually,

Yasmin Rajabi 5:51

yeah, for us, it's about building confidence with our users. So we do have an automatic mode, where we'll just deploy the configurations and make the changes in real time in production. But if you're a large organization, you want to watch that and see what the machine learning would do. There's still a balance of trust between the computers and the human. So we provide the recommendations on whatever level of how aggressive you want it to be, and how often you'd want to deploy it. And let you see that before you actually flip the switch to say, Okay, take care of this for me.

Bruce Gain 6:21

It's not the machine taking over the controls

Patrick Bergstrom 6:24

it, it is at a certain point. Okay. Yeah. How does that work? So that's part of our application that we deploy. So we actually have a controller that goes into the customer environment. That's where our recommenders application that actually makes the recommendations based on the telemetry monitoring that that we're watching. Okay, and then we're, we're a machine learning company first, like we were doing machine learning before we actually got into Kubernetes optimization. Okay, so we've got three PhDs and applied mathematics on staff that are actually writing algorithms for us. Fascinating that does this. Yeah. It's, it's even more fascinating when when we talk about our history, because we were we were doing other things first, and then we migrated to Kubernetes. And we have no idea how to set it up, or how to set the right, you know, container resources. Yeah. And then they just kind of said, Oh, well, this algorithm that we use for this other thing actually could work for that. Yeah. And then they modified it. And we've been iterating on it since then. And that's where we are today. It's interesting,

Bruce Gain 7:17

because you mentioned, I think the threshold is around 200, namespaces, et cetera, or that just set a random number or just a random number. Yeah, but the machine learning coming is coming in and extrapolating this data, and yeah, I cannot imagine somebody, a human, or even a human team trying to do this, it would be hit or miss. Yeah. And it seems, I don't want to use the word sweet spot, but particularly applicate. For this particular application of machine learning looks particularly well,

Yasmin Rajabi 7:51

it's a challenge, we hear from a lot of our users where, for example, like we work with a lot of platform engineering teams, and they're providing the tools and the automation for other engineers inside their environment to be able to operate efficiently, both from a resource standpoint, and then also be resilient. Our perfect balance is machine learning with automation, because it's a couple of settings, right? Your CPU requests, memory, five, six settings. But when you start to multiply that out, but all the namespaces, all the different environments, you start getting into 1000s. And the human brain, like can't figure out how to set all of those settings at different traffic patterns. And so machine learning handles that and provides the best settings for you. But then you have to now deploy those out. And that's where automation comes in. Because eight for a platform engineering team to be able to deploy that out still takes a lot of time, you wouldn't want to manually do it. So we provide the automation to actually push out those settings.

Bruce Gain 8:44

It's, it seems that automation is aspect is critical. Oh, yeah. And I was wondering how you could please just explain what you mean, in this context for the automation of like, how it actually works? Yeah. How does it work? And maybe yeah, you have an example like a customer example of how you really squeeze somebody?

Patrick Bergstrom 9:02

Yeah, absolutely. And with how it actually works are we have a custom built controller that extends the Kubernetes. SDK, we integrate very heavily into Kubernetes. In fact, we have a, I think it's among the largest density of CA certified engineers work for storm Forge of almost any other company. We have, we're a very small startup, but the percentage is very high. Almost. In fact, I think our entire sales team is all CK certified. It's it's one of the requirements for working with us. And so our controller is like I said, it's purpose built to basically update Kubernetes and say, Hey, swap out these resources using those native API's that Kubernetes makes available.

Yasmin Rajabi 9:40

And we also often hear from organizations that are like I have my own CI CD pipeline. So I want you to use my automation so we can just export our recommendations is a Yamo config, and then you deploy that and use your own processes. Yep.

Bruce Gain 9:53

Good. You're not obviously you're not locked into the automation. Yeah, could you please maybe walk me through an example of How you help somebody is a specific example,

Yasmin Rajabi 10:04

recently, we had a big product launch. So this customers mentioned there, and we've been working with them. They're a large SAS company, and they've very bursty bursty workload. So as a team, they had been doing horizontal scale. And their platform team was kind of dealing with this problem. Now, we're heavily over provisioned, but we also need to be able to support that burstiness and our traffic patterns. Yeah. And so they knew there was optimization opportunities, but they didn't know how to tune the settings so that when the bursts did come, they could actually handle it. And so we've been working with them to use machine learning to come up with the optimal settings, and then have them deployed out automatically, so that they can prepare for those bursts, and then also scale down and reduce their requests set their HPA to be the most optimal from a resource perspective. So they're scaling appropriately. And they can still handle those bursts. And they saw a savings of 60% on their first app.

Bruce Gain 10:57

60% of resources or tuition reduction. Yep. Oh, gratulations. It's

Patrick Bergstrom 11:02

a lot. Yeah. And that's, that's typical. Like, that's not an edge case. Yeah. And that's the interesting thing about the new product launch that she was just talking about is we're we started out doing vertical scaling, right. And so we have machine learning that will basically figure out the correct size for your pod. It's very much like the VPA, except backed by machine learning. The latest product launch, though, that we just announced is we actually now work with the VPA. And the HPA. So we can do vertical and horizontal scaling together at the same time, so you're not trading one for the other. Right? And I think our partner data dog just released that survey, what was it? It was like 40?

Yasmin Rajabi 11:38

I think less than 40%. Use hp less than one person use VPA. Nobody uses

Bruce Gain 11:44

them together. Yeah. That was something was that often asked for? I'm assuming?

Yasmin Rajabi 11:49

Yeah, I think the challenge that users had that when they asked us to kind of work with them to build this out is that horizontal scale makes more sense. You know, in the traditional world, you would add more things. When you did more things, you'd remove more things. So the HBA makes sense, but from a vertical scale, you have to really understand how your application works. And so being able to right size, and then scale, it was difficult because of all the settings involved. So users wanted a way to be able to do both vertical efficiently and then scale it horizontally as their traffic needs it.

Bruce Gain 12:19

I'm under the impression that this is particularly well adapted for SAS.

Patrick Bergstrom 12:25

Oh, yeah, there's absolutely it is. But

Yasmin Rajabi 12:27

it's honestly even just great for if you're running Kubernetes on prem, you don't want to go like you have to physically go buy hardware and extend your data center. And so we can help we've been helping our customers make more use of what they've already deployed. So existing on prem data centers, the

Patrick Bergstrom 12:44

and the 60%. Number There is critical to like the last organization that I worked for, we had multiple data centers, and two or three of them were packed to the gills, like we were talking about having to build another data center, or figure out where do we do you know, where do we move our compute to? Yeah, and imagine being able to reduce your data center consumption by 60%, and freeing up all that extra capacity, you've now added five to six years to the life of your data center before you have to think about, you know, adding another data center or completely changing your model by bursting into the cloud, or

Bruce Gain 13:15

Yeah, the I mean, even? Yeah, I mean, if you're building a data center, yeah, that you don't have to build.

Patrick Bergstrom 13:22

Yeah, that's almost a billion dollars these days to build a new data center. Yeah.

Bruce Gain 13:26

I'm thinkable. Unthinkable. Yeah. And I'm wondering, too, as this is deployed, and used cetera, where do you see this going?

Patrick Bergstrom 13:36

That's a product question. I think, I think

Yasmin Rajabi 13:38

where we look out too, is how we can extend our optimization to more things. So whether it's custom metrics that you want to be able to tweak and tune and get recommendations on, or kind of moving up the stack, or even just kind of extending what we do to pull in more platforms. So for example, we partner with data dog and Prometheus today, but additional observability platforms from ingesting it, were available on the AWS Marketplace, but showing up in more areas so that it makes it super easy as a part of your deployment to just optimization is one step in that deployment.

Patrick Bergstrom 14:11

And it should be a default step. I think in everyone's deployment, whether you're using like Amazon's Eks solution, or AKs. With Azure, it's optimization is one of those critical tools that unlocks additional developer and engineer capability there. You don't have the time for today, nevermind the fact that I don't know of anybody that wants to wake up every day and go tune, you know, Kubernetes resources for their full time job as our CEO, like I say,

Bruce Gain 14:35

yeah, absolutely. And that's something that they're not there to do. Yeah, in their minds, but it's essential and interesting. And even

Patrick Bergstrom 14:43

if you did do that, and you had a team that is dedicated to that you're talking about maybe addressing each application once a year and a cycle. Whereas with a machine learning solution like ours, you can literally do it every five minutes if you wanted to let the

Bruce Gain 14:54

machine do the work exactly what I want to do. Absolutely. Excellent. Well, this is BK I'm again signing off in Detroit and cube con cloud native con. And I would like to thank Patrick Bergstrom, CTO of storm Forge and Jasmine Rajabi, Vice President of Product store forge for taking the time to speak with me. I really found this fascinating. Thanks so much for having us.

Alex Williams 15:19

Thanks for listening. If you like the show, please rate and review us on Apple podcast Spotify, or wherever you get your podcasts. That's one of the best ways you can help us grow this community and we really appreciate your feedback. You can find the full video version of this episode on YouTube. Search for the new stack and don't forget to subscribe so you never miss any new videos. Thanks for joining us and see you soon.

Transcribed by https://otter.ai