Fixing DevOps with ML

Fixing DevOps with ML

Feb 15, 2024


Doing too much

The modern DevOps engineer is expected to be a log detective, a mentat to previously-unseen application code, an incident responder, a procurement liaison, a fiscal analyst for cloud pricing, an eBPF live debugger, a security guru, a real-time chart whisperer and dataviz expert; and half a dozen other things. 

The breadth is only increasing. DevOps engineers are now also MLOps experts, AI toolchain selectors, LLM quality analysts.

This, for lack of a better word, is real ops (RealOps, if you'd like), or what we actually do everyday—not the world of over-idealized conference talks, CIO whitepapers, and, well, a lot of blog posts.

Most of these things have little to do with automation to improve software delivery (supposed focus of much DevOps work), and a lot to do with just thinking, and well, operating. How can we help operators think and then take action? How can we help, right now?

Metric and monitoring is easy and awful

One way is with data (all kinds). But getting the right data—when you want it and how you want it—is itself a huge task. Consider just one part of all this: metrics & monitoring.

Even without explicitly instrumenting code, today's apps generate massive amounts of operational data. If you run on any cloud provider, you’ve got system-level data. If you run Kubernetes, you’ve got cluster data (and a lot more). Then with a little bit of OpenTelemetry/SkyWalking/Jaeger/etc you have all the app-level request tracing you'd ever want. Cool. 

All this data moves through your systems via who-knows-how-many tools, shippers, pollers, whatever, each one from some recent vintage of the CNCF, and eventually the data ends up in some sort of database that some people know how to access with some sort of query language, and there's also something to do graphs behind a VPN and also an alerting thing (maybe the same thing?). And this is mostly fine from a strict operational standpoint once it’s all working! Just ignore it really!

Here's Jason Dixon on monitoring & metrics back in 2011 (that's year 3 BKE if you use the Kubernetes calendar):

We don't realize it yet, but what we really need is more competition in this area, particularly from the bottom-up... How awesome would it be if, five years from now, monitoring software looked like EC2, CloudFoundry or Heroku? I like to envision it as the Voltron of monitoring software.

By this he meant a vibrant set of metrics and monitoring tools that played really well together. Well that was 12 years ago and here’s a recent image of Voltron courtesy of Midjourney:

Sad Voltron

CNCF lists 100+ monitoring tools in their tooling landscape (out of 1500 overall tools). It’s an underestimate, too. So yeah: we did get competition; too much of it, in a bloated, complex, expensive ecosystem. 

I could talk about sad Voltron all day, but here's the real point: I'm talking about just metrics & monitoring. And that is just one of the dozen or so things expected of the modern DevOps engineer. (And who here wants to build yet another monitoring tool from scratch? Actually, don’t answer.)

Should've been a farmer?

Reading this state of the world you may have convinced yourself that this profession is awful and we should all do something else.

Well despite that, I'd like to try and make this stuff better.

I know, I know. Operators are drowning in data, with too many tools, way too many responsibilities, and so the way to help is: software? This time we’ll get it right! Well.

ML can help

It’s 2023 and the term "automation", of course, means something new. 

Earlier this year, myself and Ben, co-founder here at Cased, wanted to figure out what machine learning could do to help make work better for DevOps engineers. This was exactly the goal and the level of thinking. Not build a better configuration tool, or another monitoring tool, or another deployment platform. Just, quite literally, make things better across the board for the people operating software. Because DevOps engineers are supposed to be operating software for customers—not operating yet another DevOps tool. 

To us that meant using ML innovations (LLMs, mostly) to reduce the cognitive load for everything DevOps engineers do. All their tools. Not to replace those tools (not yet at least!), but instead to soften their edges, to bring out the parts that are most useful, and to weave together all the very disparate software that powers our applications. We don’t think “DevOps is broken”— at least no more than usual. But we do think things can be better, and this is an important time to try.

Cased is starting out as an ML-powered agent you work with in Slack, because that's an easy way for us (and you) to get going. Cased does stuff like this (and more):

  • Fast Contextual Understanding
    Cased provides a quick snapshot of ongoing events, incidents, and statuses. So less scrambling around different tools (or bothering somebody!) to get a status update.

  • Data Retrieval Made Simple
    LLMs come into play heavily here: ask for the information you need in plain English— reduce the cognitive burden of navigating through yet another SaaS dashboard or random query language. Just stay in chat and share information without having to do anything but ask.

  • Visualizations On-Demand
    Static dashboards, ask for the data you want. And from any data source, too. Cased's graphing works best with time-series data (say, CPU usage over time), but it's generalized to work with almost anything. We even use it to do contributor graphs or even light-weight DORA metrics.

  • Take action from Slack 
    Move out of a siloed terminal or single-player web apps and make sure everyone sees important ops actions (updating a status site, taking a backup, and so on).

Cased isn't supposed to be just another tool in the DevOps stack. It's more like a companion, and, we hope, a very helpful one, as we try to bring calm and simplicity to operating software. We're not yet sure all the things Cased will be capable of (it's learning every day), so if you're interested in trying it out and see yourself, share your email address and we'll be in touch directly.

Get a 15 minute demo of Cased with Ted, co-founder and CEO

© 2024 Cased, Inc. All rights reserved.

© 2024 Cased, Inc. All rights reserved.