We kick things off by weighing the merits of two gender-neutral regional pronouns: the familiar y’all and the under appreciated yinz. Now that’s covered...
The global population of developers will hit 45 million by 2030, up from 26.9 million in 2021 (EDC). What platforms will they want to build on?
Did Kubernetes solve all your problems? Did it create new ones?
It seems there’s always an XKCD relevant to our conversation. Today, it’s How standards proliferate.
Maxwell, a solution architect at xMatters, took a winding road to get to where he is. After a computer engineering education, he held jobs as field support engineer, product manager, SRE, and finally his current role as a solutions architect, where he serves as something of an SRE for SREs, helping them solve incident management problems with the help of xMatters.
When he moved to the SRE role, Maxwell wanted to get back to doing technical work. It was a lateral move within his company, which was migrating an on-prem solution into the cloud. It’s a journey that plenty of companies are making now: breaking an application into microservices, running processes in containers, and using Kubernetes to orchestrate the whole thing. Non-production environments would go down and waste SRE time, making it harder to address problems in the production pipeline.
At the heart of their issues was the incident response process. They had several bottlenecks that prevented them from delivering value to their customers quickly. Incidents would send emails to the relevant engineers, sometimes 20 on a single email, which made it easy for any one engineer to ignore the problem—someone else has got this. They had a bad silo problem, where escalating to the right person across groups became an issue of its own. And of course, most of this was manual. Their MTTR—mean time to resolve—was lagging.
Maxwell moved over to xMatters because they managed to solve these problems through clever automation. Their product automates the scheduling and notification process so that the right person knows about the incident as soon as possible. At the core of this process was a different MTTR—mean time to respond. Once an engineer started working to resolve a problem, it was all down to runbooks and skill. But the lag between the initial incident and that start was the real slowdown.
It’s not just the response from the first SRE on call. It’s the other escalations down the line—to data engineers, for example—that can eat away time. They’ve worked hard to make escalation configuration easy. It not only handles who's responsible for specific services and metrics, but who’s in the escalation chain from there. When the incident hits, the notifications go out through a series of configured channels; maybe it tries a chat program first, then email, then SMS.
The on-call process is often a source of dread, but automating the escalation process can take some of the sting out of it. Check out the episode to learn more.
Rennie grew up in Kenya, Honduras, Somalia, and Oklahoma; his parents volunteered for the Peace Corps before working for the US Government overseas.
Audio tape drives are real! Check out this Retrocomputing question about how the Commodore 64 audio interface worked.
If you want to remember something better, a 2014 study says you should write it out by hand.
Rennie worked at Blackberry, and Ben remembered his colleagues at the Verge fondly hoping for their comeback. In fact, here's Ben hoping for their comeback!
Alex comes up with better ways to interact with technology and writes about it on his website.
Is there a link between playing music and writing code? A previous article of ours covered the merger of the two in the music programming language, Sonic PI.
If you're curious about the weird extremes of operating system development, check out TempleOS.
Cassidy and Alex both take copious notes through Obsidian. Alex has a plugin that may help you organize notes automatically.
The infrastructure that networked applications lives on is getting more and more complicated. There was a time when you could serve an application from a single machine on premises. But now, with cloud computing offering painless scaling to meet your demand, your infrastructure becomes abstracted and not really something you have contact with directly. Compound that problem with with architecture spread across dozens, even hundreds of microservices, replicated across multiple data centers in an ever changing cloud, and tracking down the source of system failures becomes something like a murder mystery. Who shot our uptime in the foot?
A good observability system helps with that. On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
Observability is really an outgrowth of traditional monitoring. You expect that some service or system could break, so you keep an eye on it. But observability applies that monitoring to an entire system and gives you the ability to answer the unexpected questions that come up. It uses three principal ways of viewing system data: logs, traces, and metrics.
Metrics are a number and a timestamp that tell you particular details. Traces follow a request through a system. And logs are the causes and effects recorded from a system in motion. Splunk wants to add a fourth one—events—that would track specific user events and browser failures.
Observing all that data first means you have to be able to track and extract that data by instrumenting your system to produce it. Greg and his colleagues at Splunk are huge fans of OpenTelemetry. It’s an open standard that can extract data for any observability platform. You instrument your application once and never have to worry about it again, even if you need to change your observability platform.
Why use an approach that makes it easy for a client to switch vendors? Leffler and Splunk argue that it’s not only better for customers, but for Splunk and the observability industry as a whole. If you’ve instrumented your system with a vendor locked solution, then you may not switch, you may just let your observability program fall by the wayside. That helps exactly no one.
As we’ve seen, people are moving to the cloud at an ever faster pace. That’s no surprise; it offers automatic scaling for arbitrary traffic volumes, high availability, and worry-free infrastructure failure recovery. But moving to the cloud can be expensive, and you have to do some work with your application to be able to see everything that’s going on inside it. Plenty of people just throw everything into the cloud and let the provider handle it, which is fine until they see the bill.
Observability based on an open standard makes it easier for everyone to build a more efficient and robust service in the cloud. Give the episode a listen and let us know what you think in the comments.