Do you rely on your coworkers? I don’t. And they don’t rely on me. Individual humans are not reliable, and we aren’t supposed to be.

We don’t rely on our teammates to be available 24/7. Nor to be right all the time. Humans sleep, we go on vacation, we have limited knowledge, sometimes we’re grouchy. If you’re out sick, I’ll answer chat messages addressed to you. If my dad dies, you’ll take my on-call shift. If the work you planned is harder than you thought, I’ll help you complete it. We work with our teammates so that someone on the team is available all the time.

You need to tell me there’s a problem, that’s all. Don’t just try harder! That’s not teamwork!

We don’t rely on a particular instance of the software we operate to be stay live and responsive in production. We work with tools that take dead instances out of rotation, and we work with the code to make the software report its status and its difficulties, so we can fix them.

Within our team (of people and software and tools), we work with each other, so that outside the team, the world can rely on us. The rest of the company can rely on our software to be up and on someone to answer questions all the time. We build certainty on top of uncertainty.

If we need queuing in our production software, and we operate kafka to do that queuing, can we rely on Kafka? Heck no. It’s gonna be fine for months, but when something does go bad, it’s ours to repair. When too many nodes run out of disk space and everything is doom, it’s you and me frantically googling concepts and locating backups to search for lost messages. We have to work with the software we operate. That gets harder when problems are infrequent.

If we need queuing in our production software, and we pay for a service that operates Kafka, this is different. Yes, it’ll go down sometimes, but worst case we contact a human. Someone who is an expert in Kafka gets to trace down those lost messages. It takes humans inside a system to make it resilient. The humans on the team operating Kafka are prepared to work with it; someone’s instance is going down every week, so their heads are in that space. We can rely on the vendor, because people and software are working together to provide queuing as a service.

We work with our teammates, so the world can rely on our team.