Charity is an ops engineer and accidental startup founder at Honeycomb. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of O’Reilly’s Database Reliability Engineering, and loves free speech, free software, and single malt scotch.
You were the Production Engineering Manager at Facebook (Now Meta) for over 2 years, what were some of your highlights from this period and what are some of your key takeaways from this experience?
I worked on Parse, which was a backend for mobile apps, sort of like Heroku for mobile. I had never been interested in working at a big company, but we were acquired by Facebook. One of my key takeaways was that acquisitions are really, really hard, even in the very best of circumstances. The advice I always give other founders now is this: if you’re going to be acquired, make sure you have an executive sponsor, and think really hard about whether you have strategic alignment. Facebook acquired Instagram not long before acquiring Parse, and the Instagram acquisition was hardly bells and roses, but it was ultimately very successful because they did have strategic alignment and a strong sponsor.
I didn’t have an easy time at Facebook, but I am very grateful for the time I spent there; I don’t know that I could have started a company without the lessons I learned about organizational structure, management, strategy, etc. It also lent me a pedigree that made me attractive to VCs, none of whom had given me the time of day until that point. I’m a little cranky about this, but I’ll still take it.
Could you share the genesis story behind launching Honeycomb?
Definitely. From an architectural perspective, Parse was ahead of its time — we were using microservices before there were microservices, we had a massively sharded data layer, and as a platform serving over a million mobile apps, we had a lot of really complicated multi-tenancy problems. Our customers were developers, and they were constantly writing and uploading arbitrary code snippets and new queries of, shall we say, “varying quality” — and we just had to take it all in and make it work, somehow.
We were on the vanguard of a bunch of changes that have since gone mainstream. It used to be that most architectures were pretty simple, and they would fail repeatedly in predictable ways. You typically had a web layer, an application, and a database, and most of the complexity was bound up in your application code. So you would write monitoring checks to watch for those failures, and construct static dashboards for your metrics and monitoring data.
This industry has seen an explosion in architectural complexity over the past 10 years. We blew up the monolith, so now you have anywhere from several services to thousands of application microservices. Polyglot persistence is the norm; instead of “the database” it’s normal to have many different storage types as well as horizontal sharding, layers of caching, db-per-microservice, queueing, and more. On top of that you’ve got server-side hosted containers, third-party services and platforms, serverless code, block storage, and more.
The hard part used to be debugging your code; now, the hard part is figuring out where in the system the code is that you need to debug. Instead of failing repeatedly in predictable ways, it’s more likely the case that every single time you get paged, it’s about something you’ve never seen before and may never see again.
That’s the state we were in at Parse, on Facebook. Every day the entire platform was going down, and every time it was something different and new; a different app hitting the top 10 on iTunes, a different developer uploading a bad query.
Debugging these problems from scratch is insanely hard. With logs and metrics, you basically have to know what you’re looking for before you can find it. But we started feeding some data sets into a FB tool called Scuba, which let us slice and dice on arbitrary dimensions and high cardinality data in real time, and the amount of time it took us to identify and resolve these problems from scratch dropped like a rock, like from hours to…minutes? seconds? It wasn’t even an engineering problem anymore, it was a support problem. You could just follow the trail of breadcrumbs to the answer every time, clicky click click.
It was mind-blowing. This massive source of uncertainty and toil and unhappy customers and 2 am pages just … went away. It wasn’t until Christine and I left Facebook that it dawned on us just how much it had transformed the way we interacted with software. The idea of going back to the bad old days of monitoring checks and dashboards was just unthinkable.
But at the time, we honestly thought this was going to be a niche solution — that it solved a problem other massive multitenant platforms might have. It wasn’t until we had been building for almost a year that we started to realize that, oh wow, this is actually becoming an everyone problem.
For readers who are unfamiliar, what specifically is an observability platform and how does it differ from traditional monitoring and metrics?
Traditional monitoring famously has three pillars: metrics, logs and traces. You usually need to buy many tools to get your needs met: logging, tracing, APM, RUM, dashboarding, visualization, etc. Each of these is optimized for a different use case in a different format. As an engineer, you sit in the middle of these, trying to make sense of all of them. You skim through dashboards looking for visual patterns, you copy-paste IDs around from logs to traces and back. It’s very reactive and piecemeal, and typically you refer to these tools when you have a problem — they’re designed to help you operate your code and find bugs and errors.
Modern observability has a single source of truth; arbitrarily wide structured log events. From these events you can derive your metrics, dashboards, and logs. You can visualize them over time as a trace, you can slice and dice, you can zoom in to individual requests and out to the long view. Because everything’s connected, you don’t have to jump around from tool to tool, guessing or relying on intuition. Modern observability isn’t just about how you operate your systems, it’s about how you develop your code. It’s the substrate that allows you to hook up powerful, tight feedback loops that help you ship lots of value to users swiftly, with confidence, and find problems before your users do.
You’re known for believing that observability offers a single source of truth in engineering environments. How does AI integrate into this vision, and what are its benefits and challenges in this context?
Observability is like putting your glasses on before you go hurtling down the freeway. Test-driven development (TDD) revolutionized software in the early 2000s, but TDD has been losing efficacy the more complexity is located in our systems instead of just our software. Increasingly, if you want to get the benefits associated with TDD, you actually need to instrument your code and perform something akin to observability-driven development, or ODD, where you instrument as you go, deploy fast, then look at your code in production through the lens of the instrumentation you just wrote and ask yourself: “is it doing what I expected it to do, and does anything else look … weird?”
Tests alone aren’t enough to confirm that your code is doing what it’s supposed to do. You don’t know that until you’ve watched it bake in production, with real users on real infrastructure.
This kind of development — that includes production in fast feedback loops — is (somewhat counterintuitively) much faster, easier and simpler than relying on tests and slower deploy cycles. Once developers have tried working that way, they’re famously unwilling to go back to the slow, old way of doing things.
What excites me about AI is that when you’re developing with LLMs, you have to develop in production. The only way you can derive a set of tests is by first validating your code in production and working backwards. I think that writing software backed by LLMs will be as common a skill as writing software backed by MySQL or Postgres in a few years, and my hope is that this drags engineers kicking and screaming into a better way of life.
You’ve raised concerns about mounting technical debt due to the AI revolution. Could you elaborate on the types of technical debts AI can introduce and how Honeycomb helps in managing or mitigating these debts?
I’m concerned about both technical debt and, perhaps more importantly, organizational debt. One of the worst kinds of tech debt is when you have software that isn’t well understood by anyone. Which means that any time you have to extend or change that code, or debug or fix it, somebody has to do the hard work of learning it.
And if you put code into production that nobody understands, there’s a very good chance that it wasn’t written to be understandable. Good code is written to be easy to read and understand and extend. It uses conventions and patterns, it uses consistent naming and modularization, it strikes a balance between DRY and other considerations. The quality of code is inseparable from how easy it is for people to interact with it. If we just start tossing code into production because it compiles or passes tests, we’re creating a massive iceberg of future technical problems for ourselves.
If you’ve decided to ship code that nobody understands, Honeycomb can’t help with that. But if you do care about shipping clean, iterable software, instrumentation and observability are absolutely essential to that effort. Instrumentation is like documentation plus real-time state reporting. Instrumentation is the only way you can truly confirm that your software is doing what you expect it to do, and behaving the way your users expect it to behave.
How does Honeycomb utilize AI to improve the efficiency and effectiveness of engineering teams?
Our engineers use AI a lot internally, especially CoPilot. Our more junior engineers report using ChatGPT every day to answer questions and help them understand the software they’re building. Our more senior engineers say it’s great for generating software that would be very tedious or annoying to write, like when you have a giant YAML file to fill out. It’s also useful for generating snippets of code in languages you don’t usually use, or from API documentation. Like, you can generate some really great, usable examples of stuff using the AWS SDKs and APIs, since it was trained on repos that have real usage of that code.
However, any time you let AI generate your code, you have to step through it line by line to ensure it’s doing the right thing, because it absolutely will hallucinate garbage on the regular.
Could you provide examples of how AI-powered features like your query assistant or Slack integration enhance team collaboration?
Yeah, for sure. Our query assistant is a great example. Using query builders is complicated and hard, even for power users. If you have hundreds or thousands of dimensions in your telemetry, you can’t always remember offhand what the most valuable ones are called. And even power users forget the details of how to generate certain kinds of graphs.
So our query assistant lets you ask questions using natural language. Like, “what are the slowest endpoints?”, or “what happened after my last deploy?” and it generates a query and drops you into it. Most people find it difficult to compose a new query from scratch and easy to tweak an existing one, so it gives you a leg up.
Honeycomb promises faster resolution of incidents. Can you describe how the integration of logs, metrics, and traces into a unified data type aids in quicker debugging and problem resolution?
Everything is connected. You don’t have to guess. Instead of eyeballing that this dashboard looks like it’s the same shape as that dashboard, or guessing that this spike in your metrics must be the same as this spike in your logs based on time stamps….instead, the data is all connected. You don’t have to guess, you can just ask.
Data is made valuable by context. The last generation of tooling worked by stripping away all of the context at write time; once you’ve discarded the context, you can never get it back again.
Also: with logs and metrics, you have to know what you’re looking for before you can find it. That’s not true of modern observability. You don’t have to know anything, or search for anything.
When you’re storing this rich contextual data, you can do things with it that feel like magic. We have a tool called BubbleUp, where you can draw a bubble around anything you think is weird or might be interesting, and we compute all the dimensions inside the bubble vs outside the bubble, the baseline, and sort and diff them. So you’re like “this bubble is weird” and we immediately tell you, “it’s different in xyz ways”. SO much of debugging boils down to “here’s a thing I care about, but why do I care about it?” When you can immediately identify that it’s different because these requests are coming from Android devices, with this particular build ID, using this language pack, in this region, with this app id, with a large payload … by now you probably know exactly what is wrong and why.
It’s not just about the unified data, either — although that is a huge part of it. It’s also about how effortlessly we handle high cardinality data, like unique IDs, shopping cart IDs, app IDs, first/last names, etc. The last generation of tooling cannot handle rich data like that, which is kind of unbelievable when you think about it, because rich, high cardinality data is the most valuable and identifying data of all.
How does improving observability translate into better business outcomes?
This is one of the other big shifts from the past generation to the new generation of observability tooling. In the past, systems, application, and business data were all siloed away from each other into different tools. This is absurd — every interesting question you want to ask about modern systems has elements of all three.
Observability isn’t just about bugs, or downtime, or outages. It’s about ensuring that we’re working on the right things, that our users are having a great experience, that we are achieving the business outcomes we’re aiming for. It’s about building value, not just operating. If you can’t see where you’re going, you’re not able to move very swiftly and you can’t course correct very fast. The more visibility you have into what your users are doing with your code, the better and stronger an engineer you can be.
Where do you see the future of observability heading, especially concerning AI developments?
Observability is increasingly about enabling teams to hook up tight, fast feedback loops, so they can develop swiftly, with confidence, in production, and waste less time and energy.
It’s about connecting the dots between business outcomes and technological methods.
And it’s about ensuring that we understand the software we’re putting out into the world. As software and systems get ever more complex, and especially as AI is increasingly in the mix, it’s more important than ever that we hold ourselves accountable to a human standard of understanding and manageability.
From an observability perspective, we are going to see increasing levels of sophistication in the data pipeline — using machine learning and sophisticated sampling techniques to balance value vs cost, to keep as much detail as possible about outlier events and important events and store summaries of the rest as cheaply as possible.
AI vendors are making lots of overheated claims about how they can understand your software better than you can, or how they can process the data and tell your humans what actions to take. From everything I have seen, this is an expensive pipe dream. False positives are incredibly costly. There is no substitute for understanding your systems and your data. AI can help your engineers with this! But it cannot replace your engineers.
Thank you for the great interview, readers who wish to learn more should visit Honeycomb.
Credit: Source link