The .NET on AWS Show, featuring Martin Thwaites (again!)
In this episode, we are again joined by Observability Advocate, Martin Thwaites! Join us as we discuss the differences in observability and telemetry needs between development and infrastructure engineers.
Brandon Minnick
Amazon Employee
Published Jun 10, 2024
Continue to The .NET on AWS Show, featuring Lea Mladineo!
Loading...
Brandon Minnick 1:05
Hello, good morning. Good evening. Welcome back, everybody to another episode of the dotnet on AWS Show. I'm your host, Brandon Minnick, and with me as always is the amazing Francoise. Francoise. How's your week?
Francois Bouteruche 1:20
Oh, fine. Thank you. Rainy in Paris. And today I'm very happy because there is an outside and it's, it is a Prusa. First time I'm seeing the sun since one week for one week. So really happy today. It's a sunny day. Yeah. So well, and I will ask you if you are well, Brandon, because I know you're making really a big push to be there today. I'm
Brandon Minnick 1:48
not I'm sick. Yeah, just telling friends. Well, before the show, I got sick again. I feel like I was just sick. When I was coming back from dotnet days in Romania. Which not coincidentally, is the last time I got to see our amazing guests today. But yeah, I feel like I just got sick about a month ago. And I didn't really do anything last week. Like, I didn't go to any like major crowd or even like a grocery shopping. Like I left the house twice and somehow got sick. So I'm fighting it. I've got a, I'll be giving a talk at NDC Oslo next week. So for anybody out there who is in the area and nearby Oslo, come join us come attend, we'll have a whole AWS booth along with a couple of AWS talks at Oslo. But yeah, my goal this week is to get healthy for that. And I haven't tested yet because this is hot off the presses. I just got sick yesterday. But if it is COVID, then obviously I won't be able to make it, I think based on the COVID guidelines and travel and all that stuff. But we'll see I'm gonna do my best to get healthy. He'll hopefully I don't have too many sneezes or sniffles on the stream. But I apologize if I do.
Francois Bouteruche 3:05
Yeah, fingers crossed for next week. I hope you'll be there.
Brandon Minnick 3:10
Thanks. I mean, my plan was to have you fill in for me if you're available, but hopefully I'll be there.
But speaking of announcements, first of all, you're watching before the show. We've got a couple of big announcements this week.
Francois Bouteruche 3:27
Yeah. The thing I want to highlight this week is the release of dotnet aspire, and not only the release of dotnet aspire, but the fact that the net experience supports areas so when you're using dotnet Spurrier and you want to use area services, you can you can use both together. So, we have a nice NuGet package, which is still in preview, but you can already download it and try and provide feedback. So Aspire dot hosting dot respect and NuGet package. So we really welcome your feedback. Norm has been our guests a few weeks ago with David Fowler to discuss this. I encourage you to watch back this stream. It was a great stream to discover boss was David Fowler and norm to discover the integration of areas inside that Aspire so that's my, my big item of the week.
Brandon Minnick 4:39
Yeah, I'm so excited for this because I really think dotnet Aspire is the future for dotnet developers building any sort of, even if it's not in the cloud, but any sort of back end that has multiple, just say things going on. Like if you've got an API and a website and a database is really just two to three, as far as just makes your life easier as a dotnet developer, which is why I really think I'm I really see it taking off. And yeah, I'm super excited that we have AWS support there on day one. So go check it out. Like Francoise said, Aspire dot hosting dot AWS, it is still in preview, the rest of the Aspire nougat packages, most of them, at least, are now out of preview, because it did GA at Microsoft Build conference recently. But we chose to keep ours in Preview just to get some more feedback. So we want you to download the Aspire to hosting database and you can package try it out, you know, we've got a couple 100 downloads now actually, that's probably close to 1786. There it is. So we're getting up there. But what we're essentially looking for is feedback so that we have more confidence to tick off the preview tag that the API surface, the design, the API design is what you're looking for. And that way we can feel comfortable taking off the preview tag without having to make any breaking changes once it's GA in production, but you've let us know you're the experts out there. Data developers the whole reason why this nougat package exists. So there's something you don't like, feel free to let the team know. And hopefully, yeah, we'll be able to promote it. Soon. Yeah, well, I guess we'll have to take the preview tag and it gives you the full confidence and backing of AWS.
Francois Bouteruche 6:38
Yeah, for those who are wondering just as a small highlight, it relies on AWS CloudFormation. So basically, everything you can deploy with LDS CloudFormation or infrastructure as code service. You can use it in dotnet, the sprayer so just to let you know. Yep.
Brandon Minnick 7:03
And we even have a pull request open, but we don't. Our amazing colleague Vincent does, has a pull request open to add CDK support to dotnet Aspire as well. So lots of big things coming exciting things coming. But the good news is if you delve into a spire, yeah, if you if you become a dotnet aspire, advocate, as I certainly have over the last couple months, then you're set up for success with AWS. So we'll be there with you along the way. And all the things you love doing in AWS, you'll be able to do also done at Aspire. But with that we have such an amazing guest today, Francoise. It's actually yeah second time joining us on the show. We liked him so much that we want to we want to have him back. And since we're talking about aspire, how do you talk about Aspire without observability and how do you talk about observability without with Martin Thwaites the observability kit from honeycomb welcome welcome Martin
Martin Thwaites 8:04
hello hello thank you for having me oversleep didn't do well enough on the first one so I have to come back to school you know if you if you if you've got it right on the first day, you wouldn't have to go back the second day. Do well, obviously I'll try and do better this time.
Brandon Minnick 8:22
Man, just one one day a school. Is that how it works in the UK? Just you show up day one. First grade, you nailed it. And then you're good.
Martin Thwaites 8:30
I mean, if you can do everything on the first day, yeah, they send you home. It's just
Brandon Minnick 8:36
he can already spell his name. He's first grade.
Martin Thwaites 8:39
You go one grade today. So yeah. Aspire, where do I start? Oh, my.
Brandon Minnick 8:48
Alright, let's first uh, for for folks who haven't met you yet? I mean, first pause this episode, go check out the previous episode where Martin joined us. We have audio podcast of it the dotnet. And
Martin Thwaites 9:03
what's this about us?
Brandon Minnick 9:08
I skipped Martin's last episode, we had our amazing colleague, James Easton filled in for me, but they did a great job, and you should go check out their show. But, Martin, for folks who haven't met yet, who are you? What do you do?
Martin Thwaites 9:25
Um, so my name is Martin flights. I go by Martin dotnet on the Twitter's and all of those kinds of places. My I timed myself as an observability evangelist or observability. Advocate whatever term you want to use. My goal has always been to help people understand production. I've been talking about logs, metrics, events, traces, just generally how do we understand for ruptured production for better part of a decade now. So recently, I joined Honeycomb, which is an observability back end, for telemetry data. And I've been a developer advocate for them for a little while now. Traveling the world to Romania and Moldova and drinking lots of wine. So So yeah, that's me.
Brandon Minnick 10:19
You Yeah, yeah. Quick throwback to dotnet days Romania. Actually all three of us are there at that conference. Yeah. Highly recommend that one. Again, if you're in the area, they'll they'll do it again next year, I hope. But, yeah, fantastic conference. I got sick. So as long as that doesn't happen, you're gonna have a great time. Apparently, I always get sick. Yeah,
Martin Thwaites 10:42
had nothing to do with us visiting a winery at all.
Brandon Minnick 10:47
That's right, we we did cross the border. That's the data days conferences in the city of Yash in Romania, which is the eastern border right next to Moldova. So as a little thank you gift, the conference organizers put us in a car and drove us around for what? 10 hours. The next day? Yeah. And Martin okay.
Martin Thwaites 11:13
Yeah, what the kitchen now and like where it was, it was great. I mean, that the conference is amazing. I am, I've been twice now. And the the attendees are some of the best that I've interacted with, like they, they have some of the best questions. And the reason why I go to conferences, the reason why I do podcasts like this is people get to hear me speak some people like it, some people don't, I don't care. But what I get out of it is people talking to me and telling me what they want to hear what's missing? Well, what things should we be talking about as speakers, because, you know, we can submit abstracts and that kind of stuff. But really, what we're here to do is educate people to give them resources to be able to do their jobs better. That's that's our our role. And there's no better way to do that than to actually speak to people and get them to tell you what they need. And I just absolutely loved the crowded dotnet days the first time. I had awful trouble the first time. So I remember very little of it other than I loved the attendees. And then this time, I got to see them all again. And it was just Yeah, I did a workshop there. The people they were amazing. And then I did two talks, because somebody dropped out. So I gave two talks. And yeah, just the feedback. And it's not just about was it good? Was it bad it was I want to I wanted to hear something more, something more about this, like gray. Okay, that is gold feedback. Because I can take that away, I can either I got the experience myself. And I can use my experience to be able to provide some credibility around how to do that. Or I can go and talk to some other people who have and try and formulate some ideas around that. And that is, you know, one of the things I've been talking about recently is credibility and Dev Rel, which is what I do. And how do we make sure that people can trust what we're saying, that isn't just a selling product? Whether it's selling AWS or selling Azure or selling honeycomb or selling any of those things? How do we how do we maintain that credibility? And by getting people to actually tell me what they want I can on my consulting gigs because I do that on the side I can try and you know, work out how to do those things. So I actually have some credibility in it. So but yeah, Donna days is amazing. The the two people who run it Irina, and Bogdan they're amazing as well. So yeah, go check it out.
Brandon Minnick 13:50
That's right. Welcome back to the dotnet days Romania show featuring
Martin Thwaites 13:58
so I just message arena and make sure my checks in the post.
Francois Bouteruche 14:03
Arena.
Martin Thwaites 14:07
But yeah, there's so there's so many conferences out there, and some of them are well run sort of a mom and I do pop dotnet days in that bucket.
Brandon Minnick 14:19
Of the Well of course.
Martin Thwaites 14:22
Definitely the well run once Yeah. Cool.
Francois Bouteruche 14:27
Yeah, and I think there are, as I mentioned, there are many different conferences and there are probably a conferences that fits your need that because that not that Romania is very I agree with you very. The attendees are very focused and the has question and meaningful question. With all the conference some time you get more content and if what you're looking for is is a wellness and Learn new things, discover new sing, not die. But oh, I want to discover many new things, you can go to their other big conference. So I really encourage people to the COVID crisis is over, or at least seems to be over, or we know we leave. If it was over anyway, I really encourage people to go back into in person conference. I'm an in person, person. I really value. The relationships, we can build at impersonal conference. And being in the room discussing with the speaker or discussing with the attendees makes a real difference, compared to watching a YouTube video. So I really encourage you to go back into in person, Evans, you can meet amazing people and learn a lot.
Martin Thwaites 15:59
Yeah, I also think I mean, I'm an Oslo same as hopefully, Brandon, next week, as well. And one of the things I find is the attendees don't believe they can come and speak to other speakers, they kind of they put us up on a pedestal. Like if I'm walking around the conference, the thing I absolutely love, is when somebody grabs me and asked me a question, like, I mean, don't ask me, you know, something that's not relevant, obviously. But asking me something like, how would I do this? I want it What's your opinion on this? I mean, I have nothing but opinions. Like I will tell everybody. But I just love that people actually come and do it. We have like a speaker room, that if we don't want to talk to people, and we want to decompress after our before our talks, we'll go and sit in the speaker room, if we're sat at the venue walking around, or anything like that, come and talk to us. You know, that's what you get out of going to an in person conference. And that interaction of getting something that's bespoke some some idea or some knowledge that's bespoke to your use case. That's, you know, it really energizes me talking to people.
Brandon Minnick 17:10
Yeah, and that's to kind of marry. What you've both said together is that you're internally a little behind the scenes for For everybody watching listening today. We take that feedback, and we escalate that internally to our product teams. let folks know, hey, here's what we're hearing. Like any of us, for example, every month Francois, and I get to meet and attend what we call the dotnet language Advisory Board, which is something Francoise started, which he's handed off to me. So I'm leading it now. But basically, Francoise started this as a way to bring all the all the folks at best together who work on dotnet, whether you're a solutions architect, whether you're an engineer working on a product, let's all get in the same room, and let's share what we're working on what we're hearing from the community. And so, literally, if you if you give us feed or when you give us feedback at events like this, we highlight it, we bring it back and we share it with all the engineers, the leads the PMS. Because that's, that's really what we're here for. Like, yeah, sure, we showed up to give a talk. But it's not about us. It's about making, getting that feedback from fellow developers. Because outside of that, you know, most folks don't get to chat with developers day to day. So yeah, definitely come up to us. Let us know what you're working on. Let us know what we can do to make it better. Because we want to be your we want to be your loudspeaker. We want to amplify that to our product teams. And it's definitely one way to do it. Now. If you have problems with your amazon.com Shopping account, I can't help with that. But if it's AWS related, and it's dotnet related, we can help
Martin Thwaites 19:03
buying a book from Amazon on dotnet.
Brandon Minnick 19:12
No, I just say anytime. When you know when folks ask you what do you do? Oh, yeah, I work at Amazon. And then immediately it's like, oh, here's the problem I had with my shopping cart the other day, like can you fix that? It's a big deal. But yeah.
Francois Bouteruche 19:30
If you say I work in IT, people say can you fix my printer and no when you say hey, I work at Amazon, can you fix my account please?
Martin Thwaites 19:41
The easy way to fix a printer is to buy a new one.
Brandon Minnick 19:47
Right your printers run out of ink, buy a new one. There you go. Honestly cheaper,
Martin Thwaites 19:52
I suppose I suppose the feedback thing is you know, like you were saying about the new Aspire AWS stuff. But the reason it's in preview is you want that feedback first. And you need people to give it. And the most effective way to do that is, I mean, yeah, it's great if you can write an issue and give four pages of context and all of that kind of stuff. But the lowest friction way of doing that is just a grab somebody on the booth, who's a Developer Advocate, not one of the salespeople, and you know, there's a lot of dudes where there are just salespeople and marketing people, but it will get back if you tell them. But grabbing somebody who works for that company, as a developer advocate is by far the easiest way. Just say, I've got this problem, like, don't come and complain and say, it's your fault. You broke, you caused me a bad day, last Tuesday. But, you know, come and talk to us and tell us what's wrong, tell us how we can make it better. Because our job is to make things better.
Brandon Minnick 20:50
I feel like I constantly have to remind my my boss and my boss's boss, and my boss's boss's boss about that, where just like any job, right, your manager or skip level manager, or VP, whoever's in your, in your vertical, they'll always come down to him, like, we need to do this now. And this is what we need to focus on. And I always prioritize the feedback we get from the community, because, and then also remind them that no, like, we actually work for the community. Like, yes, I know, I work for you, and you signed my check. And I'm gonna have to do most of what you asked like, but this is always going to be my top priority. So no matter what just
Martin Thwaites 21:32
just identified what my problem has been over the many years of working for companies that don't do what they ask might be the reason why I've been having problems over the years.
Brandon Minnick 21:43
Man of the people, Martin,
Martin Thwaites 21:46
a man of myself. Speaking of which, were 22 minutes into a dotnet podcast. And we've talked about dotnet conferences and deverill.
Brandon Minnick 22:04
But good segue into we have mentioned observability, we've mentioned that aspire, and even before the show, Martin, you were mentioning how there's a lot of folks who are first learning about observability through data and aspire. So while we will go all the way back to start back at the starting line. So Martin, what what is observability? What is tracing for folks who maybe have never even heard of those words?
Martin Thwaites 22:29
Okay, so observability is not new. It's it's coming into the consciousness now a lot more in full force than it ever was in the past. But it's not new. observable is from a paper in the 1960s. It was popularized in the code world and the the sort of production software world, arguably, somewhere in the last 10 to 15 years is when it was popularized. But really, it's just about how do we understand what's going on inside systems from the outputs that gives us that's all observability? is, it's a category of thing, if you like, it's just about how can we, how can we ask questions from the outside about what's going on, without attaching a debugger to production, because you know, the, the best way to debug production is to attach Visual Studio to your IaaS instance. And then just step through the requests that come through. Everybody does that once. They do it once. And then they realize that the entire production service stops every single thread while you step through, which means it's not really that easy to do it, you're also stepping through like, you know, 1000 threads, and it just doesn't work. But that's the easiest way to do it. What we're trying to do is trying to get as close to that, because from an observability angle, what we're trying to do is understand what's going on on the inside, based on what it gives us and what it gives us is telemetry. And the better the telemetry signals, the higher fidelity those telemetry signals, the easier it is for us to ask more specific questions and get to the real crux of the problem. And the more questions we ask, the more information we have, the more information we have, the easier it is to fix a bug. The easier it is to fix a bug, the quicker it is to fix a bug normally. And the quicker we can fix bugs, the more confidence we have with deploying code to production. Because if we know that we can fix bugs quickly, we can find the problem. And it's easy to fix those things, then we're going to feel more confident with deploying things. It's this big sort of flywheel if you like, of, if we have good observability while we can deploy things faster, there's a whole thing in the middle of there. That is the reason why we can do it. But observability is that catalyst that allows us to do it. The thing that I think most people struggle with is They think that observability is about I made sure I've got all my metrics. I've made sure I've got all my logs, I've made sure I've got all my traces, I've made sure I've got all my profiles, I've made sure I've got all my wrong user data. And it really annoys me this this whole, like, there's three pillars of observability. There isn't, there are no pillars, there's been white papers, by the Open telemetry community by tag observability in the CN CF, which say, no, there are no pillars, there are only signals. And there are lots of signals, which signals you use, and how you use them, is how you achieve observability. But you don't need all of them. And getting that into people's heads that there is more than just locks. log aggregation is what we had, it doesn't mean that it's bad. It just means that it is not the best for most for every scenario. And there are a lot of scenarios where tracing data, or metrics data, or profile data would be a better fit for what you're looking for. The thing that annoys me at the moment, I suppose is the Well, I've always had logs. I've always used logs, I've used lumps for the last 20 years, they've always served me well. To which you turn around and say, wow. So in the last 20 years, 20 years ago, you were writing distributed systems using micro services. You were writing 1500 Different micro services using message you were really advanced back then. And they got no no I wasn't doing that was like, oh, so so the the systems that you've built have changed. But the way that you debug them hasn't interesting. Because, you know, maybe you can, and you know, I've seen people that have done lots of things with logs that, you know, I've been able to debug their systems. And yeah, it's fine. The reality is, as your system starts to get distributed, as it starts to get bigger and more complex, I've seen the hoops that people jump through to make their logs still valuable when they could just add a trace instead. And this is the thing, it's like they've contorting themselves in weird and wonderful ways to say, look, I can still use logs, that looks very much like a trace, you know, they will, they will take their log, and they'll they'll add a duration onto each one of the log lines. And then they'll they'll add the previous log line that was on there. And then they'll add loads of these extra attributes, like what you've just done is you've recreated tracing, if you just like, did you know tivity source dot start activity instead? You wouldn't have to do all that. It's like, Yeah, but I like using logs. No, you like using traces? You're just doing it using the ilogger? Like, you don't need to?
Brandon Minnick 27:47
I mean, you don't you have to come on my show and attack me so directly, Martin.
Martin Thwaites 27:54
I think that's fine. You know, I'm in this podcast, and I don't like it. Let's all you know, they we do what we can what I what I mostly advocate for is for people to understand what's available to them, and then choose the right tool. It's an informed decision. You know, why are you using this over that? And the answer isn't because I always did. The you know, this, this is a prevailing thing in the software industry. Why are you doing that? Well, I've always done this, okay. You're the dreamer, the Lego meme, which was a Lego cart with square wheels. And there's a guy behind with a circular wheel. And it says, Are you too busy to improve? It's like no, no, too busy, got too many things to do, like pulling this cart with square wheels along. And, you know, maybe that is the right you know, the terrain is the best for square wheels. I don't know your context is not my context. But do you know what you can do with traces? Do you know what you can do with metrics? Do you know what you can do with profiles. And I think that is why I like aspire, there are lots of reasons to hate Aspire. I've been very vocal about that. But the reason I like Aspire is because it's putting front and center, the dashboard. And in that dashboard, it makes it very easy for you to see what tracing looks like. And I'm not even talking about distributed tracing for a single application where you've got just a database, and maybe a third party service that you're calling, being able to see what that trace looks like and go, Oh, that'd be really interesting. when I'm, when I'm trying to debug this in production, that'd be really cool. Where do I put this? Oh, what's tracing that? And that that idea of just being able to know that tracing exists? I mean, I've seen certain people, popular people within Microsoft who, maybe a year or two ago were touting logging is the future. You don't need traces. I can do everything with logging. And then a year ago when a by a cannot change their mind, and when tracings amazing. We're seeing all of these issues that we never saw before. I'm like, Why was I not saying that last year? Because it's, I mean, we were talking about this earlier. It's a visceral thing. It's something once you've seen it, you go, I get it. Now, I get why that's more powerful. But, you know, Aspire has popularized it, and I love it for that. I love the realisation on people's faces. When they start seeing tracing data, we have a thing in in Honeycomb called MTT WTF, which is a play on the MTT R, which is the when somebody starts looking at proper tracing day or fully implemented Instrumented Systems, and they look at their tracing data, and they go What's that? Why is that thing calling? The data is wrong? The data is wrong. You know, that's no, no, tracing is not working. It's all bugged. And then a week later, they come back, so fix it. So yeah, I just love that Aspire is putting that in people's minds.
Brandon Minnick 31:11
Absolutely. And Martin, I don't know, if you have are, you're able to pull up a dotnet Aspire dashboard on the spot? Totally. Okay, if you're not. But that's certainly my story is, you know, I've been doing logging, I'll call it tracking, because I'm a mobile developer. So we would always like user tracking user behavior, where spoiler every mobile app tracks everything you do in it. So more so than others. So more invasively so than others. But yeah, for years, you know, we would set it up to like, anytime you click a button in my app, I would log that. And then if the app happened to crash, not only would I get the stack, trace the crash report back, but also in that crash report would be all the events that led up to it. But when I say events, it was like everything that I manually logs. So it's like I could see okay, Francoise launch the app, you tap this button to go to this page, you tack this C hit click Submit, and then the app crashed. And then I could try to reproduce that. Yeah, but with observability it, it really bakes that user journey in for developers. And it gives you so much more information about mean specific libraries that are firing times. So you know, durations, like if you make an API call, like, Martin, you're saying earlier where I totally did this for I would just start a stopwatch essentially track the round trip back and forth to my API. And just include that as like a little note in the log that like, we made this API request, and it took this long, but of course, you know, we're using stopwatch. It's not that accurate. And then what happened on my back end? I don't know. I can see, I could see where maybe in the logs where the event started and stopped. But I'd have to like piece everything together to figure out like, is it a back end problems, front end problem, where's the bug. And
Martin Thwaites 33:15
you'd be looking at the timestamps between the
Brandon Minnick 33:17
timestamps, oh, it's, well, this
Martin Thwaites 33:19
event happened here. And then there was four seconds, and then this event happens. So the call that I made here, maybe that's what, but wait a minute, the time lapse that the stopwatch I put on this one only says three seconds. So what was happening in the other second, the thing that's that's really amusing about that is, when we have one server, when that one client was a desktop app, or a mobile app, it was actually really easy to correlate by timestamp, because everything was based on the same seed, they'd all have the same time. But when you get into a distributed system, so you're using some kind of fog, a distributed system, you've got lambdas, firing, you've got your ECS is got 10 different tasks in it. How do you know that all of those have the same timestamp? You know, when was the last time you made sure that the sync between all of those different things was using to the millisecond, the same timestamp? I said this in my talks, and you know, you get people is like, yeah, so how many people have synced to their time servers between all of their instances and I can count on no hands. People put their hands up, because you don't. And what we were doing was we were trying to use that correlation, like you said, you're trying to step through these entries. And the reality is a low volume systems on systems that aren't massively complex. And even on some mid complex systems and mid volume systems, that's perfectly fine. You can get those answers to those questions. When you hit large volume, and you start being too Think about how you sample logs and traces when you have to think about how you reduce that volume, because when you're hitting 120 million spans per second, that's too much. And you need to look at sampling rather than reducing the volume. You sample data, when you start needing to think about how you do that. And think I'm just gonna look at a correlation of logs and go through the timestamps and how that went through 1500 different microservices. It's the stuff of nightmares.
Francois Bouteruche 35:33
I have a question for you, Martin. Because here, I'd like to make sure that are on. Don't think that sort of everything is only for tech issues. Because a few years ago, a few years, nine years ago, I was working in a coordinating company. And I think it's that time, I've started to realize that observability was helpful, because the question we wanted to uncover was not technical question. The question we were asked by the business, okay, we would like to understand why. So we had a front end web with an acquisition acquisition funnel for people who came to the website and start making a loan request within the request to a back end API API. Take a first decision of granting or not the loan, and then it goes, a trigger message. And this message was processing the back end, sometimes they were human in the loop. So they were asking business question about okay, we would like to understand when people when someone make a request, on the front end, or money request, or how many loan requests on the website, and the reaching, being processed by a human, for example. And we were not able to answer this question at that time, because we were like, What do you want us to correlate a request on the website, to a file to a five, process by human? And we started to introduce some tracing to be able to say, hey, this event triggers this message in the queue triggers this and these were businesses. At the end, what we the question we wanted to answer was business oriented question we wanted. Are you okay with it? Is this?
Martin Thwaites 37:49
So? Technically, yes. Because I term observability? is the ability to answer any question that you might want to ask. The reason I'm sort of a little bit cagey about it is because business operations is something very different. Because you should always in your telemetry data accept some kind of loss of data. Whether you have to sample your data, whether you use metrics for that data, which is inherently has a margin of inaccuracy, there is some level of inaccuracy. So using observability data, to work out what the profit or loss is, absolutely should not do that. That isn't something you should ever consider doing. Because Telemetry is not guaranteed. You should use your operational data for that your Oh, OLTP. Sorry, there's there's OTL p, which is the open telemetry protocol. And OLTP is an online transaction processing. One there, just switch. So if you're doing stuff like that you want to report in some sort of regulatory way or a financial plan kind of way? No, that's not what observability is for, if you're wanting something to give you an idea of how production is working right now. 100% observability. So if those things are working out how many things came from the front end to the back end, over a particular time period, and did it drop so that you can scale your servers so that you can understand whether there was a business or whether there was an application problem at that time? It's all about kind of guiding you to where these problems are, rather than the business going, right. Well, well, our conversion rate has just dropped. I mean, we do this on the front end, we do this. We've been doing this for years with whether it's Google Analytics, or any of those analytics platforms, where we do conversion rates, ecommerce sites where you do from the Iclick The button to check out versus when I actually paid how many people dropped out? And where did they drop out in that flow? Then, yes, that is a thing that you could say is observability. I'm not precious about that definition. I'm precious about it being misused to mean three pillars to mean getting data, people using it as a stick to beat people with to say, you're not doing these things. What I care more about is people understand production, whether what questions they ask is irrelevant, really. But as long as they're using it to understand what's going on in production, and use that as part of their DevOps lifecycle, because DevOps is not a role. DevOps is a culture, and a ecosystem and a process where you use these things inside of that DevOps lifecycle to understand things, and then feed them back into your process to change things. If you're not looking for 100% accuracy, yeah, there's no reason why you couldn't, and use high fidelity data like tracing data, that will give you all of that data and make it really wide, not deep. So add tons of properties, you know, what was the temperature in the datacenter? When the request happens, maybe that will be important. So add all of that data in so you can start to understand the trends. What the ones that ran slow are the ones that dropped out? What was the coming out? Oh, there were all using Safari? Hmm, maybe there's a problem with Safari. I mean, there's tons of problems with Safari. But maybe that's the problem. You know, that, like I say, I am full of opinions. But you know, that idea of how do we get that commonality use the high fidelity data use the highest fidelity data you can get away with, sometimes you can't, you have to sample using head sampling, sometimes you have to sample using tail sampling, sometimes you want long term analysis, we've seen some customers using things like BigQuery, as a long term analysis tool, they dump all of their traces into something like BigQuery. So they can do long term analysis over, say, 12 to 18 months. That's a very different use of telemetry. But it is something you could use it for. What I'm more care about is the people who need it from a development perspective. The people who are engineers like me, the ones that write code that gets deployed into production, those kind of people are able to use that telemetry data, to feel more confident about deploying stuff into production. And the various different ways that that makes it better.
Francois Bouteruche 42:41
You mentioned something before, before the show about the different hues, software developers and infrastructure engineer have of observability and telemetry. I'd love to have your thoughts on this.
Martin Thwaites 43:00
Yeah. So I mean, it's something that we've been talking about quite a lot recently, which is the idea that we've got sort of development engineers, and they're all everybody's an engineer who works in these organizations, infrastructure engineer, products, engineer, developer, tester, they're all engineers. They're all doing some part of the engineering discipline. So all of these different engineers have different uses. What I think people are trying to do as an industry at the moment is shoehorn every one of those use cases into one tool and one signal type. So everybody's using metrics because well, we need to use metrics. And that comes from the person who chooses the tool. So if the person who chooses the tool is infrastructure, you'll be a metrics heavy and dashboard heavy organization. If the people who choose the tool are the developers who write the application code, the product engineers, then you'll get something that's more akin to a high fidelity analysis tool, something that's using tracing data or logging data, to get those wide contexts to do it. And I don't agree with those because the two different use cases, infrastructure engineers need to be able to know compute capacity, they need to know are we are we hitting our limit on the amount of pods that we can deploy right now? Are the nodes hitting 90% CPU, we need more headroom so that we can scale. They need metrics data for that. They need fairly low cardinality and low dimensionality data. So that's fewer fields. And the values in those fields have fewer possible values, to be able to make those decisions. Whereas the developer, they care about the end user, they care about the end user is able to service that request. They don't care whether the pods are running the developers don't care if the pods are running at 99% CPU. Because what they care about is their users are actually getting their request served in 20 milliseconds that in milliseconds. And the ones that aren't being served in 20 milliseconds that are being served in, you know, 20 seconds. They want to know, what's the commonality of those ones. They don't want a metrics dashboard that says, yep, this route, the checkout, it's gone up, we've got 50 errors when we didn't use too. They don't they want to know those 50 errors. What's common to those versus the other 5000 requests that succeeded. They want to be able to dig into those requests and look at the really high fidelity data that tells them what was happening is what Dan was talking about, you know, you want to know, did they click on this button, then that button or that button, then this button, you know, they they want to know what that journey was, they want to know how to request flow through the 1500 microservices, which ones didn't touch. The two different use cases that trying to shoehorn those into one tool is part of the problem that we have, because one person is going to lose out. Either the infrastructure engineers are going to lose out, because they need these metrics data, they need lots of different types of data in order to do their job. Because platform is hard. running these platform systems at scale is hard. And developers don't appreciate that. But vice versa, the opposite is true. Infrastructure Engineers don't understand how hard it is to debug a production system. When you've got 1000s of requests going through, you've got deployments going through day after day after day, the system is constantly changing. These are unknown unknowns, because let's just say users are weird. They do weird and wonderful things. And we need that high fidelity data. These two people are not enemies, together, they service production. Together, these two people are the ones that are service and users. They need different types of data, they need different tools. And they need to use it in different ways. And I'm starting to see a bit of a trend of change. Where unfortunately, everything comes down to people, you know, people and communication is kind of the superpower of engineering teams. Being able to to talk to somebody and say, Well, what is your use case? And can I understand and empathize with your use case. So I can help you do your job. The organizations that get that, right, those are the high performing teams, the ones that do really, really well. It's consistently the superpower of developers is talking to people. And I don't mean actually face to face, I don't mean that you need to get in a room together or go to an in person conference together, or anything like that. I mean, just talking to them, it might be a slight conversation, being open to what they need, and understand their needs. Whichever role you're in tester, infrastructure engineer, cloud engineer, platform engineer, whatever it is, that is the superpower.
Brandon Minnick 48:17
So I'm curious, Martin, I, I agree, you know, back end, front end engineering ops, we have different goals, we have different responsibilities. I'll be it. In the perfect world, we all communicate with each other. But you mentioned that we're trying to shoehorn multiple roles, multiple responsibilities into one tool. And that's not the best way to do it, which I don't disagree with. But I have got me curious and thinking, what what do we do? What should we do? Should we have two different tools? Should we have different dashboards? What is a solution for that?
Martin Thwaites 48:57
There isn't a solution, there isn't one solution. And this is my whole point. That you need to work out what you need, you need to work out your goals when we when we talk to our customers, we and we get all this feedback all the time, we work more like consultants than we do an old trying to sell a tool. Because what we're trying to do is understand what's specific about that use case? And how does our tool solve it? And how does our tool work with the other tools that they have? Because it's not about here's just a tool, let's go sell a tool. The tool doesn't think let's buy the tool is a case of Well, how am I going to use that and how is that going to affect the rest of the different people in the organization and how can they use it? Can they use it? Should they use it? There's not unfortunately just let's buy two tools. Sometimes that is the right option. Sometimes that's absolutely not the right option. And by tools, I'm not talking about commercial tools. I'm talking about the the way that we use them, it might be different databases To store the data, it might be we need a metrics database that can do some metrics, dashboards, we need a tracing database that can do this with your logging database that can do this. Now, it might be that those all sit under the one tool, if you like, maybe it's one brand or whatever. What I'm saying is that you need to know which one of those signals you're using how you're going to use it, telling people that they can't log data, because well, it's costing us too much isn't the right answer. It's a case of understanding why that logging data is useful to allow them to be able to justify the cost. Because sometimes Yeah, it is going to cost you 10,000 pounds a month to get a whether it's hosting a database on AWS, or click house database to store all your tracing that maybe that 10,000 is actually worth it. But maybe somebody else needs a different tool and saying, Well, we've already got 10,000 pounds a month in this database that we're hosting ourselves, just use that. The answer is talk to them about why they can't use that. And then work out is there a different database, a different data store that they need, with different visualization tooling on top of that? It's, you know, trying to say to somebody, well, we already have Kubernetes, so you must run Kubernetes locally, you can't run Docker? Well, maybe Docker is the best one for me to run locally. I don't know. But asking them the questions, talking to people.
Brandon Minnick 51:35
Dang, alright, good answer.
Francois Bouteruche 51:38
Just want to make a bit of fun as sorry, rather than with the me observability as a reach maturity level. Like, I can compare this to agility because the essence of agile it like observability. The essence of what you're saying is, it depends. You have to talk to each other and define solutions that match your needs. That's the essence. And in reality, what we see nowadays about security is some false saying, hey, there are a few pillars you need to follow by the wall, like like we see before in the agility space, where this is the rules of agility. If you don't do this, you will not jive. So, I think observability as a rich maturity level. No, because we have those 14 A's. This is the route.
Martin Thwaites 52:28
Oh, yeah, I mean, this is the this is the tool. And it's not the rules. This time. It's the tool. Yeah, it's well, this is the tool we have this is the tool you use. And you know, it's the whole go by me some DevOps, can I can I can you go buy me like, I mean, six DevOps, please. It's gone by me gonna go buy me observability? I'd like for observability, please, like, No, that is not what you're supposed to be doing. What you're supposed to be doing is trying to understand the goal of observability is how do I understand what's going on in production for all of those roles? How did they? How are they able to get the answers to the questions that they need? That is the goal of observability, the data stores that products that companies use, those are all just a conduit that was a an implementation detail. Because if you can get access to all that, and I say this in my talks, if you can get access to all of the answers to all of the questions that you want to ask, with purely your logging data, you have observability, well done to you don't go looking for something else. Because don't change it if it works. If you're struggling to answer some questions, write those questions down. Go to your teams, ask them what questions they would like to know the answer to forget about the tools that they're currently using? What questions would you like to ask pies, you know, pie in the sky, blue sky thinking whatever term you want to put on it? What questions would you like to ask, start, start from there. And then work out why the data stores and the tools that you've got aren't allowing them? Because I would say nine times out of 10 those tools are allowing them to do it. It might be clunky. It might be hard to do. It might be a knowledge gap. But nine times out of 10 the tools will be able to do that. Are they the best tools? I don't know. Maybe they are. If you're using Honeycomb, I can guarantee the app, but they might not be the right tools. If you start from the questions that people want to ask, and then work back to why your solutions, don't do it, and then you work to what solutions do do the thing that we want. You'll be in a much better place. Because it's not just go buy me observability it's how can I understand production a lot better than I do right now? And how can I feel more confidence to be able to deploy things. One of our engineers at honeycomb called Jamie Danielson said observability is about confidence and I love that statement. It's short. It's succinct. And it's exactly what observability is about confidence in deploying things, confidence in understanding things. And systems have changed. Do you have the same confidence now, when you've got a distributed system than you had with the same tools when you had a non distributed system? And I don't think a lot of people can hand on heart say that? Yes.
Brandon Minnick 55:28
No, probably not. And to be honest, as, as an engineer, myself, who have a couple of apps in the App Store that I've got to maintain and, you know, it's kind of you don't know what you don't know. And you kind of, are i i just kind of made things work in as the need arose, I added times to my logging to see how long my API requests were taking. And but what I didn't know at the time was there was this this step change in the industry with observability, and tracing and metrics that I totally missed. And now there's new tooling that does a lot of this for me. So I'm, I'm a convert, I'm gonna be using tracing for well, until the next best thing comes out. But, Martin, we only got a couple minutes left, we we get cut off at the top of the hour. So again, thank you so much for coming on the show. But for anybody out there who wants to continue the conversation, keep talking about observability? Where can they find you and and where are you headed next to where they can get you in person to give you all this awesome feedback.
Martin Thwaites 56:37
Okay, so find me on Twitter on LinkedIn, I will chat to anybody tag me, ask me questions. That's fine. If you Google my my name and office hours, I do office hours, people can book some time in and chat about anything to do with Honeycomb or to do with observability or open telemetry all of that's fine. Next week, I am in Amsterdam for go to Amsterdam and then Oslo for NDC Oslo. And I'm also doing an open telemetry hands on course, for NDC Porto in October. So the tickets are online for that now. But yeah, I will teach you everything there is to know about and getting good data out of production. In regards to well dotnet but also when you the language as well.
Brandon Minnick 57:27
Absolutely incredible. And that's at Martin dotnet. So M AR ti n d o t ne t so dotnet spelled out and Marshall thank you so much. Dropped the.or dot the dot I was we we do a dot the dot campaign where we use D O T n e t instead of period ne t dotnet. Going forward. But, Martin again, thank you so much for for coming back on the show for being a two time guest. We We love having you here so much that we'll invite you back anytime, of course. And thank you for joining us. Thanks so much for joining us on the dotnet on AWS show today. We stream live every two weeks. It's Monday every other Monday 8am. Pacific. And don't forget, we also have an audio podcast where you can join us in the car on the go listen to the dotnet on AWS show on any of your favorite podcasting brands. And we'll see you in two weeks.
Continue to The .NET on AWS Show, featuring Lea Mladineo!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.