OwlTail

Cover image of Ben Sigelman

Ben Sigelman

16 Podcast Episodes

Latest 18 Mar 2023 | Updated Daily

Episode artwork

Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck

Cloud Engineering – Software Engineering Daily

Ben Sigelman  Alex KehlenbeckObservability consists of metrics, logs, and traces. Lightstep is a company that builds distributed tracing infrastructure, which requires them to store and serve high volumes of trace data. There are numerous architecture challenges that come with managing this data. Ben Sigelman and Alex Kehlenbeck join the show to discuss the implementation of Lightstep.            Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck appeared first on Software Engineering Daily.

45mins

19 Apr 2022

Episode artwork

Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck

Podcast – Software Engineering Daily

Ben Sigelman  Alex KehlenbeckObservability consists of metrics, logs, and traces. Lightstep is a company that builds distributed tracing infrastructure, which requires them to store and serve high volumes of trace data. There are numerous architecture challenges that come with managing this data. Ben Sigelman and Alex Kehlenbeck join the show to discuss the implementation of Lightstep.            Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck appeared first on Software Engineering Daily.

52mins

19 Apr 2022

Similar People

Episode artwork

Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck

Software Engineering Daily

Ben Sigelman  Alex KehlenbeckObservability consists of metrics, logs, and traces. Lightstep is a company that builds distributed tracing infrastructure, which requires them to store and serve high volumes of trace data. There are numerous architecture challenges that come with managing this data. Ben Sigelman and Alex Kehlenbeck join the show to discuss the implementation of Lightstep.            Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Distributed Tracing Infrastructure with Ben Sigelman and Alex Kehlenbeck appeared first on Software Engineering Daily.

45mins

19 Apr 2022

Episode artwork

Observability improving speed and reliability with Ben Sigelman

The Confident Commit

Rob sits down with Lightstep CEO, Ben Sigelman to discuss observability and how it connects with delivering change with confidence.Get answers to questions like,What is observability vs monitoring, tracing, and logging? Do they all have their own jobs or is there overlap?Where is the split between validating before production and validating in production?Are we making software more complex than we need to? Is that complexity driving this push towards observability?Tune in today and if there's something you want us to discuss on a future episode, reach out to us on twitter at @circleci! 

42mins

3 Sep 2021

Most Popular

Episode artwork

Software at Scale 15 - Ben Sigelman: CEO, Lightstep

Software at Scale

Ben Sigelman is the CEO and Co-Founder of Lightstep, a DevOps observability platform. He was the co-creator of Dapper - Google’s distributed tracing system and Monarch - an in-memory time-series database for metrics. Finally, he’s also the co-creator of the OpenTelemetry and OpenTracing standards.We spent this episode discussing Dapper and Monarch - their design, rollout, and lessons learned in practice.Apple Podcasts | Spotify | Google PodcastsVideo HighlightsTranscript[Intro] [00:00]: Welcome to Software At Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host Utsav Shah and thank you for listening. Utsav Shah: Hey Ben, welcome to another episode of the Software At Scale Podcast. Could you tell our guests just about your story, because there's so much in your background that is interesting to me, so right from, starting off at Google, they're like creating LightStep. Ben: Sure. Thanks for having me, I'm excited to be here. I don't know whether my background is interesting or not, but to me it's kind of boring, but yeah. I graduated from college right in the thick of the.com bust and the sort of 2003 era, and I was very fortunate to get an offer to work at Google at the time. And when I went over there, they actually put me on some stuff in the ad system that was incredibly boring, to be honest with you. And also of course, ads make a lot of money at Google, but it was this particular part of the ad system that wasn't making any money, so it was kind of boring, not very lucrative for Google and I didn't like it very much. And the way I got into dapper and distributed tracing was actually incredibly arbitrary, but it's a funny story. They had this one time event where you could opt in to this, I don't know what they called it, but it was this program where they would take everyone who opted in. They look at a bunch of different dimensions, like how long you've been at a school, what office you worked in, what languages you worked in, where you were in the org chart; that kind of stuff, and they think they have 10 dimensions. And then they found the person who also opted into this program, who is literally the furthest from you in this 10 dimensional space and then set up a half an hour meeting with no agenda, and that was it. So I was working on this stuff and ads that, as I was to say, totally pointless. And they paired me up with this woman named Sharon Pearl, who was a very distinguished researcher who had come over from Digital Equipment’s Research Lab when it kind of fizzled out after the merger in the late nineties. And she, and some of the other old guards at Google were doing all the really cool system stuff. And she asked me what I was doing. I don't want to talk about it, what are you doing? And then she went through this list of really interesting systems projects. One of them was kind of like a predecessor to an S3; it was like a blob storage . There was some NLP thing she was working on and then in this list was this prototype of a distributed tracing system called dapper that never really saw the light of day, it was just kind of an idea and she described it to me. I just thought it sounded incredibly useful and really fun , and my manager at the time had 150 direct reports, direct reports. I don't think that is more of a hundred, but he had no idea what I was doing, obviously. How could you, and so I just started working on it, basically switch Utsav Shah: Teams or anything. Ben:  Well, Google famously had this 20% program, so it was kind of that type of thing, but I really liked it and I thought it was quite valuable actually, and so I moved to New York for personal reasons and I just started working on dapper , full-time my manager Yorick also had like a hundred direct reports. So he had also had no idea what I was doing and I got it to the point where it was in production and it was actually solving problems pretty quickly, just because it was IT. Well, I can get into that if you want, why it was possible to do that ,and I got hooked on that stuff and I really haven't looked back. That was early 2005 and now sixteen years later, I'm still basically working in that same overall space of how do you observe complex distributed systems and what you, what kind of improvements can you make to the software engineering process? If you are able to observe them effectively after working on dapper for awhile, I just wanted to do something different. So I went over, did a couple of systems projects that really didn't work that are not well known because they were failures. I'm happy to talk about those too, if you want, but I eventually found my way over to Monarch. I started to create a multitenant high-availability time series database, basically and it in terms of the open source world, probably the closest parallel would be M3 or something like but ended up working on that for about three or four years and then left, Google started a social media company that was as a product of complete failure, about a year into it. I realized that it was never going to work, abandoned the product, but realized I enjoyed being an entrepreneur and I wouldn't even say Pivot's the wrong word, because pivot implies that you keep one foot in the same place. I just started playing a different sport, but with the same investors and that's actually LightStep. LightStep was founded as a social media company in 2013 and a year and a half in, I was just that I completely changed what I was doing added some co-founders at that point and here we are six years after that, and I'm still working on building stuff and really enjoying it. Utsav Shah: Yeah. I think that is a super interesting background. Next, the [05:00] first question on that is, was that the era when like Larry Page or whatever, decided that there's no need for managers, that's why they just hired all of them. Ben: I don't think so much they fired them, but they would just hire a lot of engineers and hire managers to go along with it. Utsav Shah: Yeah.Ben: I think there was this idea that management was bad in some capacity, and I understand where they're coming from. I definitely don't agree. I think good management is actually one of the most incredible supportive things you can possibly have in an organization. But I think that they were a lot of the people who had come to believe that we're just coming from really bad management. Certainly bad management is worse than no management, but good management better than all of it. Ben: Right. Utsav Shah: The other thing that was interesting about Google at the time was that they were growing so quickly that if you didn't like what you were doing, you only had to wait a couple of months and some new person would take over.Ben: So that paper's over a lot of issues. I think there was a belief that Google had solved the management problems through software or something like that. That was another thing, there was a belief that by writing internal software systems to do a lot of the blocking and tackling that managers might do, and they certainly had tech leads, which serve a managerial purpose for just dividing workup. And there was a belief that they've solved that issue, and once the company everything, there's a law of large numbers, even though Google has been very successful at some point they had to grow slower. And when that started to happen, the need for managers became much more obvious and sure enough, at this point, I don't know what the ratio is, but I'm sure it's not 151 more so that they realized that they needed to correct that. But it was liberating in the sense that you could do whatever you want, but I think it was pretty disorganized and not very efficient. Utsav Shah: Yeah. Could you talk about the architecture? You said you worked on some part of the ad system that wasn't particularly interesting from my understanding, and I could be completely wrong about this, there was one monolith, Google web server, like DWS and not that many services around, is that like roughly accurate? Because I'm also thinking why did diaper make sense if it's just like one large server, but I guess that's clearly wrong. Ben: Yeah, I don't think that's correct. Certainly if you go back far enough into 1999 or something, it was probably true, but by the time I showed up, we didn't call them micro services, but they absolutely were. And I would say again that the micro services at Google were the best , probably the only good reason to adopt micro services is going back to management. It's difficult to get more than 15 or 20 engineers to work on anything efficiently in a single code base that's deployed as a single unit; it's just difficult to do that from earliest engineering standpoint. So micro services serve a purpose from a software development management standpoint, where you can create a unit of deployment that micro services at Google were much more about horizontal scaling. And that was a necessary thing. They add throughput that required that kind of horizontal scaling, but they definitely had. I remember when he turned dapper on, in production, we'd never really been able to visualize it before, but a cache, miss and Google web search. Certainly what two quests GWS, which you're referring to web server at the top of the stack, but by the time it got down to the bottom of the stack between the front end load balancers, the final thing that actually would look through some index on disc, it was 10 or 20 levels of depth to get down, so yeah, it was definitely quite distributed and also huge fan out. Oftentimes a parent would have with paralyzed request to 30 or 50 or a hundred and in some cases, children that had different parts of the index. And so you had a tail latency, things were really scary and stuff like that. So yeah it was quite distributed, especially on the web search side, early on. There were other parts of the system, like the ad system was the front end of that system that merchants would actually use was basically like a database than a Java web server. So there wasn't everything, but that was for the high throughput, low latency stuff. It was pretty distributed from early on. Utsav Shah: Interesting. And just out of curiosity, did Google prioritize the consistency or availability ? Because of that large fan out, I'm assuming availability and it just dropped a data coming from like a few shards that they were too slow or something, but yeah. Ben: Yeah. I don't think there's one answer to that, but Jeff Dean did a talk that the slides are online at Berkeley and like 2010 or 2012, it was really good talk where he discussed a lot of the techniques that they would use depending on the situation to deal with, tell and see. And I understand you're referring to the cap there but another trade-off that I think we had to wrestle with a lot was basically just cost or efficiency versus latency. [10:00] And we would often end up with something that was more expensive in order to put us a tighter bound on latencies. So if you had three copies of some service, you'd send the request to two of them in parallel and just take the first one that came back in order to manage high latency, outliers and things like that. But I don't think there's a single answer from a availability to consistency standpoint. It really depends on I guess, the business requirements. Utsav Shah: Yeah. I've seen the tail at scale stuff, setting that might be what you're referring to. That's interesting and, you turned on dapper in 2005 is what you said and what was the immediate engineering impacts from engineers at Google where your customers. So, what was their reaction and did you see like some immediate changes based on releasing it and showing it to people? Ben: That's a great question. And one of the most interesting things about dapper is that when we first got it out there in the world, well at Google, it was definitely not something where everyone's like, oh my God, this is incredible, and telling your office mates about it; it was nothing like that. In fact, I would basically go and find a tech lead for, you name it like Gmail, web search and anything that was operating at scale had a lot of services and I would kind of beg them to like meet me and then I would show up in their office, the UI admittedly was terrible, but it was still good enough to be useful. And I would show them some traces and they would always be like, wow, this is actually really interesting, I didn't know this. And would often, explore it with me and we'd find something that was troublesome and novel to them. So know they would get something that was interesting to them. And sometimes they would go in and fix that issue, but it wasn't like we had our own dashboards to track activity and it really didn't get a lot of use. I did generate a lot of value in the sense that we're able to find some, highlight the outliers and understand where that latency was coming from and make some substantial optimizations. But it was very much a special purpose tool used by experts doing performance analysis in the study state. That was really what it was primarily used for initially and there are some technical reasons for why that was the case. But if you were to think of it from a product standpoint, the issue is that we weren't integrated into the tools that people were already using. And that is still the number one problem with the sophisticated side of the observability spectrum is that the insights that are generated are genuinely useful and insightful. And even self-explanatory when you put them in front of someone, but they simply are not going to find them themselves unless it's integrated into the tools that they're already using. And it's still, I think the number one barrier to value and observability is just that it's not integrated into the kind of daily habit tools, whatever those may be. At some point we did make a change, Josh McDonald, who actually still works with me at LightStep, who was working at dapper in 2005 as well. He eventually made a change to stubby, which is the internal name for GRPC essentially anyway. And particularly this library called request C, which was used to look at active requests that are going through the process to basically just cordon off the request that had a dapper trace, that set to true. And so you could go to any process where people are already using this request, see thing all the time to see requests going through their service. And it kept a cache of slow requests from an hour or whatever at different latencies. And we had a little table of requests that had the upper traces where you could click on the link and go directly to the trace. And then it was something people are already using and the number of people that used, dapper I don't remember exactly, but it must've been like a 20 X improvement when we released that, and it was a huge change. And the only lesson dapper didn't get any more, it didn't get any more powerful when we did that. It just got a lot easier to access. And so it's all about being in the context of the workflow. That's something where some people it's kind of Jonathan similar, who incredibly smart person, much smarter than I am that's for sure. But he ended up really pressing us to build kind of a bulk data API to run Map Reduce and things like that over the dapper data. And he was in charge of something called Terra Google, which was actually the largest part of Google's index, but also the least frequently accessed. It's a very complicated system, the way that it worked, I won't go into it just because we don't have time. I don't know if I'm allowed to talk about it, but suffice it to say it was really complicated. And he did some fascinating work to understand the critical path of the system using both , it's some really substantial improvements as a result of it. So there are people like that who made these big improvements, but it's a big difference between having, delivering quote unquote business value to Google, usually in the form of latency or reductions and having a lot of daily activity, but daily activity really didn't come until we integrated into these everyday tools. [15:00] And I think that was one of the most important lessons from, the dapper stuff is that the cool technology really is not enough to get retention from engineers who are busy doing other things. Utsav Shah: That's super interesting. And I think I've heard the term Tara Google maybe five years ago when I interned there. And I think I finally learned what it meant. I'm sure I forgot about it in like three months. That's, interesting and request, see it seems like a front end towards like visualizing a context or Google's context in a sense, is that like an accurate way of phrasing it? And why did engineers user requests? That's something I'm curious about now? Ben: Well, for different things, but what was particularly nice, but request C also known as well, RPC Z container put requests C but was the part that we're really talking about, what it allowed you to do, I guess it was basically just a table. That's all that you saw the table would have a row for every RPC method that you had in your stubby service, your GRPC service and then, so each row is a different method. Okay, fine, that's simple enough and then the columns were basically different latency buckets. So you'd have requests that took less than 10 microseconds, less than a hundred microseconds, less than one millisecond, less than 10, et cetera and it would go all the way up until I don't know, things that took longer than 10 seconds. And you could examine a very detailed kind of micro log of what took place during that request. So you could think of it as just a little snippet of logs that were pertained to that request and only that requests. And then as I was saying, if the thing was that portrays, you could then link off to the distributed version of it and see the full context. The thing that was particularly powerful though, is that it had one special column for requests that were still in flight that he would be taking a really long time. So what would happen is you could have a request that was stuck and you were trying to debug it in an alive incident. And you could inspect the logs just for requests that were stuck usually because of , let's say it was often that there is a new tech slot that was under contention restock, waiting on it. You can go and see that exact thing had happened and there was a lot of really pretty clever stuff. They did an implementation to defer any of the evaluation of any of the strings in the logs until someone was looking at them. So you could afford extremely proposed with the logging on this thing and then you only evaluated the logs when an actual human being was sitting there, hit a refresh in the browser, looking at it. So unlike most logging frameworks for that sort of, that's not generally how it works. So there's a lot of pretty complicated reference counting and things like that, but it was also that if you were having an issue, you could figure out that you were blocked on this big people tablet server, and it was this particular UTEC flock that was contended. And that was being contented by these other transactions, which you can look at and figure out where the contention came from. But to be able to pivot like that in real time was pretty powerful and then having that linked into dapper to understand the context was also pretty powerful. I don't know if that makes sense, but it was a tool that was one of those things that I really haven't seen. I've seen, I think open senses and Z pages have some of that functionality, but it doesn't really make sense unless everything is using that little micro logging framework. And I just haven't seen that outside of Google or Google open source, so I still miss that. It was a really useful piece of technology. Utsav Shah:  No, I think that's amazing given this was like so long ago, and then it makes me think about taking a step back, I think maybe five years ago or ten years ago internal tools at Google are probably better than like the development experience externally. Right? You have so much stuff for free. You talk about blaze and you talk about all of these different tools, but things have evolved a lot recently, it seems like there's so many startups coming up with like different things, and even like Datadog and stuff are like fairly mature, now you get a lot of stuff for free from them. Would you say that the development environment externally is probably better than anything that Google can offer in the sense of you get the holistic experience now? Or do you think there's still things as you said, request like some functionality that's just missing because of the lack of consistency and you have to integrate like a million different things in maintaining like this Rube Goldberg set of integrations to get like a similar development experience?Ben: Yeah, that's a good question. It's really hard to compare the inside and outside of Google experience and it's not that Google was all better, and a lot of stuff that Google was actually really annoying. And I was just talking to someone about this yesterday, but the trouble with Google was that everything had to operate at Google scale and there's this idea, [20:00] which is totally false in my mind, that things that operate at higher scale are better and they're usually not, there's a natural trade-off between the scale that something can operate at and the feature set. And so a lot of the stuff we had at Google was actually pretty feature poor, compared to what you can use right now in open source. But the only thing that really had going for it is that it scaled incredibly well. The exceptions were mostly areas where having a monitor with almost no inconsistency is in terms of the way things are built it gives you some leverage. And of course there are a lot of examples of that request, to find example of dapper is actually a fine example, too. The instrumentation for dapper to get that thing most of the way there was a couple thousand lines of code for all of Google ovens, but whereas, just look at the scope of puppet telemetry or something, to get a sense of like how much effort is going to be required to get that sort of thing to happen in the broader ecosystem, so that they had this lever around consistency. A lot of the tooling at Google was it's not that it was bad, it was very scalable, but it didn't have a lot of the features that we would expect from a tooling outside. I'd also say that in my seven years of Google Workman infrastructure and observability, I never had a designer on staff and I barely had a PM ever and it really showed it's like having worked now with really talented designers, not just who can make UIs look nice, but who really think about design with a capital D and stuff like that. And just a completely different ball game in terms of how discoverable some of the value and the feature set and things like that. So like the Google technology often lacked that sort of Polish and I think there are many different vendors out in the world right now that I have built things that are much easier for an inexperienced user to consume, even if the technology is equivalent, I think the user experience is not. So that's another area where I think what we had at Google and is actually unfortunately a pretty far cry from what you can get now; it's just by buying SAS. Utsav Shah: Yeah. Well, we thought about if people ran infrastructure teams like product, like you're trying to sell each piece of your infrastructure to potential buyers, you would create a better product because you'd have to think about user experience and you have to think about, make customers actually getting value. So, trying to make that happen, it's not the easiest thing in the world, but that makes sense. When you released dapper, did you have any sampling at all or was it like, I'm assuming yes, but wouldn't have just worked before?Ben: Yeah. Sampling is an interesting topic. There's a lot of places you can perform sampling in, a tracing system and dapper performed it almost in every one of them but yeah, dapper had actually a pretty aggressive sampling. We started with one for 1024, so that was the base sampling rate in dapper ,and then we realized that even after that cut of one for a thousand centralizing the data. When we initially wrote the data just to local desk where we wrote log files and we deployed a Damon that ran on every host at Google is actually by the way, if you ever want to like jump through some hoops, try to deploy a new piece of software that runs as root on every machine, tell you that was a real nightmare from a process standpoint. But anyway, so the day this thing would sit there, it would scan the log files and basically do a binary search anytime someone was looking for a trace, and so that's the thing that I started with , that was honestly a terrible way to build that system really bad. So eventually we moved to a model where we would try to centralize that data somewhere for all the reasons you might imagined, but it turned out that the network costs and centralizing that data, even after the one for 1000 cut was substantial and the storage costs were also really substantial. So we did another one for 10 on top of the ones with 1000. So we were doing one for 10,000 sampling randomly before we got to the central store that was used for things like now, producers and stuff like that. And it pretty much means you can never use dapper for all sorts of applications. Like for web search was fine because I don't rememb6er the number, but it was order of like a million grids for a second, so fine but for something like people check out where people are actually buying stuff. It's of course intrinsically a much lower throughput service, but the transactions are actually more valuable, so you're getting cut both ways ,and we didn't have a dynamic sampling mechanism on that for when I was working there and people could adjust to the sampling rates themselves, but they usually didn't. So the technology that's really not that useful except for the high throughput services where that sampling, wasn't a complete deal breaker. I think with LightStep and with other systems that have been written in the last couple of years, there's a recognition that sampling really serves a couple of purposes. One is to protect the system from itself. So you don't want to have [25:00] an observer effect and actually create latency through tracing with dapper. We had that issue because we wrote a local disc we're basically entering the Colonel at least on disc flush and for hosts that were doing a lot of disk activity. We could actually create latency with high sampling percentages, but there's no need to do that, you can just flush the stuff over the network and especially that was 2005, that works a lot faster. Now, next to a lot faster, now you can actually get the data out of the process without sampling in almost all situations. They're probably an outlier cases here or there, it's not true, but overall there was no issue with flushing all the data out of the process. And then you just need to decide how much you're willing to spend on network and how much you're willing to spend on storage, and that's a whole set of other constraints. The other thing that I have recognized is that long-term storage is quite cheap. The wiring networking costs in terms of a lifetime, that data end up being almost as expensive or in many cases, more expensive than storing it for a year, so if you can find some way to push the storage closer to the application itself, even if it's just in the same physical building or availability zone, that's a pretty big win as well. So a lot of the work that we've done at LightStep is actually trying to take advantage of some of those, you're just trying to be on the right side of those cost curves in terms of where we actually do the high throughput, and then where we do the sampling stuff like that. Utsav Shah: This reminds me of how Monarch is designed. I was just reading up on the paper before this, it's the same where you're trying to flush something that's in a local data center or the local availability zone. And then finally, when you query, you're getting such less data that you can do that once and ask questions of multiple regions. Is that roughly accurate? Ben: Yeah, I think there are definitely some similarities in modern. I have to say, if we could go back in time, I would have pushed back harder on some of the requirements that were put on UIs. I don't think we did the wrong thing, given the requirements that were handed down, but the requirement that we depend on, almost nothing except for, physical DM or physical, DM's kind of a misnomer, but the fact that we weren't allowed to take advantage of Google's other infrastructure beyond just the scheduling system and the kernel and things like that, it really limited what we could do. And then when you also pile on some other requirements around performance and availability and kind of forced to store everything in memory, and we did and then the paper goes and talks about the number of tasks, which are basically virtual machines that moderate consumes and the steady state. And I remember correctly, the paper's number is like 250,000 VM steady state and that is just extraordinarily expensive system right there. A VM of course is not the same size as the physical machine, but it's a lie and that's not even counting the VMs that are being used for durable storage and long-term storage of the data. And wherever they're putting that stuff in Google's longer-term storage systems, I mean just a tremendously expensive system and that's not a good thing and I'm not convinced that's the right approach. We've certainly, with some of the work we've been doing lately LightStep, we basically had to write our own time series database from scratch and rather than trying to re-implement what we did with Monarch, I think a lot of the lessons we've learned is that there are ways to do that are far more efficient without really paying a penalty in terms of performance. And, yeah, I remember that we felt like we had no choice, but to do everything in memory, there are some similar systems that Facebook like the grill system, I think also ends up making the same decision at about the same time, maybe it was because flash wasn't quite commodity at that point and so we felt like it was disc or like physical spinning disc or memory. And now of course there's some interesting things, but that was expensive. I don't know if it's a cool system and it's very powerful, but awfully expensive. Utsav Shah: Yeah. So just for listeners, Monarch is a monitoring system. You can see it provides the same end interface too, as like Promethease, it's designed in a very different way internally, including to the user. I think the configuration system is different, but it provides kind of like the same purpose. A design did replace Boardman, which is the original monitoring system, which like engineers had to deploy for themselves, whereas like Monarch was like a SAS service in the sense that you just had to add your metrics and things would work automatically. Is that like a good summary? Ben: I think that's exactly right about what Monarch is. The Promethease thing is a little funny though. Promethease is architecturally much more similar to Boardman than Monarch important. Boardman had a lot of issues that Promethease, I think has improved upon that were, self-inflicted like Boardman to actually use Boardman to monitor your system, [30:00] you had to use not one, but five different, totally unique to Google domain, specific languages, all of. Utsav Shah: PSM.Ben: Yeah. All of which were totally arcane if you want my honest opinion had like lots of got you's, like for instance, sorry, this is the ramp, but if you wanted to do at arithmetic, which of course is something you'll want to do when you're writing queries, you could use the minus operator. No surprise, but if you had variables that you were subtracting and you didn't separate it by spaces, it allowed hyphens to be a variable names and it would just like silently failed it. Oh, that's not, it would just substitute a zero for that expression and crazy stuff like that. And of course, since it was a handwritten DSL that wasn't particularly well documented or maintained, there really wasn't staffing to improve that there was definitely a period of Google where it was kind of awkward on anytime you ran into a new problem to write some kind of language, some of these languages in the borderline university were pretty small to be fair. But the point I'm making, if they each have their own grammar and their own rules, and most people basically just copy paste in someone else's portal in order to hit their launch criteria. So there wasn't a lot of thought and care being given to writing maintainable code , and it definitely is code. I think if I remember correctly, Gmail's configurations, which, you know, admittedly were generated programmatically, but those borderline configurations for like 50,000 lines of code and it was totally inscrutable. So there's a lot of frustration about that kind of stuff. Whereas problem I think is far more sensible, I could critique this for that, but it generally makes sense , I get it, I think that I don't want to sound overly critical if this is not a good or a bad thing, that's just sort of recognizing that every system is designed for a certain set of problems or whatever. But for me, if you did in here at one of the most problematic from the sweat of mine, which is that it wasn't really designed for distributed pretty evaluation, you can kind of do it, but you have to manually share the thing yourself. And that's a very difficult thing to maintain, to do all the rebalancing and things like that ,and I think that the initial effort at Google was actually, it wasn't Promethease, but it was almost like community is let's fix Boardman and building a new system that has the same scaling characteristics, but has one language, not five better language improvements to this or that; a better internal time series or things like that. But it was still basically the same architecture. And my recollection is this guy, Alan Donovan is another person who's lot smarter than I am; really clever person, but he was working on this stuff at the time. And I think his observation was if we're saying that the system board of mine has tons of issues, how could it be the right thing to architect it and have the kind of block diagram be exactly the same, but how each block just be better? Shouldn't , we be thinking about this a little bit more holistically and to really examine the problems that people are having. And I think when we did that, we realized that the number one problem that was causing a lot of the other weird stipulations people are doing with the fact they had to manually shard and balance this thing, and that distributed credit evaluation was kind of a hack. So the thing that made modern so interesting and also so difficult was that it really was horizontally scalable and that users did not need to worry about where their data was being balanced is also a multitenant from day one, which was allowed a central team to run it for all of Google, instead of trying to repeat that effort with every team in their own little cluster. And it made the design much harder, but ultimately I think more robust and I'm not knowledgeable enough about for me, if he has to know how much effort would have to go into making it really do that. I've seen Thanos has added some functionality like this, but I think that the pretty evaluation really pushing that down and making sure that you do as much of the aggregation as you can at the lowest level, and then bring things back up, have a lot of. There's a lot of subtleties that I think we felt like we had to build into the design pretty early. That's the thing that we are really trying to escape from with Boardman was a design that made it difficult to do distributed per evaluation and that's difficult to handle really large datasets that don't fit in a single feed because that's the underlying pain point in Boardman that led to a lot of other pain points. Utsav Shah: That is super interesting. The name Alan Donovan, I think I've seen it with basil get logs. He wrote like Star Luck For Go, and I think I might be in that Google group.Ben: So yes, that's right. I think he wrote one of the official Looking Go Programming Language has books. He has some languages background, really nice person, very intelligent guy. But, I credit him with sort of forcing us to step back and really think about what problem you're solving with Monarch. And yeah, that was really fun though, I loved building that system. That was probably my happiest time that Google was the summer that we were prototyping, that it was just like amazing team that went very quickly. It was a lot of fun. Utsav Shah [35:00]: Can you talk more about the district query evaluation? I don't fully get why it's problematic. So let me explain to you and you can tell me where I'm wrong. So what you're saying with Bergman Boardman that query evaluation mostly happened at the higher layers where, I guess if you could just explain to me because I don't fully grasp it. What exactly is the difference in lecture? Ben: Yeah. I wasn't being clear, so totally makes sense, so let's take a simple query. You want to understand the ratio of your error rate to your total request rate across your application and you want to group it by RPC method. So let's assume that the amount of data that you have for all of these types of series is large; to put this in context, some of Gmail's metrics were distribution value. So the actual value type was a histogram and a single metric with all the cardinality turned into 250 million times series in the steady state. So very high cardinality surface area that we were trying to aggregate around and the problem that you have that you're trying to do that sort of query that erode ratio query, it's a joint. So you have two different queries, you have a rate query and account query, and then you have to compute, you have to create buckets for each of these RPC methods, just doing a group buy and then within each of those, you have to do a bunch of math. One option is to basically have a credit evaluated at the top of the stack that just talks to all of the sort of like leaf nodes. And in Monarch, we called them leaves each leaf node. And you would say, okay, give me all the data you have for this particular metric , and they would stream the data back to you and you just do that. You do the math, it turns out the data size is large enough that if you do that, you're pretty evaluation times it moves into the tens of seconds or minutes in some cases was kind of a non-starter. So instead, what you'd like to do is say, okay, fine. So we'll compute this at the leaves, but the problem is, and this is the most important point. If you're a grouping by RPC method, there's absolutely no guarantee at all. In fact, it's just not true that all of the data for one RPC method is going to be on one server or another. So each of the services in each of the Monarch leaves is going to have some portion of the data, so what you want to do is compute what we would call a partial aggregation. So everyone confused, they're part of this particular query, so they each make the RPC method buckets, and then they pass those partial results up the stack to the mixer level where now you've done the aggregation so the data size is pretty small over the wire.And then you complete the aggregation now that you have all the data, get the final numbers for both the error and the account and then at the last step at the top, you join the two things, divide them all and you've finished yourself. So that the most important thing to understand is that it's not possible for the lower level nodes where the horizontal scaling has happened. They cannot compute the final number because they don't have enough data to do it. So they have to have some way of communicating partial results back to the top of the stack, and that example, it's not that difficult, but in terms of the full query plan for an arbitrary query and the language that we are doing, it's a lot of subtlety and complexity to how those different types of praise can and cannot be pushed down to be evaluated at the leaves. And if you ever end up in a situation where you need to pull all of the data up into the mixer level, the whole thing totally falls apart from a performance standpoint. And oftentimes even from a feasibility standpoint, you end up owning that thing if you're not careful. So there's a lot of streaming, evil and pushback on channels, so you don't flood the thing that's getting this huge fan and from all the children and stuff. So this guy, John Benning, who another person much smarter than I am, but he designed that thing and worked on it for years and to kind of optimize it. And yeah, it was a really interesting piece of technology, but it was just quite subtle. And I think if we hadn't designed it for that initially, it's just hard to make the query model that isn't designed to create these like partial aggregates. I think it was hard for me to imagine how you would send that in after the fact, because of the way you approach the computation is you have to be able to kind of truncate, the computation and send it as a partial computation up the stack instead of as a set of query results. And I'm not saying it's impossible, I'm assuming it gets pretty ugly. So that's the thing that I was referring to. Does that make sense? Utsav Shah: Are you saying that the query language itself also needs to be designed with this thing in mind? Or is that mostly just the way you shared out and you make your query plan?Ben: We really tried not to put constraints on the query language because of this. There were certain types of joins that were really hard. Like a full relational join, it was deeper than that, this problem, but it wasn't designed to be. And relational [40:00] database and those things are really quite difficult to implement, but the joints are aware. I think we had to kind of draw the line on some of the functionality, but a lot of other stuff, it was kind of a query language has some very powerful capabilities, especially dealing with the time dimension, but also was much more limited than you. It wasn't as powerful as a lot of SQL like languages for doing just general purpose computation, stuff like that. So it was very much designed for time series data, but I would say it seems to have been general enough to represent a wide variety of like operational use cases. I just wouldn't want it, it wouldn't take the place of inquiry or something like that for general purpose computation. Utsav Shah: Yeah. And one thing I think listeners should note is that you might think that, users are not running that many queries. How does it matter? But like a lot of people write alerts and like monitors and they're basically super complicated and they have to be evaluated a lot. And I'd imagine that a lot of your load was from these alerts that have to be continuously evaluated to make sure, and it's super critical that they fire quickly. You don't want there to be like a production outage and there's a slow down due to the monitoring system. Ben: Certainly the case! Utsav Shah:  So it's a really interesting problem. I think Dropbox tried to build their own and I think they rolled it out successfully and they just put cardinality limits. You can have queries with super high cardinality, but very few other limit, because I think the cardinality exclusion is where a large source of problems is, is that accurate? Ben: Yeah, that's definitely accurate. So that's another rant if you don't mind me going in that direction for a minute. This is one of the things I did not realize written on Monarch, but I've come to believe that there really are two types of monitoring data. There was telemetry, I guess you could call it, there is statistical data, which we usually call metrics and there's transactional data, which we usually call traces or sometimes logs or structured events. And those are the two flavors of data and the Achilles heel for the structured events, the traces for the transactional data. But the Achilles heel there is that at high throughput just retaining it for a long time. It's just really expensive to so much of it, like you were processing a lot of transactions. It's big data, it's expensive. I don't mean big data in the sense of big data, but it's a lot of data. And then on the metrics side, you can handle the high throughput very naturally because the only thing that changes the value of the counter, it just goes up, but it doesn't actually make the metric data larger; it just changes the numbers if you have high throughput. So the Achilles heel for metrics data is high cardinality. I actually wrote some stuff about this on Twitter a few weeks ago. I'm checking send to you afterwards, if you want to include it in the article, but, the thing that's so frustrating is that cardinality is necessary. You it's totally inevitable that you're going to want to include some tags in order to understand and isolate symptoms of interest, and I think metrics should be used to isolate symptoms of interests that connect the health at some part of your system to the business. at some abstract level, that's really what you're supposed to be doing metrics. And then I think because that's the tool that people use, it just becomes the hammer for every nail that they see. And you just try to use cardinality to address every aspect of observability, which is a complete disaster from a cost standpoint and from a user experience standpoint, and I will try to elaborate on that. So let's go back to this example of RPCs or something like it's totally fair and smart actually, to have a tag on your RPC metrics for the method, because you want to distinguish some reads. Totally fine, because those are different things you might want to independently measure from a health standpoint, but then you might want to say, Traffic spike or latency spikes and I want to understand why. And so if you're using a metric system, the only tool you really have to understand that variance or that change is to do a group buy on some other tag and hope to see that one tag value explains that blip. And so you now have a lead to go and follow that tag value where it leaves you, whether it's a host name and I understand the appeal of that, but there's two problems. One like you add a couple more tags and the confident, total explosion is immediate and you're suddenly spending a ton of money, whether it's on from Promethease or a vendor, it's really expensive, and then I think even more pressing issue is that in the code, you only have access to things that are locally available. So you have access to your own host name and things like that, but it's often the case. In fact, I think the numbers I've seen, which I would bring true for me is that 70-75% of incidents in production are caused by an intentional change, like a deployment or a config push elsewhere in the staff that version change or config push is not going to be in your local tags. If you're in service A or if you're in service B [45:00] at the bottom of the stack and service A at the top of the stack pushes a new version, that's flooding you with traffic service A's version is not going to be available for grouping and filtering anyways, so you're paying a lot of money to have all this cardinality and you can't even group by the thing. They have to explain it now in the transaction data and the tracing, you absolutely do have that. The traces flow through both services. These health metrics are linked to the transactions via hosts, by a service names, by method names, all sorts of stuff like that. There are ways to pivot over to the tracing data programmatically in an observability solution, and in the transaction data, the high cardinality is not an issue. You can do an analysis of thousands of traces in real time, and actually understand that the thing that changed is that before, service A above, you was on version five and now it's in version six, and then that explains the difference in the health metric you started with, but using cardinality as a way to do it, sort of ability is a big mistake. Sorry, metric cardinality is the way the observability is a big mistake.I think that metrics should be used to understand health and nothing more. And then you have to be using observability tooling. That's smart enough to pivot over to the transaction data where it's both cheaper and more effective to understand these sorts of systemic changes that lead to these health changes in the first place. So at Google, I think we were actually way behind where we are now, honestly, with LightStep in terms of how we would pivot from time series data over to transaction data and back again, but that's really the essence of it because you can't do high throughput with one and you can't do high cardinality with the other, so you have to feel that to use tools that pivot from one to the next intrinsically. And I think that's the thing that allows you to kind of get out of that cardinality trap more than anything else, it's not setting a cardinality limit, it's just not needing it to be in your metric data in the first place. I think that's really the solution that we'll find ourselves pursuing over the next couple of years. Utsav Shah: Well, I think that makes total sense. And maybe just super quick to explain the lessons, like why is high cardinality bad? And I think the answer is because like in a time series database, when you have like a data point with a different cardinality, you have to basically store it as like a different drawer or different columns, and that's what causes the explosion.Ben: Yeah, that's basically it. And high cardinality isn't bad, high hardly metrics are bad. I think the issue is that in a time series database, you can basically think of it as a huge spreadsheet and each row is a different time series. And it turns out that creating a new row is a lot more expensive than creating a new cell. Utsav Shah: So you're just incrementing like a number in that cell?Ben: Yeah, or adding another data point to an existing time series as far cheaper than creating a new time series. I think that the issue with high cardinality, it sounds so esoteric, but it ends up being an issue that like your chief financial officers quit and start caring about. Because you can add one piece of code, literally one line of instrumentation that says, I've got this metric, that tracks requests, I'm going to add customer ID and host name that just explodes, and then every single value, every single interaction of that line of code is going to create a new time series and you're probably dealing with, it's not vendors being evil by the way, they have cogs, they have to pay for it. But if you write a piece of code that incurs, 10 million time series in some PSTB somewhere, that's going to be expensive, no matter what he has to be are using. So some are more expensive than others, but it's still just different flavors that are expensive and I think ideally the developer can add that code a platform team can write some kind of control that says, I never want cardinality for any metric to exceed X, and then you can do some kind of top cave thing to retain the high frequency data and aggregate the rest. I think that's the kind of long-term vision I have for how cardinality should work, but really it would be great use event data tracing data for high cardinality where you don't pay a penalty at all. It just literally doesn't matter in terms of the cost of the solutions and then stick to the metrics, high throughput data where you need precise answers about critical symptoms. Utsav Shah: And then the flipside of that is why is cardinality not a problem for tracing? How do you store tracing data that in a way that cardinality is an issue at all?Ben: I think it's depends on how you index it. So cardinality it's an issue with the index and the TSB. Utsav Shah: Yes.Ben: In the tracing database. I am actually interested, there are many different ways of doing it. Some people have column stores. I think I don't work on any of them, so I don't want to misspeak, but I'm pretty confident that their underlying data database is a column store where, you have different trade-offs in that type of world. But LightStep does is too complicated for me to explain right now, but we have our own way of managing cardinality and the trace data. So it's not that it's free or something, but it's not an issue like it is for a time series database. So I think that different people have addressed it in different ways, [50:00] but I don't feel, it's a very satisfying answer to your question in a time series database, because of locality constraints you have around time series, like adding cardinality just has a fixed cost that's relatively high. I think that's probably the best way I can explain it. Utsav Shah: Okay. Is the index and electricity data, just the trace ID or the hash basically, or a service, which is relatively straightforward index?Ben: There's a lot of different ways to do it. That's why I'm kind of hedging on this because I'm trying to be precise, I like being precise and it's actually quite a diversity of ways that it's done right now. So I can't specify how it's done everywhere. In the dapper ecosystem, we did have a few indices that we special case, and then if you want anything that, wasn't one of those indices, you had to write a MapReduce, which is super high latency. But other systems have no limits and cardinality others as well, index up to end values of every key and no more. I mean, there are a lot of different approaches to it. Utsav Shah: That makes sense. And I can read up more on how Jaeger works, but when you search for a trace ID in, you get that fast, everything else is kind of not that fast, so that makes sense to me, and I think there was a lot of good information. I feel I've just learned a lot on how all of this infrastructure works for sure. Do you have anything that you want to add on top of this? Just what you've learned, all of this stuff, you left Google in 2012, what was like the one thing that was just really something that you still use or some information that you still remember from that time that is one design principle perhaps, or just one way of thinking about things?Ben:  That's a good question. I kind of referred to it, but not as precisely, but one thing that Jeff Dean said a few times, which really did stick with me, was just this idea that you really can't design a system that's appropriate for more than like three or four orders of magnitude of scale appropriate being the keyword. And this goes back to this idea, that system is Google. We're not better because they're more scalable, and one thing I liked, it's not about Google, I think most companies are like this, but the in-house technology at Google came with pretty accurate advertising for what it was good for and what it wasn't good for. And there's no shame in saying, yeah, this database is good until you hit the scale and this one's terrible if you go below that, because it doesn't have all these features that you'd expect or what have you, and my experience outside of Google, or whether it's open source software or vendor software, it's just that people are understandably reluctant to describe the scale that their system is appropriate for. And I mean, it's a great question to ask, actually, if you're talking to someone about something that they're really excited about and they're trying to pitch you on it, just say, so tell me like, what's too much skill for this? and what's too little skill for it? And if people can't come up with an actual answer to that, I think that's a bit of a red flag in my book. And that was something that really stuck with me after Google and I think it applies to any technology that you're building, and also it's good to be humble about that too. You can remind yourself, you built some really scalable thing, it probably doesn't do something that less scalable thing could do for a site and just to try and think about fitness for purpose, that's maybe the thing I feel that's comes up over and over and over again, anything from engineering ,to product ,almost to marketing. To think about what market segment is this really appropriate for? What, where, would the scale that we're targeting live in the marketplace? So I think it's relevant at that level too. Utsav Shah: Yeah. It reminds me of this meme, MongoDB is like web scale. Ben: What does that mean? No comment.Utsav Shah:  The web is so different for so many different things, there is no concept of like web scale, it just sounds fancy. My current company uses MongoDB, it seems to work so far, it's probably fine. And I have so many more questions in terms of like, I want to ask you about open tele metrics and open tracing and all of these things and lifestyle. But I think it would be nice if you do that in a follow-up search. This was great, and I feel like I learned a lot, so thank you so much for being a guest. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

54mins

4 Apr 2021

Episode artwork

Ep. #3, OpenTelemetry with Ben Sigelman of LightStep

The Kubelist Podcast

In episode 3 of The Kubelist Podcast, Marc Campbell speaks with Ben Sigelman of LightStep. They discuss the inspiration and origin story behind OpenTelemetry, the challenges of observability, and the path from sandbox to incubation.The post appeared first on Heavybit.

40mins

14 Oct 2020

Episode artwork

Ep. #3, OpenTelemetry with Ben Sigelman of LightStep

Podcasts – Heavybit

In episode 3 of The Kubelist Podcast, Marc Campbell speaks with Ben Sigelman of LightStep. They discuss the inspiration and origin story behind OpenTelemetry, the challenges of observability, and the path from sandbox to incubation.The post appeared first on Heavybit.

40mins

14 Oct 2020

Episode artwork

Ben Sigelman - The Future Of Observability & Why It's Not Just Telemetry (Observability Series - Part 2)

Masters of Data Podcast

There are three types of people in the data world: mathematicians, scientists, and engineers. Mathematicians are interested in understanding things that are true or false. Scientists are interested in furthering knowledge and enjoy answering challenging questions. Engineers are interested in building things that are useful, so they can solve a problem that’s important. Engineers in the software industry are currently searching for ways to resolve the issues associated with microservices. Right now, the software industry is facing a massive architectural transformation, and engineers have the opportunity to create systems that solve important problems. That’s why Ben Sigelman — CEO and co-founder, started Lightstep, to create something useful and impactful. He saw an opportunity to accelerate the industry’s transformation while improving the developer and end-user experience, and he took it. Using observability, he built something that could help people gain more confidence and understanding of their own system.As an ex-Googler and co-creator of Dapper, Ben Sigelman witnessed the birth of microservices at Google. He learned a great deal from his experiences, and Lightstep is in many ways a reaction to and a generational improvement beyond those approaches. Sigelman’s fascination lies in deep systems and how they break, but he is also passionate about separating the telemetry from the rest of observability. There is a lot of noise in the marketplace and confusion about how to approach observability, but Sigelman is confident that in the next 5-10 years, applications could change the way the software actually works, not just the way we understand it. Listen to Ben Sigelman and Ben Newton discuss the future of observability, and learn more about how this transformation could impact the industry.This week’s episode is the second installment of a special three-part series on observability in data. Tune in each week to hear about how the world of observability in transforming into a major player in the data realm.

35mins

24 Aug 2020

Episode artwork

Episode 110: Kelsey Hightower and Ben Sigelman Debate Microservices vs. Monoliths

The New Stack Context

Listen to ALL of our shows here: https://thenewstack.io/podcasts/Welcome to The New Stack Context, a podcast where we discuss the latest news and perspectives in the world of cloud native computing. For this week’s episode, we spoke with Kelsey Hightower, a developer advocate at Google, and Ben Sigelman, CEO and co-founder of observability services provider LightStep, about whether or not teams should favor a monolith over a microservices approach when architecting cloud native applications.Hightower recently tweeted a prediction that “Monolithic applications will be back in style after people discover the drawbacks of distributed monolithic applications.” It was quite a surprise for those who have been advocating the for operational benefits of microservices. Why go back to a monolith?As Hightower explains in the podcast: “There are a lot of people who have never left a monolith. So there’s really not anything to go back to. So it’s really about the challenges of adopting a microservices architecture. From a design perspective, like very few companies talk about, here’s how we designed our monolith.”Sigelman, on the other hand, maintained that microservices are necessary for rapid development, which, in turn, is necessary for sustaining a business. “It’s not so much that you should use microservices, it’s more like, if you don’t innovate faster than your competitors, your company will eventually be erased, like, that’s the actual problem. And in order to do that, you need to build a lot of differentiated technology,” he said. Microservices is the most logical approach for maintaining a large software team while still maintaining a competitive velocity of development.Later in the show, we discuss some of the top TNS podcasts and news posts of the week, including an interview with IBM’s Lin Sun on the importance of the service mesh, as Sysdig’s offer of a distributed, scalable Prometheus, a group of chief technology officers who want to help the U.S. government with the current COVID-19 pandemic, and the hidden vulnerabilities that come with open source security.TNS editorial and marketing director Libby Clark hosted this episode, alongside founder and TNS publisher Alex Williams and TNS managing editor Joab Jackson.

40mins

27 Mar 2020

Episode artwork

The Joy of Building Enterprise Software with Ben Sigelman

Screaming in the Cloud

About Ben SigelmanBen Sigelman is a co-founder and the CEO at LightStep, a co-creator of Dapper (Google’s distributed tracing system), and co-creator of the OpenTracing and OpenTelemetry projects (both part of the CNCF). Ben's work and interests gravitate towards observability, especially where microservices, high transaction volumes, and large engineering organizations are involved.Links Referenced OpenTracing: https://opentracing.io/ OpenTelemetry: https://opentelemetry.io/ Twitter: https://twitter.com/el_bhs Email:  bhs@gmail.com This podcast: http://ScreamingintheCloud.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted episode is brought to you by LightStep and, as a result, I am speaking with Ben Sigelman, founder and CEO at LightStep. Ben, welcome to the show.Ben: Hi, Corey. Thanks for having me.Corey: You have an interesting backstory. We’ll get to the whole modern LightStep story, but originally, some folks are born in the cloud, you instead were born at Google. You were a co-creator of Dapper, which is my understanding their internal distributed tracing system, and you’ve done a lot of open source work, too, OpenTracing then OpenTelemetry, both part of the CNCF, so you’ve been focusing on the monitoring/observability/don’t ever get those two words confused movement for a while now. What’s your backstory? Where do you come from?Ben: Well, my mother and father looked a—no let’s see, where did I come from? I was in college, and I started off freshman year with all of the seniors getting a thousand job offers in 1999. And then I graduated in a very different environment, and all the internet busts had happened and things were looking a little grim. And I just barely was able to get an offer anywhere, but I was lucky to get it from Google. And I went there and worked on ads originally, which I, frankly, didn’t enjoy at all. My first couple of months there, I was pretty unhappy actually. I didn’t like the work I was doing, I didn’t like the product. And then I had a meeting with this woman named Sharon Pearl, who was, at the time, she was working on five or six different computer science research projects. She had come over from Digital Equipment’s Research Lab along with a bunch of the other old-guard people at Google, like the first 100 employees. And she was super, super, super—well she is super, super, super smart.And I just—we had this, literally this serendipitous, completely random conversation, and she asked me what I was doing. And I said I didn’t like it that much. Asked her what she was doing. And she rattled off a list of several projects. One of them, I remember was this distributed blob store, kind of like a S3 or GCS type of thing. There’s another one that was a global identity management system for all Google end-users, etc. But there’s this one called Dapper that she had prototyped with Mike Burrows and Luiz Barroso, who also came from these research labs in the late 90s, early 2000s. And it just sounded fascinating. Unfortunately, it wasn’t done, so no one could really use it, but they’d realized that it was possible to trace requests across what we would now call microservices at Google. They didn’t call it that, but you could watch a single request go from a web user all the way down through thousands of services and back to the user in 100 milliseconds, or whatever it was. And I just thought it was fascinating. And at the time, my direct manager had 120 direct reports. I don’t mean that his organization was 120 people. But he had 120 direct reports, one of which was me, which is to say he had no idea what I was doing, because how could you? And I just started working on Dapper instead, and I thought it was awesome. And it started to work, actually, I got it to work in pre-production environment, and built a team around it, and then put that into production. And that was 2005, and I’ve just been pretty mesmerized by this overall problem space, and I don’t think that’s ever really going to let up, so I just keep on working on it.Corey: It’s strange, in that I had the privilege of working with a Google VP many years ago who had left Google and was talking about some of the same principles of tracing. Specifically, every system should expect a event identifier in it, and if it doesn’t wind up getting one of those, first it should add one, and secondly, it should raise an exception, so that that can get caught as the fact there’s something that is not participating in this event tracing system. Now, what made this unique at the time was this was circa 2011 or so, and we all looked at him like he had just grown a second head, because how big do you think this website is, buddy? Maybe that’s fine for Google. But here in reality, that’s not how the rest of us tend to conceptualize these things. Well, then we went into a microservices direction, which turns every outage into its own version of a murder mystery. And now having something like that is no longer optional for reasonable troubleshooting perspectives. It’s sort of suffered on some level from the curse of being too early to the market. It seems like you folks are right on time.Ben: Yeah, at the moment, it does seem that way. When we started the company, that was my biggest concern, actually, I wasn’t worried about whether this would be necessary, but I didn’t know when, and I think in retrospect, our timing was right on target. There are other products that came before LightStep’s that were in a similar vein that I think we’re actually too early, that started four or five years earlier, and they built a great product, but all that you could install it on was like a PHP website or something, and it’s just, not like a Facebook kind of thing, but just like a blog, and it’s just you don’t need distributed tracing to manage your personal blog. So I think we did get the timing right on the nose, but frankly, that was an accident, [laughing], just good luck.Corey: One of the things that I’ve always found to be a challenge for the distributed tracing set has been in trying to articulate the value of what you do. For example, I’ve gone round and round with this, with the Honeycomb folks in previous venues and different folks. And I know, for example, that you are legitimately in this space because whenever I refer to you as being observability-focused, Charity Majors doesn’t punch me in the face. So, first, you have the Charity-not-screaming-at-people seal of approval. So good job on that, this is legitimate, not a branding exercise.Ben: I’ll put it on my tombstone. Yeah, Charity is my best frienemy in the industry, we obviously compete at some level, but I think Honeycomb does great work, and she doesn’t suffer fools, so I’m glad that so far she hasn’t called me out, or something like that. Yeah, to be honest, I don’t really like positioning LightStep as a distributed tracing company. That’s not really how I think of our mission, or even really exactly our product. I think our technology under the covers certainly is all about distributed tracing. But that’s, in my mind, an implementation detail. We do see a lot of people in the market that have heard about distributed tracing, can recognize at some instinctual level that being able to follow requests across services is going to be part of the solution, and then they start looking for distributed tracing. And, frankly, if someone comes to our door and says, I want to buy tracing, and we have a tracing based solution, we can sell them that product, and I think there’s a lot to be said for that. Both parties benefit from that, but it’s not really the way I think about the space, and I do think that for distributed tracing going forward, it’s important that we talk about what it does and not what it is, if that makes sense. The point of distributed tracing, for me, is just to satisfy the same old requirements we’ve always had for observability or monitoring before; we need to deploy our software faster, we need to understand why it’s performing slowly and where, and then we need to reestablish regular performance if there’s been some kind of emergency. Really, those three things are the driving forces behind every observability product, and tracing just happens to be necessary if you want to do any of that in a distributed environment, like microservices or serverless.Corey: One of the challenges you have is that historically describing what it is—an offering like this does—presupposes A) that someone has a extensive background in running large scale applications, and doing that in a very public, very global fashion. And secondly, that they have spare 45 minutes to sit there and listen to the in-depth exposition that describes what the heck your thing does. So has that problem gotten better? In other words, is it easier to describe to people today what you folks do, then it was a few years back?Ben: It has gotten a bit easier. I think you hit the nail on the head, though. We were chatting before we started recording and I was explaining that I have no interest in turning this into a product pitch. And this question, it risks us going in that direction, which I really don’t want. But part of the reason that’s gotten easier is that products like LightStep’s product, they solve problems, right? And I think it’s much easier to explain these problems to people when they’re actively feeling a lot of pain around them, as opposed to it being a theoretical exercise. I think before people moved to microservices, we could draw diagrams of, this is what it's gonna look like in a year or two years, when you make all these transitions and when your system is distributed and ordinary transactions no longer exist in only a single host. It was a theoretical exercise at that point. Now, it’s a much more visceral thing where we can say, “Have you ever had an experience where you have two teams shouting at each other because they can’t decide which one is the root cause of the problem? And they both have dashboards saying they’re healthy, yet it’s clear that one or the other is actually responsible?” or, “Have you ever had Kafka just totally on fire, because you have one of the 10 tenants is suddenly sending more traffic, and you can’t figure out which?” Or, “Have you ever had a situation where you’re dealing with a P0 emergency, and the one person who actually understands how to debug it is on vacation?” These sorts of things are symptoms of microservices and deep multi-layered systems, and once people can identify those problems, it’s much easier to say, “Well, let me explain how the sort of technology we’re bringing to bear is relevant to those problems.” So it has gotten easier over the last couple of years, frankly, because the level of active pain has gone way, way, way up with I think that the credible migration towards more distributed architectures in the last couple of years in ordinary mid-market enterprise companies.Corey: Well, let’s go back in time a little bit, if we may, to originally, I don’t believe LightStep was aimed at the monitoring/observability space at all, to my understanding you were something of a social media company, and then you had one a heck of a hard pivot. Did it turn out that you just had—you sucked at telling a social media story, and then well, we’ve raised this money, we may as well do something fun, because we’ve made ourselves unemployable along the way, or is there something more nuanced to it?Ben: That’s, yeah, it’s not a well-known fact. It’s not a secret, it’s just—feels, I don’t know, it’s not something that I expect people to ask about, and I forget to tell them. So LightStep, per se—when I left Google, I really had a bee in my bonnet about Facebook actually, as a product. I thought it made people miserable. I actually still think it makes people miserable, and the observation was that most people, certainly including myself, are complicated, and if you compare your inner experience as a human being to other people’s carefully curated vacation photos, it doesn’t feel very good. And this is, at this point, a well known—it’s almost a trope at this point, but when I left Google in 2012, that wasn’t as well known. I wanted to create a social media product where people were encouraged to be more candid and to be themselves, and then we would connect to each other—they could, I don’t know, find some common ground. The funny thing about the product—so I managed to raise a seed round around that idea, which I’m forever grateful for, I mean it was a really fun thing to build. And I had a very small but very high-quality team, and we built a prototype of this, and got it out on the app store and so on, and the surprising thing is one, it actually kind of worked. Like, we had a bunch of people that love this product, they really loved it, and you’d say, “What do you think of this product?” And they’d say, “Oh, this is the most important app on my phone. This has gotten me through really hard times, that sort of stuff, which is great if you’re building a social product to have that kind of zeal.” And then you ask the same people, “Okay, well, who would you tell about this product?” And they would say, “I would never tell anybody about this. It’s way too personal, way too private.” And so I—after about a year, after having the product in market, I decided that we had built almost exactly what I wanted to build. The vision had been achieved, and it was a total failure. The people that we retained, were, I would say 90 something percent of them were depressed introverts, and I love depressed introverts, a lot of my friends are depressed introverts, but they’re a terrible, terrible audience if you’re trying to build a viral social media product, they just won’t talk about anything. Like it’s impossible to get them to talk about it. So, at that point, I told the investors—I wrote a board deck that one of them is actually anonymized and used with her other portfolio companies that won’t admit that they’re failing. And I basically explained why this is never going to work. And I said, you can have your money back if you want it, but like I’m done, because I’m not interested in running out the clock, or I’m gonna do something totally different. And I do think it was relevant because prior to that I had been working on this observability type stuff at Google. And I actually really enjoyed it intellectually but had this idea that I wanted to work on something that was more meaningful to society in a direct kind of human to human way. And that experience building that social product was quite sobering for me. First of all, I’m really bad at it. I mean, really, really bad at it. I think my intuition around what’s going to work, what’s not going to work is not that good compared to how it is in other areas, like in the [inaudible 00:13:45] realm. Second of all, I think to win in those games, you have to play dirty and I don’t like doing that. And the funny thing about enterprise software is that when people are paying significant amounts of money for a product, they don’t just take your word for it. You actually have to deliver value, and it kind of goes back to high school economics, where it’s a mutualistic thing for all parties. A vendor can exist by amortizing the cost of developing something really powerful across many customers, and the customers win because they could never build something like this, or if they did, it would cost way too much money. So it’s this thing where you have this really clean, honest sales process, and at the end of it, both parties feel like they’re winning, because they are, and I find that, after working with consumer, which I thought was frankly kind of depressing, I found it to be a huge relief. And the reason that we’re working on this is just that it’s an area where we think we’re contributing actual value in a way that’s tangible. Like, you can tell that you’re doing something valuable because people pay for it and they want to renew year after year. And to me that’s a more validating feeling, then trying to sneak a couple seconds of people’s time, while they’re in the bathroom or something like that, which is like how it felt on social media, frankly. It just wasn’t that gratifying when it was all said and done.Corey: So a common problem that you’re going to see with a lot of companies that trend into the monitoring/observability/yelled at by Charity Major space is the propensity to wind up going broad, where, yeah, today you do, for example, distributed tracing. Tomorrow, you’ll do log analytics, the next week, you’ll do alerting, and suddenly you’re trying to be Datadog, Jr., but we already have a Datadog. And as you look at these companies continuing to expand to all of these different coverage areas, it becomes very challenging to differentiate any of them other than that one area that they excel at. First, you haven’t done that, so how have you avoided it? And secondly, what do you think drives that?Ben: Well, we’re not as old as Datadog, so one way to not to do it is it’s not to be around long enough, right? But there’s also why we wouldn’t do that. I don’t want to throw too many stones at Datadog, either I mean, they’ve obviously built a—Corey: To be clear, I’m not trying to insult Datadog with that comparison. They’re fantastic, but they are the best of breed in this space. So everything that’s the newer generation trying to become the next coming of Datadog, well, why? I can see the story around individual components being awesome. What I’m not loving is this idea that everyone needs to be a broad platform for all use cases.Ben: I think the problem in my mind, it really comes back to what I was saying earlier about whether LightStep is a distributed tracing company. Again, we use distributed tracing, and it’s the core of what we do. I do not think of us as a distributed tracing company. Nor do I think the problem that we solve—the problem we solve is not distributed tracing, or it really shouldn’t be. And I don’t consider it to be dishonest, but I do think it’s confusing for the market to have large vendors, whether it’s its Datadog, or Splunk is also acquiring their way into similar position, right? I don’t think it’s helpful for the market to position the problem space in terms of these technologies. Not just because it’s confusing, because tracing is not a problem, it’s a solu-, it’s not even a solution, it’s just a technology, right? Like, it solves nothing on its own. It’s partly that and, I think more importantly, that you don’t want to solve problems by having three or four different tabs open. Like having a tab open to the logging, and tracing, and metrics portions of some product suite is not a useful workflow. In my mind, the things that people are trying to do on a day to day basis are to deploy software with more confidence, to improve performance, and/or to recover from errors with haste, like some kind of on-call firefighting workflow. That’s like—you know that, of course, we can drill down into the ontology below that, but those are the main things you’re trying to accomplish if you’re actually an end-user of say, Datadog, and my issue with the Datadog’s product strategy is not so much the accumulation of all these different data sources, which I think actually is totally reasonable, but the fact that they’re positioned—Corey: Oh, that’s what you want if you’re a Datadog customer, absolutely.Ben: —right, but they’re positioned almost separate SKUs. In some cases, they literally are separate SKUs you pay for separately, but they’re actually not integrating that data from a workflow standpoint in a way that I feel like is very beneficial for their end-users. I think because it’s hard to do that, it’s not that they don’t want to, I think if you watch their keynotes and stuff, I think that’s what they’d like to do, as well. But they haven’t been able to do it because there’s too much gravity and too much velocity around the products they do have for them to execute on that. So I think the way you become—let me say one more thing. I definitely hear vendors talking all the time about building this platform or that platform. When you go and talk to buyers, especially at larger organizations, nobody is saying that we want to have one vendor for everything. I talked to a buyer at one of the major investment banks once, and he was saying, we already buy 57 different monitoring products, so don’t tell me you’re gonna sell me one tool to rule them all. Of course, I asked him why he couldn’t buy 58, right? Like, that’s the—Corey: Oh, that’s the question you got to follow up with.Ben: —right, exactly. But seriously, it’s not—maybe it’s some fantasy level, they would like that, but it’s completely unrealistic because you’re dealing with maybe four or five different generations of application technology. And so, if nothing else, Datadog is great, but it doesn’t really do a lot for your mainframe, right? I think you’re going to, at the very least have to integrate generationally, and then I think you also have to do some integration across different business units and pieces of the org that buy different tools for whatever reason. And so, the one platform thing isn’t something I really hear from the market as much as I hear it from vendors, for what it’s worth. Now, going back to the heart of the question, though, I think that the only answer, in terms of the company that wins in any of this stuff, whether it’s LightStep or someone else, is to actually to focus on specific jobs to be done and to try to solve them end-to-end within a single tool. I do think that it’s necessary to bring other forms of data to bear on that problem, which is why LightStep, frankly, by the end of the year, I don’t think will be thought of as a tracing company, per se, as much as we are right now. I do think other forms of data are necessary. But I think it’s a mistake to position the product as a series of modules that are tied and tightly coupled to specific types of data. For instance, metric data is mostly totally unused and totally useless in any given investigation. There’s a very small subset of metric data that’s actually relevant. In order to figure out what subset that is, you have to, I don’t mean you should, but you have to be able to understand the relationships between different services on a per-transaction basis. The only way you can do that is to look at traces in the aggregate. So in my mind, the tracing data, it’s not the product. And LightStep’s product, you spend a very small amount of time actually looking at traces. You spend a much larger amount of time looking at statistical aggregates drawn from those traces, either directly or used to inform the interpretation of other data, whether it’s metrics, or logs, or whatever. So in my mind, the user needs to tell you what they’re trying to do. Are you trying to deploy software? Are you trying to solve a performance problem? Are you trying to resolve some kind of page? That context is enough to take all of this telemetry data, which is what we’re talking about here: metrics, logs, traces, etc, are telemetry data, that context is enough to interpret the telemetry data in a way that doesn’t require some kind of advanced degree or dozens of years of experience with tracing systems. And I don’t think that Datadog’s strategy is a bad one from a data acquisition standpoint, but from a product standpoint, I think it ultimately, it leads to a really fragmented end-user experience. And I find that to be kind of problematic. So that’s, that’s how I see it. And LightStep’s overall strategy is just to focus on specific workflows and to be the best at that. That’s what we’re trying to do, and not to be terribly distracted by the portfolio of telemetry data types that various other companies are integrating or acquiring through partnership or otherwise.Corey: And I think that that’s a very fair assessment. To be clear, I have no problem at all with Datadog providing this. It’s something that is, I think, the right move. The problem I have is that you see so many companies that specialize in one thing, and they’re founded and they do that one thing so well, that then it feels like they’re suddenly veering into that we’ve got to do everything story. And for example, LightStep does a phenomenal job of effectively instrumenting observability into microservices applications, I don’t know that you would necessarily do nearly as well with log analytics, for example. The idea of—it’s the loss of focus on the one differentiating factor that makes everything work. It’s the same story is why no one has ever bought a multifunction printer that they liked. It’s, do one thing, do it well, and leave the rest for other folks.Ben: Yeah, I think I was talking to, this is early on in LightStep when we were just in some customer discovery mode, we didn’t really have a product, but I talked to someone who bought—well, I’ll just say it—they bought New Relic. This is like in 2016 or something. And I asked them if they liked it, and they’re like, “No, not really.” And I was like, “Well, why do you buy it?” And they said, “Well, it’s B minus at everything.” And I think that was supposed to be sort of a good thing, right? It’s like, they didn’t need to buy—they wanted to have fewer vendors, not more vendors, and that helped them in that goal, but they weren’t particularly enthusiastic about it. And to your point about LightStep, if someone has a three-tier app like if they’ve got some Java app sitting on top of Oracle, and that’s the whole application, we would immediately walk away from that. You said log analytics. To me, it’s less about that particular data type and more about the architecture. LightStep is very focused on organizations that have incorporated some microservices. I mean, 100 percent of our customers still have a monolith as well, but the point is that they’re actually doing microservices in some capacity, and that opens up a whole set of problems that we’re designed to address. So, I think it’s funny when vendors claim that they’re the right thing for everybody. Everyone should be focused on a particular part of the market, and that’s the part that we’re focused on.Corey: And I think that that’s very fair. Now, what makes this interesting is you’re also involved heavily with the OpenTelemetry project, which is a CNCF open-source project. Do you find that that is a, I guess, either a diversion of focus, or a conflict of interest, given that you have a private company that’s aimed at something that is very similar, if not identical, from a naïve third-person point of view?Ben: I really don’t think that LightStep and OpenTelemetry have that much overlap, actually. Any of these companies—or open-source projects, if you were to look at things like Jaeger, Prometheus, that sort of stuff—the problem can be segmented pretty neatly between the acquisition of data, which in this case is telemetry data for observability, and the interpretation and analysis of that data. LightStep has long believed that the acquisition of data really should be something that’s done in the commons. This has a lot of benefits for customers in that integrating with LightStep or anything else that supports OpenTelemetry is a matter of binding yourself to a portable, non-vendor-specific, I don’t want to use the S-word, standard, because it’s not like an IEEE thing, but the point is that it’s a portable framework that you can use to integrate any number of potential downstream analytical tools. That decision is completely decoupled from which of those analytical tools you want to use. I think the fact of the matter is that OpenTelemetry doesn’t actually do anything. It doesn’t present you with a UI. It just gathers data in a way that is vendor-neutral. And that’s the point of that project. The reason that LightStep pursued that is partly just, frankly, trying to be customer-focused. That’s what we think is best for our market. And so we want to bet on that technology. And then the other piece of it was that, if you go and talk to people who worked at New Relic and AppDynamics, and their glory days when they were ascending very quickly, they’re spending like 80 or 90 percent of their engineering resources on agents, which are actually not even differentiated anymore. For a while there, APM agents were the thing you really were buying, and then the analytical tool was pretty basic. That’s kind of flipped over at this point, where the analytics are getting much more elaborate, mainly because of the rise of deep multi-layered systems, and microservices, and so on. But the collection of telemetry data, people expect it to be automatic at this point. No one has patience for manual instrumentation. And the idea, with things like OpenTelemetry and auto-instrumentation OpenTelemetry, is to make that shared responsibility of everyone who’s trying to do observability rather than having every vendor repeat the cost of building all that in a proprietary way. Because that’s how things were several years ago. And the irony of all this is that the vendors that did that work, they’ve spent a lot of muscle marketing those agents, but privately, they’re very excited about getting out of that business. It’s not differentiating for them, and it’s still a big cost center. It’s not something that their customers really benefit from anymore and it takes up a lot of the resources. So there’s pretty wide alignment around the value of something like OpenTelemetry, and with so many different competing vendors involved in the project—at this point, I think they’re like eight or nine of us—it’s difficult for anyone to kind of run away with the ball. there’s a governance structure and so on. So, I don’t think there’s much of a risk of it turning into a mechanism for any one vendor to win or lose. The main thing I see is a potential for us to have some kind of rising tide that our customers benefit from as well. And that sounds like B.S., frankly, but I mean every word of it. [laughing]. You’re welcome to call me on it. Corey: No, no, I accept that. One thing as well, that I think has been extremely valuable from the perspective of looking at LightStep and understanding what it is, is the interactive sandbox that really takes you by the hand through using it in a production style environment. It’s handy to help folks really understand what it is, it sort of walks around the, “How the hell do I describe this and an elevator pitch to someone without the same level of experience.” But it is useful and, credit where due, it only demands an email address and not a 15 field form to start playing around with it. So, if people are curious about what LightStep does, I would encourage them to go and take a look at the interactive sandbox at LightStep.com.Ben: Yeah, I think that’s a good idea, mainly because people often ask us to describe what we do and how we’re different, and we can describe it in words, but we realized what people want to understand is how it’s useful, it’s not how it’s different, and the sandbox environment allows you to walk through scenarios like deploying software, or finding that the root cause for an error or performance anomaly in a way that lets you do all the clicking. And you can explore the whole product from there if you want, but it gives you some guardrails just to actually solve the scenario. And people have found it to be quite educational, I think, just in terms of how we built it, and we’ve actually had folks from much larger organizations, like the kind of Googles and Facebooks of the world have been using it as well, to help develop their own internal approach [inaudible 00:29:22]. So, even if you have no interest in LightStep, I think it’s still a worthwhile thing to check out, because a lot of the stuff that we’re showing in there, I think, is actually somewhat novel and just kind of fun. So, many people have told us, that’s been a really helpful thing for them just to understand the space better, independent of LightStep.Corey: Excellent. Well, Ben, thank you so much for taking the time out of your day to speak to me today. If people want to hear more about what you have to say, where can they find you?Ben: Yeah, you can find me on Twitter. @el_BHS, like the Spanish article, el BHS, and you’re welcome to look me up on the internet and send me email, or whatever. I’m actually pretty good about responding to that. And of course, LightStep is at LightStep.com, and the sandbox really is the best way to understand what we do if you’re an engineer, but you’re always welcome to reach out to any channels to me if you want to provide feedback or ask any questions or any of that stuff. I love talking with folks.Corey: Thank you so much for taking the time to speak with me today, I really do appreciate it.Ben: Thank you, Corey. It’s been really fun.Corey: Ben Siegelman, founder, and CEO of LightStep. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on Apple Podcasts. If you’ve hated this podcast, please leave a five-star review in Apple Podcasts, and make sure to instrument it appropriately so that we can trace where it entered and exited.Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at ScreamingintheCloud.com or wherever fine snark is sold.This has been a HumblePod production. Stay humble.

31mins

25 Mar 2020

Loading