Cover image of Real World DevOps

Real World DevOps

I'm setting out to meet interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Weekly hand curated podcast episodes for learning

Popular episodes

All episodes

The best episodes ranked using user listens.

Podcast cover

Understanding Observability (and Monitoring) with Christine Yen

About Christine YenChristine delights in being a developer in a room full of ops folks. As a cofounder of Honeycomb.io, a tool for engineering teams to understand their production systems, she cares deeply about bridging the gap between devs and ops with technological and cultural improvements. Before Honeycomb, she built out an analytics product at Parse (bought by Facebook) and wrote software at a few now-defunct startups.Links Referenced:  https://www.honeycomb.io https://www.heavybit.com/library/podcasts/o11ycast/ TranscriptMike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, to the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.Mike Julian: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast you're probably also interested in better monitoring tools and that's where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to Influx Data for helping to make this podcast possible.Mike Julian: Hi folks. Welcome to another episode of Real World DevOps podcast. I'm your host Mike Julian. My guest this week is a conversation I've been wanting to have for quite some time. I'm chatting with Christine Yen CEO and co-founder of Honeycomb, and previously an engineer at Parse. Welcome to the show.Christine Yen: Hello. Thanks for having me.Mike Julian: I want to start this conversation off in kind of what might sound like a really foundational question. What are we talking about when we're all talking about observability? What do we mean?Christine Yen: When I think about observability, and I talk about observability I like to frame it in my head as the ability to ask questions of our systems. And the reason we've got that word rather than just say, "Okay well monitoring is asking questions about our system," is that we really feel like observability is about being a little bit more flexible and ad-hoc about asking those questions. Monitoring sort of brings to mind defining very constrict parameters within which to watch your systems, or thresholds, or putting your systems in a jail cell and monitoring that behavior, whereas, we're like, "Okay, our systems are going to do things, but they're not necessarily bad." But let's be able to understand what's happening and why. And let's observe and look at the data that your systems are putting out as well as thinking about how, asking more free form questions might impact how you even think about your systems, and how you even think about what to do with that data.Mike Julian: When you say asking questions what do you mean?Christine Yen: When I say asking questions of my system, I mean being able to proactively be able to investigate and dig deeper into data, rather than sort of passively sitting back and looking at the answers I've curated in the past. In order to illustrate this, to compare observability monitoring a little more directly with monitoring, especially traditional monitoring when we're curating these dashboards, what we're essentially doing is we are looking at sets of answers from questions that we posed when we pulled those dashboards together. All right, so if a dashboard has existed for six months the graphs that I'm looking at to answer a question like, what's going on in my system, are answers to the questions that I had in mind six months ago when I tried to figure out what information I would need to figure out whether my system was healthy or not. In contrast, an observability tool should let you say, "Oh is my system healthy?" What does healthy mean today? What do I care about today? And if is see some sort of anomaly in a graph, or I see something odd, I should be able to continue investigating that threat without losing track of where I am, or again relying on answers from past questions.Mike Julian: So does that mean that curating these dashboards to begin with is just the wrong way to go? Like is it just a bad idea?Christine Yen: I think dashboards can be useful, but I think that over use of them has led to a lot of really bad habits in our industry.Mike Julian: Yeah, tell me more about the bad habits there.Christine Yen: An analogy I like to use is, when you go to the doctor and you're not feeling well. A doctor looks at you and asks you, "What doesn't feel well. Oh, it's your head. What kind of pain are you feeling in your head? Is it acute? Is it just kind of a dull ache? Oh, it's acute. Where in your head?" They're asking progressively more detailed questions based on what they learned at each step along the way. Honestly this is kind of parallels a natural human problem solving concept. In contrast, I think the bad habits that dashboard lead us to build are things like it would be the equivalent of a doctor saying, "Oh well based on the charts from the last three times you visited, you broke your ankle and you skinned your knee." Pretend you go to the doctor to skin your knee. You know, "Oh okay, you broke your ankle last time, did you break your ankle again? No. Okay, did you ... How's your knee doing?"With dashboards, we have built up this belief that these answers to past questions that we've asked are going to continue to be relevant today. And there's no guarantee that they are. Especially for our engineering teams that are staying on top of incidents and responding, and fixing things that they've found along the way. You're going to continue to run into new problems, new kind of strange interactions with routine components. And you're going to be able to ask new questions of what your systems are doing today.Mike Julian: It seems like with that dashboard problem we have that same issue with alerting. I've started calling this kind of reflexive alerting strategy where it's like, "Oh God. We just had this problem. Well we better add a new alert so we catch it next time it happens." It's like well how many times is that new alert going to fire? Probably never. Like you're probably never going to see that issue again. With dashboards, dashboards are the same way. What you're describing, God I've seen this 100 billion times where someone curates a dashboard, is like, "Okay, well first thing now that we have this alert is let's go look at the dashboards and see what went wrong." I'm like, "Well no, graphs look fine." So no problem, but clearly the sites down.Christine Yen: Yeah, there's a term that we've been playing with, dashboard blindness. Where if it doesn't exist in the dashboard it clearly hasn't happened, or you know it just can't exist, because people start to feel like, "Okay, we have so many dashboards. One of them must be showing something wrong if there's something going wrong in our system." But you can't. You can't, that's not always going be the case. To expect that means that you have this unholy ability to predict the future of your system and man if people could really predict the future, I would do a lot more things than just build dashboard with that.Mike Julian: Right. Rather than just shit on dashboards forever what is a good use of a dashboard? Like presumably you have dashboards in your office somewhere?Christine Yen: Yes. I think dashboards are great jumping off points. And I actually very much feel like dashboards are a tool, they've just been over used. So I absolutely don't want to shit on dashboards because they serve a purpose of providing kind of a unified entry point. Right. What are our KPI's? What are the things that matter the most. Great. Let's agree on those. Let's go through the exercise of agreeing on those, because as mush as we would like to think that this is a technology problem that can be solved with tools, a lot of the time these sorts of things require humans and process to determine. So let's decide on a KPIs, and let's put them up on a wall, but expect and spread the understanding that wall is only going to tell us when to start paying attention. Dashboards themselves can't be our way of debugging, or our way of interacting with our systems.Mike Julian: Right. So in other words that dashboard it's going to tell you that something has gone wrong, but it won't tell you what?Christine Yen: Right.Mike Julian: I think that's a fantastic thing. And that actually mirrors a lot of the current advice around alerting strategy too of you find you SLIs, alert only on an SLI, not on these low level system metrics.Christine Yen: Yeah, I love watching this conversation evolve. I think Monitorama 2018, something like three talks in a row were all about alert fatigue. And it's so true to see these people, to see these engineering teams fall into this purely reactive mode of, "Okay, well if this happened, this is how we will prevent it from happening again." And each postmortem just spins out more alerts, and more dashboards. Inevitably your people are going to end up in a state of unsustainable hundreds or thousands of dashboards to comb through. And then their problem isn't how do I find out what's going on? It's, how do I figure out which dashboard to look at? Which again is looking at things from the wrong perspective. Dashboards tell you that something has happened and you need a tool that's flexible enough to follow your human problem solving brain patterns to figure out what's actually wrong.Mike Julian: Funny you mentioned the Monitorama, there was a talk, I want to say 2016 maybe, I think it was Twitter where they had this problem of alert overload, just constant alerts. So they decided, "You know what we're going to do? We're just going to delete them all." Done. Like, "We'll just start over." I'm like, "That's such a fantastic idea." People think that I'm insane when I recommend it, but hey Twitter did it so I'm sure it's fine.Christine Yen: Yeah, I mean drastic times call for drastic measures. It's funny talking, being especially in the vendor seat talking to a lot of different engineering teams about their tools and how they solve problems with their production systems. There is definitely an element of kind of this safety blanket feeling. Right? "Okay, but we need all of our alerts. How will we know when anything is going wrong?" Or, "We need all of our alarms for all time at full resolution." And I get it. I feel that there are patterns that folks kind of get into, and it's how you know how to solve their problems, and especially when things are on fire. It feels like you don't have time to step back and change your process when you're like, "No, this is what I'm doing to keep most of the fires under control." And I think this is why communities like yours and Monitoramas, and it's whether it is so good that we have ways that we can share different techniques for addressing this so that folks who are in the piles and piles of alerts hole can dig themselves out of it, and start to find ways to address that.Mike Julian: Yep, yep, completely agreed. So I want to take a few steps back and talk about monitoring. There's been a lot of discussion about how observability is not monitoring. Monitoring is kind of I guess looking at things that we can predict. We think through, and feel free to correct me at any time here. We think through failure modes that could possibly happen, and design, or dashboards design alerts for those failure modes that we can predict. Whereas, what you were describing earlier, observability is not that, it's for the things that we can't predict. Therefore, we have to make the data able to be explored. Is that about right?Christine Yen: That's about right. For anyone in the audience knee jerking about that, I want to clarify. I really think of observability as a super set monitoring. And the exercise of thinking through what might go wrong is still a necessary exercise. It's the equivalent of the software developers should still write tests. You should still be doing this due diligence of what might go wrong. What will be the signals for when it goes wrong? What information will I need in order to address it once it does go wrong? All these are still important parts of any release process. But, instead of framing it as, here's the signal I'm going to preserve it as this one metric, and immortalize this as the only way to know if something is going wrong. What we'd say, Honeycomb would encourage you to do is take those signals, whatever metric, or whatever piece of metadata that you'd want in order to identify that something is going wrong, and instead of immortalizing them, flattening them as pre-aggregated metrics, instead capture those as events, and you know, maybe it does make sense to define and alert, or define a graph somewhere so that you can keep an eye on it. But instead of freezing the sort of question that you might ask make sure you have the information available later if you want to ask a slightly different take on that question, or have a little bit of flexibility down the road.Mike Julian: So thinking through all the times that I've instrumented code, hasn't this always been possible?Christine Yen: It has. I would say not-Mike Julian: I feel a very large but coming on.Christine Yen: I think that as engineers we are taught to think about, or understand the constraints of the essentially data store we're writing into when we write into it. We're taught to think about the type of data we are writing, and the trade offs, and traditionally the kind of two data stores, either a log store, or a tensors metrics store that we've used has limitations. Either that limit the expressiveness of the metadata that we can send, and talking specifically about things like high-cardinality data, and tensors, metrics, we've just been conditioned that we can't send that sort of information over there. Or, okay logs are just going to be read by human eyeballs at grep, so I'm not going to challenge myself to structure them or put analytical information potentially useful for analytic queries into my logs. I think that the known trade offs of the end result have impacted habits in instrumentation. When instead, like you say, all this should have been possible all along. We just haven't done it because the end tools haven't supported this sort of very high level flexible analytical queries that we can and should be asking today.Mike Julian: Yeah, you used a word there that I want to call attention to because it's kind of the crux of all of this, which is high-cardinality. I have had the question come up many, many times of what in the world is it? And it's always couched in terms of like, "I think of myself as quite a smart person, but what the shit is high-cardinality?" It's one of those things of, I'm afraid to ask the question, because I should know this like everyone thinks I should. I know it because I had to go figure out what in the world everyone was talking about. So what is it? What are we talking about here?Christine Yen: I'm glad you asked. This is also why, for the record, our marketing folks have tried to shy away from us using this term publicly because lot of people don't know what it means, and they're afraid to ask. So thank you for asking.Mike Julian: But it's so core to everything we're talking about.Christine Yen: So very clinical level, high-cardinality describes a quality of the data in which there are many, many unique values. So types of data that the high-cardinality are things like IP addresses or social security members, not that you would ever store those in your, in any data-Mike Julian: And if you were, please don't.Christine Yen: Things that are lower cardinality are things like species, or species of person issuing the request, or things like AWS instance type. Yes, there's a lot of them, but there's far fewer of them then there are IP addresses. And-Mike Julian: There's a known bound of that measured in maybe hundreds.Christine Yen: Yeah. Yeah, and I think the reason that we're talking about this term more, and it's coming up more, is that we are moving towards a more high-cardinality world in our infrastructure. In our systems. And when I say things like that I'm like, well 10, 15 years ago it was much more common to have a monolithic application on five micro servers, where when you needed to find out what was going wrong that you really only had five different places to look. Or five different places to start looking. Now even at that kind of basic level, we have maybe instead of one monolith we have 10 micro-services spread across 50 containers, and then 500 Kubernetes pods all shuffling in and out over the course of a day. And even just that basic, which process is struggling, is much harder to answer now because we have many more of these combinations of attributes which then produce a high-cardinality data problem. And I think that's something that people are starting to experience more of, in their own lives, that a lot of vendors or open source metrics projects are starting to recognize that they also have to deal with as an effect of the industry moving in this technical direction.Mike Julian: One of my favorite examples of this came from days back when I ran graphite clusters, the common advice was don't include request IDs or user IDs in a metric name. And ten to one running graphite that's still a pretty common thing, because if you do, well it explodes your graphite server. The number of whisper files that get created is astronomical. So the end result is that we just don't do it. And like you just don't record that data, but what you're saying is, no, you actually do need that data. Like not having that is hampering your exploration and trying to answer the questions.Christine Yen: Absolutely and I mean in this case with press IDs or ID, again there might be some folks in the audience being like, "Well I'm Pinterest and I have the luxury of not having to worry about individual user IDs, and maybe, but I guarantee that there are some high-cardinality attributes that you do care about that are important for debugging. For us at Parse it was app ID. We were a platform so we had like 10s, 100s, eventually millions of unique apps all sending us taped data, and we needed to be able to distinguish, "Okay, well this one app is doing something terrible. Let's black list him and go on with our day." And if it's not user ID for some folks it might be shopping cart ID, or Mongo instance that it's talking to. Our infrastructure has gotten so much more complicated. There's so many more intersections of things. In graphite world you would need to define so many individual metrics to figure out that a particular combination of SDK on a particular node type, hitting a particular end point for this particular class of user, you'd have to track so many different combination metrics to find out that one intersection of those was misbehaving. But more and more that's our reality. And more and more out tools need to support this very flexible combining of attributes in this way.Mike Julian: Right. Yeah, the more and more that we start to build customer facing applications, especially like the applications where the customer can kind of have free rein over what they're doing, what they're sending, like I don't know a public API means that one customer using one version of the API, using one particular SDK, could cause everyone to have a very bad day. And if you're aggregating all that, how are you going to find it's them? All you see is just that the service is sucking.Christine Yen: 100%. Yeah, the more like, ultimately we're all moving towards overall where we are, multi-tenant platforms, and if not user facing platforms then often assured services inside larger companies. Your co-workers are your customers and you still need to be able to distinguish between that one team using 70% of your resources, versus other folks.Mike Julian: Right. Yep. So it seems to me that there's kind of a certain level of scale and engineering maturity required before you can really begin to leverage these techniques. Is that actually true?Christine Yen: I don't think that there is. There's no, you must be this tall to ride bar on the observability journey. There are number of steps. There are steps along the way that allow you to use more and more of these techniques, but when I think about teams that are farther along their journey than others. It's often more of a mindset then anything technical or anything like that. Right. When I think of steps along the observability maturity model and we're Liz Fong-Jones our new developer advocate, formerly with Google, is actually working on something along these lines for release, I think sometime in June, when we think of that, it's part tools, but it's also process and people. And it is, I think that there are some changes afoot in the industry about how people think about their systems. How people instrument. How people set up their systems in order to be observable, that really all factor into how effectively they're able to pick up some of these techniques and start running.And choice of tooling is a catalyst for this. Ideally you have a tool that, sorry Graphite, lets you capture the high-cardinality actuate that you want to, but that's only a piece. And I think that we are in for a lot of really fun kind of cultural conversations about what it means to have a digi-driven culture. What it means to be grounded by what's actually happening in production when trying to figure out why the things that you're actually observing don't line up with what you expect.Mike Julian: All right. So you've given a lot of talks lately and over the past year or two about observability driven development, which sounds really cool. Can you tell us what it is?Christine Yen: Yeah. Observability driven development, or kind of as I like to say, to kind of zoom out and just talk about observability, and the development process is a way of trying to bring the conversation about observability away from pure ops land, or Pure SRA land and into a part of the room where developers and engineers hang out. So my background is much more of an engineer, my co-founder Charity, comes much more from the ops side of the room, and we've really started to see observability as basically just a bridge that allows and empowers software engineers to think more about production and really own their services.And one of the things that I've pressed on in these talks about how observability can benefit the development process is what a positive feedback loop it is to be looking at production even way before I'm at a point of shipping code. There are so many spots along the development process when you're figuring out what to build, or how to go about building it, what algorithm to chose. Or, "Hey I've written this. My test passed, but I'm not totally sure whether it works." There's so many spots where if developers gained this muscle of, "Hey let me send you, check my theory with what's actually happening with production." People can ship fresh, better code, and be a lot more confident in the code that they're pushing out there in the first place.My favorite example's from one of our customers, Geckoboard, they're obviously very a data driven culture, their primary business is providing dashboards for KPI metrics. They were telling me the other day about a project that their PMs were running actually, and the PMs were the primary users here not the engineers, where they ultimately had incomplete problem to try and solve. And their PMs were like, "Well we could probably have the engineers go off and try to come up with a perfect solution, or we can come up with like three possible approaches to this solving this problem. We could run these experiments in production. Capture the results in Honeycomb. And then actually look at what the data is saying about how these algorithms are performing." And the key here is that they're actually running it on their data. Right?There's a realism that looking at real production data gets you that is so much better than sitting around debating theoreticals, because they're able to say, "Okay, well we've had these three implementations running in parallel, and looking at the data this one seems like the clear winner. Great. Let's move forward with this implementation." And they can feel confident that it's going to continue behaving well at least for the foreseeable future whether traffic remains the same.Again these are bad habits that people have fallen into, right? Where dad's are like, "Okay, monitoring something that I need to add right before I ship it just so that up spokes will stay off my back when I tell them that everything is fine." Or, "That the up spokes isn't going to look at in order to come yell at me." I don't know, but it's like that shouldn't be the only time we're thinking about implementation. That shouldn't be the time we, I'm speaking for software developers here, should be thinking about what will happen in production. Because at every stage you know more and more people are using feature flags to release their code. Cool. You should be capturing those feature flags in your instrumentation, and alongside, "Hey, cool, what is X user think about this thing that we've featured flagged them into?" You should be looking at, okay, what is the performance of your system look like for folks who have that feature flag turned on or turned off? Are your monitoring metrics, observability tools, whatever flexible enough to capture that. Isn't that just as interesting as the qualitative, does user X like this new feature? It's got to be.And there's so many things that more and more are starting to be part of the development process that observability tools should be tapping into, and should be encouraging in order to break down this wall between developers, and operators. Because ultimately you know, you said more and more we're building user facing systems at the end of the day their goal has to be delivering a great experience for those users.Mike Julian: Right. Yeah, we're all on the same team here.Christine Yen: We're all on the same team.Mike Julian: So let's say that I'm a listener to this show, but I don't use Honeycomb, I can't use Honeycomb for whatever reason, but I really like all of these ideas. I want more of this for me. How can I get started with it? Like are there ways I can implement this stuff with open source technologies?Christine Yen: There are probably some. First you want a data store that is flexible enough to support these operations. Right? So you should be looking for something that lets you capture all the bits of metadata that you know are important to your business. For Parse, to use as an example, that was things like app ID, operating system version of the client operator. In Parse it was a mobile back end of service. So we had a bunch of SDKs that you could use to talk to our API. So for we're evaluating the quote, unquote health of that service it was which app is sending this traffic? What SDKs are they using? Which end points are they hitting? Those mattered to our business, and those are also incidentally much easier for developers to map to code when talking about health or anomalies than traditional monitoring system metrics.So identify those useful pieces of metadata. Make sure your tool can support any kind of interesting slices along those piece of metadata that you'll want. And make sure honestly, lots of folks again there might be some folks in the audience thinking, "Well I can do this with my data tools." I don't know how many data scientist you have in your listenership, and it's true, lots data science tools can do that. I know that for our intents and purposes, as an engineering team at Honeycomb we care about real time, so that tends to be something that disqualifies many of the data science tools.But I think that more that tool choice, folks who are excited about observability, folks who are looking for the next step beyond monitoring should really start looking at places in their development process, or release process where they're relying on intuition rather than data. Right? Where else can we be validating our assumptions? Where else can we be checking what our expectations are versus what is actually happening out there in the wild? This culture and process is really what that observability driven concept is trying to get at, is where can you be more regularly, efficiently, naturally be looking to production to inform development in order to deliver great experience for our customers.Mike Julian: Yeah, that's fantastic advice. This has been absolutely wonderful. Thank you so much for joining me. Where can people find out more about you and your work?Christine Yen: The Honeycomb blog is a great place to find kind of a mix of stories, and more conceptual posts. Honeycomb.io/blog. I know that we actually also have our own podcast. It's called the ollycast. I think it's o11y.fn. And of course have Honeycomb Twitter feed and we have community Slack as well for folks who just want to talk about observability, and want to get a chance to play around.Mike Julian: Yeah, awesome. As a parting story, I was on one of the first trials of Honeycomb, what back when it was still closed, and I can't remember where I read it. It might have been part of the in app documentation, it might have been something that Charity said on Twitter, but it was like, "Don't use Honeycomb for WordPress, that's not what we're built for." At the time I had about 100 node WordPress clusters. So I'm like, "You know what, I'm going to use this for Word Press.Christine Yen: Awesome.Mike Julian: I did actually find the interesting things out of it, which I found pretty hilarious.Christine Yen: Cool.Mike Julian: So there you go. I believe you do actually have a free trial as well now?Christine Yen: We do. We have a free trial. We also have the community edition. It's a little bit smaller, but should be enough for folks to get a feel for what Honeycomb can offer. A note about the WordPress disclaimer, I'm glad you got value out of it. I think that's awesome. I would also say that a 100 node WordPress cluster is a whole lot more complicated than we thought when we said that early on. And I think that the distinction that we wanted to make there was, you know, if you have a simple system maybe you don't need this much flexibility. Maybe whatever you have set up is working fine. Because ultimately over the course of this podcast observability it involves changes not just to your tooling, but how you work and how you think about your systems. And that was really a kind of a disclaimer to make sure folks who were interested in investing in a little bit in all of that.Mike Julian: Yeah. Yeah, absolutely.Christine Yen: I'm glad you overcame and tried it out.Mike Julian: All right. Well thank you so much for joining. It's been wonderful.Christine Yen: Thank you. This has been a lot of fun. I'm a big fan.Mike Julian: Wonderful.Christine Yen: Thanks.Mike Julian: And to everyone listening, thanks for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes you can find us a realworlddevops.com. On iTunes, Google Play, or wherever it is you get your podcasts. I will see you in the next episode.VO: This has been a HumblePod production. Stay humble.


16 May 2019

Rank #1

Podcast cover

DevOps is Dead with James Turnbull

About James TurnbullJames Turnbull is originally from Australia but now lives in Brooklyn, NY. He likes wine, food, and cooking (in that order) and tattoos, books, and cats (in no particular order).He is a CTO in residence and lead startup advocacy at Microsoft. Prior to Microsoft, he was the founding CTO at Empatico. Before that, James was CTO at Kickstarter, VP of Engineering at Venmo, and in leadership roles at Docker and Puppet. He also had a long career in enterprise, working in banking, biotech, and e-commerce. James also chairs the O'Reilly Velocity conference series. In lieu of sleep, James has written eleven technical books, largely on infrastructure topics.Links Referenced:  Twitter: @kartar Turnbull Press TranscriptMike Julian: This is the real world DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps from the creators of your favorite tools to the organizers of amazing conferences or the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and capacitor for realtime streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to InfluxData for helping make this podcast possible.Hi folks I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is James Turnbull. You probably know James from his seeming inability to stop writing technical books such as Monitoring with Prometheus, The Art of Monitoring, The Terraform book and like a bajillion others. He has also worked for some pretty neat companies too, like Puppet, Kickstarter and Venmo and now he works at Microsoft leading a team as CTO-in-residence. Welcome to the show, James.James Turnbull: Hi Mike.Mike Julian: I'm really curious like what is a CTO in residence?James Turnbull: I guess my primary mission is to make Microsoft relevant to start ups much the same way that Microsoft is shaping its relevancy towards the open source community. We're also interested in looking at other audiences that we've traditionally not been involved with and so it's just one of those.Mike Julian: Gotcha. And you're just leading a team of people that are focused on that sort of stuff?James Turnbull: Yeah, so most of my team is people who've come from startups or and particularly from engineering management leadership roles in startups. One of my colleagues is ... was the CTO of SwiftKey and another one, fairly famous, Duncan Davidson who wrote Tomcat and Ant and has been around engineering management for a long time — and folks like that who really are here to help sort of startups understand a bit more about how to grow and scale. And I think some of the big challenges startups have are actually not technology related at all. They're really about, you know, how do I build a recruiting process? You know, I had 10 engineers last week, I have 100 this week. How do we structure the team? So we've sort of brought together a group of folks who have fairly deep experience in those sort of problems for startups and have sort of a deep empathy for the startup community.Mike Julian: Yeah that's quite the task ahead of you.James Turnbull: Yeah. Look, I think, I mean Microsoft traditionally been known as an enterprise software company. You know, a lot of startups are not sure of their relevance to us. I think increasingly we're seeing traction to cover is one is that obviously Azure is one of our focuses and the cloud platform in there. And that platform is looking more broadly at not just enterprise audiences but other groups. And secondly, a lot of startups ... Microsoft's deep in the middle of most of their customers. So particularly if you're a Beta based startups, something like that and you're, you know, you're trying to sell into enterprise business, Microsoft has been doing that for 30 years. They have all the connections and account managers and sales folks and you know, multimillion dollar relationships with some of the people you want to be customers with. We can provide you with A, some of those connections, but also a lot of advice and expertise about how to sell to those customers.James Turnbull: And having worked at both Empatico and Docker, you know, a large part of my job was attempting to sell, you know, as a small startup, as a fairly early employee at both into big companies. You know, you can't walk in the door to Wall Street financial if you're a 30 person start up in Portland, Oregon without having a pretty credible story. So I'm happy to sort of help startups and I do some of that messaging and understand how to have some of those conversations.Mike Julian: Yeah. It's one of the interesting things about my own company is my clients are all these large companies too and I'm a two person company, but selling into a very large company is not ... it's nothing like selling it to a small company. Everything works differently. People think about their jobs differently.James Turnbull: Yeah.Mike Julian: Yeah. I think maybe my most favorite thing of everything you just said is this isn't your daddy's Microsoft. The Microsoft we all grew to know and hate is not today's Microsoft at all. Not by a stretch and that's just absolutely incredible to see that turnaround.James Turnbull: Yeah. I got a LinkedIn request from somebody yesterday and the message said, you know, I've read a bunch of your books and you know, I've used a bunch of different sorts things you've worked on. I was really surprised to see you at Microsoft. And I was like, okay, this could end badly the next couple of sentences, because I've certainly had a few people of my generation who remember the bad old days and “Linux is a cancer” and things like that. And he finished with, you know, it's really interesting to see companies grow and change. And I was like, wow, okay, that's a ... I thought that was going to go really badly but I think it's a fairly accurate reflection.Microsoft is aware of the fact that this is not a position that pragmatically that was not a good business position to be in. The world is changing. It's moving towards the cloud, you know, the stacks in people's companies, the way they manage things, the infrastructure, the software, you know, things are changing. And I hesitate to say this conclusively, but I think open source won, you know this for certain values of one, given recent sort of events, discussions about large corporates and their contribution to open source, but as a technology choice, it's pretty clear to me that open source won. And I'm kind of a bit smug about that to be honest.Mike Julian: Speaking of your books I would just straight up say you're the one that got me into monitoring and you kind of did this unknowingly, like we hadn't met until a couple of years ago, but in 2006 I guess it was, you released a book called Pro Nagios 2.0 and at the time I was working for a very small private school and someone said, hey, we've got like these couple hundred printers and they keep going offline and like, you know, we should probably know when that happens. So I'm like, well I don't know how to solve this problem. So I started googling around and find this thing, Nagios and then find, oh hey, there's a book on it. So I bought the book and like that ... I learned about monitoring that day. Like that's ... Pro Nagios 2.0 is what got me into that and this whole time like it kind of started my career, which was really cool. So thank you for that.James Turnbull: You're welcome. As I said to you before we started I feel like apologizing too because I can't remember anything that's in the book and I think that it's probably acting as a monitor stand for a lot of folks and I'm pretty sure that my ideas about monitoring we're very embryonic but I really appreciate that. That's always exciting to hear when someone actually is like this was really helpful. Because none of us are ever going to be John Grisham right? We aren't in this for the money and those conversations where someone reached out and said, I read your book, it really helps or even if I read your book, it didn't help and here's why I'm like, that's awesome. I'm glad somebody got, you know, got something out of it and had some feedback. And so yeah, if anyone who's listening, who has ever had the urge to tell me what's wrong with things or what went well in the book, my email is easy to find, feel free to shoot me an email. Always happy to chat.Mike Julian: Yeah, absolutely. Like one of the things that authors don't get very often is feedback, positive or negative. I really expected to get a lot of hate mail for the stuff I wrote in my book and it just didn't happen. I was very disappointed.James Turnbull: I actually think that ... I was thinking about your book the other day and I, when we exchanged some emails and I think your timing was very good. I think people are waking up to the fact that monitoring was evolving and I think that what you had to say was not only very timely, but for me it solidified a bunch of different ideas that I had that I'd previously sort of, you know, I could talk about in abstract or you know, in the solid sort of way. I think the first couple I would strongly recommend people should read the first couple of chapters of your book.Mike Julian: Those were my favorite to write.James Turnbull: Yeah, they are one of the better summaries I've read. I guess more modern monitoring.Mike Julian: Well, thank you for that.James Turnbull: Yeah well it's a topic that I think you, I and about 200 people care about but we all do.Mike Julian: When I was ... I'd tell people like hey I am writing this book on monitoring their first response was almost invariably, have you read The Art of Monitoring?James Turnbull: Oh dear.Mike Julian: Do you really think I would set out to write a book without having read every book on monitoring there is? Like that I was somehow unaware of one of the most popular books out there, so I'm like, you know what? I'm going to have him ... I'm going to have James do a blurb on the back of the book and solve that problem forever.James Turnbull: That was a good plan. I looked ... was looking through my Amazon history the other day when I was doing my taxes and I can tell when I'm working on a new topic cause I have literally bought every single book on that topic. Not just writing a book but I decided the other day that I should ... I gave Rust a stab last year and I didn't get a chance to do anything with it and I thought I'd give it a stab again and I thought oh I'll buy a couple of books and see what they're like and so I can see the pattern of here's some Rust books and then a few years ago here's some go books and a few years before that here's every book about monitoring, which is not actually a large portfolio but there's enough around.Mike Julian: It is much smaller than people would think.James Turnbull: I bought a bunch of books on around the same time it's like on systems theory and stuff like that because I was struggling to find adequate ways to talk about monitoring as a construct and I realized that the maturity of the vast majority of the conversations out there, you know those on any technical topic, there's sort of like I look at the very bottom of the pile Hacker News post comment thread on there is like the worst case scenario and then there's like a few Stack Overflow answers and then this may be a detailed blog post that's going to explain how to use something and then maybe there's a ... somebody having an opinion about design or the language aspect of some language and then there's like a computer sciency like somebody's thought about things, document and monitoring is very heavily stuck at the blog post end of that spectrum.Mike Julian: Yeah, I completely agree. Like when I was trying to find higher level thoughts on it, they're just not there. The conversation, I think the level of conversation has started to shift in the past couple of years and that's awesome. Like, I really want to update my book now because of all the stuff around observability coming out has changed the conversation dramatically. One of the interesting things is I never even used the word observability anywhere in my book.Yeah.Mike Julian: Like it just wasn't ... people didn't ... people weren't talking about that way so I didn't talk about it that way either.James Turnbull: Yeah I was having this conversation with Darren Schwartz who makes some software for database observability, database monitoring and Darren is super smart and very much more computer sciencey person than I am. I realized that there were a bunch of stuff in there ... stuff you'd hear the way of his thinking that you know, he was one of the handful of people that had taken commentary about monitoring and observability further than just, you know that scratched the surface sort of thing. And he's not a person who ... I thoroughly recommend there's a couple of short things he's written and his blog post that are really interesting sort of reading from ... as I learn stuff that that that was sort of more high level and useful and and over arching than I than I had previously seen.Mike Julian: So I want to shift topics a little bit. You and I were talking before we started recording about DevOps and I will start off with a very provocative statement. DevOps is dead. What do you think?James Turnbull: I think I agree. I was involved in very early days. I was trying to look at it before when I wrote my first blog post on DevOps and it, I think it's like 2008 or 2009 and I think I went to the second DevOps days. I didn't go to the first. I think probably, and I take some responsibility for this because I worked for a company that's sold a DevOps tool, but I think the first time a marketing person described A, categorized DevOps as being about tools and B, used it as a somewhat abstract rallying, cry, marketing rallying cry rather than a cultural statement. That's when the first knife was sort of stuck into the entity as it were and I think, yeah, I think I would agree now.Mike Julian: Tell me more about that. Like what do you mean by that knife going in? Like is it really that marketing killed DevOps?James Turnbull: I mean I'm being honest here on marketing now I think it's probably a factor. To me, DevOps was almost nothing about tools, tools to me were enablers for folks doing DevOps things. To me, the big thing about DevOps and the thing that really struck me when I first started thinking about the concept is that I've been doing engineering things for 25 years now. I feel really old and a significant part of the scars that I have as far as those experiences are being on eight ... one of the ... of either sides of the conversation being the developer of a bit of software or the operator bit of software where I've been in conflict with the other party because you know, we either didn't talk about how they built it or they don't understand ... they didn't understand that the environment that they were deploying it into.And most of those conversations happened at three o'clock in the morning on a conference call where a vendor is screaming at us because some mission critical piece of infrastructure is down and they're losing money. And to me that was ... that's been a really ... that was a really scarring experience. And to me, DevOps was about solving that problem. It was about having conversations with the people we work with and going, you're building this thing, here's an idea of what it looks like in production. You know, and by the way, can we make sure that we do this, this, and this to ensure that we care about security and monitoring and you know, backup and recovery or whatever it happens to be and you know, create that sort of bridge between those two disciplines in which really hasn't existed for most of my career.Mike Julian: Yeah, completely agree with all that. What do you think about SRE? Like has that changed things to? To me I feel like SRE is also kind of a marketing label.James Turnbull: Yeah. I mean I know a lot of people out at the Google SRE organization and I deeply respect the work that they've done. Yeah and it's definitely ... there are definitely a lot of solid ideas in like the SRE book is an example of ... I actually ... it was very ... responses to the SRE book were very polarizing, let's put it that way. It was very … I quite liked ...Mike Julian: That's a very polite way of putting it.James Turnbull: I quite liked it and I thought it was really useful. What I'm really sad about was that it wasn't published in 2005.Mike Julian: Yeah.James Turnbull: When it would have been actually life changing to a bunch of people. A bunch of us who worked in the high volume, high value web facing world. It's a solid ... the SRE program at Google is a solid platform. Not everything applies to everybody, you know, the classic refrain of, you know, you're not Google. I think that needs to be reemphasized a few times. Not everybody has Borg. Like there's definitely a flavor to it, but I can't deny the fact that a part of the reason it was released was definitely as a marketing aide to Google's recruiting in the SRE organization and there's nothing fundamentally wrong with that but, it needs to be acknowledged as one of the origins of that movement.Mike Julian: I observed a conversation happened recently in a Slack where someone released a series of articles, fantastic articles, and they were referring to an important measurement as a KPI and someone responded just like, why didn't you call it an SLI? It's like, well, because an SLI is something that essentially Google came up with. We've been using the term KPI to meet an important metric for, I don't know, decades and the term SLI is less than 10 years old.James Turnbull: Yeah. I mean but, you know, this is one of those things like the ... every generation reinvents the past right? You know, I'm going to say something controversial here. I look at the way Kubernetes is configured and I look at the sea of YAML files that I'm expected to poke my way through. There's still some tooling around that and I'm like, did we learn nothing from the horror that was configuration files. I mean it could be worse. I was having the argument that other day, it could be worse, it could be XML, but I'm like, so I could be stabbed in the front and the back. But it feels very, ... it feels weird to me that we, you know, there's a bunch of lessons we haven't learnt and a bunch of things that we have reinvented the wheel about.So, you know, I kind of ... I'm not really fussed about the terminology people use. I'm not even fussed about sort of recognizing that there's a past history there except to hopefully learn from it as long as people take it on board and go, you know, in the case of SLI's and KPI's, it's like you have a customer, they have a measure of how successful they are, you know, that should mirror your measurement of how your ... the functionality of your infrastructure or the thing you look after for them. And if they have that sentiment, I don't care what they call it, you know, an SLA and SLI or a KPI. Yeah, I think debating about that is a funny one.Mike Julian: So you mentioned that the sea of configuration files and Kubernetes, which is absolutely wild. Like yes, it's like we didn't learn anything at all. That brings me to do we even need DevOps anymore? Like on one hand we're still making the mistakes that we used to make, on the other things are very different than they used to be.James Turnbull: Yeah. I think that there is no yes or no answer to that statement. I think it's a bit more nuanced. Obviously there's a bunch of things we haven't learnt and I, you know, I overheard a conversation at a Kubecon a couple of years ago where two fairly ... I would say 20 something looking engineers are talking about the fact that we'd be so much easier if there was some sort of templating system for configuration. It would make so much easier if we could build templates and stuff. And I was ... I have no hair anymore, so I wasn't pulling my hair out, but I was mentally doing it. I thought don't say anything James, you'll look like an old fart. Like just turn around and walk away, go to the bar, have a quiet drink. But, so yeah, definitely we need to ... we should learn from the things that came before us to make the experience of the people maintaining these systems at least as good as if not better than the experiences we had.But that being said, Kubernetes is an example of how far up the stack we've moved. You know, back in the day I spent a lot of time worrying about Linux kernel modules and package management system and IP tables and stuff like that. To a large extent those are not skills that are relevant to a lot of contemporary engineers who are working on say container based systems because that's all black box to them. It's taken care under the covers for good or ill, you know that they're running a Q cluster on top of a machine that they may not even maintain or may not even know anything about. So perhaps some of those, the problems we had in the past might not exist anymore perhaps. I don't know it ... never been a huge fan of black boxes either so.Mike Julian: It seems to me that we've ... we have moved some problems around like some of the problems are still there, we just don't see them anymore or like we've made them someone else's responsibility. Like when say Amazon or Azure, pick your cloud provider of choice. I don't have to care about the network anymore except I kind of do, but it's entire black box to me so when something is kind of hinky, I can't really do anything about it anyways.James Turnbull: Yeah.Mike Julian: So there's that whole discipline of network engineering that has ... where a lot of systems people were also amateur network engineers are not anymore.James Turnbull: Yeah, I think there ... I mean the argument the cloud providers make and I think it's a reasonable one is that economies of scale apply not just a cost they apply to you know, stability and availability and you know, the ... in the vast majority of cases the 80/20 rule applies and you don't need to care about the fabric between your infrastructure. In the cases where you do like, let's say I'm a high frequency trader or something like that where I care about every pico-second between me and the pipe out the building and me and the trading floor. Yeah maybe you're not running in the cloud, right? Maybe you're running on, you know, custom built high-performance machines with incredibly tuned kernels and network stacks. Does everybody else, you know, need that? Probably not, but you're right, it does make debugging more complex and potentially problematic.Mike Julian: So I think all that is pretty interesting. And there's also something you mentioned before we started recording about  Puppet, Chef, config management in general, significantly less relevant than it used to be. I remember that my last full time job, most of the work I did was writing Chef and Puppet like it was a whole lot of orchestration and config management of like how do we build a system and now like as far as I could tell, no one's really caring about that anymore. Like we have ... the problems of moved further up stack.James Turnbull: Yeah I agree and I've been saying for a few years and I think if I look at some of the work that's come out of the some companies and you know, HashiCorp too to some extent. The important thing about configuration management is the lessons learnt about configuration management.Mike Julian: You know, right.James Turnbull: And the fact that the abstraction has moved up the technology stack should be like, you know, how do we apply the lessons we learnt managing infrastructure level components and managing application and service level components. Orchestration is not a solved problem by any stretch of the imagination and things like microservices make things considerably more complex.Mike Julian: Yep.James Turnbull: You know obviously they're very flexible in many ways but you know all of a sudden you have 300 little services that talk to each other via various ports and require different levels of security and AAA, you know with the required different pieces of configuration like this is a non trivial problem and guess what, we've actually solved some of these non trivial problems before why some of these companies hopefully will reinvent themselves to be in that space and I see a little bit of that happening now. We'll just see who survives I guess.Mike Julian: Right. I have a few last questions for you. Of the bajillion books you've written, which ones have been your favorite? Like what was the most interesting one to write?James Turnbull: I think it's probably the art of monitoring. There's a lot wrong with that book and a good amount ...Mike Julian: As always.James Turnbull: I went into an obsessively deep hole and I wrote a 700 page book, which is very focused on technology, using technology stack to articulate what is effectively a change in thinking. And I did that because I ... everyone had a conversation with, I was like, who's going to buy a book about the theory of monitoring, but people might buy a book that has like configuration files and technology and shows you how to do things, maybe that'll work better. And of course I realized that I would have written a much shorter book, and probably not have spent a year and a half of my life buried in complex configurations if I hadn't have written a theory of monitoring book and it might've been quite timely.So yeah, I think it's probably my favorite book, but it's also my least favorite one too. And there's definitely ... there were some terrible ... I brushed over some topics that I probably should have covered in more detail and there's a couple of ... I recently found a calculation error in one of my graphs that are ... a Russian PhD student pointed out to me and I was like, huh. It's not a big miscalculation, but it's enough that I felt embarrassed and I went back to this guy with I'd done a calculation wrong and our formula wrong. And, but yeah, so there's moments where I'm like, how many people saw that and thought, what an idiot. So that's and that's never good.Mike Julian: I had that happen recently with mine. My book is being translated into Japanese right now, which is super cool and the Japanese translators are thorough.James Turnbull: Okay.Mike Julian: They have found so many errors and so many like typos, but some of them are are calculation errors. They don't fundamentally change what I'm talking about or the illustration, but it's one of those like, Huh, I really did screw up an average calculation.James Turnbull: Yeah, I did that too… but yes this Russian PhD student, he was hilarious. He's like, I just don't understand how you got this number. And it was ... there's no spreadsheet. There's no our formulas in there. There's just like a graph and he's obviously smart enough to look at the graph and go, that's wrong. And he was very polite about it, but he was genuinely thought I might've discovered a new branch of math as opposed to me making a terrible mistake which I though was flattering and horrifying at the same time.Mike Julian: Absolutely. So I noticed there's ... seemingly, there's a trend with how you write books. To me looking from the outside, it's you start writing on a topic right about the time it hits mainstream. Whether that's true or not, that's how it seems to be. So what's ... what can we look forward to next from you? Like are you thinking about writing any new books?James Turnbull: I am contemplating it. I had ... I've had a long dry spell of just writing sort of bits and pieces for internally. I'm writing a bunch of content for Microsoft right now. I'm thinking about writing something again, something technical. I feel like maybe service mesh is probably somewhere in this space that I'm interested in, but I don't see anything in there yet that sort of resonates with me as a solution I want to write about. But I think that's the space I'm going to watch. I would love to write a book about startup engineering practices and about [inaudible 00:29:59].Mike Julian: That could be fun.James Turnbull: But I think that I've been beaten to. Camille Fournier wrote The Manager's Path, which is, to me, every time I read it I'm like, I can't do any better than this. This is an awesome book.Mike Julian: It is a very good book.James Turnbull: And so I feel like that position has been taken, but I've had some thoughts about like the startup and things like that. Like I think there's definitely ... we're definitely in a different era and there's definitely some lessons learned and you know, particularly things around topics like work life balance and ethics and diversity and inclusion where an update to some of the seminal ideas about startup, the way startups work might be welcomed.Mike Julian: Yeah. All that sounds great to me. Where can people find out more about you and your work?James Turnbull: Probably the easiest place is Twitter. I'm one of the dying generation that uses Twitter, so my Twitter handle is @kartar and if you're interested in my books, turnbull.press will ... is my grandiose imprint and that to list all of the topics of the books and the topics and so forth and that's probably the easiest way to find me.Mike Julian: James, thank you so much for coming on the show. It's been a pleasure to chat with you.James Turnbull: You too. Thanks so much for having me.Mike Julian: And thank you to everyone else listening to the real world DevOps podcast. If you want to stay up to date on latest episodes, you can find us at RealWorldDevOps.com and on iTunes, Google play or wherever it is you get your podcasts. I'll see you the next episode.Announcer: This has been a HumblePod production. Stay humble.


9 May 2019

Rank #2

Similar Podcasts

Podcast cover

Avoiding & Treating Burnout with Dr. Sherry Walling

About the GuestDr. Sherry Walling is a clinical psychologist, entrepreneur, international speaker, yoga teacher, podcaster and best-selling author. She works with leaders and entrepreneurs around the world to help them tackle the many mental health and relationship challenges that go along with building a great business.Her best-selling book, The Entrepreneur’s Guide to Keeping Your Sh*t Together, is a handbook for navigating life as an entrepreneur.Married to a serial entrepreneur, Sherry combines her extensive experience helping people who have high-intensity jobs with her 18 years of personal experience in the trenches of the startup world. Sherry combines the insight and warmth of a therapist with the truth-telling mirth of someone who has been there.When she’s not in the consulting room or hopping conferences; Sherry can be found on her paddleboard, in the yoga studio, or ushering her three kiddos through an art museum in some fabulous city.She can also be found at ZenFounder.com, SherryWalling.com or on Twitter as @ZenFounder.LinksManaging Founder Stress Guide“How to Stay at the Top of Your Game” talk at BoS USA 2017 Links Referenced:  Book recommendation: Play it Away by Charlie Hoehn Book recommendation: The Happiness Advantage by Shawn Achor  Sherry’s “How to Stay at the Top of Your Game” talk at BoS USA 2017  Managing Founder Stress Guide Transcript:Mike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools — and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.Mike Julian: Hi Folks, I'm Mike Julian, your hosts for the Real World DevOps podcast. I'm here with Dr. Sherry Walling, clinical psychologist, author and fellow podcaster at ZenFounder.com. Welcome to the show, Sherry.Sherry Walling: Hey, it's my pleasure Mike. After a couple of reschedules, I'm glad we finally got this to come together.Mike Julian: It was a bit of work to make that happen, but I'm really excited about this episode. So Sherry, why don't you start off by just telling us a bit about who you are and what you do.Sherry Walling: Yeah. So like you said, I'm a clinical psychologist and I have spent my professional life working with people who have high intensity jobs. That looks like a couple of different things over the course of my career. Sometimes it's with folks in the military returning from military service. I've worked a lot with physicians, who have high intensity work either in the ER or in surgery. Then I actually work a lot with software entrepreneurs and software folks. My connection to the developer world is largely through my husband, Rob Walling, who maybe some of your audience know of MicroConf and things like that. So I work with really smart people who are trying to do really hard things, and often have a lot of pressure in their work lives.Mike Julian: Yeah, that's some awesome stuff. The reason I'm especially interested in talking to you is as DevOps engineers, we lead extraordinarily stressful lives. A day in the life of a operations engineer… we’re project driven, and yet it's often interrupt heavy. So we're never really finishing anything, thanks to putting out fires constantly. Just so many fires is all the time, everything is always awful, everything's just a tire fire everywhere we look. So we're responsible for keeping systems of multimillion dollar, sometimes multi billion dollar companies running and available, and on call is a standard part of the job. And sometimes this is even this is really bad, like the on call rotations of some companies might be 15 pages a night for a week or two weeks at a time, and it just gets insane. This is not just a few years or one job, this is like an entire career. So to say that the role of an Ops engineer is stressed, is kind of an understatement. So I'm really excited to talk to you about how can we manage the stress? Is there something we can do about it? How can we improve our lives?Sherry Walling: Well, I'm really empathizing with the way that you're describing this, and there's a couple of things that you said that would lead me to believe that folks who are doing this kind of work are really at a pretty high risk for burnout. Whenever there's high work demand, so lots of things to do without having appropriately accomplishable goals, goals that have a tight timeline and clear successes — those three things tend to be a combination that really causes burnout in a lot of folks. So I can empathize with the amount of stress that folks are feeling, and certainly guessing that some of your listeners are really struggling at this point with stress related difficulties, or things like burnout.Mike Julian: Yeah, absolutely. So why don't we talk, why don't we just start with burnout? There's so many different ways that I want to tackle this conversation, but why don't we just start with burnout? What is burnout?Sherry Walling: Yeah, so burnout is a newly recognized clinical syndrome. It has its own diagnostic code. It is a real thing.Mike Julian: That's fantastic. It's about time.Sherry Walling: It is about time. It exists in ICD 10, which is the International Classification of Disorders 10. It's called something like “burnout estate of vital exhaustion,” is the technical title which I think is really lovely language that paints the picture of what this feels like. Vital exhaustion. So burnout is something that was researched by Christina Maslach, a psychologist at UC Berkeley. She spent her whole career, like 40-something years really developing this and researching it thoroughly. Burnout really has three components. It's a sense of first of all, emotional and physical exhaustion. So people are just tired, they're physically tired. They might have a cold that they just can't recover from. Their bodies are broken down and worn out. Then emotionally exhausted which can look like depression, like flat, epithetic, not a lot of umph for passion. So physical emotional exhaustion is number one. The second component of burnout is feeling detached or cynical. So this really is where you feel like the work that you were doing doesn't have a lot of value, and the people that you are serving or working with are just irritating the hell out of you. You don't care so much as maybe you used to about what it is you're trying to accomplish. You don't find it rewarding or meaningful. Then the third component that's used, so sort of technically diagnosed burnout is a lack of personal efficacy. So feeling like no matter how hard you work, or what accomplishments you may be able to brag about, it feels like you're not getting anything meaningful done. You're just not pushing the needle forward, or you're not able to push the boulder up the hill. So burnout is specific, and it is specific to how someone is doing in their mental health in relationship to their job. So you don't get burnt out from being in a crummy relationship with a girlfriend or boyfriend. You have other problems perhaps, but burnout is very specific to your job.Mike Julian: Man, I'm just having flashbacks. Just so many jobs where I was checking the boxes on all three of those pretty hard. Is burnout inevitable? Is it just going to happen when we're having these stressful jobs?Sherry Walling: Yeah, some of the epidemiological studies suggests that it's between 25 and 30% of adults experience burnout in a given year. So it's really common.Mike Julian: Well, that sounds terrible.Sherry Walling: A lot of people experience it, and that stat's from the US and then there's a similar study for the UK. It's not inevitable in the sense that there are some things that protects people from burnout, even in really high intensity jobs. So again, some of the folks that I've worked with over the years, some of them do incredibly difficult really hard stressful things, but they do okay. So it's not inevitable, but it depends a lot on how much control you have over your work environment, and then what you're doing in the other parts of your life to help protect yourself from burnout.Mike Julian: That's a fantastic point. When I went through a super bad case of burnout some years ago, and I had realized that during the course of this, I found a book. I found a book, it's called, Play it Away by Charlie Hoehn. It's a fantastic book about his experiences of with burnout. One of the things that he really talked about was he had realized he stopped hanging out with friends. He had stopped doing hobbies, but all this stuff crept up on him. It wasn't an immediate thing. It wasn't like, “You know what? Screw my friends, I'm not going to talk to them anymore.” It was like over the span of say six months or a year, you realize looking back that you haven't had dinner with a friend in a while. You haven't had dinner with any friends in a while. So it comes up on you slowly. Is that part of your experience as well?Sherry Walling: Yeah, I think that's very true. It's a slippery slope. Most of us start our work feeling pretty positive, or pretty ambitious about what we're going to get done and how we're going to contribute to the world. Then over time, some of that optimism and energy just gets eroded away. Hanging out with people, having strong social connections is one of the most important protective factors that helps prevent burnout. Some of the other ones are being able to celebrate successes, which I hear is hard in your community when you have all these projects, and there's not a point where it's like, okay it's done. But when you do have those projects that ship, when you are finished with something, taking the time to really celebrate that and let your brain on a neurological level of fuel, the positive chemicals that come along with the satisfaction of finishing something. People don't do that very often or do that very well, but it's really important.Mike Julian: Yeah. I've seen someone, it's actually a whole bunch of people now, have a practice of every minor win. No matter how small it is, they celebrate with a cupcake to themselves. At first, I thought like I want to save that for something big and then I realized no, actually you can't. We don't often get big wins and we can't ever predict when they're going to come, so you celebrate the wins as you get them.Sherry Walling: Yeah. There's all kinds of different ways that people can do that. Whether that's like, a lot of people do it with food or sweets or a special drink. Having a bottle of wine that you write a sticky note and you say like, “I will drink this bottle of wine or I'll open this wine when I finished this project, or when this thing launches” or something like that. Having those designations both for short term goals and bigger goals is helpful. You might have a cupcake a week, sort of on a weekly reward schedule, but then give yourself a bigger treat when something bigger is accomplished.Mike Julian: Yeah, I love it. So we've talked about a couple of different ways to mitigate and detect when you might be burning out, or have already crossed that threshold. Are there any others?Sherry Walling: One of the most important things that helps protect people from burnout is feeling attached to the meaningfulness of their work. Like really having a strong sense of I'm going to do a good job writing this email, or fixing this problem because it is ultimately meaningful to my customer, to my community, to my business, to my whatever. People define that in lots of different ways, but the bigger the gap between what you're doing on a daily basis and what you find meaningful, the more vulnerable you are to burnout.Mike Julian: Yeah, that makes sense. It's hard to get excited about being a system administrator for ad networks.Sherry Walling: I don't know, I mean is it an interesting problem to solve? Is it sort of like actually interesting?Mike Julian: For me. Yeah for me, I don't think I could do it. It's just nowhere near my interest, and I think there's a lot of people that are listening that have the works that they're doing is not that interesting to them. Whether it's not intellectually challenging, or they don't like the company they actually work for, they don't agree with them from a philosophical standpoint; so the gap may be pretty big. What are some ways to deal with that?Sherry Walling: You do have to listen to yourself to how tolerable it is. One thing is to make sure that you're really optimizing for meaning in other parts of your life. You do the 9:00 to 5:00, you collect the paycheck and that paycheck allows you to live the life that you want to live. So you have other parts of your life that you find deeply meaningful, we sort of offset the not super meaningful part of your 9:00 to 5:00. I don't love to leave people there for years on end. I don't think that's a recipe for a satisfying healthy life over the long run, but I think we all go through seasons where we're like, “I could take or leave this day job, but hey at least that I can afford my model train hobby, or the great vacation I'm taking with my husband or wife.”Mike Julian: It sounds like that one might be, it's not a long-term strategy, but more of a short-term strategy to get your mind back on track, to even bring yourself closer back to being healthy, so that you can think about longer term solutions. In that situation, a long-term solution might be getting a new job.Sherry Walling: Yeah, sure. I think one of the ways that we can reconnect with meaning is to notice what we're grateful for, especially if we're bummed out about what's happening at work. It can be a really helpful practice to at the end of each day or some point of your day, just jot down three to five things that you are grateful for, that you are happy about in your life. That sounds maybe like a little bit hippy dippy, or touchy feely, or something, but there's actually a tremendous body of research behind the psychological benefits of a simple practice like that.Mike Julian: Yes. I had the pleasure of seeing Shawn Achor speak, I guess last year and I think he wrote the book, The Happiness [Advantage]. I think that was the name of it. And he talked about this body of research. He was basically going around to companies and saying, "Let me practice on your team.” What we're going to do is gratitude journals was one of the things he did. We're going to have everyone on your team every morning, they're going to write down three things they're grateful for. Then we're going to measure before this project and after this project, how they feel about the work they're doing, and we're just going to measure it by asking them, “How do you feel about your job? How do you feel about your life?” He says there were noticeable improvements despite the job not changing. Everywhere he went, he repeated this study and it just kept coming out the same way. What essentially the research is saying, you are happy when you think you're happy. Like you can choose to be happy.Sherry Walling: That our emotions are largely a state of mind. I say that somewhat gently, because I do think that that can also lead to some frustration on the part of friends and family of those who are in burnout, or who are experiencing something like depression to say like, just pull yourself together, you can choose this because it's certainly not that straightforward.Mike Julian: Yeah, absolutely. That is probably the hardest part of this. You're with a friend and they've just lost a loved one and all you have to say is choose to be happy. Yeah, that's not going to get anyone anywhere, like it's awful.Sherry Walling: Yeah. I think the gratitude work is really important in protecting against burnout, and in part of the recovery from burnout, but when someone is really truly in the midst of burnout, there begin to be these neurological changes in their brain. So the amygdala, which is the fear center becomes overly active and really floods the system with cortisol.Mike Julian: That's interesting.Sherry Walling: So we can measure changes in the brain. The other part of that that happens that's really problematic is that the connections between the amygdala and the prefrontal cortex, which is where we make all of our fancy human decisions and plans. Those connections begin to weaken, which means that the skills that we would use to talk ourselves through our feelings, or to even actively choose a different way of feeling, those resources become much less available to us just on a brain level when we're in the midst of burnout. We see neurons begin to die that connect those kinds of brain circuitry, that help us to choose our way of feeling.Sherry Walling: So again, if you're sitting with somebody who's in burnout and you're like, “Can you just pull yourself together please? Like, oh my God, why are you choosing to be miserable?” It may be that they've gotten so far down this path that they're really impaired in their brain's ability to make a choice about how they feel. At that point, people really often do need to take a pretty significant rest so that the brain can begin to repair itself. That can also happen. It's not a death sentence for your neurological health. So to be burnt out, there's possibility to bounce back.Mike Julian: That's exactly what happened to me. I was so far down the burnout hole that I snapped to the core one day, which is like it's not me. It was pretty bad. The guy nearly started to cry and my boss called me and he's like, "What did you do?" He's like, "You take the weekend off because we need to have a chat." I decided to, I was going to leave the company after the weekend I'm like I got to go, I got to leave. He's like, "So where are you going to go?" "Well, I'm going to a remote beach in Costa Rica." He's like, "Is that a euphemism for another company I'm not aware of?" Like, "No, no, it's actually a remote beach in Costa Rica." I stayed there for months. I just sat around doing nothing. Just the very idea of being on a computer, it made me nervous, it made me anxious.Sherry Walling: Yeah, well you're perfectly representing what often needs to happen when people need to recover from burnout. It's usually a minimum of two weeks of really deep rest. Sounds like you were able to take more time than that, which is great and perfect. I hope everyone ... I wish we had sabbaticals, right? So I used to be in academic and every seven years, you apply for sabbatical. You don't just go lay on a beach somewhere, although you can, you work on a project. You're learning something, you're writing something, you're bringing new life into your academic or intellectual career. I wish that all careers had that, because I think all thought workers could really use that three to five months off to just reset.Mike Julian: I completely agree. Yeah, I completely agree with you. That sounds fantastic. So we have two problems here. One is for those people that are so far deep down into this, and they can't take that time off. Then we have the people who are maybe not quite that far down, but they're feeling the effects of burnout. So it sounds like the solutions to these two are different solutions.Sherry Walling: Rest is always helpful. Relationships are always helpful. Even if you can't do weeks of rest, you can do evenings where you're not working unless you're on call, or weekends when you're technology free. Obviously, unless you're on call, but taking clean breaks as much as you can. So rest is always helpful. Relationships are always helpful. Creating or connecting to as much meaning in your life is always helpful. Celebrating any rewards that are, or any positive things that are happening is always helpful. Setting clear goals for yourself is always helpful, even in someone who's burnt out. Maybe the goal is like, I'm going to read three non-technical books, or I'm going to have coffee with three friends in the next month. So you have that sense of here's what I'm setting out for myself, and here's what I'm chipping away at.Mike Julian: Okay. Yeah, that makes a lot more sense. So it's not just like, nope, everyone has to go take four months off now. You really can have smaller things that you can do, no matter what your situation is. So for some people it might just be to set a strict work schedule. Like yes, you have been working for your company 60 plus-hour weeks, cut that down to 40 and like a strict 40. So say 5:00 or 6:00 PM every night, stop working.Sherry Walling: I think it's tough in the field that you're in, because this around the clock or mostly around the clock expectation for work availability is really, really contraindicated for human mental health. I will say that of course, in the United States and to some extent in the UK, or I'm sorry in the EU, depression is the number one driver of disability in the US. Most people who are missing work or who are taking large chunks of time off work are because of depression, which of course overlays with burnout and other things. So this is a really deep problem.Sherry Walling: I do think that those of us who have positions where we have some power to advocate for changes in our work environments, this is a worthy cause to take up because 60-hour work weeks of being on call over the weekend. That's fine for a short duration, but if that is your work life forever without significant break, that's almost always a recipe for really significant trouble.Mike Julian: Let's look at from the perspective of a concerned friend or family member. If I have a friend who I know is burnt out but they don't agree, but I see all the signs, what can I do to help them?Sherry Walling: I think it's hard when you're having a conversation with someone and they just flat out disagree. I think when we use language like, "Hey, I'm concerned about you. It seems like you're really carrying a lot right now.” Or “Wow, you've been working so many hours. How are you holding up?" I think when we use this more neutral language that expresses how we feel, but isn't finger wagging at someone-Mike Julian: Not accusatory.Sherry Walling: Yeah, or not pathologizing. Not saying like, “Wow, you seem really depressed. You look really tired.” Nobody wants to hear that. I think to some extent when we drop the seeds of like, "Hey, I've been just talking with a lot of folks lately who are really burnt out. I've certainly been really burned out. I see how hard you're working. I just worry about that for you." Those kinds of sentences, they normalize it. You're not being like wow, you're weak, you're lazy or whatever. You're also telling your story or sharing vulnerably, but like not in this way that's not invited. You're not lecturing. So I think the way that we approach the conversation is very important, and sometimes you're a squeaky wheel a little bit. Especially if someone just disagrees with you.Mike Julian: Right. Could there be something like very simple steps that I could take of, so the suggestions on wording on how to approach a conversation would be super helpful. You've also mentioned that having friends, having hobbies, having relationships in general is also very helpful on this. So if I have a friend who is consistently bowing out of going to some event with me, going out for coffee in exchange for doing work, how can I handle that? How can I pull them out of that, pull them out of the hole?Sherry Walling: I think consistently showing up and meeting people where they are. I think when someone is really in a bad state, maybe they're at their house and they're on their laptop all evening, maybe you just drop by, bring some guac and chips, and just sit with them. Maybe you're on your laptop too. It's not a deeply communicative interaction, but just feeling the presence of another person can be really helpful. Also suggesting things like let's just walk around the block. We don't have to have a two-hour hang out, like let's just walk around the block for a couple of minutes. I'll pop by and then I'll go and you can finish your work. So being a little bit pushy in a graceful way, and respecting that you ... I mean even me as a professional helper, I can't force people to do more than they are ready to do, period. You can show up, you can drop hints, you can make it easy, you could be available. You can give resources, but you have to wait for people where they are, which is really hard. Mike Julian: Yeah, absolutely especially when it's someone you care about and watching what they're going through, it's hard to sit there and know that you can't cause them to be better. You can't make their situation better yourself, it's they have to do it. You can help, but sometimes all you can do is just watch and be there.Sherry Walling: Yep, and hanging around, dropping by with a snack. Those are pretty powerful things. They don't feel like we're doing much, but they're pretty powerful. Again, when someone's feeling bad, sometimes a 10-minute conversation is all that's really tolerable.Mike Julian: Yeah, absolutely. So let's talk about post-burnout. When I recovered from my own burnout, the thing that was on my mind was I want a stress free life. I was, it's like the extreme reaction to what I had just come out of. You know, sitting on a beach in Costa Rica is pretty stress free.Sherry Walling: I don't know. There's some big iguanas I've seen there down in Manuel Antonio.Mike Julian: That's true. They are huge.Sherry Walling: That's a little stressful.Mike Julian: It was a little stressful the first day I was there because you wake up and it's like 6:00 AM, and there's iguanas running across the tin roof, but you have no idea what they are because it's your first day there. So you’re like, what the hell is on the roof?Sherry Walling: Thud, thud, thud, thud, thud, thud, thud.Mike Julian: Is it going to eat me?Sherry Walling: Maybe.Mike Julian: That was a little stressful. So for me, when I came back from this and I thought I want a stress-free life, and I very quickly realized that's actually kind of a bad thing. I got really bored. That led me to this question of is all stress bad? What's your experiences of that? Is there a balance? Is there such thing as good stress?Sherry Walling: Absolutely.Mike Julian: Okay. So what does that mean? What is the difference between good stress and bad stress? How do I tell the difference?Sherry Walling: Well is going to sound so geeky, I'm sorry. I'd like to quote the Yerkes-Dodson principal. So there's this old school Psychology 101 study, looking at the relationship between stress and performance. It's called the Yerkes-Dodson principle and it's a beautiful inverted U. So it says that as stress rises, performance also rises, but when stress becomes past the tipping point or past the midpoint, then performance begins to drop. You can picture that inverted U in your head maybe. Stress is simply a measure of… Physiologically, it's a measure of activation. So stress feels very much like excitement. Very much like passion. It's the same part of our bodies that are like, “We're awake, we're wired.” So stress is motivating. Stress helps us feel keyed up, and awake, and focused. So like your identifying, no stress means we're really not fully awake, or functioning at any level of intellectual or physiological stimulation. Like we're just existing, but the magic is finding the point at which your stress helps you, and really being wise about the point at which your stress becomes too much, and it starts to inhibit your ability to do a complicated task or be a kind person.Mike Julian: Yeah. When I came back from my sabbatical, I took a job that was very low stress. It was very intentional. I took a job that was slightly lower paying but had much, much lower stress levels; but I wasn't really doing anything interesting. I sat around basically doing absolutely nothing of interest to me for like four to six months. I realized like, I'm really bored. So I realized I didn't have a big project. I didn't have anything I was working towards. I didn't have anything pushing me to do anything really. So for me, I decided to write a book. You want to talk about stress, like there's one. To me that was a good sort of stress, because it pushed me along. Now at the end of writing a book, as I know you've done too, you start to tip into the bad stress a little bit.Sherry Walling: Yeah, for sure especially the marketing part for me. I was like, I don't know how to do this.Mike Julian: No, it's awful. You've spoken before about this idea of acute versus chronic stress. I can maybe guess about that, but why don't you tell us, what does all that mean? What is acute stress? What is chronic stress?Sherry Walling: Yeah. I think before diving into the differences of those, we have to tell the truth about the fact that we are, our emotional mental life is very integrated into our bodies. So we don't really have these dichotomies between emotional health and physical health, or mental health and physical health. It's just all one system. So when we're talking about stress, we are talking about how the body responds to elevated stimulus, which sounds not very sexy, but we have these amazing bodies that are hardwired to protect us in the event of stress. You can think of the fight or flight. How do you, when you're threatened, your heart rate elevates, your breathing becomes faster. It moves into your chest, your muscles become tense. You're ready to get some shit done, or run for your life. Either way.Mike Julian: I mean, you know, whichever.Sherry Walling: So that is the perfect use of acute stress. Something is threatening, something is in your face, something is elevated and your body needs to live in a different space for a short period of time to respond. When that process goes on too long, when we live at a constant state of elevated stimulus, when the demands that we experience in our lives are beyond what we can meet, then we move into chronic stress. Frankly, it tears apart our bodies because we're just not meant to live in that elevated place. So our muscles are sore. We get heart disease, our breathing is never calm. We never relax, our bodies don't learn to relax. So again, acute stress is the perfect physiological response to something that's threatening, and that can even be an existential threat, like an angry boss or a project that's coming up. It's not always a tiger. It's something that we need to gear up for and be all in on — highly focused, pretty adrenaline fueled, usually moving pretty quickly. It's great, that's acute stress is how our bodies are meant to react. Again, when there's a project due every day or the boss is mad every day, and we're always living in that place, is when we begin to experience a number of physical and mental health problems.Mike Julian: Yeah, I remember towards the end of my couple of jobs ago of the burnout job, I was waking up in the middle of the night thinking that my pager was going off and it never was, but I would still wake up multiple times a night, and just like bolting awake, sitting up, grabbing my phone and thinking that it had just gone off from a page. This still actually went on for months afterward. It's the getting a page, getting that on call alarm sounds like acute stress, but because that happened so often for me, it actually became chronic stress.Sherry Walling: Yeah, absolutely. Our bodies need to relax. I mean that's part of this whole conversation about burnout and stress, is really allowing ourselves safe, quiet spaces where we can re-establish a low baseline heart rate, where our breathing can calm down, or our muscles can relax and where our mind can be calm and sort of unbothered, I guess.Mike Julian: Yeah. I had the pleasure of seeing you speak at a conference a while back, and one of the things you did in your talk was I thought it was pretty interesting. You had us go through a breathing exercise right there in the conference. I can't even remember which one you did, but I've seen you write about them before as well. What's your favorite breathing exercise to have people just to relax themselves?Sherry Walling: I think the simplest one that I often begin as an introduction point is I just call it four by four breath. It's where you try to breathe in for four seconds, and then exhale for four seconds. So you're slowing down your breathing and then you're moving your breath both slow and low, so down into your belly button. So if I'm doing this in an event, I'll have people put their hand, their palm over their belly button and see if they can move their hand, or even watch their hand move when they inhale. So it's like picturing your belly getting filled with air like a balloon. Then when you exhale, your hand goes back towards your spine because the balloon is deflating. So low, slow breaths is one of the best tools that we have to counteract the stress response in real time. There's a ton of science behind why it works, but it's not just like a yoga thing or a meditation thing, or like a weird Sherry thing. There's a tremendous amount of research to support that technique is a great one to use.Mike Julian: Yeah and I can definitely say that it really does work. It is incredible how such a simple thing affects you so profoundly.Sherry Walling: Yeah. People can, I love that tool because you can do it in a meeting. You could do it without people knowing you're doing it, it just looks like you're breathing slowly.Mike Julian: I was reading an article you wrote for Stripe Atlas website a while back, where you mentioned that you've actually done this onstage while giving a talk, and people don't notice. That's incredible to me, I'm totally going to do that.Sherry Walling: Yeah. Like you're doing an intervention on your own self.Mike Julian: Right, it's great.Sherry Walling: Yeah. It's pretty powerful and it's something that people need to practice. Taking four breaths sounds real simple, but really slowing and lowering, slowering. Really doing those two things and being able to get there pretty quickly is an important part of this skill. So when someone is trying to work on this, I just recommend people do it three times a day, after you eat, before you eat, when you brush your teeth, whatever anchoring event will help you remember to do it regularly for a period of time, until the skill becomes muscle memory or a little bit more hardwired.Mike Julian: Yeah, that's great advice. So on that note, I like to ask all my guests to give us something actionable that people can start on today or this week. So for us, if I think I'm experiencing burnout, or I'm trying to avoid it, or I'm in the deepest, darkest hole I can find already, what can I do? What advice would you give?Sherry Walling: The way that you described that, the deepest, darkest hole. I mean honestly I would look for a mental health professional. I think sometimes mental health professionals, therapists, counselors, get a bad rap or there's a lot of stigma perhaps about seeing them, or they can be inconvenient or whatever, but I do think that if you are feeling like you are at the bottom of the hole, and you're not sure how to get out or you don't feel like you can go lay on the beach in Costa Rica, you need some ideas that are really specific to your situation, I'd get in touch with a therapist. There are increasingly people who will see people via Zoom, or via something that makes it easy for your life. You don't necessarily have to go to an office.Sherry Walling: The other thing that I would do is really to write about how you're feeling. Good old fashioned journaling can be very therapeutic. Certainly in addition to some of the other skills that we've already talked about, like adding a deep breathing practice, adding a gratitude list at the beginning or end of your day. Those are really simple things, they don't need to take a lot of time. They're free, they're not expensive, but they can really the needle in terms of you feeling a little bit more in control over what's happening inside of you.Mike Julian: All right then. Yeah, that's fantastic advice. So Sherry, this has been an absolute pleasure talking with you today. This has been great. I love everything you've said.Sherry Walling: My pleasure. Thank you.Mike Julian: Thank you so much for joining us. Where can people find out more about you and your work?Sherry Walling: Yeah, so I live online largely at zenfounder.com. So my podcast is called ZenFounder, and we talk about lots of topics like this. Work, family life, mental health kinds of issues. I also have a book called The Entrepreneur's Guide to Keeping Your Shit Together, which I'm told is relevant for people who are not only entrepreneurs, but anybody who's really doing high-intensity job.Mike Julian: I've read the book, it is wonderful book.Sherry Walling: Thank you. I have a guide to stress that Stripe put out, that you can put in the show notes so that's free and available, and I think a helpful tool. So I love doing this. I love that I get to do this and so I also work one-on-one with people. If people want to reach out to me about some consulting work or doing a talk at your company or whatever, this is my jam. So happy to help.Mike Julian: All right, well, thank you. Thanks for all you listeners. Thanks for listening to the Real World DevOps Podcast. If you want to stay up to date on the latest episodes, you can find this at realworlddevops.com and on iTunes, Google Play, or wherever it is you get your podcasts. I'll see you in the next episode.


31 Jan 2019

Rank #3

Podcast cover

The Vendor Is Not the Enemy with Cory Watson

About the GuestCory G Watson is a Technical Director at SignalFx with 20 years of SWE, SRE and leadership experience at the likes of Twitter and Stripe. He's an observability wonk, optimist, and lover of sneakers. He hopes you're having a great day!Links Twitter: @gphat Website: onemogin.com Links Referenced Patrick McKenzie’s blog post, “I’m Joining Stripe to Work on Atlas” Book recommendation: Information Dashboard Design: The Effective Visual Communication of Data by Stephen Few TranscriptMike Julian: This is the Real World DevOps podcast. I'm your host, Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences, from the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.Mike Julian: Ah, crash reporting, the oft forgotten about piece of a solid monitoring strategy. If you struggle to replicate bugs or elusive performance issues you're hearing about from your users, you should check out Raygun. Whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do which, if you're anything like me, is ask the nearest person, "Hey, is the app slow for you?" and getting a blank stare back because, "Hey, this is Starbucks, and who's the weird guy asking questions about mobile app performance?" Anyways, it's Raygun. My personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to raygun.com.Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is Cory Watson, Technical Director for the Office of the CTO at SignalFX. He's previously run observability at Stripe and Twitter. Cory, welcome to the show.Cory Watson: Thanks for having me.Mike Julian: I think it's interesting that you have gone from running observability teams at pretty interesting places like Stripe and Twitter.Cory Watson: Thank you, thank you.Mike Julian: To now, you're working for the enemy.Cory Watson: That's a good way to put it.Mike Julian: Yeah, you have suddenly gone to the vendor side.Cory Watson: Yeah, the nicer way that I put it, is I say I switched sides of the table.Mike Julian: Ah, yes. That is a much better way to put it. Apologies to my sponsors for insinuating that they're the enemy. You're completely right, it really is just the other side of the table.Cory Watson: That implies that it's just a simple change in aspect, though. It's really not. It's actually a pretty fascinating difference, I think.Mike Julian: Well, why don't you tell us more about that? What is that difference? What is it all about?Cory Watson: I think, in the past, when you work at a place that uses these vendors, you're often trying to make sure you maintain, or at least it's always been my goal to maintain, a sort of neutrality or maybe not a shim layer is the right way to put it. But how do we retain our independence and make sure that we could switch, if necessary, and all these other things? It's both good, usually from a technical perspective, but also from a leverage perspective, right? Because you wanna be able to switch if you need to or go to the new, cool thing that might come out. Not that we jump and switch that frequently, but you wanna be able to retain that independence. Now, suddenly, I'm on the opposite side. I think there's two interesting bits about it. One is the change in perception or the change in the approach. The second is just the learning experience that I have from watching business being conducted from this side of the table. I guess we can start on the difference in perspective. To your earlier question, it's like, alright. Here I am previously sitting over here, going like, "Okay, don't tell them anything and pretend like you hate everything that they show you, and never show your true colors."Mike Julian: Fantastic negotiating tactics.Cory Watson: For lack of any better training, I think that it's the place that you try to go. But at the same time, it's actually somewhat different than I imagined, because I'm happy to work for a company that largely isn't trying to sell you something for the sake of selling it to you. I think this is true of pretty much all vendors, we only want you, as a customer, if you're ultimately going to be happy. Especially for the duration that many of these engagements, contracts, or purchase periods, or whatever last, you don't just want to ease in, make a buck, and then ease back out again. You've got to stay with it.Cory Watson: In many ways, I think that my past experience of wanting to hold everything close to the vest, to use that idiom, don't really work that well because you need to give as much information as you can to the vendor so that the vendor can, hopefully, tailor the solution as well as possible, right? At the same time, I think it all comes down to price, though, at the end of the day. I think in that, luckily I don't have to have that conversation. I am only here to talk about the pros and cons and the approaches of how to do observability and how SignalFX can be helpful for you. I actually think that in switching sides, I've actually seen how it can actually hinder the process of adoption and understanding. Because, if you're like, "Well, we're not gonna tell you how many things we're monitoring, or how many metrics there are, or what our challenges are." It makes it extremely difficult to articulate a value proposition to the other people because suddenly I'm like, "If I can't help you size this, I don't know how to have the conversation." It's interesting to be on the side where you suddenly lack so much information because-Mike Julian: I've run into that in the Amazon world, where people will say, "Oh, we're spending a ton of money with Amazon, but we don't want to tell them what our future plans are. We don't wanna tell them about our product roadmap because security reasons. What if they leak it, what if they try to do it themselves?" On some level, like it's AWS so maybe it will actually happen.Cory Watson: Yeah, as we saw last week.Mike Julian: Right. When you're spending that much money with a company, when a company is such a core aspect of how you do business, for example on monitoring product, it's no longer a vendor. They're a partner.Cory Watson: Yeah, that's an excellent way to put it. I love that phrasing because I felt that way even as a customer. I've been a customer of many companies in this context, in the monitoring, observability, whatever context system stuff, and it's true. Once that spin gets to a certain amount or when it's such a critical part of your infrastructure, these systems are effectively the highest criticality or whatever of your internal systems. Because if you can't see what's happening, you can't make changes. You can't run if you can't see. It's absolutely right. It needs to be a partnership. The more information you can give, the better. Once everybody gets under a mutual NDA things, I think loosen up a bit. It's easier to share. It's also understandable because things like the number of hosts you run and the magnitude that some companies operate are sensitive subjects. I think it's very reasonable to hold it close. But at the same time, the better information you can give, then the better the solution can be tailored. So yeah, that's one side of the difference from switching over to being a vendor. Luckily, I think, on the other side of what I've learned ... actually, I shouldn't use "side" because that's confusing.Cory Watson: So there's that aspect, but then the learning experience is pretty interesting for me too. How sales organizations structure and work through this stuff? I'm not in sales. I've always worked in infrastructure at companies, be it for observability things or as an SRE. Suddenly now, I'm faced with learning how they approach it. Recognizing who at the customer that you're working with, who's your champion there? Just like anything, you need a champion. Someone who’s going to help. It's not strictly adversarial, but at the same time, the terminology is often like, "Well, who are the people who are fighting this? What are their motivations? Who are the people who are championing? What are their motivations?"Cory Watson: It's interesting because I've always used LinkedIn, mostly as a tool to stay connected with people I used to work with. Nowadays, I don't see this because I don't do it, but the sales people do. They know who everybody in the org is because LinkedIn is the org-chart. It's like, "Well, who do they answer to? Who's their boss? Who's gonna be in this meeting? What are their titles? Who's got the purse strings in this conversation?" It's all stuff that I look back and I see. I remember salespeople asking me these sorts of questions. I was always like, "Harrumph, why are you bugging me with these questions?" Only to realize that they're trying to figure out how to position themselves to best answer the questions that are out there, and also to sort of understand is this going to be fruitful, because companies can waste a lot of time if they're talking to the wrong people or all this other stuff. It's been really fascinating. Some of this was intuitive to me, just in working with small companies. It's just been fascinating to be on this side of the table, as we've been describing it, and learn how they approach this problem because I know how to approach systems problems. But these are essentially people-problems which is, at least in this context, all new to me.Mike Julian: Yeah, absolutely. I can totally echo everything you're saying on that as a consultant myself. I do also go through LinkedIn and start mapping org charts. Definitely been there. What's interesting to me about the sales conversations, is once you stop looking at vendor sales as adversaries, as someone trying to sell you a thing, where you get this idea of, "They're trying to sell you a thing whether you need it or not." Their trying to foist a thing on you. They're trying to trick you into signing a contract. That's not actually how any of it's going because most salespeople are not measured just strictly on how much money they bring in, but also on retention.Cory Watson: Yeah, yeah. It's a good point. It's about that year over year, over year. No one wants to sign a contract. I often saw companies saying, "Oh, we want you to sign these long contracts." I thought of it in purely dollar magnitude. I never thought of it as the comfort of that relationship being there. It's reasonable that when you sign up with a new company, you don't want immediately get into some three-year deal or something. But, at the same time, knowing that's gonna be there helps every side of the equation. I used to think about the contracts we were signing in raw margin terms. Well, I know what it costs to buy a server. Put it in a rack and run it. But you don't think about the engineering organization that's built up to also deal with all the silly stuff I'm asking for like the 100 different features I've got in a list, making sure that those are all getting done. I'm not the only customer. There's so much that goes on behind the scenes. I don't know. I feel privileged every day to be able to be able to see it from this side while still leveraging the fact that I'm an observability wonk and I do this stuff every day. I still get to leverage my strengths, but also shoring up my weaknesses when it comes to the sales side of the table. It's been a lot of fun.Mike Julian: Yeah, learning the business side of things — I think it's been the most interesting aspect of my entire career. Basically for me, the past five years learning how business is done, especially for selling infrastructure services and infrastructure products. To me, I think it's also the most impactful thing I've ever done in my career. All the knowledge and skills I've gained over the entire career of doing monitoring and observability and infrastructure. Yeah, it's all great. The things that are really made the most difference for me was learning how the business functions.Cory Watson: I don't think I've thought about it that way until listening to that explanation. I think if I rewind a little bit, the reason I took the job was I felt like I could have more of an effect on our industry by helping people connect these dots and leverage a vendor, if it was the right vendor for them, to get this job done instead of developing some of this stuff perhaps in-house. I mean not that you shouldn't necessarily do that, but there's a trade-off. Now, basically, I have shorter conversations leveraging my past experience to your exact point, to help them make this decision. Then hopefully have a significant like an outsized impact. The conversations are almost small compared to the impact that they have. Whereas, my engineering impact, is so much more long-term everyday typing and drudgery than going and spending two hours meeting a customer and having these conversations directly.Mike Julian: Yeah, absolutely. When you're working on the vendor side, you do have the ability to be a much larger multiplier. When you're working inside of a company, the effect you have is pretty limited. If you want to affect an entire industry, it's possible but it's hard. Especially if your company is not a vendor, if you're at a vendor, and that vendor is also sizable and doing interesting things, then you become a multiplier.Cory Watson: An interesting connection to my past life, a fellow by the name of Patrick McKenzie, who works at Stripe, goes by @patio11 on Twitter, recently wrote a blog post about why he joined Stripe. Part of it was, even though he had previously worked in trying to help small company succeed, it was because working for that vendor, in this case, gave him an outsized impact on all of those companies. He's probably much more articulate about it than I am. I just read it yesterday. It was like, "Yeah, man, I believe that. That's what I'm trying to do."Mike Julian: I will find that post and put in the show notes because I'm a big fan of Patrick.Cory Watson: Yeah, he's pretty good dude.Mike Julian: Transitioning a little bit, you've had a background running observability teams. Now, you've also been an IC at various places. But now you're in this weird middle ground, where you're not actually sales. You're not running a team anymore, but you're not strictly an IC either. It sounds like you've got a weird role.Cory Watson: I like that you basically defined it by the absence of things, instead of the presence of any one thing because that's what I find tough about it. I think to pick up on the thread you've dropped there, having spent now a little over 20 years, mostly doing IC work, occasionally engineering manager, a few VP roles. In all those cases, though, there was a fairly direct connection between either infrastructure work or engineering, programming output, even as a consultant, right? I also consult all those more general programming consulting. There was always this time spent with hands on keyboard, making code pop out the other side was what I was judged on, whether it was myself or the people that I worked with to as a manager or what have you.Cory Watson: In this role, it's tricky because I just said that a lot of my job is to go and have pre-sales conversations. I'm not a sales engineer, and I'm also not a salesperson. I'm often brought out as, "Well, here's Cory Watson, who's, as you said, at the beginning of this session, has done a bunch of observability work. He's here to effectively just have a friendly conversation with you about what you're doing." Thankfully, I don't work in a company that expects me to just shill for them or anything else, right? I tell them what I think and what the approaches are there. This is rarely a thing that you do with just one vendor. There are often a few that overlap or are mutually beneficial to each other. I think the trick, though, is trying to figure out, what do I value at the end of the day? What releases those endorphins in my brain or whatever that triggers my happiness and excitement, and makes me want to get up and come to work every day? The conversation we started with, this idea that we're learning much more about the business, and having a larger effect is one thing, but that's a long. That's like parenting. That takes decades to pay off. It's often unfulfilling, in the moment. I often joke with my partner that our daughter, she's not gonna repay us for all this work for many years. For now, we're just, "Nope, just do the kid thing." The difficulty here is connecting that. I've been spending a lot of time documenting my work, because it often hasn't felt like I "accomplished much." I'm making air quotes. I've had to spend a lot of time documenting what I am doing and learning to recalibrate my internal measure of what types of accomplishment I've had. How many conversations that I have with customers, sometimes months ago, that now materialize into someone who's a happy customer? How many conversations did I collect feedback on? Or how much insight have I given into product changes that ultimately then land and turn into something like that. I think that's maybe there's something to say here about that outsize impact as you were describing. That larger impact that you have also often taking much longer to propagate.Mike Julian: Absolutely.Cory Watson: The waves, even though they're big, take quite a long time to travel. That's a big difference in internal awareness of your own role.Mike Julian: Yeah, absolutely. When you think about the vendors that we all look up to like, "Oh, well, they've made a really cool product, look how old they are." What we see now was not quick. Success looks an awful lot like drudgery. [laughter]Cory Watson: You caught me off guard for that one. We may have to edit that laugh down. That was a snorty one.Mike Julian: Success looks an awful lot just hard work. You're absolutely right. The success that you see in your day-to-day work, it's not actually felt for months.Cory Watson: Also, things that can seem simple. I've been doing observability work now, since basically as long as observability's been associated with computer stuff in some capacity. Since I don't know, 2000 something, 13-ish something like that. Now, when I have conversations with some of our customers, some of that is re-discussing things that I have ... a lot of its re-discussing my past experiences, some of the decisions I made as a customer. That often doesn't feel... I feel like I'm just telling them something I already knew. I don't feel like that's valuable. I often feel, for example in this conversation, am I giving you new insights or just saying something someone else already said on Medium 100 times? The point is not to make it sound like it doesn't have value just that.Cory Watson: You don't often see the impact until much later when you realize that that customer made some internal cultural change. I was just discussing with someone last week, how to help them make observability more fundamental to their day-to-day attributes. I asked them like, "What carrots are you providing to your team?" It's easy to have sticks and say, "We're gonna whack you on the hand if you don't measure." Are your deployment processes connected to your observability data so that you can say, "If you do it right, you get these cool features as a side effect." How much of that is really happening in your org? That's a conversation I've had many times. Sometimes, for a customer, that's brand new. Even if it is new, the other person you're speaking with often brings a whole new perspective, some really exciting new ways of thinking about the problem. I don't know. Every customer conversation is special and awesome in its own way. In some cases, we find out later they had a big impact. In other cases, they buy something else. But that's okay too, because we've all gotten better as part of the process.Mike Julian: Something you said there reminds me of one of the ways a vendor really helps is that there are often conversations happening internally that when someone external comes in, and says the exact same thing, it lends more credibility to it because now it's not just internal. It's now someone third-party that has no vested interest is now saying the same thing. New initiatives that internal people want to do will often find traction as a result of a vendor coming in and saying it.Cory Watson: So much so that I have had active conversations with vendors when I was a customer, I mean clearly, some of the companies that have worked for, the name of the company meant something. Then I'm something of a personality sometimes. The combination of that weight, it was not uncommon for vendors to basically call me and want me to break a tie. Not me as in Cory, but me as Cory or maybe the job I'm doing et cetera, et cetera. It really does because it's very easy. The problem with eating your own dog food, to use the industry metaphor for using your own stuff, is that eventually, it all tastes the same. Sometimes you can't tell. Do I need some cumin on this or not? Do I need more salt? I don't know, it's dog food to me. It's funny because that's also my role internally, is I work in the office of the CTO, which means I don't participate directly in day-to-day engineering. I do some R&D function. I do a lot of customer feedback stuff because I talk to a lot of customers. I also am a neutral party when it comes to this stuff. I don't work on the back end, so I'm not defensive about it. If you're listening to this and there's something you don't like about signal effects, hey, I probably don't like that either. I'm helping them understand what we could do to improve it and helping to break those ties. Also, to shape where we're going and what we're doing. In addition to the customer side, to the feedback side, to the sales side, there's also just the rank and file every day. What more cool stuff can we be doing?Mike Julian: Yep, absolutely. On the topic of you, being a character and you talking to customers, I imagine there's a lot of stuff that's .... Let me rephrase that. What do you see the future of observability holding? What are you talking about with your customers? What are you thinking about? What are you working on?Cory Watson: I think there's two pieces to this. When I started at SignalFX, I tweeted out one day like, "Hey, I've started doing work now. If you've got questions, thoughts, ideas, let me know." John Allspaw, who many listeners may be familiar with, but if you're not, he used to be the CTO at Etsy, now runs a company called Adaptive Capacity Labs. He talks a lot especially lately about incident stuff. How do people function in the systems when there is failure? He tweeted back to me, which first off was like, "Holy crap, that's so cool. I've loved his work. How do you even know who I am?" He says like, "Tell me what your tool does that helps us." This wasn't just aimed at me or aimed to SignalFX. It was aimed at the industry.Cory Watson: A lot of what I've been discussing internally is, "Okay, you've instrumented your stuff. You've got these charts. You've got these artifacts; charts, dashboards, mechanism, tracing, and all these mechanisms for looking into the problem. But what are we doing to help you with that?" What are we doing to provide you with that information? I don't wanna overfit for the problem of the person who got paged, because these platforms often do a lot of other stuff as well, right? But what are we doing for the person that got paged? It's a good line of work too. I think as an industry, I feel like all the tools that we operate, we lost the person on call being the person giving the feedback about the tool. I felt there's a lot of improvement in all the vendors tools that I've used. In fact, there's even a rich third-party ecosystem of tools that basically intercept your alerts to help you provide context or deduplification. Is that deduplification? I usually hate when you say a word so many times, you don't even know if it's real. I think that's a word. There's so much more we could do in that space. What is just a simple usability of it? I remember I did a short stint as a product manager in training at Twitter. One of the things that I learned from someone who was training me, was sometimes a feature can be implemented really cheaply and simply. Just prove its efficacy. I don't think I've seen a single monitoring tool that when it notifies you, has a button that's like, "Hey, computer, this is not helpful. Please stop doing this." It doesn't have to actually take action, but it should record that sentiment, because that's really important.Mike Julian: That's a fantastic idea.Cory Watson: Well, you can implement it pretty cheaply. Just log the thing. Take it to a web server and log that HTTP endpoint and then just go back and scrape it together later. Then feed that information back to the folks who do your DevOps tooling, or maybe your manager. This is something that I often think we provide very poor tools for engineering managers to look at the health of their own call rotations that they're responsible for and to help guide people. It's very easy when you're on call to get caught up. These things just go off all the time, and I can't make it any better. We rarely leave time. One of the things I used to push on my team at Stripe that I think we're pretty good at is, if you're on call, part of your responsibility is to leave it in better, assuming that you have time, leaving in a better place than it was when you got there.Cory Watson: If you're seeing alerts that are bothering you, allocate time as part of your on-call rotation to go and read, tune those. If it's a threshold, a static threshold that you need to change, do that. If the runbook's slightly out of date, make sure you schedule times for that or make tickets for other people. You don't necessarily have to take on all that work yourself, but record it because engineering managers are never gonna be able to help you allocate time for it, if it's an undefined quantity. That's something that I'm thinking a lot about. Today, specifically, I've been digging in a lot on accessibility of the tools that we operate. When people say accessibility, we often think of people who have some either permanent or temporary disability. Maybe they have an injury and lost the use of an arm or you're having a sling, or something or maybe they're blind, but we also [crosstalk 00:26:40].Mike Julian: There's a lot of those so-called invisible disabilities.Cory Watson: Well, 8% of those of North Eastern European descent have red-green color blindness. What is the thing we all use in all of our charts to denote good and badness? It's red and green. These are the colors that those people are most likely to not be able to see. You've lost that entire channel of communication with those people. Today, for something I'm working on about dashboard design, I'm looking into accessibility. How about screen readers? What are we doing? Think about our charting elements. Are screen readers capable of deciphering those? There's also a lot of stuff that's not even that technically complex. What are your titles of your charts? I don't have it handy, but something I ran into earlier was some study done about ... or some research. I don't know if it was a study, but there was some research where someone said basically no one knows what charts are except what the title says. It's often been one of my favorite parts of observability tools where someone goes, "Well, what's in that chart?" Well, it's so and so number of seconds that this happened. They're like, "Yeah, but where?" You just have to go look in the code and find that measurement and go, "Okay, that's what that means."Mike Julian: Yeah, I hate that.Cory Watson: Well, I think that we do ourselves a disservice by not gratuitously labeling.Mike Julian: Right, I agree.Cory Watson: X-axis labels, Y-axis labels, the titles of the charts, the units that are placed there. Taking the time to basically wean yourself off of using jargon and just the assumption that the other person knows what these things mean.Mike Julian: One of the things that has completely changed my views and perspective and really level up my own skills on the aspect of visualization was reading Stephen Few's book on Information Dashboard Design. It’s this massive, full color, really amazing print quality book originally published with O'Reilly. He's got a second edition out. It's also fantastic. Also, he's written just tons and tons of other books. The entire thing is, this is how you should be building dashboards. This is what visualizations should look like. A whole bunch of examples of visualizations done poorly and explained why they're not working. One of my favorite things is he lays out really clear reasons why a pie chart is the worst chart ever created.Cory Watson: Oh, yeah. I've also been ... I mean that that sounds really interesting. I just put it on my list to pick up some of the work that probably he even researched is what I've been reading lately. A lot of old theory. [crosstalk 00:29:26]Mike Julian: He was a student of Tufte.Cory Watson: That pie chart example, this is one people love to harp on — but like, our ability to understand the amount of area contained in a pie wedge, it's pretty terrible. It works for large differences; but for small differences, less so.Mike Julian: One of the visualization types that I really wish more monitoring tools used — Few talks about this as well — is tables are actually a very valid visualization. But we don't think about a table as a visualization. But, in a lot of cases, that's the most effective and easiest way to understand that information.Cory Watson: In that same perspective, we also underutilize I think, bar charts. If you look at our ability to process quantitative information like the length of a bar, that's why bar charts are often cited as being superior to a pie chart because our ability to understand length is so much better.Mike Julian: Basically, anything that you would reach for a pie chart for, it should probably be a bar chart.Cory Watson: Yeah, but then that lends itself also to the tabular form because a table puts all the information on those X-Y axes because it's basically a chart without graphics. It's just a chart with numbers in its place. One of the things that's even suggested for accessibility purposes is to put the label for the data as close to the visualization as possible. If you've got a line chart, put it at the tip, for lack of a better word, the right-hand side of the line. My favorite, I think, thing I've learned about that, and then I'll switch topics, was that we build what are called run charts. The generic form of a chart that shows time on the bottom, or on the X-axis is a run chart. It's a special form of a line chart. My favorite thing that I read about it was going back to my Tufte books, which I have tucked away and haven't looked at in forever, but I pulled them back out for this research, was that we rely on these charts. Yet time changing is very rarely the causal part of what's happened. We have this entire visualization technique built out of showing things change over time. Yes, time is changing. That's very important to us; but rarely do we put as much thought into providing what is the context inside our organization or in our systems that are actually affecting that change? How many of those hints are we providing to people, which is the teaser for what I'm gonna try to talk about a lot, which is, how do we do a better job of instrumenting the things that are currently not instrumented and actually being able to ... I don't think causally is a word, but we correlate things often. Correlation is not causality as we all know. Causalating things is much more difficult.Mike Julian: You and I have talked about this previously totally not on the show, where there's plenty of stuff out there that seems hard to measure or impossible to measure, but it's actually the stuff we care about the most. One of my favorite examples is measuring customer happiness. If you have a service-level indicator that has to be mapped to customer happiness, otherwise, how do you know that is going to be valid? Well, now I have a new problem. How do I measure customer happiness?Cory Watson: We get lazy and we just don't measure that stuff. We say, "Oh, you just can't measure that."Mike Julian: But that's not true.Cory Watson: Yeah, it turns out there's actually some techniques for that stuff.Mike Julian: There's a lot of really interesting techniques. Unfortunately, we could talk for hours about that topic alone. Perhaps, I'll have to have you back again sometime soon and we can dig into that one.Cory Watson: Yeah, the only other thing I'm poking at in this realm is control theory, which is the roots of classical observability. Not necessarily what we're talking about in computers. It's something I've doubled down on recently. It's pretty common when you're demonstrating or when vendor show you this stuff like, "Here's a chart, here's it behaving badly, and here's some effect that we're having on the system." If you move forward into a more automated world, most modern, large scale, industrial things are automated to the point that things like control theory are what govern them. I'm very interested in why we still so heavily rely on intuition and people to do a lot of the operational work that's on the Ops side of the DevOps equation. How much of that could be improved if our systems were more, drumroll, controllable?Cory Watson: Observability is how measurable something is and whether or not the state of its internals can be inferred by summing together its outputs. What we don't have are systems that are easily controllable; how many of our systems have direct API's for influencing those knobs and levers that govern their operation? That's something I'm really interested in because as we increase the surface area of our applications that we can be controlled, if imagine them as a three-dimensional space of good configuration and bad configuration, how much can we automate a lot of that stuff and make it? There are people out here doing a lot of stuff in terms of real math and science on systems being safe. Little's Law is a very commonly-cited one for queuing. The Universal Scalability Law and a lot of other performance-minded ideas that can actually be applied to the things that we're doing, but our systems very rarely allow us to manipulate them in that way. Instead, someone's gotta go edit the YAML files, save it, upload it, stop it, start it, redeploy it, blue, green it. How much could we be doing if these things were a little more automated? I think that's something that it sort of solves that dirty secret that observability has, which it tells you something's wrong, but not necessarily what it is. That's probably the third thing I'm poking at a lot these days.Mike Julian: Oh, that's fantastic stuff. I'm hoping that you're going to be writing a bunch of blog posts and giving talks about all this.Cory Watson: Yeah, that's definitely something that's in my ... I don't have a contract, but it's in my proverbial contract is that I'm supposed to be writing about a lot of this stuff. In between talking with customers and assembling some of this, I think over the next year or so, you'll see a lot more that come out. I am definitely currently working on how to make better dashboards.Mike Julian: Wonderful.Cory Watson: As inspired by a lot of the stuff we're talking about here, I'm gonna have to totally check out the book that you mentioned earlier.Mike Julian: Awesome. Well, where can people find out more about you and your work?Cory Watson: Well, you can find out more about me either on my personal website, which is onemogin.com, O-N-E-M-O-G-I-N. That's the southernism for do it one more time, if you haven't read it before. Then you can also find me on Twitter @gphat G-P-H-A-T. Both of those places, and I do a pretty good job about babbling about both of them. Usually, though, Twitter's probably the right place. If you can deal with all my retweeted hilarious, weird Twitter jokes.Mike Julian: Yes. All right. Well, thank you so much for joining us on the show.Cory Watson: No problem. Thanks for having me.Mike Julian: To all the listeners, thank you for listening to the Real World DevOps podcast. If you wanna stay up-to-date on the latest episodes, you can find us at realworlddevops.com and in iTunes, Google Play, or wherever it is you get your podcasts. I'll see you in the next episode.


28 Mar 2019

Rank #4

Most Popular Podcasts

Podcast cover

DevOps in a 150 Year Old Nonprofit with Dan Barker

About the GuestDan spent 12 years in the military as a fighter jet mechanic before transitioning to a career in technology as a Software/DevOps Engineer/Manager. He's now the Chief Architect at the National Association of Insurance Commissioners. He's leading the technical and cultural transformation for the NAIC, a non-profit focused on consumer protection in the insurance industry. Dan is also an organizer of the DevOps KC Meetup and the DevOpsDays KC conference.Links Referenced:  Insure U: created by the NAIC for consumer insurance education  Dan’s talk at DevOps Enterprise Summit 2018  State Ahead Strategic Plan from NAIC Explore technology proposals, fiscal budget proposals, etc. on the NAIC site  Dan’s website TranscriptMike Julian: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.Mike Julian: Alright folks, I've got a question. How do you know when your users are running into showstopping bugs? When they complain about you on Twitter? Maybe they're nice enough to open a support ticket? You know most people won't even bother telling your support about bugs. They'll just suffer through it all instead and God, don't even get me started about Twitter. Great teams are actually proactive about this. They have processes and tools in place to detect bugs in real time, well before they're frustrating all the customers. Teams from companies such as Twilio, Instacart and CircleCI rely on Rollbar for exactly this. Rollbar provides your entire team with a real-time feed of application errors and automatically collects all the relevant data presenting it to you in a nice and easy readable format. Just imagine — no more grappling logs and trying to piece together what happened. Rollbar even provides you with an exact stack trace, linked right into your code base. Any request parameters, browser operating system and affected users, so you can easily reproduce the issue all in one application. To sweeten the pot, Rollbar has a special offer for everyone. Visit rollbar.com/realworlddevops. Sign up and Rollbar will give you $100 to donate to an open source project of your choice through OpenCollective.com.Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps podcast. I'm here with Dan Barker and the chief architect for the National Association of Insurance Commissioners. Welcome to the show Dan.Dan Barker: Hi Mike, it's great to be here.Mike Julian: So National Association of Insurance Commissioners, it's like the four least interesting words in one title I've ever heard. What in the world do you do?Dan Barker: We thought if we combine them that it would be more interesting, that may not have had the intended effect. So the NAIC is a nonprofit, about 150 years old. We kind of got our start organizing events for insurance regulators, so primarily the chief insurance regulators for each state and territory and we organize events to get them together. And we also created model law. Over time, as technology advanced, we started to host some centralized technologies within the NAIC and providing a lot of the back-end applications that regulators use. We also offer something called Insure U, where you can go and learn about insurance, all kinds of insurance, I'm sure that it will get flooded now.Mike Julian: Because we're all chomping at the bit about insurance.Dan Barker: Yeah. And so we still do a lot of the event planning, taking regulators and kind of informing them on technology. There's a big movement in Insure-Tech with tons of investment right now. And so we're trying to make sure everyone is staying up to date with things like blockchain, AI, a lot of the data-focused stuff particularly around unconscious bias and the data and how to clean that out. So we're a pretty diverse group and we have a lot of kind of different focuses but one of those is the technology side and I'm the chief architect and leading up a lot of the technological transformation as we move to Amazon Web Services, moving everything to the cloud and moving to some more open source tools and trying to move towards a DevOps culture.Mike Julian: That sounds like a pretty fascinating situation.Dan Barker: Yeah, it's really exciting. We have all kinds of new tools and new things. We're moving towards ... This company has done a pretty good job of standard blocking and tackling actually, what you might be surprised at in a nonprofit.Mike Julian: Right.Dan Barker: One of the only companies I've ever even heard of that had one version of Java and that has been-Mike Julian: That's pretty impressive.Dan Barker: Yes, I continue to ask in every meeting if that's true, and-Mike Julian: Are you sure?Dan Barker: Yeah. I'm sure there's one somewhere around here. But yeah, for all of our applications, they're all on the same version of Java and it's up to date to you know, like Java 4.Mike Julian That's pretty impressive, congratulations.Dan Barker: Yeah, so it's a great base to start on. And it's been a great journey so far. I've been here for a year and the CTO that kind of came in to start this off has been here for, I think about three years. So it's about the length of the transformation so far.Mike Julian: Okay. So we're talking about transformation here. You actually gave a talk about this at DevOps Enterprise Summit in Vegas earlier this year, or 2018. What was the problem that started this entire transformation process? What were you trying to solve?Dan Barker: So, we had several different opportunities that we were looking to kind of utilize moving forward. So we have this infrastructure that has been ... We're a nonprofit, so we don't have a ton of funding and it's been a bit challenging. So we haven't been able to move as efficiently because we haven't optimized a lot of our technological systems, much of what we have has been more by request, kind of an IT department within a company, non-technical company. And so we're trying to move towards being more of a technology company which requires a little bit more proactive planning. So one of those areas we're trying to gain some efficiencies, gain some efficiencies across all the different teams. So trying to standardize a lot of our [inaudible 00:06:30] methodologies, standardizing our development techniques. We also have a lot of silos, so we were siloed not only in our operations and development sides, being siloed but we also had each development area being siloed. So we have three main areas and they all do everything differently, which is very challenging, especially when you try to move someone across teams. As priorities shift, it becomes very hard for them to pick up on what is going on there. And they really form different cultures, different processes, and they all use different technologies.Dan Barker: We also needed to improve our technology. So I talked about Ensure-Tech coming. And the regulators and their staff are expecting more from us on the technology side to help them better regulate more technologically advanced companies, particularly companies that are now getting into blockchain and AI. And we need to advance all of our technology and our training and understanding to a point where we can explain to them and hopefully help audit some of the algorithms that'll be used in the future and validate that the data doesn't have any implicit bias in it. That they aren't noticing or that the company hasn't noticed, which has been pretty common to this point, that it's something that hasn't been looked at enough in most companies and most of it is unintentional. It just happens to be that the data is formed in a way that there's unconscious bias. So we need to accelerate all of our technological capabilities to deliver on those types of features that will help protect consumers of insurance, which is a long-term buy. So you're not going to know if it's going well until it's too late. So we're trying to protect against that.Mike Julian: Right. So if we were talking about, what was the whole business case behind this, it's really that second one. The second group of explanation is the business reason of why you started down this process of your main stakeholders. The insurance commissioners are looking at the market seeing a whole bunch of changes and realizing they're not ready, from a technical perspective, to handle those changes.Dan Barker: Right.Mike Julian: So they're looking to you for that support and that increased capability and flexibility to handle the major shifts in the market that they're seeing.Dan Barker: Right, yeah. The main driver is definitely that we are able to offer faster response to demands by regulators and higher quality products for them. And notice I never said anything about saving money.Mike Julian: Right.Dan Barker: And that is not even an expectation of ours as we move to the cloud, which is something important, that we've often focused on saving money, but now we're focused more on delivering high quality products.Mike Julian: That's absolutely interesting. I want to dig into that a bit. Because every DevOps transformation we tend to see, as well as every cloud migration is either explicitly or implicitly focused on cost savings. We have all these companies out there with their own data centers, and now they're looking at cloud services and realizing, "Hey, we can save a boatload of money by moving." The reality is that they rarely do, they always end up costing more, but they generally come out ahead.Dan Barker: Yeah, exactly.Mike Julian: It's interesting to me that you've kind of skipped that whole thing and said, "No, we're just not going to focus on saving money here. We're looking for the increased capability."Dan Barker: Yeah, so that's something that we discussed as well, about how we were going to focus on that or choose not to focus on that and it's something that was discussed. As soon as I got here, I was supposed to read and edit and I did a document called State Ahead, that we released shortly after I arrived here and Scott Morris, the CTO here had written most of that for the technology side and in there it never mentions anything about saving money. And the point of that was to make sure that people weren't focused on trying to save money, because we thought that we would save money, but we didn't want to make it the priority, the focus because we could make a lot of things, a lot of caveats, a lot of choices that are maybe negative to save money that negatively impact our quality or speed of product development. And so State Ahead is something that you can find out on the naic.org website. It's our three years strategic plan. It's been really cool to work for a nonprofit because they ... And a particular nonprofit like this where the board is a little bit more in flux. So they've had a lot of problems of every year the board changes because of the way it's cycles. I won't bother explaining here. But anyway, it changes every year. And the membership changes regularly because people get re-elected or they term limit out or they have-Mike Julian: Does it roll all at once every year or is it staggered?Dan Barker: No, it basically ... Yes, so I guess I'll explain it here.Mike Julian: But my whole point there is are you losing all of your stakeholders every single year?Dan Barker: No. You're losing the top one pops off.Mike Julian: Okay.Dan Barker: It's basically a queue.Mike Julian: You have an organizational queue, congratulations.Dan Barker: We have our board at the queue. So it's a first-in first-out queue and there's five of them. And so every year we get a new president and a new vice president. Our CEO and COO are static, but we do have that change and we have membership changes as well who approve all these things. So the reason we wanted State Ahead is because we wanted a three-year plan that everyone in the membership was bought into. And so that no matter who got into the board, it was very likely that they had signed off on the three-year plan so that we could commit to that. Because a lot of times people would come in, they would have their own initiatives for their own state or wherever they were at, or whatever the reasoning was, they had their own initiative. And so they would focus on that for a year, and then we do the next year and the next year and the next year. And so very few of the projects were we able to extend beyond just one year. And so this was really great to have that and then also all the budgets and everything, all the proposals I write up are public. So you can actually go and comment on them if you'd like to. All of our technology proposals are out there. They're available for comment for a certain number of days and then they are voted on after all the comments are addressed and we'll usually address comments if you have them.Mike Julian: That's pretty cool, also a little nerve-wracking.Dan Barker: Yeah, well, I was really happy that my first cloud one didn't get any comments, because I was a little nervous. It was the first one I'd ever done. But I think the next one after that had gotten some really good comments. So that's really helpful to help shape the direction of the insurance industry.Mike Julian: So was this report how you actually got started on this whole process, or did you start somewhere else?Dan Barker: Well, so that's kind of the end of one journey.Mike Julian: Okay.Dan Barker: Or the culmination of one journey maybe. And so what happened is ... So Scott Morris was here for a couple years before I got here, and he kind of started on the culture side and he really wanted to build a conviction in leadership so that we didn't sway after a few months, things don't ... Public companies, it's very common to have three to six month window, and then if it doesn't show massive improvement, then it's gone.Mike Julian: Right.Dan Barker: And this is not a feature improvement to an app, that we can show value coming back. This is largely a paying off debt maintenance site move. That should show improvement but it'll take time to show. So he wanted to build that conviction. So he took a lot of the leadership to Amazon, to their data center, took them to partner companies, the leadership team and the technology area went to partner companies to discuss what they've done with cloud and how they move with blockchain and other technologies. So we really wanted to build some shared experiences, so that we all had the same vernacular kind of to look back on.Dan Barker: We did an initial assessment as well. And we used that throughout the entire transformation. We did it with AWS, and it was focused, it was AWS, an AWS partner, and it was very focused on those technologies, but we've used it for a lot of the same verbiage and language so that we had a common lexicon moving forward. And that has been really helpful, even though we aren't necessarily using all of the recommendations, we have something to look back on, that's a common place. He also went ahead and encouraged everyone to read the Phoenix project. And so that went really well. Everyone really enjoyed that book and kind of understood why we're doing what we're doing and the focus initially was all about culture and has really been about culture, the entire time. We're doing technological things, but we know that we have to have a solid base in the culture before we can execute on those technological areas effectively.Mike Julian: What is culture media in this context? What were you trying to change? What were you trying to pay attention to?Dan Barker: So, I mean, this place has a really good culture. I don't know what it was like when Scott arrived, but when I arrived, I was shocked at how nice people were and how open people were to different ideas and to kind of shining light into areas. And what we focused on is really a culture of continuous learning, trying to encourage people to proactively go out and learn on their own. This is a nonprofit, a larger enterprise and it's right next to government. So it had a bit of a top down structure for a long time. And a lot of people were used to ... I'm sure they've been in trouble for doing things they weren't told to do.Mike Julian: Right.Dan Barker: And so they had a bit of a defensive mechanism of kind of like, "We’ll, wait until we're told what we're supposed to do and then we'll go to that." And so they've very quickly, once given room, kind of opening up space so that they can move on their own and solve their own problems. That was the big piece, kind of empowering everyone, so they can solve their own issues, their own problems in their area, rather than having someone have to say, "Hey, this is how you're going to do it, and this is what you need to do." More just giving direction and additional context. We've done that in many ways, so we do lunch and learns. Those are really hard to keep populated at other companies that I've been at. It takes a great deal of effort. We haven't had a single vendor come in and speak, I don't believe anyway. We may have had some consultants, but pretty much everybody has been internal.Mike Julian: That's awesome.Dan Barker: And we've had it every other week, but I used to do it once a month, and it was hard to staff the vendors a lot of times. So it's been pretty amazing and there have been talks on all kinds of things, work-related, not work related and really impressive to see everyone share-Mike Julian: How do you keep that so populated? What's the secret to success with that?Dan Barker: I think some of it is just the inherent culture here that people want to share and want to help. That might be a side effect of working for a nonprofit that's focused on insurance, that includes things like health care and other things about helping people. So I don't know if that's part of it. Definitely the person in charge that Gail McDaniels, he really gets out there and tries to gin up interest in coming and speaking. So he's done a great job running that piece. We've also done a tableau vizathon, a city-wide tableau vizathon is what they call them.Mike Julian: Okay.Dan Barker: I'm not a tableau person, but that was really well attended. And we had basically a hackathon and we really helped kind of get our folks involved in the community more and we get all of our money, all of the nonprofit’s money comes from consumers, it comes through the companies that pay us based off of, I don't know some algorithm that other people know in the company. I have no idea how they get the money. So this comes from consumers, and my goal when I got here was, I wanted to give back as much as possible. So some of these city-wide events that we've done, we've done meetups and other things, was something I wanted to focus on, just like our adoption of more open-source tools has largely been, so that we can give back more to the community since they're helping fund us.Dan Barker: So that's been a big thing. We also have done a hackathon. So we did this during the AWS re:Invent Conference. And I think we may have a hard time getting people to go to re:Invent because it was such a successful event here. We watched the keynotes live as a group and we actually had a Slack channel open, so that people at the conference and in Kansas City could live chat about what was happening during the keynotes. And then we had a hackathon that had, I think it was over 40 participants. We only have about 260 engineering staff. And a lot of people, I think another 20 or 40, or something like that were in Las Vegas at re:Invent. So it was a pretty good attendance just for the hackathon. And we had a lot of cool things that people produced, some that I'm probably not allowed to mention.Mike Julian: These are always the most interesting ones.Dan Barker: Yeah. So it was a great time, everyone really seemed to love it. And we also had internal people present. We had AWS and GitLab come and present for us and we had a couple other people present on culture. We had an internal panel. I mean, it was really just a ton of sharing for I think it was three days long. We rented out a local theater. I mean, this was ... The amazing point or the amazing thing about all of this wasn't even necessarily the event, which was amazing. But it was how the event was planned, which was all in a Slack channel. So basically a guy named Dennis Wilson who's the Director of Technology over on the NIPR side, which is the National Insurance Producer Registry, which is a wholly owned subsidiary, a lot of explanation. Go watch my talk notes. I'll talk a little bit more about it.Mike Julian: That talk will be on the show notes.Dan Barker: Yeah, okay, great. So he just kind of mentioned it and had an idea of watching the keynotes, and then somehow that snowballed into a hackathon and all these other people coming in and talking and renting out this theater, and having all these GitLab training sessions. And it was pretty amazing to have people just come into the Slack channel as they heard about it or said, "Oh, well, I think I can help with that." And they would just come in, kind of read the history, grab a task off the task list and start going. I mean it was wild to just sit there and watch and not really get involved anymore. Mike Julian: Yeah.Dan Barker: Kind of like ideas stuff.Mike Julian: That's pretty cool. To me I see a pretty direct line between the ... There's something you mentioned earlier where when you first got there people ... Or before you joined NAIC people were very, "I will wait to be told what to do." But since you've been there that's not been the case at all people are jumping to do whatever they think is necessary to do. That culture change, whatever prompted that, wherever it got started to me is a pretty direct line. So from there to what you're talking about, of people feel empowered to do what they think is necessary or do what they think is interesting.Dan Barker: Yeah, definitely. And I think it's all just really giving people opportunity and allowing them that opportunity to succeed or fail, and to not judge whether or not they succeed or fail, but to judge that they've taken the opportunity. There are times where, whether I'm predicting it right or wrong, I may have seen that, "Well, this probably isn't going to work well." But it's better to let it happen and let it succeed or fail than say, "Well, I don't think this is going to work." Which is, "Stop it now." Because you're automatically not empowering them, which is by definition, you're not empowering them.Mike Julian: Yeah, but it's exactly.Dan Barker:  …protect them from possible failures. And it's like, "I may have done the same thing, but this is a different context as well." So when I started, I was very clear that I may have done similar things at other companies, but the context here is different. And that's going to change a lot of decisions, and I will not be the ... Very quickly we will make decisions where I have never been here either.Mike Julian: Yeah.Dan Barker: I've never been to this situation before either, so you're going to have to help out.Mike Julian: Yeah, absolutely. So kind of switching gears a little bit, you mentioned how you got started with this, with your CTO, working with the executive teams and taking them to partners, taking them to Amazon data centers, basically getting executive buy-in. Once all that got started, and was underway, how did you get everyone else on board? What did you do for the rest of your management? What about the engineers, maybe people outside of engineering but are kind of on the ground, doing the work? What did you do to get everyone on the same page?Dan Barker: Yeah. So part of that was the State Ahead document. But we even got started before that with taking an application to the cloud. So we took one of our smaller application, but something pretty important and we move that up into AWS and we use Lambda and we found some pain points in there with Java.Mike Julian: As one does.Dan Barker: It’s like the older Java apps spinning up on Lambdas. But it was a good experience, and it's still running up in production. Now we've obviously made modifications and updated it, but we wanted to get a win. And then we also used a bunch of new technologies for another app with MongoDB and some other more NoSQL type systems and that was a big hit because we were able to get it done very quickly, replace a old and fairly complex system and do it in half the time and half the budget of the original project.Mike Julian:  That's pretty awesome.Dan Barker: Yeah. Which was amazing.Mike Julian: I bet everyone loved that one.Dan Barker: Yeah, definitely. Everyone loves that, even if it's not about money.Mike Julian: Right.Dan Barker: People love saving money.Mike Julian: Yeah.Dan Barker: So that was a huge success. And so we did those with a small team, engaged team. We tried to pick people who would really step out of their boundaries and help wherever they could. And we then just repeat that over and over again. What I've learned is that if you're not tired of saying it, then you probably haven't said it enough. And so I try to make sure that I'm fully tired of saying everything and congratulating people and telling people what a good job this company has done so far and particularly those two were kind of trophies that we can hold up and say, "We've done this already, we don't have to worry about this. This is something that everyone can achieve.” We also offer a ton of opportunities for learning. We'll basically pay for whatever courses you want to take or if you want to have a subscription to any of the online learning systems we'll provide those, we buy books like crazy. Had an order of, I got a little worried. So I listed all the books that I liked and gave links for them on Amazon and it was over $1,000 in books.Mike Julian: Wow.Dan Barker: Things I could remember kind of off my head and they bought them.Mike Julian: That's incredible.Dan Barker: Yeah, and I was like, "Wow. You just bought $1,000 in books. The for-profit companies I've worked for would never do that." So it was really great. We have our own little library. We also have an actual formal library, which is an interesting component of the kind of the cultural transformation journey. They're actually researchers up there, because we have a research center and we have model laws that we create. And you can basically ask them to do any research and they'll just go do it and send you back all this amazing stuff.Mike Julian: That's pretty cool.Dan Barker: They do pretty much everything now. But one of the people, Erin Campbell, jumped on to Slack before we ever told anybody about it. We were just kind of beta testing it and I guess she heard through the grapevine, how to get on it. I don't even know how she knew how to get on it. But somehow she showed up there, created a book club channel and some other channels, and then started ordering books that people were talking about and then taking them down to their desk and giving them to them when they were in then really engaging with everybody which was an awesome thing to have, when you have someone who's not in the technical area, actually in the library, which you don't usually think of — I mean it's like insurance.Mike Julian: Right.Dan Barker: Most people don't get that excited about the library, I'm sure there are people yelling at me on their radio now.Mike Julian: Absolutely.Dan Barker: Although they're probably whispering, “Let's be honest.” So she got on and really engaged and that was a great thing to be able to show people on the technical side that, "Look, there's someone from the library getting on here and fully engaging." And another piece that we were able to show, particularly developers but also project managers and management throughout the company, is that I did some of the docs on our internal communication standards for the IT group — and I basically copied most of those from GitLab, but don't tell them. They're all in their public handbook. And so I had the head of HR go in and help us make sure that they were all in agreement with it and he actually submitted a commit back to GitLab, the HR director.Mike Julian: That's incredible.Dan Barker: It is shocking, right? And so then you-Mike Julian: Yeah, that's amazing.Dan Barker: Yeah. So I went and helped him through it and stuff and we talked through everything, but he did everything. I never typed anything for him. And so it was great to hold that up, as kind of like a collaboration with someone outside of the IT group and that they're engaging in these systems that we're using and that we find to be easier systems to interact with because that kind of stuff, we can put it into a web page, we can put it into Confluence and all these other tools through automation. Once it's in a ... like git-type repo. So they thought that was pretty cool. And to have someone who's on the HR side interacting with our systems was a nice achievement.Mike Julian: Yeah, that's great.Dan Barker: And we've also had engagement, I mean, this company is really amazing. We've also had engagement with our lawyers. So they've been always very fast to respond and always do a great job of reviewing all the standard stuff you'd expect IT lawyers to be reviewing. However, they also engage with us early on our ... Nexus IQ Server engagement with Sonatype and coming in and trying to understand how they're going to work in this new system. They never complain about something being new or us trying to get them into other systems. They've been fully on board with new open-source program that we're introducing. So we can open source more of our internal stuff and have more contributions to open-source. They've been really strong advocates. So it's pretty amazing to have so many people on board with the transformation across every area of the company, really.Mike Julian: Yeah that's great. So all that sounds amazing, but surely there's been some stuff that hasn't gone as well as you'd like.Dan Barker: Yeah. So there always are, right?Mike Julian: Right.Dan Barker: So a lot of times with these things, certainly everyone wants you to predict the future of how long things will take. So you're always going to be behind schedule because no matter what you say, something happens. We've taken down our kubernetes clusters that run in GitLab or that [inaudible 00:35:00] the runners. That was a bit of a hard impact but we're not ... We've done a good job of introducing blameless postmortems and the blameless culture so that people don't ... You know, we don't want them to feel bad if something happens. It was a honest mistake, it was a growing pain of having to swap out some fundamentals of the AMI that cause the overall crash of the kubernetes system. And so that was a little bit of a setback and we've had multiple things like that. Networking is really hard between [inaudible 00:35:38] in AWS in the data center, and we've had some struggles of ... There's the silo mentality of, "I am network." Or, "I am database and I don't know anything about the other things," has been a bit of a struggle. We have one person on the kind of the platform team building out a lot of the platform components, who is a database administrator and has learned a ton of other stuff now. But it's funny because that person will answer questions about monitoring, but then will get upset if someone wants to do database stuff because database stuff is just too complex.Mike Julian: Yeah, interesting.Dan Barker: Yeah. And it was interesting podcast last week or the last podcast on this series talking about databases actually, because it's very pertinent to a lot of the things that we're doing. We've had some connectivity issues and we've questioned, “Do we have any settings anywhere limiting connections?” And we've had this persistent issue of single connections are clearly being limited somewhere because they just hit a flat plateau and just stay there, and you can run 1000 of them and they'll increase linearly. But every one of them has the same limit. But we haven't been able to find it. And we were told very sternly, that there were no limits in the system anywhere. And we're still searching that one out, we're still trying to make sure that when we move more of our apps up there, that we are going to be able to connect our databases on premises, because we're not prepared to move them. At the same time we feel like that's going to be too big of a bang, right? To do all at once. So we're trying to do the apps and then the databases. We also have ... I mean, it's a normal enterprise, so a lot of apps talk to other apps' databases. So you gotta wean those off during that process as well or at least understand the connections. And we've had other things, so open communication like Slack is hard. I sometimes kind of suck at biting my tongue. And so I've made mistakes and I've said things I shouldn't have in public forums, and I know that others have argued and things. And so it's been really nice to be in a company where people forgive you rather than holding things against you because you said something when you're upset and in Slack, it's permanent.Mike Julian: Right.Dan Barker: You can delete it but I think that it's better-Mike Julian: Someone saw it.Dan Barker: Yeah, someone saw it, and I think it's better to leave it, to show that, "Look, you can recover from a failure like that." That's a failure of ... It's a different kind of failure than the technical failures. We often talk about DevOps, but I think it equally is important because we all have hard days and we have hard weeks or months or years.Mike Julian: Right.Dan Barker: And stressful situations that come up where you accidentally say something or type something that you wish you hadn't.Mike Julian: Yeah.Dan Barker: So it's a very forgiving culture. And that's a personal failure for me. But we've also had some of the technological failures and we've had a lot of questions on QA, what's going to happen to QA? And MBAs and other types of jobs like that. And for the most part, I've tried to share articles or other information but not specify a direction, because I want the leaders to emerge in those areas organically and then lead those teams through that.Mike Julian: So all this has been pretty awesome, is really fascinating stuff. For those people that are listening that are also working in a similar organization or looking to start or perhaps in the middle of their DevOps transformation, what advice would you give them?Dan Barker: So I guess the most important thing that I've learned is patience. You have to be patient. These are usually long transformations. Oh, they're really never-ending, I don't if you achieve the perfect culture and then you don't have to do anything. You'll repeat the same thing over and over again to the same people sometimes. And the way you say it one time will be the time they get it, just by changing up how you say it, or how you display it. And that takes patience to not ... Especially when they are like, "Hey, I totally understand this” and you're kind of happy, but you have that little happy frustration.Mike Julian: Right.Dan Barker: [crosstalk 00:40:55]. And so I try to place the blame on myself in those situations of, I need to figure out how to say it differently every time rather than just saying it the same way; and the change needs to be organic, but supported from the top. So we can't force anything into the culture, we can create opportunities like Slack, or like GitLab, where there's more open communication and it's easier to interact with each other. We can provide those platforms somewhat by force, although when I provided Slack, it was adopted in an insane way. We didn't expect we had to quickly buy more license because we were hitting our license limit very quickly. And then we had to offer training and a rollout plan which we hadn't planned on doing. It was just going to be our IT department.Dan Barker: And then the business unit started getting involved and legal and everybody… It's a real battle of maintaining patients throughout the long haul and then trying to play catch-up when the dominoes start to fall. The same thing happened with GitLab and so I would say be patient, offer as many opportunities for learning or collaboration as possible and let people choose whether they want to do something, whether or not you think they'll succeed or fail. Just let them do it. Even spending several weeks on something that you think is going to fail or may not be as big of a value to the company might spur tons of innovation in the future from that person. So you often have to kind of let them ... Give them some goals to achieve. You can't just let people go do whatever they want, without any type of limit, but usually we'll try to have some goals but then some free time to do extra stuff and letting people go and lead those initiatives and not have it just, "Well, you're not a manager so you can't lead." So that's kind of the core of what I would do. But all of that is [inaudible 00:43:28] patience.Mike Julian: Right.Dan Barker: I was in the military for 12 years. So I learned patience.Mike Julian: Oh, yes. Well, thank you so much for joining me. This has been great. Where can people find out more about you and your work?Dan Barker: So all of my information is on danbarker.codes — that's C-O-D-E-S. And you can also check out naic.org for all additional information around NAIC and our initiative and get our State Ahead document, any of our fiscal budget proposals. And then all my information is on danbarker.codes, all my presentations, where I'll be speaking next.Mike Julian: That's awesome. Well, thank you so much for joining me.Dan Barker: Yeah, thank you very much, Mike. I appreciate it.Mike Julian: Thank you to our dear listeners for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcasts. I'll see you folks in the next episode.


7 Mar 2019

Rank #5

Podcast cover

InfoSec For DevOps Engineers with Kelly Shortridge

About Kelly ShortridgeKelly Shortridge is currently VP of Product Strategy at Capsule8. Kelly is known for research into the applications of behavioral economics to information security, which Kelly has presented conferences internationally, including Black Hat, AusCERT, Hacktivity, Troopers, and ZeroNights. Most recently, Kelly was the Product Manager for Analytics at SecurityScorecard. Previously, Kelly was the Product Manager for cross-platform detection capabilities at BAE Systems Applied Intelligence as well as co-founder and COO of IperLane, which was acquired. Prior to IperLane, Kelly was an investment banking analyst at Teneo Capital covering the data security and analytics sectors. Kelly graduated from Vassar College with a B.A. in Economics and was awarded the Leo M. Prince Prize for Academic Achievement. In Kelly's spare time, she enjoys world-building, weight lifting, reading sci-fi novels, and playing open-world RPGs.Links Referenced:  InfraGard The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win RSA Conference Chaos Monkey TranscriptMike Julian: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the world's most interesting people doing always work in the world of DevOps. From the creators of your favorite tools to the organizers or amazing conferences. From the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with our other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and as a hosted SaaS solution. You can check it out InfluxData.com.My thanks for InfluxData for helping making this podcast possible.Hi folks I'm Mike Julian, your host for the Real World DevOps Podcast. My guest this week is Kelly Shortridge, the VP of Product at Capsule8 and an internationally known speaker on InfoSec topics.So Kelly, welcome to the show.Kelly Shortridge: Thank you so much for having me Mike.Mike Julian: You know I was looking at your LinkedIn and there was something that kind of stood out to me was your FINRA license. You started your career off in finance.Kelly Shortridge: That's true.Mike Julian: So what in the world happened there? How does that work?Kelly Shortridge: Yeah. How does that even happen. So one, FINRA exams are very painful so I didn't want to have to re-up those, but mostly I started my career doing mergers and acquisitions covering information security companies. And while I quite liked the finance side I noticed that security had a ton of opportunity not just as far as vendors, but the problem space is huge and it's very unsolved and the incentive problems are enormous. So for someone like myself who had studied some Behavioral Economics, just all of the messy incentive problems were kind of like catnip for me. So I knew I had to go into the industry.Mike Julian: Yeah. The Behavioral Economics side, the study of incentives. That's absolutely fascinating to me because, especially in security and systems got everything we do is incentives based.Kelly Shortridge: Yes.Mike Julian: And it's often incentive that we're not even paying attention to. Things we don't even think about. Like if you-Kelly Shortridge: That's definitely true.Mike Julian: Yeah. Like you make it hard to use two-factor and people aren't going to use two-factor.Kelly Shortridge: Yup, so that's something that I feel like people outside of security understand immediately, but inside security they don't always understand the fact that if you don't make something that integrates into work flows, people are going to bypass it. But you're absolutely right, there's such a web of incentives and on the one hand you have things that are explicitly stated you know, security to your privacy is important, but then you have more of a tacit goals and priorities, which are that, well security's a cost center and really what matters is being able to deliver on time and you know releasing at a certain cadence.So those tacit assumptions also create a bunch of incentive problems and InfoSec. But I always think that InfoSec because they wish they were more relevant try to be the culture of know and ram through really annoying technologies for people to use just to show that they're still relevant.Mike Julian: Yeah, that hits home. I've seen that way too many times.Kelly Shortridge: Yeah. I think most developers... It's not really a love hate relationship, it's mostly just a hate relationship for the most part. Somewhat bi-directional.Mike Julian: What really about security drew you away from finance? Have you found there's good parallels that you've been seeing?Kelly Shortridge: It's interesting. There's certainly parallels, particularly on the risk management side, particularly anything around risk centrality because that's a huge part of financial systems, so that's not really what I did day to day. I think what's interesting in security is there's a huge lack of effective communication. Even when you go to conferences you know, there will be some 0-day that's dropped or whatever, but it's often not communicated very well, and certainly when you look at enterprises, security priorities aren't really communicated well to the rest of the business. And a lot of investment banking is quite frankly effective communication.It's about quickly researching something in general companies and understanding it very deeply to be able to talk about it and effectively persuade acquirers that's it's worth acquiring a company and so frankly even the excel and PowerPoint skills I learned along the way are really helpful in just being able to talk to people about security. You know I can speak to CEO's, I can speak to Board members, I can speak to DevOps people about security in a way that's still understandable and that's what I feel like we're still missing a lot of in information security.Mike Julian: Yeah that makes a ton of sense. I was reading one of your articles, I can't remember which one it is. I'll have to go find it and throw it into the show notes, but you made mention of a tax from nation states. And I thought that was pretty interesting. It definitely stood out to me. And perhaps I have a bit more security exposure to that side of things than the average DevOps person might, just from, I worked for the government for a while so I've seen it. But for the vast majority of people who haven't, what is a nation state? Like what are we talking about in security context?Kelly Shortridge: Nation state's generally in the context of an attacker is a government sponsored entity. So in some cases it gets a bit blurry, particularly with China or Russia where you have criminal groups that either lightly or strongly have the backing of the government, or at least the government looks the other way because it still benefits you know the government's goals. But in general it means a nation state attacker and you'll also hear the term APT which is and Advanced Persistent Threat. Part of the reason why they can be advanced and persistent is because they're well resourced and they're also well motivated. They have very stated goals and they have actual budgets that can go towards pwning things.Mike Julian: This is actually a thing, like this is actually happening?Kelly Shortridge: It is actually happening, so the extent to which it's happening, particularly for the average organization I think is a bit more dubious. I think by far and away, the script kiddie threat or the criminal group threat is far bigger.And that's where for example why a transition to security looking at behavioral economics, there's a concept called prospect theory. And part of that is basically people overweight small probabilities under underweight large probabilities. So for example you overweight the probability that a shark's going to eat you and you underweight the probability that you know you'll be hit by a car you know, succumb to some sort of car accident.The same applies in security that people vastly underweight the fact that probably they'll succumb to phishing or some other sort of kind of like somewhat stupid scripted attack and then they overweight how much you know Mossad is a classic example, Mossad's going to you know find who your secretary is and they're going to install some special sort of pen that then transmits some sort of exploit over to his or her machine. And then that machine's going to exfiltrate data by fluctuating the power supply and someone's then hacked into the power plant to read that... all of that stuff is more fan fiction than anything else except for you know national laboratories or governments. Maybe Fortune 10’s.So in general, people definitely like how sexy nation states are as far as this kind of attacker because no one wants to be owned by a 12 year old, right? That just feels bad.Mike Julian: Right. That's just embarrassing.Kelly Shortridge: Exactly. Exactly. So I think that's why there's so much focus on nation states rather than kind of the quotidian threats.Mike Julian: I got to see James Mickens speak a while back. And one of his slides was... It basically just said, the threat model is Mossad or not Mossad.Kelly Shortridge: Absolutely. Yes.Mike Julian: If it's Mossad you're screwed anyway. It's like, don't bother, give up, you're done. If it's not Mossad well now we can have a conversation about what you can do.Kelly Shortridge: Precisely. And that goes into a really good threat modeling in the sense that even with, I think it was APT 28, which most people know as the one that you know hacked the DNC. They tried phishing first. I think it was something like google-admin, but the google had two zeros and maybe you know people should have spotted it. I tend not to blame users for those sorts of errors. But the point is that even this super sophisticated groups will absolutely try phishing and absolutely will try these unsophisticated methods first because if they don't have to blow something expensive like a zero-day vulnerability which takes tons of time to research and perfect and get reliable. They won't do it. They'll absolutely try low hanging fruit.I think it's similar with a lot of developers, if they don't have to like reinvent the wheel and create something fancy, they won't. People tend to optimize for what's quick and what works.Mike Julian: Right. Yeah. When I used to work for Oak Ridge National Lab, one of the first things that we were taught was, you don't plug in USB drives from outside of here. And be very careful about what links you click in an email. And the FBI came around through the InfraGard program to tell us that... they basically gave us this briefing on system administrators by nation states are the, they're kind of at the core of who they're targeting because they have tons of access and no one pays attention to them.Kelly Shortridge: Definitely.Mike Julian: That was kind of scary when I first heard it and then I realized, well actually most... what they're going to be doing is just coming in at the lowest level possible. Like here, plug in this USB drive.Kelly Shortridge: Exactly.Mike Julian: They're not going to be like, a beautiful woman in a bar trying to find me in this long 18 month process. That's just not going to happen.Kelly Shortridge: Yeah. If the USB stick in the parking lot works, you might as well try it and then after that you know it's some discounted thing on NewEgg in an e-mail, right? And then it escalates from there.Mike Julian: Yeah. Exactly. They're not starting at the top.Kelly Shortridge: Exactly. I think it completely defies human nature to think that they would start with the most expensive option rather than exhausting the rest of the options.Mike Julian: The threat modeling here is really interesting to me because... so say the attacker is going to be starting with the cheapest perhaps most effective option, that means how I'm thinking about my defense is also going to be very different. I'm not protecting against these really fantastical situations. I'm protecting against phishing links.Kelly Shortridge: Yes.Mike Julian: When I'm trying to design some sort of security posture and like, I'm a DevOps engineer, I don't have any standard security staff I don't have any specialists around me, what can I do?Kelly Shortridge: So one thing that I definitely recommend, and it's actually lucky because it's somewhat easy, is just go through kind of 101 guides on how to hack web applications. Because whatever the 101 guide says is probably what their minimum viable threat model is, right?It's the same thing with corporate security. Going through and trying to crack passwords is probably step one using some sort of dictionary. Obviously something like two-factor kind of mitigates that. So what I've proposed before is the concept of decision trees, which I assume a lot of the audience will be pretty familiar with them.But if you're not is basically the idea that you start with, okay, you have the state of the world, you have some sort of attacker and you have let's say an application that contains sensitive information. Obviously the attacker isn't going to care about the non sensitive information. They'll probably go for whatever the, let's say credit card data.Then you figure out, okay, what's the easiest way the attacker will get there? So I have this notion called yellow sec, which is basically if you do nothing and you just hope that security will happen out of the ether, that would be for example if you're storing the credentials in the database, you don't have any network segmentation. You don't have any data tokenization. You certainly don't have any access control on it. So that would be the yellow sec option.And so then when you start think about, okay, if we did absolutely nothing, what would be the easiest thing for the attacker to do? You can start eliminating those paths and then forcing the attacker down the hardest paths possible, which again eliminates a lot of the very common script kiddie threats. Then you move onto eliminating the common criminal group threat. And finally again once you get the Mossad level, like just don't care about Mossad, they're going to find a way regardless. So as long as you keep forcing attackers down that harder path, you're going to frankly eliminate yourself as a target.Mike Julian: That's fantastic advice and for whatever reason, and I'm kind of ashamed to admit this now, I never considered looking at the 101 attack guides to figure out how to set up my defense.Kelly Shortridge: Yeah. I think if you want to role play a bit it's like, okay imagine you're a teenager you know the stereotypical teenager in Eastern Europe, what would you do first? You know that you're... I assume probably a lot of your listeners were at technology companies. You know of technology company through an article in TechCrunch. You know that they have sensitive customer communication, something like that. Okay, now you think, how am I going to hack them? You're probably going to go to really stupid guides at first, so just look at those and eliminate all of those stupid ways, right?Mike Julian: Yeah.Kelly Shortridge: Yeah.Mike Julian: How do you consider... you're absolutely right.Kelly Shortridge: Yeah.Mike Julian: So you and I were talking before we started recording, about... DevOps people and security people.Kelly Shortridge: Yes.Mike Julian: And you have opinions on this. Can you tell me more?Kelly Shortridge: I do and I'm a bit of a traitor in that I definitely empathize more with the DevOps side than the security side of things. But-Mike Julian: Why is that?Kelly Shortridge: It's that way because I really dislike the notion that I see a lot in security which is again that culture of “No.” It's that notion that there's this almost, you know I almost call it this moral and almost like missionary perspective of there's this abstract perfect security archetype and every company has to meet it and anyone who violates security is just you know, violating divine blessing or something like that. It's just very overly serious to a certain extent and there's this lack of self awareness that security most of the time slows companies down. And that if security isn't working on behalf of the business to make sure the business can survive and not choking out those workflows, then what is it doing? If it's hindering the business then you might as well not have deployed security at all because mostly frankly, most of the consequences of any sort of breach or reasonably minimal, particularly with the rise of cyber insurance, that means you'll get reimbursed for incidents.So to me DevOps at least understands, okay we are supporting money making activities, but also we do face cost constraints and stuff. Security doesn't quite have that same level of self awareness and they certainly aren't making money for the business. So, again I empathize more with the people that seem to be supporting the business more than not. And I do in my experience think that when I talk to DevOps people about security they're way more receptive than when I try to talk to security about DevOps and what they can learn.So that's part of why I side more on the DevOps side. But my thesis right now, which I've been harping at least for a year now is that DevOps and security should actually be BFFs. They're frenemies but they shouldn't be. But there are obviously a bunch of cultural challenges. There's a ton of scar tissues, but ultimately with the rise of something like resilience engineering, you can extend the concept of like, okay, assume that things are going to fail. To assume that things are going to fail also in a security context. There will be a breach. So really there's a lot of common ground that I think both teams so to speak don't realize, exist.Mike Julian: Tell me more about those cultural challenges you mentioned.Kelly Shortridge: There are a ton of cultural challenges. So for one security people tend to have the break it mindset rather than the build it mindset. And they tend to think that most people who don't consider security first, are stupid. So if you aren't beginning your design phase with architecting perfect security, a lot of times they'll just think that developers are fundamentally less intelligent. That's something I've legitimately heard. And I think that's stupid in itself. Right? It's just-Mike Julian: Yeah that's awful.Kelly Shortridge: Yeah. There are different priorities obviously. I think on the DevOps side, there's also this notion of it's the, what is it... fail fast or fail faster, maybe I'm just quoting Silicon Valley at this point, but you know building things not necessarily with regard for security. Which also isn't great because security still is a part of managing business risk. So I think it's... fundamentally those mindsets are different, and frankly the best security people I know are the ones who have developer experience.Kelly Shortridge: And even on the DevOps side the ones who tend to consider security, tend to be I think better in organizations. I love the stat from the state of DevOps report by Dr. Forsgren, where it states that companies that resolve security incidents more quickly and also have security sooner in the build phase, actually reduce any time to recovery. It actually benefits the business. And it benefits velocity when you're considering security. It's just, are you doing it in the right way?Mike Julian: All right. Have you ever read The Phoenix Project?Kelly Shortridge: I have not.Mike Julian: Oh. It's a really great book by Gene Kim.Kelly Shortridge: Okay.Mike Julian: And how it opens up is talking about this archetype of a security professional and this person, I think they named him John. It's a parable, which really great read, really enjoyable. But this character John, does all this bunch of stuff and just isn't telling anyone that he's doing it. And then the entire application completely crashes.Kelly Shortridge: I believe it.Mike Julian: And then he starts blaming everyone else and like, oh well, none of you care about security and I'm the only one protecting this company and, all of you else only care about making money. And like, we've got to save this.Kelly Shortridge: Exactly.Mike Julian: And then it progresses through and there's kind of a... this security person changes their mindset over time with the input and experience of talking to other people to realize, well, no, there are actually layers of security and I may not see them all.Kelly Shortridge: Exactly.Mike Julian: So the example given in the book was the security person wanted to encrypt a field with CVVs or something like that, and it later came out that was completely unnecessary because finance had paper controls to handle it all.Kelly Shortridge: Okay.Mike Julian: So it was completely moot point, but in doing so he actually broke the entire business as a result of making this completely unnecessary change because he was only seeing things from his perspective.Kelly Shortridge: Yes. Yeah. I feel like half of my talks around conferences are about seeing things from other people's perspectives and teaching security people how to do that. So I'm not sure if the parable has been fully digested yet in InfoSec, but I think it's a good one.Mike Julian: Like hearing everything you're saying, I'm like, surely we've solved this by now. But apparently not.Kelly Shortridge: No. If you look... One meme that I really hate and if you hear security people telling you this, don't believe them, is that the pace of attacks and you know attackers shifting methods it's just evolving constantly and we could never keep up. That's just not true. For the most part if you look at underlying techniques, they don't change that much. Even phishing is something that was happening in the 90s. So fundamentally-Mike Julian: That's interesting-Kelly Shortridge: Yeah, fundamentally things don't change all that much. Though I do think on the positive side of things, some of the new technology around infrastructure is actually changing things for the better. But otherwise if someone is saying that they can't keep up with the pace of change, it means that they don't have good underlying basics in place. And so they are just constantly reactive. And that's a huge problem in security, is very few people are proactive and thinking frankly more in the DevOps kind of like architectural view, rather than in just like, oh there's a fire, must put it out. Okay next fire, et cetera.Mike Julian: Right. So I want to get into that but I have one other question before we talk about that area. We've been talking about this terrible archetype of security people. Some of the people listening have security staff that they may not have the best of relationships with. What can they do to start to bridge that gap? You mentioned that DevOps and security should really be BFFs. How do we get there if we're not already?Kelly Shortridge: So one reason why I like resilience is because of all the commonalities there are with security. So I think even starting with the conversation about like, okay listen we want to make sure that our apps have really good up time and they aren't disrupted somehow because we have to be performant. What are some of the security benefits there? Is there a way that maybe we can collaborate to make sure that part of that up time for example like reduces the threat of like denial of service. That's something that I think both goals have in common. Kind of looking for what are you working on and where could that somehow apply to security I think is a good first step.Also just acknowledging it's a bit of I guess ego stroking and nudging in a certain way but acknowledging like, listen we think security's really important like we're looking to implement x technology kind of leading into kind of the next discussion we'll have like, what are some of the security benefits here. Like for service measures for example, is there a way that this orchestration can actually reduce work for you? We know that you're super busy and you're putting out a ton of fires. Is there some way that we can actually help automate this for you because we're going to automate some stuff for ourselves?So I think those are the sort of olive branches I would recommend. It's kind of like you don't want to tell them they're unimportant.Mike Julian: Sure.Kelly Shortridge: That's... right? That's their biggest fear in a certain way. But it's think sometimes they just fundamentally don't realize... they see knew technology as something almost scary and it's another again fire they have to put out. It's another threat model they have to create. So figuring out, okay like we're frankly going to do this regardless, like what are the ways that we can reduce work for them, is something that I think for the most part security people will be really receptive to that.Mike Julian: I have found that telling a security team that thinks like that, hey by the way we're going to turn all of the servers every couple days, or every couple minutes. Like we're just going to rotate the entire infrastructure and also by the way we're going to consistently break out own infrastructure intentionally. And you just see their heads explode. Like, "You can't do that!"Kelly Shortridge: Yes. So here's my counter because this is something I've been talking about constantly and will keep talking about, is something like that, what you just mentioned... Remember back to that nation state and the APT, well the P in that is persistence and it turns out it's really hard to persist on something that's constantly rotating, right? So there are actually security benefits. So you can tell them, "No, you can even drop my name, not that I'm super important, but say like listen I heard that it's really hard for attacker to persist if our infrastructure's constantly rotating."Another thing I've constantly mentioned is how Chaos Monkey is actually a really good security tool, not just a resilience tool for that reason because it reduces persistence. Again bringing up service mesh like I believe personally scratched the surface but I promise you most security people know nothing about it. And I guarantee you they don't know the fact that it means that they don't have to manage individual blinky boxes anymore. That they can actually just deploy firewall rules and access control and stuff like that in a much friendlier manner.So I think it's trying to, and this is where I put the onus more on security to understand the technology rather on DevOps to understand the threat models and the security needs by going back to, if you create a really basic threat model right? And you go through those 101 things and every time you're looking at new technology like you mentioned thinking about, okay how would this stop the script kiddie? Where would this make it difficult for them? And presenting that to the security team is a really effective way to remind them like, "Listen, this isn't scary, this is something that can actually help you."Mike Julian: Yeah. I mean it's hard to not be scared when you go into security conferences.Kelly Shortridge: Yes.Mike Julian: Until recently I lived in San Francisco, so... RSA Conference is there every year. And walking the floor of RSA it's hard to not think the world is burning and everything is awful.Kelly Shortridge: It's so true. There's so much fear, uncertainty and doubt in all of the marketing messaging. I personally hate it. I think it's totally unnecessary, but yeah I think security, the industry itself really tries to be inaccessible and sound scary, which totally hurts and then they complain about the fact that there's a security skills shortage. And it's like, well you're kind of presenting the industry like a nightmare. No wonder people don't want to join in with you.Mike Julian: Right.Kelly Shortridge: So I definitely quibble with that. But I think remembering that security fundamentally is about in this digital age, is businesses are just inevitably they have to be in the digital world. Just making sure they can survive. That's fundamentally what security's. There are digital risks and how can we make sure the business can still survive and ideally thrive with those digital risks. We remove the nation state stuff and you know the 0-day and the FUD and all of that. When you make it that simple I think it's a lot more accessible and it starts to make a lot more sense as far as what you should do strategically.Mike Julian: If I'm looking at security products, like something to help me, should I just categorically ignore all the ones that are using FUD and marketing?Kelly Shortridge: I think you may be left with basically no products to be honest. I think-Mike Julian: Well that sucks.Kelly Shortridge: Yeah it does suck. It's incredibly difficult even for seasoned security professionals to navigate. You know they're seasoned with 20 years of experience that still talk to me about how difficult they find it just to figure out what are companies actually doing, particularly with the rise of AI and machine learning and everything. Then they just hand wave and say, "Oh it's our crystal ball don't worry about it." Which is helpful of no one.So I think if you're a systems administrator or a DevOps person looking at security tools, the key thing to ask I would say is start with the work flows. Make sure that you're not going to be adding undo work because if something, you know some, what are called SIMs, I think like Splunk and other things that basically ingest a bunch of data and help you manage alerts and stuff like that, sometimes those can add 30 hours of work a month just to maintain them, right? Yeah.Mike Julian: Wow.Kelly Shortridge: They're really difficult to implement and this is often across the board with security products. They're really difficult to maintain so just starting even with like, okay but what's the realistic, essentially cost of using your product on an ongoing basis, I think will help you a lot because security shelfware isn't going to help anyone.I think the other thing is specifically looking at the site to see, are they kind of pain... do they at least acknowledge that that's even a pain point. Because companies that are just hyping up again like the machine learning or the AI and stuff like that and not talking about optimizing workflows or reducing manual effort. Those are the ones that probably in general aren't going to provide as much value. Again, because either they're going to sit there or they're going to be so time consuming that you can't actually focus on more strategic products.Mike Julian: It has been absolutely fantastic chatting with you.Kelly Shortridge: Thank you so much, yeah.Mike Julian: I've learned a ton. This is great.Kelly Shortridge: Yes definitely and anyone listening, feel free to always talk to me because I'm always looking to see how we can... don't tell the security people, but how to make security teams a little more obsolete and integrated more into the DevOps process itself.Mike Julian: Well on that note, where can people find you?Kelly Shortridge: Yes so I have a website, swagitda.com It's S-W-A-G-I-T-D-A. It's a finance joke for another time, but I have both speaking and writing sections which includes kind of blog posts both long form and shorter as well as some of the conference presentations I've given and that word 'swagitda' is also where you can find me on Twitter and reach out.Mike Julian: Well fantastic. Well thank you so much.Kelly Shortridge: Thank you so much Mike.Mike Julian: And to the rest of you thanks for listening to the Real World DevOps Podcast. If you want to stay up to date on the latest episodes you can find us at realworlddevops.com and on iTunes google player, wherever it is you get your podcasts.Mike Julian: I'll see you in the next episode.


23 May 2019

Rank #6

Podcast cover

Open Source is Not A Business Model with VM Brasseur

About VM BrasseurVM (aka Vicky) spent most of her twenty-plus years in the tech industry leading software development departments and teams, providing technical management and leadership consulting for small and medium businesses, and helping companies understand, use, release, and contribute to free and open source software in a way that's good for both their bottom line and for the community. Now, as the Director of Open Source Strategy for Juniper Networks, she leverages her nearly 30 years of free and open source software experience and a strong business background to help Juniper be successful through free and open source software.She is the author of Forge Your Future with Open Source, the first and only book to detail how to contribute to free and open source software projects. The book is published by The Pragmatic Programmers and is now available at https://fossforge.com.Vicky is a moderator and author for opensource.com, an author for Linux Journal, the former Vice President of the Open Source Initiative, and a frequent and popular speaker at free/open source conferences and events. She's the proud winner of the Perl White Camel Award (2014) and the O’Reilly Open Source Award (2016). She blogs about free/open source, business, and technical management at {anonymous => 'hash'};.Links opensource.org Fossforge.com anonymoushash.vmbrasseur.com vmbrasseur.com marythengvall.com Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure TranscriptMike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences. From the authors of great books, to fantastic public speakers, I want to introduce you to the most interesting people I can find.This episode is sponsored by the lovely folks in InfluxData. If you're listening to this podcast you're probably also interested in better monitoring tools and this is where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with our time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and Kapacitor for real time streaming. All of this is available as open source, and they also have a hosted commercial version, too. You can check all of this out at influxdata.com.Hi folks, I'm Mike Julian your host for the Real World DevOps podcast. My guest this week is VM Brasseur otherwise known as Vicky, an expert in open source strategy and the author of the book Forge Your Future with Open Source. She's previously the Vice President of the Open Source Initiative and currently Director of Open Source Strategy at Juniper Networks. Well Vicky thanks for coming on the show.Vicky Brasseur: Well thanks for having me Mike, I'm very happy to be here.Mike Julian: I want to start with a seemingly simple question, but I have recently learned in the past half hour that this is more complex than it seems. What is open source?Vicky Brasseur: Yeah, can't imagine how you learned that. No, it's a question that a lot of folks in technology think they know the answer to, but unfortunately they're usually wrong. That's because they usually don't realize that there is a legitimate definition of what it means to be open source software. It is called the open source definition. It is maintained by the Open Source Initiative. If something does not adhere to each of those 10 points on the open source definition, it isn't really open source.Unfortunately people just sort of assume, well if my source is out there, if my source code is out there, it's open, right? Well, not really, because if you restrict it in any way or if you don't put an appropriate license on it, then people don't know it's open source. If you just put your code out there without a license for instance, it's all rights reserved. You have the copyright over that code or your company if you developed it for your company. It's all rights reserved as far as copyright and no one else can use it, unless you put a license on and that's what the license does for you. Only an open source license, one that is approved by the Open Source Initiative, that's the only kind that you can be assured actually gives you all of the things that the open source definition guarantees.Mike Julian: What's really interesting about that is, there's always people that go around GitHub onto like the main project and say, "Hey, I noticed that you don't have this license, you should really have a license file." I'd always thought that that was just kind of an oversight, like, "Oh yeah, it's fine, it's totally open source. There's just no license. There's no license file." What you're actually telling us is that, if you don't have that, if you haven't specified what license this is under, by default it's not open source. Like, it is “all rights reserved.”Vicky Brasseur: It is, exactly. It is all rights reserved. The best you can call it is source available. You still retain all of the copyright over that, and therefore it is all rights reserved. You retain all rights to that code, no one can use that software at all unless you give them the rights to it. That means somebody could use your software and put themselves at legal risk by violating the copyright of your software and you. If you don't put a license on it, that's what they're doing. Therefore, they are at legal risk, they can get sued and if they are running a company and they're using your software, they can't really get acquired frankly if they are using software that is encumbered by somebody else's copyright. That's why it's so important for multiple reasons to make sure you have a license on there. It really takes care of all those legalities. It's a relatively short list of OSI approved licenses, you've got the Apache and the MIT and all your GPL flavors and LGPL and AGPL and yeah. There's a bunch of them and they cover a broad swath of things. If you just use one of them, you don't have to care about the legalities, somebody has already taken the time to figure that out for you.Professional lawyers have written these things, gotten them approved by OSI. You know they give you everything from the open source definition and you know it's legal. Just use it. It's pretty easy.Mike Julian: You just named off a whole bunch of different open source licensing. I'm always confused when I release a project, like what should I license this under? Screw it, I'll go with MIT or Apache and call it a day, and I never really put any thought into it. There's a lot of these licenses, so presumably I should probably be putting more than two seconds of thought into which of them, if I'm even doing open source at all.Vicky Brasseur: It depends. I mean if you're a business, you're going to put a great deal of thought into this, because you have specific business requirements and strategic needs for releasing that software at all. If you don't care, put GPLv3, put MIT, just slap that on it and throw it over the wall. If you don't care, don't think about it. GPLv3, MIT, that's great. If you care about software freedom, if you care about the morality of allowing other people to look at and manipulate and redistribute your software, use a copyleft software, use GPLv3.If really you could not give two farts about that, then put MIT on it and just get it out there, but license it appropriately otherwise you're screwed. If you really have a lot of other considerations as far as a some sort of patent concerns or something like that, that's when you need to take it to your lawyer and have them look at it and figure out strategically what makes the most sense. If you're just an individual, default to GPLv3, default to MIT, you should be fine.Mike Julian: It sounds like there's actually a whole lot more to open sourcing something than just slapping a license on anything you throw in GitHub.Vicky Brasseur: Yes.Mike Julian: Especially if I'm a company.Vicky Brasseur: So much more. I mean if you're an individual even, it's very important that you do more than just a slap a license on it. I know I've been saying that the past I know five minutes, just slap a license on it and move on. Unfortunately it is slightly more complicated than that, but not much. That's because most softwares compose of multiple different pieces of code. You've got this module, that module, this library, that library. Now with open source as you release it, someone doesn't just have to take your whole package and move on. If they want they can cherry pick individual pieces of your code. They could just take one module if it does what they want for instance.Now if all you do is you slap a license file in that repository, then you walk away, if someone just takes that one piece of code, then later on when they're under a merger and acquisition situation for instance, that piece of code is going to be found. Nobody will know where it came from. You won't have some sort of path showing that oh I wrote this. You won't be able to prove it via version control and you won't have a license file. You won't know who wrote it, you won't know under what license you're using it, so you're going to be in a big buying depending upon the software. You might have to completely re-architect to get that out of there or rewrite it or something like that. You don't have that copyrighting encumbrance, because while it was originally open source, you don't know where it came from, you don't have that provenance.As you are releasing software, make it so much easier for everyone. At the very top of each file, I know and developers roll their eyes every time I say this, but come on people, we have tools now that can avoid this. You can zip that up, and you don't see it. At the top of your file, just have a commented out section, which is a simple copyright statement. Copyright Mike Julian 2019, done. Then underneath that you put licensed under GPLv3, say that. Those two lines at the top of every single file you know now if that file gets lifted out and used elsewhere, somebody will know under what rights they are allowed to use it and who wrote it.They have their legal butt covered because you have put a copyright and a license statement in there. Then you have the full copyright file elsewhere in your repository. I have a nice big section on this in my book and how to release your software as an individual. As a company, yeah, there's a lot more concerns. I personally at Juniper, I don't want to release software A, if it has IP concerns that we can make money off of. I need to talk to legal, I need to talk to the product teams. I need to figure out how to get this released appropriately, because just throwing it over the wall as an open source project but appropriately licensed is one thing. My company is not going to get any benefit out of it if we don't treat the community properly, if we don't actually engage in it. All we're going to get is people looking at it and saying, "Yeah, hey look, Juniper released code isn't that cool." There's a lot of benefit in that, don't get me wrong, but there is so much more benefit in building a community of users and of contributors. That can gain companies a great deal.Mike Julian: Yeah, that makes a lot of sense. I've definitely been in companies where they have a strong culture of being and working in the open source community. They have software they've open sourced and they're maintaining it. There actually growing communities around it and that brings them so much good will in the community. Not to mention it brings them more business as well as recruiting. It's a huge recruiting magnet too.Vicky Brasseur: It's a massive recruiting magnet. Why would you not do this appropriately and build a community for recruiting alone?Mike Julian: Right.Vicky Brasseur: I used to run software engineering departments at the BP level in various companies. The amount of time and effort and money that goes into the recruiting is spectacular, depending upon the whatever employment firm is putting up the study. It can be anywhere from 150 to 250 or more percent of that person's salary. That's how much it costs to replace them. A, you want to manage appropriately so they don't leave in the first place, so you don't get that 150 to 250% hit on their salary. Also, you want to make it as quick and easy as possible to get the right person in there.Now if I'm using open source software strategically within my company to build my products, and I'm releasing software appropriately and I'm engaging with all of these communities in an authentic way, then what I am doing is I am meeting a lot of people who already are familiar with my stack. Who already know ECO and Kubernetes and name your flavor of the week. They'll know that and they'll know my company. When I come knocking saying, "Hey, I have an opening," I'm going to have people lined out the door. Not only will they be more qualified to come on board, but since they already know the stack, their onboarding time is dramatically cut. Therefore, they can get more productive more quickly because they already know the software. They don't necessarily know all the special little delicate snowflake things of my stack, but they're familiar with the software. I don't have to teach them YAML and stuff like that. I'm not going to get there is my company isn't being an authentic community member in the free and open source software communities that my company is using and participating in and really relies on.Mike Julian: Right. You've been talking about these concepts of open sources strategy for years. It sounds like a lot of what we're just now discussing is part of that idea, like the open source strategy is there more to it?Vicky Brasseur: Yes. What a leading question, yes, oh my gosh.Mike Julian: Yes. What else I'm I missing? Please tell us more.Vicky Brasseur: I mean there's using open source software. If you look at the various studies out there, it's anywhere from 70% to 90% of the software that's being used and written right now is relying on free and open source software in some way. We're not just simply counting Linux in there, but it's everything else. It's the entire node ecosystem and it's Python and it's PHP, it's everything. It's huge. Everything relies on free and open source software. That frankly, that's not really strategic. That's just a gimme, yeah, whatever. You're going to be using free open source.Mike Julian: Of course, we're going to use Apache.Vicky Brasseur: Exactly, right, exactly. That's what we're going to do. Everyone does this. It would be stupid for us to roll our own at that point. Like are you going to roll your own SSL libraries? Not if you're wise and that sort of thing. You're going to use …Mike Julian: I sure hope you aren't.Vicky Brasseur: Oh please and if you are, stop now. Just stop, back away slowly. You know you're using these things, but some of them are more important than others. What makes the most sense for your business to be looking at, to be investing in, because you could just throw money and people and time at every single thing you're using your stack, but that doesn't make a lot of sense. You have due diligence you have to perform and you have to look at this strategically. It's not just releasing software strategically such that you can get the benefits of it, but it's also supporting software strategically. It's contributing to software strategically. You have to know how to do that properly and how your people have to be trained appropriately. You have to have policies in place for compliance and various things like that. There's just so many different moving parts to doing open source well from a business point of view. A lot of companies think they know how to do it and as a now former, thank you Juniper, free and open source software business strategist, I'm here to tell you most companies do it wrong. They're putting themselves at massive risk. They just assume they know what they are doing, but it's as though they learned about open source software like most open source practitioners now learned about it via the telephone game.They heard from someone who heard from someone who heard from someone who heard from someone, who heard from I don't know Stallman 40 years ago this is what it's about. Therefore, they know what they're talking about, and I'm sorry they don't. They just don't. There's a lot to this to do it properly.Mike Julian: I guess on that note, shifting gears a little bit, let's talk about open source business models. This has been a hot topic in the news in the past couple of years with Amazon trying to kill Mongo in the names of trying to kill Elasticsearch. Well basically Amazon just trying to kill everyone. What's going on with these concepts of an open source business model, why are people suddenly changing their licensing now, what's going on there?Vicky Brasseur: You can't see me gritting my teeth because this is radio so to speak. There is not now and there never will be an open source business model full stop. People who say there is know absolutely nothing about business and my goodness it's difficult not dropping F bombs right now because I'm pretty passionate about this subject. There is no open source business model. Open source is one of the many tools you use to make your business successful. Just like any other tool you're using, just like your marketing team, your sales team, all the tools you're using, sales force and the people who are cleaning up your office, they're helping to make your company successful. Your support team is incredibly useful that make your company successful. Open source software is just another one of those things.If you as a business are going to release your secret sauce and you're going to put it out there for the world to see and take and put it under an OSI approved, free and open source software license. Then you're going to get your knickers in the twist because someone else takes it and does something with it, I'm sorry, the license you put on there, you have given them permission to do this. They're doing exactly what you told them you could do. It is not their fault if you can't run a damn business. If they take this open source software that you have released and they make a more compelling business and product offering out of it than you do, that's not their fault, that's yours. That's you not listening to the market. That's you not listening to the users. That's you not able to deliver on your particular business prospect. That's not the fault of open source, you've got to learn how to do some business honey. There is no open source software business model, there is only business models. Open source is one of the many things that can help contribute to a successful business model. Sorry I did say I was a little passionate about this.Mike Julian: I really wish I had an applause sound effect right now would be great. Yay, like that was all very enlightening. There's no such thing as an open source business model, instead we use open source as a technique for growing our business, but really we still need a business model to begin with. Open source is just a component of that.Vicky Brasseur: Yes.Mike Julian: Looking at companies like, I really don't mean to be calling out Mongo and Elasticsearch, but they're the two most recent ones. In those situations, I actually read this too, what should they have considered doing instead of as you say getting their knickers in a twist over something they told the market was totally fine? What is the other option?Vicky Brasseur: Well, I can't say specifically what these companies could have done or should have done because I don't know what they did do.Mike Julian: Let's come at it from a different angle rather than telling some other company what they should do. Let's say that I'm writing some software that is kind of along the same lines of I want to open source it for the world to use and use that as a lead gen to sell my commercial offering. You know that sounds an awful lot like what everyone else has already been doing and now they're getting their lunch eaten. What are my other options? What else could I consider?Vicky Brasseur: Well there are multiple business aspects that you could take there. I mean yes, a lot of other companies are going the open core model and there's not necessarily anything wrong with the open core model. Now for the listeners who don't know what that is, it is essentially where you have the core of your software be under a free and open source software license. It's freely available. Then you can have an enterprise version that you sell that has value adds on top of it. You have your core version that's free and anyone can take and do it with. Then you have your enterprise version that people pay for and they get increased support or they get more features and they get more speed or more seats or whatever it doesn't matter what it is. That's part of your business model, that's part of your business.There's nothing wrong with open core in that way, that's perfectly fine. Part of what these particular companies are complaining about is as you mentioned earlier, saying that other companies are eating their lunch by taking these things and not contributing back to the software itself. I am going to take your database software and I'm going to have another offering. I'm going to build a better product on it, and so I'm going to take your customers. That's fine, and that's perfectly okay, but using that software and not getting back to it is kind of dirty pool. The free and open source software world we call that the free rider problem where people are using the software and not contributing back.Now these companies that are recently switched licensing and said, "Oh my gosh open source business model doesn't work," yawn whatever your business model doesn't work, there is no open source business model. It's like saying unicorns don't work. They all complain about this, but none of them have ever once said, "And here's how we reached out to these other companies and ask them to contribute to the community." None of them have said, "And here's how we ask how we can make it easier for them to contribute to the community." None of them are talking about the attempts they have made to try to get other community members. Frankly if you look at their repositories for their core software, it doesn't look like they've done that for anyone. You can't point fingers at a large bookstore to the north of me saying that they've been doing bad things.If you haven't been running a good community, if you are just doing things where you are the only people who are playing in your little sandbox, you're not letting anyone else in. Then you really get pissed off if someone else builds their own sandbox next door out of the same sand you're using? I'm sorry that doesn't make any sense to me. If you want your free and open source software project to be successful, you have to build a good community around it. That means reaching out rather than expecting everyone else to reach in. Meet the people where they are and try to figure out how you can make it easier for them to become a part of the community, because that becomes the rising tide that lifts all boats. How many metaphors could I throw into this particular rant? A lot of them.Vicky Brasseur: That's something that I think a lot of companies do very, very poorly when they release the software, is they just assume if I release it they will come. No, that is not what happens. Community takes time, community takes effort. If you want your open source software to benefit you more than just word of mouth of look at them releasing something, you have to put a lot of effort into it to get it right. Sorry little mini rant there.Mike Julian: It sounds like the community is really the core facet of all this. If you want good software and you want people to really like using your software, you need to build a community, you need to foster that. What are some tips for, if I'm launching an EP software, how can I grow my community from there?Vicky Brasseur: How do you grow your community? Well there's lots of different ways to do this, and Mary Thengvall has a really great book that's come out recently that's related to community. You should check that out. It's officially about developer advocates, but there's a ton of community work in that and Mary does really great work in community. She is your community specialist. However, being a free open source software for 30 years now, I have picked up a few things about community, so I feel more than qualified to talk about this. Number one, documentation for love a dog, write everything down, document all the stuff. Documentation is going to scale so much better than your developers. Make sure you have stuff documented before you release it. By stuff, I mean how to stand up your developer environment, how to get started. Why would I even want to use this software, here's our glossary and very importantly how do we contribute? How do I as a user of your software, how do I show up and even just make a simple bug patch? How do I send a documentation patch? How do I do even the simplest stuff? Where do I communicate with you?Document all of those things as well and really just throw open the doors. Also, it's absolutely vital, it's table stakes now and people who say otherwise are probably jerks you don't want in your community anyway. It's table stakes to have a code of conduct and to enforce it, because if your community, if your project is not friendly to people, if it doesn't treat people with the basic level of respect that a code of conduct and allows them to be insured, then your community is not somewhere that anyone wants to be at which means you don't have a community, you have a cesspit. Get a code of conduct, learn how to use it.Mike Julian: Completely agreed.Vicky Brasseur: Yeah. There's many different other things you can do as far as building community, but those are some of the starters.Mike Julian: Yeah, those are some really great tips. Shifting gears a little bit, you and I were talking before we started about this, we started this call about a concept you've been talking about called open source sustainability. Could you tell us more about that? What's your idea?Vicky Brasseur: This is a big buzz word in free and open source circles lately is all about making free and open source sustainable. This started, well we've been kind of been talking about it for a long time, because of this whole free rider problem. With free rider you can't see my air quotes, but that's been something that we talk about in free and open source software for a very long time is people using, but not contributing back. That's a problem and that's something that we can potentially work to not fix, but at least shift a bit. That's something, but this is really, we've been talking about that for a long time.A few years ago Nadia Eghbal came out with a study through the Ford Foundation called Roads and Bridges. It was about I guess the crumbling infrastructure of free and open source software and how so much of what we use is not well maintained. That's led to a lot of conversations around this. We've all seen problems with this around openSSL and heartbleed how there just weren't enough people there to go maintaining it. They were just killing it themselves almost literally to maintain this off.Mike Julian: Yeah, turns out it's like one person is kind of doing most of the work.Vicky Brasseur: There was a lot going on there. When that started the conversation around, what does it mean to make sure our free and open source projects in which we all rely because we've all built our businesses on them, what can we do to make sure that free and open source software is sustainable and will stay around? Frankly to me it's a business risk to be using something that's not maintained.Mike Julian: Absolutely.Vicky Brasseur: I can't put my company's money into something that I can't guarantee is going to be maintained for a long amount of time. Now because we're in technology, and because most of technology is run by VCs, most of the conversation around open source sustainability has focused laser, just laser focused on money. What we're going to do is, we're going to get a ton of money and we're goin to pay these maintainers. If we pay these maintainers, it'll all be better because money fixes all the problems. What money doesn't fix technology does and no, no, for crying out loud, no that's not the only way to solve this problem. This is a social problem as well as a financial problem.I have, in the past, managed a team where people were paid to contribute to free and open source software projects. That's all they were paid to do, is make these open source projects better and whatever makes sense for you. These projects were very strategic for the company. Made sense for the company to be paying people to make these strategic things better for them, which was brilliant for the company. I'm really glad they did that, but at least one of those people was the only maintainer on an absolutely vital piece of internet infrastructure. Like something that ran so many different things. I know exactly how much this person made because they reported to me. I also know that throwing more money at the problem was not going to solve the fact that they were working 70 to 80 hours a week to try to maintain this. That is not a money problem, that is a resourcing problem. It's a standard sort of management issue. What we have to do if we want things to be more maintainable is you fix that, you fix that bottleneck. You fix that incredibly horrible bus factor of one. That's what you have in a lot of free and open source software project.Now how do you fix that? Is you as a company need to contribute back. By contributing back I'm not just talking about throwing money at the problem, you have to contribute resources. Those resources can be human, they can be technological for servers to help scale things out better. They can be more people to document, they can be people to design. They can be people to market. It doesn't matter, but get these vital free and open source software contributors support and that support is not necessarily money. Money helps and certainly all these people would love to get paid more and get paid full-time to work in their free and open source software projects. It's not going to help if they are still the only one, and they're still working 80 plus hours a week to save your ass sometime. Give back and contribute to, and learn how to contribute to these projects, which is where I'm going to plug my book frankly because we didn't talk about this, but I'm going to do it.Mike Julian: Please.Vicky Brasseur: It is the only book on how to contribute to free and open source software projects. If people don't do this properly, free and open source software will not scale. It is growing at millions of new open repositories in GitHub alone every single year. That's just GitHub, that doesn't count GitLab, that doesn't count Bitbucket, that doesn't count all the things that Apache and all these other projects are running. Millions of new repositories every year, who's going to maintain that? We need to train people how to contribute to open source software and that's why I wrote my book. Otherwise, we are going to collapse under our own weight. Please learn how to do this.Mike Julian: On that note, where can we find your book?Vicky Brasseur: Where can you find it? Will there be show notes?Mike Julian: There will be show notes.Vicky Brasseur: Okay, good, well there will be a link in the show notes then. The link which will go directly to my publisher is fossforge.com, so F-O-S-S-F-O-R-G.com and that will go directly to the Pragmatic bookshelf page for this. I love the Pragmatic folks.Mike Julian: Wonderful.Vicky Brasseur: They've been so amazing to work with. If you ever need to write a book, man go with them, they're so fun.Mike Julian: Yeah, that's great to hear. Aside from your book, where can more people find out about you and your work?Vicky Brasseur: Oh about me, well they can go to my blog which is anonymoushash, one word .vmbrasseur.com. You can also just find it from my website which is vmbrasseur.com and I do way too much of the twittering, so that's probably the best way to keep up with all the things that are on my mind right now. You're not going to see this is what I had for lunch or OMG look at the cute kitties. That goes on a different Twitter account, but you will hear all about …Mike Julian: Open source all the time?Vicky Brasseur: Yes, this one is open source all the time, management all the time. It's a lot less dull than that, but trust me, I hope.Mike Julian: Yeah, all right. That's awesome. Well thank you so much for joining us, this has been an absolute pleasure to have you.Vicky Brasseur: It's been super fun. I love talking about this stuff and I'm very grateful for the opportunity to do so.Mike Julian: Well thank you. To all our listeners thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.


1 May 2019

Rank #7

Podcast cover

Observability in Mega-Scale Banking with Greg Parker

About the GuestGreg established and leads the Enterprise Monitoring Services team at Standard Chartered Bank, and together with his team wrote and implemented a strategy and approach to effectively monitor and leverage data from over 1,000 applications, 30,000 servers, 15,000 network devices, public and private cloud, mainframe, tandem, and multiple other technologies in a sustainable and scalable way. Applying Agile and DevOps techniques to the build, engineering, and support of the monitoring ecosystem at Standard Chartered, the team brought together tools across the technology stack and advocated techniques such as monitoring as code in order to improve monitoring quality and make it a mandatory part of the deployment pipeline.Prior to that he worked at Barclays Capital in Singapore and Goldman Sachs in Tokyo, Japan in various infrastructure and engineering roles.Links Referenced: Connect with Greg on LinkedIn TranscriptMike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools — and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.Mike Julian: Hi folks. Welcome to the Real World DevOps podcast. I'm here with Greg Parker, head of enterprise monitoring services at Standard Chartered Bank, way out in Singapore. Welcome to the show Greg.Greg Parker: Thanks,  Mike. I'm doing well. How are you doing?Mike Julian: I'm doing fantastic. So Standard Chartered Bank, like what is this? It sounds like just a bank, but I've been talking to you about it and it sounds like it's a whole lot bigger than I imagined.Greg Parker: Well, Standard Chartered operates across 70 countries. There's more than 1200 branches, there's 90,000 employees, and it's just a sprawling financial institution, but it's primarily operating in Africa, Middle East, a lot of emerging markets, and the headquarters for IT is in Singapore, though the bank is headquartered in London. And so out of Singapore we drive the technology strategy and across all of the markets over 70 countries. And we get a lot of diversity in our environment because of the different strategies that we have in each country. Coming from, I was working for Goldman Sachs for about ten years, where IT was very tightly controlled from the center from New York where, the word came down from the heavens around this is how you're going to do everything. And then I went to Barclays, which was a similar model except the word came down from London, and at Standard Chartered it was really Singapore saying, this is what we should be doing and this is how we operated for our group owned applications, but there were 70 other countries saying, this is how we have to do it in Nigeria, and this is how we have to do it Kenya, this is how you have to do it in Pakistan. And so you have all of those issues creep up when you're working across emerging markets and especially in a financial.Mike Julian: What's your role in Standard Chartered?Greg Parker: So my role at Standard Chartered is to run enterprise monitoring. And, it wasn't my original role. I came in to drive some infrastructure projects, large infrastructure projects, and when I got there, I saw that monitoring was essentially chaos. There was really no central strategy around how we're going to do it. And when I worked with some people there and we effectively established a central enterprise monitoring organization for Standard Chartered, the problem was there was no central strategy or tool set, or group of tools that we were using for monitoring, and there were multiple vendor deals, negotiated at different prices at different times with different countries. And so there's a lot of inefficiencies that were contributing to massive, MTTRs. Which meant, when an issue occurred, a thousand different teams got an alert, nobody know whose fault it was, and it took all this time to work out, what's the root cause and how are we going to resolve it? And I think a lot of that comes down to, the fact monitoring wasn't precise.Mike Julian: And I'm sure in no small part do two countries not be able to talk to each other.Greg Parker: Certain countries couldn't talk to each other and other countries just didn't know to talk to us. And so there was a lot of people working in silos.Mike Julian: How does your strategy even look when you have all these different entities that are doing their own thing, and like culturally you're not able to say, “This is what we're going to do.” So what's your approach instead?Greg Parker: Well, we do have the authority to dictate if we want to. And that's one of the things that came along with establishing this central organization which is backed by the CTO and the head of technology services, which is going to say, our mandate is to go out and fix monitoring for SCB. But at the same time it's not something that we want to do, is to just give people a mandate. And I've been saying this the whole time is that our strategy is not perfect and it's never going to be perfect. Our strategy is focused around corralling all the data that's out there, and translating it, and enriching it, and normalizing it, and then exposing it through APIs. And that's really the crux of it. But, it's never gonna be perfect, and our focus was to just implement a working framework, and help to improve monitoring so that we can reduce those mean time to resolve and mean time to detect times, and just give generally a better sense of observability for the company. So we have a sense like that we know what's going on. Mike Julian: Why don't we talk more about the strategy behind what you're doing. Like what does it actually look like? How did you come to this strategy? How are you implementing it? What is it even?Greg Parker: So we started with multiple teams that had implemented their own monitoring with the tools that they wanted to work with. We have BMC, deployed across all of the infrastructure. A lot of the application teams had purchased ITRS. There were other tools like AppDynamics and Dynatrace and some open source tools out there with Grafana and Elastic and all of that. And so my first thought was, we're not going to standardize everybody to one tool, and there's not one tool that's going to, be this panacea that's solves all our monitoring problems. And that's kind of one of the key tenets of monitoring is that, there's a lot of different tools. It's not about finding the perfect tool, it's about getting them to work together. And so the initial thought was, let's just take all this data that we have and somehow ingest it and normalize it and then figure out a way to leverage it. We can use more modern tools once we have a dataset that we can leverage. And so we started off on that path, and we started building integrations, and building ingestions, and building a translate, an enrichment, and a normalization layer that would ingest not only the data coming from the patrol agents, of BMC Patrol agents across all of our servers, but everything that was being deployed, AppDynamics, or APM. We onboarded a tool called Sysdig for container monitoring, Dynatrace does synthetics, ITRS is obviously doing a ton of application component monitoring as well as infrastructure monitoring, and some tools are harder than others to integrate. And that was the main focus of our efforts over the last 18 months or so was ...Mike Julian: Is on the integration side?Greg Parker: Yeah. Was bringing everything into one place. When you're at a big company, like Standard Chartered or any large, investment bank, you can spend years talking about what you want to do and-Mike Julian: At some point you just got to do it.Greg Parker: ... Yes. And I've seen it happen. Exactly. I've seen companies talk for years about cloud strategies, and monitoring strategies, and web service strategies, and database strategies, and at some point you just have to dive in and start doing it.Mike Julian: So what makes the integrations some more difficult than others? Like what are the actual challenges with it?Greg Parker: Well, we talk a lot about, there being open source tools and there being vendor tools and proprietary tools, and there is this thinking that proprietary tools aren't open and you can't get the data in or out. That's not necessarily the case. There are some open source tools where it's really hard to pull up the data and there is some vendor tools where it's actually relatively easy. But for us, Patrol wasn't too bad. AppDynamics, Dynatrace have relatively open platforms for us. Actually, we ran to a bit of an issue with ITRS because they don't really structure their data in the way that you would expect from a normal monitoring tool. There's [inaudible 00:10:57]. These things make it difficult for us to say, this is where the breach occurred and this is where it was resolved, and this is what it should look like, and this is how we should normalize it. And at the same time, they have a Kafka method in order to bring in the data. But we were running it on such old VMs that if we were to turn on that Kafka, then it would impact the performance of the gateways, and ITRS operates on a siloed model so that we had 300 gateways across our environment, more than 300 gateways. And for us to ingest the data from all of those gateways would've meant either a single script pulling stuff from the central database or 360 different scripts running on each gateway and bringing data in. So for us that was a major challenge.Mike Julian: Yeah, that sounds like a nightmare. How are you going around to all the teams and getting them to adopt what you're doing?Greg Parker: Well, like I said, we wrote a white paper in the team essentially about what we were going to do, what the approach is going to be as far as enterprise monitoring. And we circulated that in the forums that we have for technology standards. We got some feedback, we got people who didn't read it and just said fine, and I think you'll see a lot of that especially in big banks or people who are running a large organizations. And then we got people who are like vehemently disagreed with our approach. Like violently, just like absolutely not. And I think that's the problem that you'll always find is, you'll have somebody who's completely ideologically focused on a single type of solution. The protests, to attend to come from people that hadn't spent a lot of their careers in large corporations but in technology companies where you can probably do some more experimentation.Greg Parker: But it's not that they were wrong at all. They're absolutely right. We should maybe look at these more modern technologies. But when we have a limited amount of time in order to deliver value for a company that's not focused on technology, like our primary service isn't delivering, cloud to our users or something like that, or primary services being a bank, and we had an environment, built out on BMC, that we just needed to upgrade in order to be able to achieve what we wanted to do. And so it ended up being a compromise obviously, where we would agree, yes, this is what we want. We're not necessarily delivering it the way everybody wants, but we're doing something for the purpose of expediency, getting a solution out there, and at the same time we're going to start evaluating, what's the best way to do it? And I think it's always an evolution like that at big companies.Mike Julian: I have spent a lot of time at very large companies myself and I've also spent time in the small companies, and I ran into a lot of people that are, they don't like how technology decisions are made in very large companies, and mostly how slow they are and how behind the curve they feel like they are. But to me there's actually a lot of really good reasons for that. But to me that actually takes away a whole lot of stuff that I don't need to worry about anymore. Like with you, you have the support of a very large bank behind you. You have all this data that you get to play with, a single division at which you're working on towards the entire revenue of many of the companies that are in Silicon Valley. So just the scale of what you're doing is way more interesting that you're only gonna find it a large place.Greg Parker: You don't account to these issues when you're trying to monitor a single application, then you're just like kids, fantastic, let's do [inaudible 00:15:06] of tracing and let's just get everything into a database and then you can build a Grafana dashboard and it's fantastic. But I think that's the thing is, we're not trying to monitor our company. We're trying to build a framework so that each individual team can monitor their application. Right?.Mike Julian: You and I were talking while back about this particular challenge as well, about how, if you develop this strategy and develop this platform, how do we get it out there? And the one where we're talking about is, well, you don't have like one team that needs to adopt it. You have several dozen teams that all are doing their own thing and you want to get them to adopt it. So it's not a quick thing. This is not going to be done anytime soon. So this is very much a very long play for you. Do you have any idea about how long it's going, how long of an investment this is?Greg Parker: It feels like, well, it's something that never ends first of all, but it feels like it's already been going on forever. So we establish the team like around the end of 2017. We spent the better part of the last, 14 to 16 months building the platform and the integrations, and our thinking was, we'll get people on board with the concept, we'll build something, we'll deliver it to them, and then we'll slowly drive the adoption. And so having spent the most of the last year building the platform, in 2019 we're going to be primarily focused on driving the adoption. So we've migrated over about 50 teams at this point from monitoring through either email or remedy tickets or something like that to our standard platform which exposes data through API and it has a front end, and we plan to drive, another 50 to 100 teams next year onto that central platform.Greg Parker: And I think one of the difficult pieces we'll be looking at the teams that are using more mature solutions that, where they've actually spent a lot of time. We have a team that's has for resources that have been fully dedicated full time for the last three years on our eCommerce area [inaudible 00:17:28] this straight to bank platform, which is eCommerce in real time, trade execution and all of these sort of things. And they've built an ecosystem around this ITRS gateway that is completely custom and completely complicated. It's just insanely complicated. And then they generate dynamic dashboards and they inject XML because that's the only way you can really talk to ITRS in a programmatic way. So it generates XML flat files and pushes it out to the gateways and everything like that. And I mean, for me the solution that works for them, and so that's not going to be our primary goal is to absolutely get 100% of the bank onto our framework and our platform. It's really for the teams that haven't put that type of thought into their monitoring. And if you have just been working off of emails and remedy tickets because alerts were auto-generated to tickets, which is a horrible way to deal with alerting, but for teams that were doing that sort of thing, they're more willing and they're definitely much more eager to say it. I'm so glad that you delivered a platform that is actually purpose built for monitoring.Mike Julian: You made a really fantastic point there. I want to call it out so we don't miss it. That your goal is not 100% adoption. Your goal is to provide something where there wasn't anything good before. So it's not that you're trying to tell the teams, all the teams that you support that, no, you have to use my thing. What you're telling them is, this is available, but if you want to do something else that's fine. But this thing that ...Greg Parker: If you like your manual, keep it.Mike Julian: Like this thing that we've built is fully supported, so if you don't want to maintain your own stuff, then you can use ours.Greg Parker: Yeah, exactly. And we try not to position it as a mandate. Generally, we’re an internal team, we could manage one, but we sell it. I have a team of people that spend their time sitting down with the support people on the ground and selling our platform to them as if we're a technology company, because we want people to want to buy into it and not just feel like they're forced to use it.Mike Julian: What sort of questions or I guess objections are you getting when you're selling this to teams? Are there any common complaints, any common objections that come up?Greg Parker: There's two big ones. I mean, the first one is that people are very used to the way that they were doing things. And that happens when they've been monitoring in a certain way for the last 10 years, which is, they just stare at a remedy queue and they hit F5 every 30 seconds to refresh it. And some of them actually have created macros, so they don't have to actually hit F5. It just automatically refreshes every thirty seconds.Mike Julian: That's incredible.Greg Parker: Talk about automation. But they're very used to that and they're very ... Because audit and compliance is always a huge pressure at a bank and they're very worried that they might miss a ticket. And that's why they've auto-ticketed every alert, even minor alerts, 70% threshold breaches and that sort of thing. And they're like, as long as it's a ticket, then we can't say that we've missed it. And that's entirely missing the point of monitoring, which is like, you shouldn't actually even have to look at anything if there's no remedial action to take. So you do have to sit down with a lot of junior support people because the domain heads that are sitting in Singapore and aren't really on the ground, hear gripes coming up from the ground and saying, “We don't want to move to this solution that the monitoring is trying to force on us,” and it's about, helping them to understand the bigger picture. And so it doesn't, even though we have the support from the domain heads, you have to do a grassroots campaign. I feel it's the best way to approach it. And the other big gripe that comes along is for people that are using tools like ITRS, because there are certain features in ITRS that we can't replicate in our central platform like acknowledging and snoozing an alert or something like that. And that's a functionality that people have been used to. Even though it's not necessarily a good practice. And there are things like that and you just have to kind of take them case by case.Mike Julian: Sure. And on that note, you mentioned that there's this functionality in ITRS that people are used to of acknowledging or snoozing an alert and it's not a good practice. So in that situation, are you also, through your platform also teaching people how to do monitoring better, like as an education component to it?Greg Parker: We really try. There were four pillars of our monitoring transformation, people process governance and technology, and we through the people's side, we built up the team. From process, we are building a lot of documentation and training about how people should be thinking about monitoring, and a monitoring pyramid which is that you obviously think about your business deliverables and your business KPIs before you think about what are your alerts going to be, what are you going to monitor for? And so that's, another completely different aspect of my organization is trying to drive better monitoring practices across the organization. And that's much harder than building the technology.Mike Julian: As it turns out, people are a lot more work.Greg Parker: Yes, exactly. Driving changes in behavior, especially when they're like deep seated little habits is very difficult, and we formally modified our software development framework, our SDF and our SDLC and gateway checks and all of these things in our governance documents, but it doesn't mean anything. People are not going to sit down and read those. It's really about, helping people to drive better monitoring in their area.Mike Julian: You kind of touched on a point earlier of audit and compliance is a big deal. So with all this stuff you're doing, regulatory controls. I mean you're a bank, so regulatory controls and audit and compliance and all that stuff surely plays a pretty big role in what you're doing.Greg Parker: Yep. For me that's probably 70% of my job. It's massive.Mike Julian: What does that mean? What does that look like to you?Greg Parker: So the compliance aspect is driven by your own policies and that's the thing that's really that a lot of people probably don't get is that, all of your audit burden is brought on by yourself, because your internal … you have three lines of defense in any organization, any corporation. You have your first line which is your internal risk and control team, and your second line which is your group operational risk, and then your third line which is your group internal audit, and they'll all judge you based on the policies that you write. But at the same time, if your policies don't adequately address the risks that face the company, they'll write them for you. Mike Julian: So it's much better for you to write them?Greg Parker: Yes, but it's always a balancing act because I could write a policy that says, there's no central monitoring standard for the bank, and then they'll say, well what about the risk of the bank not having adequate monitoring? And I'll say, there's a small monitoring standard for the bank where all you have to do is monitor CPU. And they're like, well, what about the risk of a file system failure? And then so it's just, you're just constantly balancing about, I don't want to have too much governance burden, but you want to address the risks of the bank. And so you go into this negotiation with Gore, with group operational risk and talk about, I think this is not a material risk, based on my experience and based on industry practice, and you arrive usually at a compromise, but in the end there has to be a monitoring standard. And that's the little piddly tedious thing that GIA is going to audit you on every time. So for us, a constant finding in audits across the company is that, the monitoring standards that your CPU threshold should be 90%. And we looked at the configuration and it’s 87%, and it's just nonstop. And so from our perspective, we want to put out a monitoring standard that helps improve the company's overall production stability and addresses the risks of the company not having adequate monitoring. But it shouldn't be too specific that it causes internal teams to fail audits when there's not actually a material risk.Mike Julian: Like alerting on a CPU?Greg Parker: Right. Alerting on a CPU at 91% is not a material risk, if the standard says it should be 90%. One way that we try to do that is we de-emphasize the importance of static thresholds, obviously. And it's sort of it an ancient monitoring technique anyways. Now you have a lot more machine learning and dynamic thresholds. And so we try to de-emphasize the importance of static thresholds. We put more emphasis on broad themes of monitoring, like you're monitoring for your performance, you're monitoring for errors, you're monitoring for peak utilization and high levels of demand. And then from that point on it's really about educating your internal audit and risk and control teams about why these better address the risks of the company.Mike Julian: You actually end up with really two different customers here. So you have the group that's using the platform and then you have your internal governance that's judging you on what you've written and what you're doing, and you're having to satisfy both, which is almost an impossible task in some ways.Greg Parker: That's a good word. They're constantly judging us — and sometimes based off of antiquated understanding of the industry. I love these auditors sometimes being in the business for 30 years and they saw how monitoring work in 1990 at IBM, and they expect to see a similar structure when we're trying to drive something that's more modern. But generally the users are appreciative of the fact that, we feel that the monitoring related questions around audit, but they have to be aware of that, of the policies that exist and try to drive them and implement them because, another way to drive our policies across the bank is that they’re going to show up on these noncompliance lists if they're not compliant to the policies. And so what we do, is we tried to establish a reasonable standard and a reasonable framework, and from that point on we can let Gore and GIA drive the compliance to an extent that's sort of a hammer that essential team can use across cross a large organization.Mike Julian: So our listeners are going to absolutely roast me over a fire if I don't ask this question. But what's the tech stack look like underneath this platform you've built? Like what goes into it? So you mentioned that there's a lot of vendor, is a lot of proprietary, a lot of commercial and a lot of open source tools. But what's gluing it all together?Greg Parker: Like I said, we tried to bring together a bunch of different tools. At the center is a tool called TrueSight Operations Manager, which is from BMC. And that's an aggregation. It's a manager of managers, and it's a way to ... it has enrichment models, it has normalization models, and it has a lot of interfaces. And so the different agents that we have across the organization to collect data include, like I said, ITRS, BMC Patrol, AppDynamics. There's open source tools out there, there's Elastic and there's Beats that are out there. Like I said, Sysdig does our container monitoring and that's based on an open source agent that was developed back in 2011. There are teams that are using Prometheus, there are teams that are using Telegraf and Influx to collect data. And we were able to ingest that into, via a set of proxies across every country and in our two main data centers ingest that data in, normalizing it and enrich it with data from RCMDB, is our configuration management database, enriches that monitoring data with information about business criticality and application and owner and a lot of different things. Greg Parker: And on the other side of that, because like I talked about before, BMC is a massive corporate vendor, but they're starting to step in the API direction and they've developed APIs for TSOM. But there's a lot of issues with the APIs for TSOM. They're complicated, they're hard to use. And so while teams can use the front end of TSOM, of TrueSight Operations Manager to build simple dashboards and there's drag-and-drop and there's Widgets, we then stream that data of TSOM into an elastic database. And then we've built a platform on top of elastic using the elastic KPIs, and then we have the elastic KPI's plus the set of custom KPIs that we've built to expose all of the data to users in real time, so that they can build their own real-time dashboards and visualizations off the back of that. And for right now, that's why we have a team of like 60 people or whatever, that ...Mike Julian: Just to make sure that people got that. You said 60?Greg Parker: Yes. Across engineering and support and our governance and our service management teams.Mike Julian: I just want to be sure that people listening to this, like, this is not a simple thing to do. That's a pretty significant organization working on that.Greg Parker: And most of the banks that I've been with have not actually devoted that much resource to their enterprise monitoring organization. There's usually an enterprise tools team that has a handful of people or maybe 10 or 20 people, and then there's a support team that supports a group of sort of IT for IT applications, but we really thought that we're going to take all of those teams and bring them together, and really try to drive a strategy centrally. And that's why it's a relatively large organization, but usually you have that number of people spread out across the bank, supporting the entire ecosystem.Mike Julian: Got you. There's so much that goes into what, how a large bank operates that it's just never occurred to me before. So this has been absolutely fantastic to learn about.Greg Parker: I mean, it's just an octopus with his tentacles just going out everywhere and you really just trying to get everything, corral everything together. And so, still being early on in our journey, like I said, we're just trying to corral everything. But there's other teams, that have gone out. In Standard Chartered for example, the cloud team has gone out and sort of built everything from a greenfield, and with their best of breed tool set of choice for DevOps, and all of the aspects of the DevOps pipeline as well as monitoring, which also integrates with our central framework. And so there are areas and there are pockets where, and their application teams out there that are building microservices applications based on Docker Swarm and based on Kubernetes. And so they're very modern in fault tolerant and performance. And then of course there's still plenty of applications out there that run on a mainframe and tandem, and all of those fall into the same rules and policies.Mike Julian: So surely you've learned a few things from doing this whole project. What's gone well? What hasn't worked?Greg Parker: I would say that the thing that I would change if starting over, if I were just starting over from the beginning would be to initially just talk to my users more and really focus on improving monitoring from day one. Whereas, we were saying, we're so far behind we have to upgrade our system and we have to start building all these equations, we have to build this central data layer. But we actually had the tools in place where we could from day one, start talking to our users and implement these fundamental things that didn't require our modern tool set. They didn't require anything advanced. You don't need Kafka, databus and Cassandra and all these different things. If a Unix team is not able to monitor VCS. I mean, we needed a few scripts and a knowledge module, and then you can have active, proactive monitoring of your VCS clusters and know when they're going down or falling over or hitting resource constraints. And it was only after we had possibly spent maybe half a year that I had the head of Unix or the head of platform is coming up here and be like, "When are we going to get better ping monitoring and when are we going to get better VCS monitoring?" And I said, "Well, actually we've been working on completely overhauling or upgrading the platform." But I think a lot of problems can be solved just by obviously talking to your users, and we probably could have reduced our MTTRs and MTTDs quicker if we had started with that, and while at the same time concurrently starting driving an upgrade.Mike Julian: So this has been absolutely fantastic conversation. Thank you so much for coming. For people in a company like yours, they're probably listening here and saying, “Yeah, but I can't do that like for all these different reasons like that won't work for me.” What advice would you give them?Greg Parker: I would really just say, to focus on your monitoring coverage and compliance to your monitoring standards, and focus on developing good standards. Because like I said, if you're at a large company or if you're at a company that has a typical corporate governance structure, you don't necessarily have to be the person that drives the compliance to that. If you write a good, risk reviewed policy that addresses your company's main risks as far as monitoring is concerned, then you have teams of people whose job it is to drive compliance with those policies and who will audit the other teams for those policies. So federate the workout, try not to take everything under yourself, which is a lesson that I've learned. It's impossible to do in a bank with 34, 35,000 production systems, and 1,000 applications. If you're driving a large organization, you don't have the resources, then focus on a very solid base, and ensure that your group risk and your group audit are on board with your policies and understand them, and so that when they run, when they do their jobs, that they drive those teams to adhere to those policies.Mike Julian: All right, so is there anywhere that people can find out more about you or your work?Greg Parker: I'm trying to think.Mike Julian: You do work for a very large bank, so perhaps not nearly as public as some others.Greg Parker: I mean, I'm on LinkedIn and I welcome people to, they can get in touch with me there.Mike Julian: Alright. I'll throw that in the show notes of course.Greg Parker: Sure.Mike Julian: Well, thank you so much for joining me, Greg.Greg Parker: No problem, Mike, anytime you want.Mike Julian: And thank you to all our listeners as well. If you want to stay up to date on the latest episodes, you can follow along at realworlddevops.com or on iTunes. And if you're listening to this on iTunes, please rate us. So thank you and have a wonderful evening.Greg Parker: All right. Thanks Mike. Appreciate it.


21 Feb 2019

Rank #8

Podcast cover

The Science Behind DevOps with Dr. Nicole Forsgren

About the GuestDr. Nicole Forsgren does research and strategy at Google Cloud following the acquisition of her startup DevOps Research and Assessment (DORA) by Google. She is co-author of the book Accelerate: The Science of Lean Software and DevOps, and is best known for her work measuring the technology process and as the lead investigator on the largest DevOps studies to date. She has been an entrepreneur, professor, sysadmin, and performance engineer. Nicole’s work has been published in several peer-reviewed journals. Nicole earned her PhD in Management Information Systems from the University of Arizona, and is a Research Affiliate at Clemson University and Florida International University.Links Referenced:  2019 State of DevOps Survey Previous State of DevOps Reports TranscriptMike Julian: This is The Real World DevOps Podcast, and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, and the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.Mike Julian: Ah, crash reporting. The oft-forgotten about piece of a solid monitoring strategy. Do you struggle to replicate bugs, or elusive performance issues you're hearing about from your users? You should check out Raygun. Whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do, which if you're anything like me, is ask the nearest person, "Hey, is the app slow for you?" And getting a blank stare back because hey, this is Starbucks, and who's the weird guy asking questions about mobile app performance? Anyways, Raygun, my personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to raygun.com.Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps Podcast. My guest this week is Dr. Nicole Forsgren. You may know her as the author of the book Accelerate: The Science of Lean Software and DevOps or perhaps as a researcher behind the annual State of DevOps report. Of course that's not all. She's also the founder of the DevOps Research and Assessment, recently acquired by Google, was a Professor of Management Information Systems and Accounting, and has also been a performance engineer and sysadmin. To say I'm excited to talk to you is probably an understatement here. So, welcome to the show.Nicole Forsgren: Thank you. It's a pleasure to be here. I'm so glad we finally connected. How long have we been trying to do this?Mike Julian: Months. I think I reached out to you, it's March now. I reached out in November, and you're like, "Well, you know, I have all this other stuff going on, and by the way, my company was acquired."Nicole Forsgren: Well, back then, I had to be sly, right? I had to be like, "I've got this real big project. I'm sorry. Can we meet later?" And, God bless, you were very gracious and kind, and you said, "Sure-"Mike Julian: Well thank you.Nicole Forsgren: ... "we can chat later." And then I think you actually sent me a message after saying, "Oh, congrats on your 'big project'." I said, "Thank you."Mike Julian: That sounds about right.Nicole Forsgren: I appreciate it. Yeah. And then, you reached out again, and I said, "Oh, I'm actually working on another big project. But, this time ..."Mike Julian: It's not an acquisition.Nicole Forsgren: Yeah, it's not an acquisition. This time, it's a normal big project, and it's this year's State of DevOps report. And we just launched the survey, so I'm super excited we're collecting data again.Mike Julian: So we can get that right out of the way, where can you find the State of DevOps report?Nicole Forsgren: All of the State of DevOps reports are hosted at DORA's site. We still have the site up. And all of the reports that we've been involved in from, I want say we started in 2014, I'm so old I already forgot. All the reports that we've done are hosted. We'll post them in the show notes. If you can grab yourself a Diet Coke or coffee or a tea or a water, or if you want a bourbon. Get comfortable. Sit back, takes about 25 minutes. I know, right, everyone's like, "Girl, 25 minutes?"Mike Julian: That's a big survey.Nicole Forsgren: I know. It is. But it's because the State of DevOps report is scientific, right? We study prediction, and not just correlation. But sit back, get comfy and let me know what it's like to do your work. Because we're digging into some additional things this year; productivity, tool chains, additional things around burnout and happiness, and how we can get into flow, and really what that looks like. And some really great things are a bunch of people have already chimed in after taking the survey in really thoughtful ways. Also, by the way, I love you all for taking it if you have. Share it with your colleagues, share it with your peers.But they've said that just by taking the survey, they've already come away, even before the report has come out, they've already walked away with really interesting ideas and tips and insights about how they can make their work better.Mike Julian: Yeah, that's wild to think about, that the act of taking a survey actually improves my work. Because most surveys I take, I'm finished, I'm like, "Well, that was kind of a waste of time." It feels like I just gave away a bunch of stuff without getting anything.Nicole Forsgren: Yeah, and I think the reason it works that way is because we're so careful about the way we write questions that sometimes just the act of taking the survey helps you think about the way you do your work. So just the act of kind of taking some of these questions helps people think about what they're doing. And then, of course, like I joked already, it's my circle of life, the survey will be open until May 3rd and then I will go into data analysis and report writing. And we expect the report itself to come out about mid-August.Mike Julian: Well, why don't we take a few steps back and say ... Everyone loves a good origin story. I believe you and I met at a LISA many, many years ago. You were giving a joint workshop with Carolyn Rowland on-Nicole Forsgren: Oh, I love Carolyn.Mike Julian: Yes, she's also wonderful. I should have her on here.Nicole Forsgren: My twin. Yes. Absolutely.Mike Julian: So you were a professor then when I first met you. I'm like, you know that's kind of interesting that a professor's hanging out at a LISA and giving all this great advice on how to understand business value, which I thought was absolutely fascinating. Professor, hanging out in the DevOps world, how'd that happen?Nicole Forsgren: Oh my gosh. Okay so, the interesting thing is, I actually started in industry. My very first job was on a main frame, writing medical systems, and then writing finance systems. So I was a mainframe programmer. And then supported my main frame systems, right? Which is how so many of us in Ops got our start in Ops was someone was like, "Well somebody's gotta run this nonsense." Right? I was still in school, and then I ended up as a DEV, right? I was a Software Engineer at IBM for several years, and then pivoted into academia. Went and got a PhD, where I started asking questions about how to analyze systems, so I was actually doing NLP, natural language processing.Mike Julian: Interesting.Nicole Forsgren: Yeah, I was doing…Mike Julian: Yeah, that's a weird entry point into that. Definitely not what I would have expected.Nicole Forsgren: Yeah, so the crazy thing, my first year was actually deception detection.Mike Julian: I bet that's awesome.Nicole Forsgren: It was really interesting, it was super fun. But I leveraged so much of my background from systems work, right? Because what do we do? We analyze log systems.Mike Julian: Right.Nicole Forsgren: Right? We're so used to analyzing a ton of data in a messy format, many times text based, super noisy, can't always trust it, right? Right now people are like, "I can't trust surveys. People lie." Kids, so do our systems.Mike Julian: All the time.Nicole Forsgren: Right? And so, they loved me for a bunch of this work. All of a sudden, I randomly did a usability study with sysadmins. We wrote up the results, gave them back to IBM, and IBM was like, "Well what do you mean? We followed UCD guidelines, user center design guidelines. This should be applicable." And I was like, "Wait, whoa whoa whoa whoa, what?"At the time, they had one set of UCD guidelines, for all users. Super super advanced, high level advanced, sysadmins, who were doing back-up, disaster recovery, everything. And people who had bought a laptop and were using email for the first time in their lives.Mike Julian: I'm sure that went over super well.Nicole Forsgren: What? I'm like, "That's it. Changing my dissertation." Which of course, panicked my advisors. They were like, "You're gonna what?" So I start doing what, at the time, was kind of the groundwork for DevOps. Which is, how do you understand and predict information systems? And by information system, technology, automation, usage and prediction and then outcomes and impacts of the team, individual team at an organizational level.Which now, I say all that, that's big words, that's academic words, for basically what's DevOps. How I do I understand when people use automation and process and tooling and culture, and how do I know that it rolls up to make a difference and add value? Which now we're like, "Oh that's DevOps."This is late 2007.Mike Julian: Oh wow. So you were early days with us.Nicole Forsgren: Yeah. It was a really interesting parallel track, because now we look back and we're like, oh this is about 10 years ago. That was kind of the nascent origins about the same time as DevOps, right? So, so many of us kind of stumbled into it about the same time. I had no idea this was happening in industry. I kept plugging away, I kept doing it, stumbled into LISA, trying to connect data, of course, like every good academic does. Desperately trying to find data.Stumbled into, bumped into a group collecting similar things but using different rough methods. A team from a cute little configuration management startup called Puppet, right? Started working with them, invited myself onto the project. God bless them, I have so much love and respect for them because they basically let this random, random academic tear apart their study and redo it and lovingly tell these two dudes I had never met before, on the phone, named Jean and Jez, that they were doing everything wrong and that this word they were using wasn't the right word. Redid, in late 2013, the State of DevOps report, made it academically rigorous, and then that, kept going for several years, right? And then suddenly, we redid a bunch of stuff after a couple years.I left academia, walked away from what was about to be tenure, to go to another cute little configuration management startup called Chef, that was fun, right? So I'm working on the report with Puppet, and working for Chef, and continuing to do research and work with organizations and companies. And I left academia in part, because I was seeing this crazy DevOps thing make a difference. But in academia, they weren't quite getting it yet. And I wanted to make sure I could make a bigger difference, because I'd started working at tech in college in 98, 99, 2000; we lift this crazy dot com bust.And it wasn't a bust because everything crashed and the world ended like people thought but companies failed, it had huge implications and impacts for what happens to people. They lose their jobs, it breaks apart families, they get depressed, it impacts their lives, some people were committing suicide. And I was so worried about what happens when we hit this wave again and we're starting to see that hit again. So what happens if companies and organizations don't understand smart ways to make technology, because you can't just keep throwing people at the problem, or throwing the same people at the problem. And when I say throwing the same people I mean, seven day forced marches.I was at IBM when they made us do that, right? They got pulled into a class action lawsuit, you can't do that. That's not a way to live.Mike Julian: Yeah, I've been on many of those, they're brutal. And they don't result in anything useful.Nicole Forsgren: It's just broken hearts and broken lives, right? And so, some people like say, you really care about this. I'm just this nerd academic who just cares too much about what I do. And so if we really can, fundamentally change the way that people make software, because if it will in fact, actually, fundamentally make their lives better ... let's do it.And then, thank God, what we found is that it really does. Sure, it's nice that it delivers value to the business but that matters because then, what it does, is it helps them make smarter investments because then in turn, it reduces burnout. It makes people happier, it makes their lives better, and I think that's the part that's important.Mike Julian: So what you've been finding is that by a company implementing all these better practices of continuous deployment, and faster time to delivery, faster time to value ... it makes the lives of the people doing the work better?Nicole Forsgren: Yeah and John Shook has found this as well. Right? He did this great work in Lean, in that in order to change ... some people have said like, "How do you change culture?" Let's find ways to change culture. Sometimes the best ways to change culture is to change the way you do your work and I'm sure we've seen that ourselves, right? In other aspects of our lives. To change the way we feel, to change the way our family works, to change the way our relationships work. You actually physically change your lived experience, or some aspect of your lived experience.And so if we change the way that we make our software, we will change the way that our teams function, which is changing the way that the culture is. And so, said another way, if we can tell our organizations which smart investments to make in technology and process, then we can also improve the culture. We can also change the lives of the people, right? And the Microsoft Bing team found this, right? They wanted to make smart investments in continuous delivery.And in one year, they saw work life scores go from, I'm pulling this off the top of my head, but I want to say it went from 38% to 75%. That's huge.Mike Julian: That's an incredible jump.Nicole Forsgren: Right. And it's because people are able to leave work at work and then go home. You can go see your families, you can go to a movie, you can go eat, you can have hobbies, or you can go binge watch Grey's Anatomy. You can do what you want.Mike Julian: That's one of the most incredible things to me is that there's this idea of in order for a company to be successful they have to push their employees, kind of put them through the ringer. Intuitively, that's never felt right. And you actually have data that shows that's not right. Doing these things, actually makes everyone better. The business improves dramatically, the people's lives improve dramatically, and everything's awesome.Nicole Forsgren: Right and if we want to push people, that's not sustainable. And if anything, we want to push people to do things that they're good at and we want to leverage automation for things that automation is good at. So what does that mean?We want to have people doing creative, innovative, novel things. Let's have people solve problems, let's have automation do things that we need consistency for, reliability for, repeatability for, autotability for. Let's not have people bang a hammer and do manual testing constantly. Let's have people figure out how to solve a problem, do it once or twice to make sure that's the right thing, automate it, delegate that to the automation and the machines and the tooling, hand it off, be done, and then pull people back into the loop into the cycle, to figure out something new.I think it was Jesse Purcell that said, "I want to automate myself out of a job constantly." Right? Automate yourself out of your current job, and then find a new job to automate yourself out of again. We will never be out of work.Mike Julian: Yeah, I used to worry about that when I first started getting into DevOps and actually, when I first started working on automation it wasn't DevOps at the time, it was automating Windows desktop deployments at a University. And this is in the early 2000s. And one of my big worries was, well because I spend half my week doing this, if I were to automate it I'd spend an hour doing this, what am I gonna do the rest of the time? They're just gonna fire me 'cause they don't need me anymore.As it turns out, no, that's not what happened at all. Higher value of work became work because I wasn't focused so much on the toil.Nicole Forsgren: Right, and those types of things, machines and computers can't do. And the other thing, I used to tell all my friends, don't think about that in terms of job security, right? Don't try to paint yourself into a thing that no one else can ever do because then you can't be replaced, because that also means that you can never get promoted.If we always make sure that there are aspects of our job that can be automated so that there are opportunities for us to pick up new work, that only creates more opportunities for amazing things. There are always going to be problems, there are always problems for us to solve. I don't want to be stuck doing boring work.Mike Julian: Yeah, God knows that's the truth.Nicole Forsgren: Oh my gosh I know. I don't want to be stuck doing boring, repetitive work. That's just a headache. If we can find, especially really challenging, complex things, and if we can find ways to automate that, trust me, we will never dig ourselves to the bottom of that hole. That is always there.Mike Julian: So I want to talk about the State of DevOps report and I want to start off by asking a question about something you mentioned earlier. You mentioned this phrase, academic rigor. What is that, what does that mean?Nicole Forsgren: Academic rigor includes a few things, okay? So one part of academic rigor is research design. So it's not just yoloing a bunch of questions ... sorry, yolo is my shorthand for like, "Your methodology is questionable."Mike Julian: I've been seeing a lot of those surveys come out recently.Nicole Forsgren: Yeah. So one is research design. And some people say, "Nicole, what do you mean by research design?" So research design is, are the types of questions you're asking appropriately matched to the method that you're using to collect the data? Right? Are these things matched? And for some things, a survey is appropriate. A one time, so one time is cross-sectional, one slice in time survey across a whole industry. Some things this is appropriate for. Some things this is not appropriate for.One good example, a whole bunch of people really want me to do open spaces, questions, in State of DevOps report.Mike Julian: What does that mean? Like open ended questions?Nicole Forsgren: No, open spaces. So a lot of people have a lot of feels about open office spaces. Should I work in an open office space? Is open office space influence productivity? Or pair-programming ... does pair-programming affect productivity? Does pair-programming affect quality? People have a lot of feels about these things. The type of research design, employed in the State of DevOps report, is a survey that is deployed completely anonymously, across the entire industry at a single point in time, is not the appropriate research design to answer either of those questions.Mike Julian: Why is that?Nicole Forsgren: Because what you would need to do is have a much more controlled research design. So I would need to know, for example, who you were working with. I would need to know, so let's go with the peer review one, I would need to know the types of problems you're working on, the types of code problems, I would need to now the complexity of the problems, I would need to know how long it's taking you, right? If you're wanting to now productivity, right? 'Cause I would need to know a measure of productivity. I would need to know what the outcome is. So if my outcome's productivity, I would need to measure productivity, because I'm gonna need to control for perplexity, right? Because things that are more complex, we expect to take longer. Things that are less complex, I expect to take not as long, right?And then I would need to match and control. Right? So even things like open office spaces, right? Because if you're doing peer programming in an open office space versus not an open office space, if you're doing it at an office, I would need to know seniority of the person, or some proxy of seniority. I would need to now how you're paired, are you paired with someone at your approximate experience level, if not seniority experience level. I would need to know how the pair-programming works, I would need to know the technology involved, I would need to know if you're remote, or if you're actually sitting next to each other. I would need to know if you're both able to input text at the same time or if one person is inserting and the other person is not.So that when I do comparisons, I know what the comparisons are like.Mike Julian: That's an incredible amount of information. I never expected that you would have to know so much in order to get a good answer out of that.Nicole Forsgren: And that's off the top of my head. Right, I'm spit balling because you asked me a good question. And that's just on research design and then you move on to analysis, right? When you move on to analysis, then we need to get into the types of questions that you have asked. Are these types of questions, are we looking at correlation? Are we looking at prediction? Are we looking at causation? What types of data do we have available and which types of analysis and questions are they appropriate for?Again, they need to match up the right way. Some types of data, are not appropriate for certain types of analysis or questions. So you really need to make sure that each one is appropriate for the right types of things. Right? Certain types of analysis, like mechanistic, survey questions will never be appropriate for mechanistic analysis, right? Although, quite honestly, no one's every gonna be doing mechanistic analysis. Never and by the way, if anyone comes to me and says they're doing mechanistic analysis, I'm gonna sit back and listen to you very intently, very interested because I don't think anyone's doing mechanistic ... it's not a thing.Mike Julian: So when you're analyzing the results of the survey, what we're seeing is one question followed by another question, followed by another question, and you know hundreds of questions. When you're analyzing this stuff, are you looking at a question at a time, or are you looking at multiple questions and then interpreting the answers based on what you're seeing across several different questions?Nicole Forsgren: So when I'm writing up the results, when I'm writing up the report, I am writing up the results of my analysis, and my analysis is taking into account a very, very careful research design. Now what that means is, my research design has been very carefully constructed to minimize misunderstandings. It tries to minimize drift in answers. So, one way that we do that, and this is outlined in part two of accelerate if there's any stats nerds that want to read up on this, we do things called latent constructs.So, you asked about having only a few questions or several questions. One way we do this, I mentioned, is called latent constructs. If I want to ask you about culture, right, I could ask 10 people about culture and I would get 15 answers. 'Cause culture could mean so many different things, right? In general, when we talk about culture in a DevOps context, we tend to get something that ... people will say very common things like, breaking down silos, having good trust, having novelty, right?So what we do is we start with a definition, and then we will come up with several items, questions, that capture each of those dimensions. So you might want to think about a dope Venn diagram, where each of the questions is overlayed and then all of the things where they have the biggest, or the perfect overlay, that very center, that little nut, that is what the construct is. That is what culture is, that is what's represented by culture.And then each of the individual circles is each question. That's what we do in research design. One part of research design. When I get to stats analysis mode, I take all of the questions, all of the items, across, not just culture, but every single thing that I'm thinking about. So in years past I've done monitoring observability, I've done CI, I've done automated testing, I've done version control, I've done all of these things, and I throw all of them into the hopper, right?Mike Julian: Which is probably your massive Excel spreadsheet I'm sure.Nicole Forsgren: No, it's SPSS. I use SPSS but you can use several different stats tools. And we do principal components analysis. And what we do is we say, how do they load? Basically, how do they group together and do we have convergent validity? Do they converge? Do they only measure what they're supposed to measure? And do we have discriminant validity? Do they not measure what they're not supposed to measure? And do we have reliability? Does everyone who's reading these questions, read them in a very very similar way?Once we have all of those things, and there's several statistical tests for all of those, then I say, "Okay, these several items, usually three to five items, all of these items together are culture," or "all of these items together are CI," or "all of these, right these grouping of items, represent this." Okay, now, now, I can start looking at things like correlations, or predictions, or something else and then I get to the report, and now I will just talk about it like, culture.So I talk about it as one thing, but it's actually several things and then when I talk about culture, I can say, "This is what culture is," and I can talk about it in this nuanced, multidimensional way, and I know what those dimensions are because it's made up of three to five, to six to seven questions, and by the way, if one of those questions didn't fit, because I know from the stats analysis, I can toss it, and I know why. And I always have several items. That's the risk, if you only have one question or if you only have two questions. If one of them doesn't work, which one is the wrong one? You don't know. Right? Because, is it A or is it B? I don't know.At least if I start with three and one falls out, then it's probably the two that are good.Mike Julian: Yeah. Many listeners on here have taken a lot of the surveys run by marketing organizations, except the surveys are also designed by people in marketing …Nicole Forsgren: They're designed by people who want a specific answer.Mike Julian: Exactly.Nicole Forsgren: And that's the challenge.Mike Julian: Right, whereas, to make this very clear, the State of DevOps report is not that at all. There's a lot, as you said, rigor that goes into this.Nicole Forsgren: So the nice thing is that we have always been vendor and tool agnostic.Mike Julian: You're not looking for a very particular answer to come out, you want to know what is actually out there.Nicole Forsgren: And we're not looking for an answer to a product. So, in the example of CI, what is CI? I don't care about a tool. I'm saying, if you're doing CI and if you're doing CI, continuous integration, in a way that's predictive of smart outcomes, you will have these four things. The power in that, is that anyone can go back and look at this as a evaluative tool. If you are a manager, or a leader, or a developer, you can say, "Any tool that I use, any tool in the world, I should look for these four things," or "Any tool I build myself, or if I'm doing CI, I should have these four things."If you're a vendor, you should say, "If I think I'm building or selling CI, I better have these four things. Right? So that's the great thing and I've gotta say, God bless my new team. They're letting me run this the same way. It's still the same way. It's still vendor and tool agnostic, it's still capabilities focused. Every single thing you look for, whether it's automation or process or culture or outcomes, it's vendor and tool agnostic, it's capabilities focused, and again, the power is that you can use it as a evaluative tool.Is my team doing this? Is my tooling doing this? Is my technology doing this? Am I able to do this? If I'm not, what is my weakness? What is my constraint? Because if I take us back to the beginning, what is it that drives me and the DORA team, what is it that we want to get out of this? We want to make things better. And how do we do that? We can give people an easy evaluation criteria. And I'm not saying it's easy, because all of this is easy, it takes work. But if there's clear evaluation criteria, we've got somewhere to go.Mike Julian: Since I know that you love talking about what you found in your several years of doing this. What are some of the most interesting results you've come up with?Nicole Forsgren: Oh, there's so many good ones.Mike Julian: Let's pick your top three.Nicole Forsgren: Okay, I think one of my favorites is, and I'm gonna do this in cheesy marketing speak …Mike Julian: Please have at it. We have prepared ourselves.Nicole Forsgren: Someone who had a little startup and had to fake it as a marketer for a minute, we'll see how I do at this.Architecture matters, technology doesn't. Number one. Okay. So what does that mean? What that means is, we have found that if you architect it the right way, your architectural outcomes have a greater impact than your technology stack. So architectural outcomes, some key questions are: Can I test? Can I deploy? Can I build without fine grained communication and coordination?Mike Julian: What does the fine grained mean?Nicole Forsgren: Do I have to meet and work with and requisition something among, do I have to spin up some crazy new test environment or do I have to get approvals across 17 different teams? Notice, I just mentioned teams. Communication and coordination can be a technology limitation or it can be a people limitation. This harkens very much back to Conway's law.Mike Julian: One of my favorite laws.Nicole Forsgren: Right? This is very much a DevOp thing. But, it's very true. Whatever our communication patterns look like, we usually end up building into our tech. Now, I will say this is very often easier to implement in Cloud and Cloud native environments, but it can absolutely be achieved in Legacy and Mainframe environments as well. We did not see statistically significantly differences among Brownfield and Greenfield respondents in previous years.Mike Julian: That's good to know.Nicole Forsgren: Yeah, so I love that one. That one's super fun.Okay, number two. Cloud matters, but only if you're doing it right.Mike Julian: Oh, what does right mean?Nicole Forsgren: Dun dun duh. So, this was one of my favorite stats. We found that you are 23 times more likely to be an elite performer if you're doing all five essential Cloud characteristics. I guess you could say if you're doing all five essential characteristics of Cloud computing according to NIST, the National Institutes of Standards in Technology. So I didn't make this up, this comes from NIST, okay?So it was interesting because we asked a whole bunch of people if they were in the Cloud. They're like, of course we're in the Cloud, we're totally in the Cloud, right? But only 22% of people are doing all five things. So what are these five? So these five are on demand self service. You can provision resources without human interaction, right? If you have to fill out a ticket and wait for a person to do a ticket, this doesn't count. No points.Another one is broad network access. So you can access your Cloud stuff through any type of platform; mobile phones, tablets, laptops, workstations. Most people are pretty good at this. Another one is resource pooling, so resources are dynamically assigned and reassigned on demand. Another one is rapidly elasticity, right, bursting magic. We usually know this one.Now the last one is measured service. So we only pay for what we use. So the ones that are most often looked is usually broad network access and on demand self service.Mike Julian: Yeah, what's interesting about that, to me, there's nothing in there that prevents, like say, an internal open stack cluster from qualifying.Nicole Forsgren: Exactly, right. So this could be private Cloud. I love that you pointed that out. The reason that this is so important to call out is, it just comes down to execution. It can be done and the other challenge is so often organizations, executives, or the board says you have to go to the Cloud and so someone says, "Oh yes, we're going to the Cloud." But then someone has redefined what it means to be in the Cloud. Right? And so, you get there, someone checks off their little box, puts a gold star on someone's chart, they walk away, and they're like, "Well we're not seeing any benefits." Well yeah, 'cause you're not doing it.Mike Julian: Right. Yep.Nicole Forsgren: It's like, "I bought a gym membership, I'm done." No. And again, I'm not saying it's easy, right? There's some work involved. Now the other thing that I love is that, let's say you're not in the Cloud, for some reason you have to stay in a Legacy environment, you can look at these five things and you can implement as many possible, you can still realize benefits.Mike Julian: Right. It's not an all or nothing approach. You can do some of these and still get a lot of benefit from it.Nicole Forsgren: It's almost like a cheater back to number one, which was architecture matters, technology doesn't. How can I do a cheat sheet to see some really good tips on how to get there?Mike Julian: So what's your number three here?Nicole Forsgren: My number three would probably be, outsourcing doesn't work.Mike Julian: Yeah.Nicole Forsgren: Which some people hate me for and they shoot laser beams out of their eyes. So let's say outsourcing doesn't work*.Mike Julian: Okay, what's the asterisk?Nicole Forsgren: Asterisk, the asterisk is going to be that functional outsourcing doesn't work.Mike Julian: Okay, so say outsourcing my on call duties, probably isn't going to work so well.Nicole Forsgren: Taking all of DEV, shipping it away. Taking all of TEST, shipping it away. Taking all of OPS, shipping it away. Now, why is that? Because then, all of you've done is taken another set of hand offs, you've created another silo. You've also batched up a huge set of work, and you're making everyone wait for that to happen. The goal is to create value and not make people wait. If now everyone has to wait for everything to come back, if you're making high value work wait on low value work, because it all has to come back together, which is usually the way it works, you're boned.Now, functional outsourcing. If you have an outsourcing partner that collaborates with you and coordinates with you and delivers at the same cadence, that's not functional outsourcing. That's the asterisk.Mike Julian: Okay, gotcha.Nicole Forsgren: Also, if they're part of your team and they're part of your company but they basically disappear for three months at a time. Sorry kids, that's functional outsourcing. I worked no points, may God have mercy on your soul. It's not helpful.Mike Julian: Right. It seems to me, how you could tell if you're in this predicament, is if there is a noticeable hand off between your team and whoever you have given these items to, you have functional outsourcing. Would that be about right?Nicole Forsgren: Yes, and especially if there's a noticeable hand off and then a black box of mystery.Mike Julian: Of like, how is the work getting done?Nicole Forsgren: Step one, something, step two, question mark, step three: profit.Mike Julian: Maybe. So the first two, it's all good because we can kind of see where to go from there, but this third one actually seems a bit harder because if I'm a sysadmin, I have absolutely no control over this functional outsourcing. I may hate it just as much, I may hate it myself, but I don't have any control over it. What can I do as a sysadmin, or someone in ops, someone in dev, how can I improve that situation?Nicole Forsgren: So some ideas might include things like, seeing if there's any way to improve communication or cadences in the interim. Right? You might still have that outsourcing partner, because that's just the way it's gonna be. But, let's say that you've batched up work in three month increments, is there any way to increase handoffs to once a month? Is there any way that we can take capabilities that we know we import, working in small batches, and just increase that handoff? Is there any way that we can integrate them into our cadence, into our teams?Now I realize there is some challenge here because from a legal standpoint, we can't treat them like our team because then, at least from the United States standpoint, once we treat them like an employee, then we're liable for employment taxes and all of that other legal stuff. But if we can integrate them into our work cadence, or more closely into our work cadence, then our outcomes improve.Mike Julian: Okay, cool. That makes a lot more sense. That doesn't sound nearly as hard as I was fearing.Nicole Forsgren: So it can be starting to decrease the delay on the cadence, asking for slightly more visibility into what's happening, if it's a complete black box, looking for that.Mike Julian: Nicole, this has been absolutely fantastic. Thank you so much for joining me. I have two last questions. Where can people find this State of DevOps report to take the survey? Where is the survey at?Nicole Forsgren: Oh, we've got the survey posted. Can I include it in show notes?Mike Julian: Absolutely. Alright, folks, check the show notes for the link. And my last question for you is where can people find out more about you and your work, aside from this survey?Nicole Forsgren: I'm a couple places. So my own website is at nicolefv.com and I'm always on Twitter, usually talking about ice cream and Diet Coke, that's @nicolefv.Mike Julian: I do love you Twitter feed. It's one of my favorites.Nicole Forsgren: Yeah, everybody come say hi. My DMs are open.Mike Julian: What I love most about your Twitter feed is roughly around the time that you're writing the report and saying, "Oh my God, why did I do this?"Nicole Forsgren: Yeah, I try to keep it locked down, but every once in a while something will slip, like "Oh my gosh everybody, something good is happening," or "oh I forgot this one thing," or "So much good is happening."Mike Julian: Yeah, I remember last year like, "Oh my God this is so cool but I can't tell you about it."Alright, well thank you so much for coming on and thanks to everyone else listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on Itunes, Google Play, or wherever it is you get your podcasts. I'll see you on the next episode.


11 Apr 2019

Rank #9

Podcast cover

The Business Value of Serverless with Yan Cui

About the GuestYan is an experienced engineer who has run production workload at scale in AWS for nearly 10 years. He has been an architect and principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. He has worked extensively with AWS Lambda in production, and has been helping various UK clients adopt AWS and serverless as an independent consultant.He is an AWS serverless Hero and a regular speaker at user groups and conferences internationally, and he is also the author of Production-Ready serverless.Guest Links Yan’s blog Yan’s video course: Production Ready Serverless Find Yan on Twitter (@theburningmonk) Subscribe to Yan’s newsletter Centralised logging for AWS Lambda TranscriptMike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools — and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.Mike Julian: Hi folks, I'm here with Yan Cui, an independent consultant who helps companies adopt serverless technologies. Welcome to the show, Yan.Yan Cui: Hi Mike, it's good to be here.Mike Julian: So tell me what do you do? You're and independent consultant helping companies with serverless. What does that mean?Yan Cui: So I actually started using serverless quite a few years back, pretty much as soon as AWS announced it, I started playing around with it and the last couple of years I've done quite a lot of work building serverless applications and production. And I've also been really active in just writing about things I've learned along the way, so as part of that, a lot of people have been asking me questions because they saw my blog and talk about some problems that they've been struggling with, and asked me, "Hey can you come help me with this? I got some questions." So as part of the doing that, I like to help people, first of all and then just part of doing that is something that's been happening more and more often, so in the last couple months I have started to work as an independent consultant, helping companies who are looking at docking serverless or maybe moving to serverless for new projects and want to have some guidance in terms of things they should be thinking about and maybe have some architectural reviews on a regular basis. So for things like that, I've been helping with a number of companies, both in terms of workshops but also regular architectural reviews. And at the same time, I also work part-time at a company called The Zone, which is a sports streaming platform and we also use the serverless and is contained very heavily there as well.Mike Julian: Okay, so why don't we back up like several steps. What the hell is serverless? Just to make sure that we're all talking about the same thing. What are we talking about?Yan Cui: Yeah that's a good question, and I guess a lot of people has been asking the same question as well because now they say you see, pretty much everyone is throwing the serverless label at their product and services. And just going by popular definition out there based on what I see in the talks and blog posts, I guess in terms of my social media circle, I guess by the most popular definition, serverless is pretty much any technology where you don't pay for it when you are not using it because paying for OpTime is a very serverful way of thinking and planning, and two is, you don't have to worry about managing and patching servers because installing demons or Asians or any form of subsidiary or support software on it is again, definitely tied to having servers that you have to manage. And three, you don't have to worry about scaling and positioning because the systems just scale a number of underlying servers on demand. And by this definition, I think a lot of the traditional backend server's things out there like AWS S3 or Google BigQuery, they also qualify as the serverless as well.Mike Julian: Okay, so Lambda is a good example of serverless, but there's also this thing of like a function as a service and they seem to be used interchangeably sometimes. What's going on there?Yan Cui: So to me, functions as services, describes a change in terms of how we structure our applications and changing the unit of deployment and scaling to the function level that makes every application. A lot of the function and server solutions like a dual function or Lambda as you mentioned, they will also qualify as serverless, based on the definition we just talked about and generally I find that there are a lot overlap between the two concepts or paradigms between functions and service and the serverless. But I think there are some important subtleties in how they differ because you also have functions of service solutions like Kubeless or Knative that gives you the function oriented programming model and the reactive and event driven module for building applications, but then runs on your own Kubernetes cluster.Yan Cui: So if you have to manage and run your own Kubernetes cluster, then you do have to worry about scaling, and you do have to worry about patching servers, and you do have to worry about paying for op time for those servers, even when no one is running stuff on them. So the line is blurred when you consider Kubernetes as service things like Amazon’s EKS or Google GKE where they offer Kubernetes as a service or Amazon's Fargate, which lets you run containers on Amazon's fleet of machines so you don't have to worry about positioning, and managing, and scaling servers yourself.Yan Cui: At the end of the day, I think being serverless or having the right labels associated with your product is not important. It's all about delivering on business needs quickly, but having a well understood definition on those different ideas that we have, really helps us in terms of understanding the implicit assumptions we make when we talk about something. So now that everyone is talking about calling their services or products serverless, is really not helping anyone because if everything is serverless, then nothing is serverless and I really can't tell what sort of assumptions I can make when I think about your product. Mike Julian: Right, this is the problem with the buzzwords is, the more you have of them, the less it actually means and the more confused I am about what you do. So because I love talking about where things fall apart... Like serverless, it's a cool idea. I think it works really well and yet, I've seen so many companies get so enamored with it that they spend six months trying to build their application on serverless or in that model. And then a month later, they go under. I can't help but make that the tie between the two of  — you spend all your time trying to innovate on this and at the end of the day, you didn't have any time to innovate on the product. So that's an interesting failure model. But I'm sure there's others here where people are adopting serverless in the same way when we first started adopting containers. Like, "Hey, I just deployed a container and works on my machine, have fun." So when is serverless not a good idea? What are the pitfalls we're running into? What are people not thinking about?Yan Cui: I think one of the problems we see all the time ... You mentioned when something's a hype, a lot of the adoptions happen because there's a lot of hype behind the technology and there's a lack of understanding of, this is the requirement we have and the technical constraints that you have and you go straight into it. I think this happens all the time and that's why we have the whole hype cycle to go with it. I think when you are a newcomer to a new paradigm and it's so easy to become infatuated by what this product can do and when you see the whole world as a hammer, you start looking for nails everywhere and this happens when we discover NoSQL. All of a sudden, everything has to be done with NoSQL. The MongoDB and Redis which is everywhere to solve every possible database problem, often again with disastrous results because again, people are not thinking about the constraints and the business leads they actually have and focus too much on the tech. If anything, I think with serverless, we have this great opportunity to think more about how do we deliver business value quickly, rather than thinking about technology itself. But as engineers, as technology people ourselves, you can see how easy it is to fall into that trap and I think there's a couple of used cases where serverless is just not very good in general right now. One of them is when you require consistent and very high performance.Yan Cui: So quite a lot has been made about cold starts which is something that is relatively new to serverless, well to a lot of people using serverless but again, it's not something that's new entirely. For a very long time, we've had a deal with long garbage collection pauses or server being overloaded because low is not evenly distributed, but with serverless, that becomes something that's systematic because every time a new container is spawned to one of your functions, you get this spike in latency. For some applications, that is not acceptable because maybe you are building a realtime game for example, where latency has to be consistent and have to be very very fast. You are talking about, say a multiplayer game, leaving a nine percentile latency to be below 100 milliseconds, that's not just something that you can guarantee with Lambda or any serverless platform today.Mike Julian: I worked with a company a while back that was building a realtime engine and that was a hell of a problem. So we were building everything on bare metal and VMware, and then had this really nice orchestration layer running on top of a puppet. And this is a hell of a problem because as load comes up, we're automatically scaling the stuff out, except as we're adding the new nodes, latency is spiking because we're trying to move traffic over to something that's not ready for it.Yan Cui: Yes, and with serverless, you don't have this luxury of say, let the server warm up first and then you give it some time before you actually put it into active use. Literally you can respond on the first request that they don't have a spare server running around to handle. So you always have cold start, so you can't just say, "Okay I'm gonna give this server five minutes to get warmed up first." Maybe it's JVM that takes up your warmup time so that you can feel that you're low balanced and the rest of the system to take into account the time it needs to warm up before you put into active service. With serverless you can’t do that, so where you do need consistent high performance, serverless is a really bad fit right now. I think you just touched on something else there as well, the fact that you need to have a persistent connection to a server, so there's some kind of logical notion of a server.Yan Cui: That's again, something that serverless is not a good fit for. If you want, say, a persistent connection in order to do realtime push notifications to connect the devices, or to implement subscription features in the GraphQL for example. In those cases, you also constraint by the fact that functions can only ... Run the occasion for a function can run for only certain amount of time. I think that's a good constraint. It tells you that there's certain used cases that are a really good fit for functions and service, but there are whole other cases that you just shouldn't even think about doing it. There are other ways you can get around it, but by the time you do all of that, you really have to ask yourself, "Am I doing the right thing here?"Mike Julian: Right.Yan Cui: And I think another interesting case is that, and this is again something that I find often made out of proportion is in terms of the cost. Sure Lambda is cheap because you don't pay for it when it's not running, but when you have even a medium amount of load, you might find that you might pay more for API Gateway where Lambda compared to if you just run a web server yourself. Now that's true, but one of the things that you don't think about and this most people don't think about enough is, the personnel cost, the amount of skill set you need to run your own cluster, to be able to look after your Kubernetes cluster, to do all these other things associated with having a server, that often is all that makes it more expensive than whatever premium you pay for AWS to run in your functions.Yan Cui: However, if you are talking about a system that has got, I don't know, maybe tens of thousands to request per second, consistently all the way throughout a day, then those premiums on individual invitations can start to strike up really really quickly. And I had a chat with some of the guys at Netflix a while back and they mentioned that they did a precalculation that if everything on Netflix runs on Lambda today, it will cost them something like eight times more and therefore if you are running at Netflix scale, that is a lot of money, way more than the amount of money you will pay to hire the best team in the world to look after your infrastructure. So if you are at that level scale and the cost is out to wreck out, then maybe it's also time to think about maybe moving your load into a more traditional containerized or em-based setup where you can get a lot more out of your server and do a lot more of the performance organization there, than to run them in Lambda.Yan Cui: And I think the final use case where, Lambda is probably not that good a fit or serverless is not that good a fit today is that, even though you get a good baseline of redundancy built in, so you get Multi-AZ out of the box and you can also build multi-region active APIs relatively easily; but because we are relying on the platform to do a lot more, and the platform service is essentially a black box to us, there are also cases where some of the built-in redundancy might not be enough. For example, if I'm processing events in real time with Kinesis and Lambda, the state of the polar is a black box, it's something that I can't access. So if I want to build a multi region set up whereby if the one region starts to fail, I can move the stream processing to a different region and turn it on. So have active passive set up, then I need to access the internal state for the poller which is not something that I can do, or I have to use some whole lot of infrastructure around it to be able to simulate that.Yan Cui: And again, by the time I invest all the effort and do all of that, maybe I should just start with something else to begin with. Again, those are some of the constraints that I've had to think about when I decide whether or not Lambda or serverless is a good fit for the problem that I'm trying to solve. As much as I love serverless, again, I don't think it's about the technology. It's about finding ways that can deliver the business needs you have, so whatever you choose, you have to meet the business needs first and foremost, and then anything that can let you move faster, you should go with that.Mike Julian: So all this reminds me of an image floated around Twitter a while back, that people dubbed, “Docker Cliff.” And the idea was that you had Docker at the very bottom of Dev and Prod, but to get something from Dev, like when I'm developing Docker on my laptop, to actually put it in production, takes way more than just a container. How do you do the orchestration? How do you do the scheduling? How are you managing network? What are you doing about deployment, monitoring, supervision, security and all this other stuff on top of it that people weren't really thinking about. And so for developers, Docker was fantastic. Like oh, hey, everything is great. It's a really nice self-contained deployable thing except it's not really that deployable. And I'm kind of seeing that serverless is much the same way of, we threw out a bunch of Lambda functions, like this is great. And immediately the next question is, “How do I know they're working? How do I know when they're not working? What's going on with them?” CloudWatch Logs is absolutely awful, so trying to understand what it’s doing through there is just super painful and the deployment model is kind of janky right now. How I've been deploying them is just a shell script wrapped around the aws-cli. I'm sure there's better ways to do it, so are there other stuff like this? Are there other things that we're not really thinking about and what do we do about those?Yan Cui: Yeah absolutely. The funny thing is that a lot of the problems that you talk about are things I hear from other clients or from the people from the community all the time, in terms of how do I do deployment, and how do I do basic observability stuff and the thing is that there are solutions out there that do various different degrees and I think you find that as the case with a lot of AWS services, that they cover the basic use and needs. CloudWatch Logs being a perfect example for that, but it does sit very crudely.Mike Julian: Right, it's like an MVP of a logging system.Yan Cui: Yes.Mike Julian: Every CloudWatch team, it's true.Yan Cui: And the same goes to, I guess, CloudWatch itself as well, but the good thing is that at least you don't have to worry about having to install those agents and whatever to ship your logs to, your CloudWatch Logs. So CloudWatch Logs becomes a good staging place for your logs and gather them and then from there, you can actually ship them to somewhere else. Maybe a ELK Stack, or maybe one of the main services like [inaudible 00:18:48] Logglyor Splunk or something else. So the paradigm of doing that is actually pretty straightforward. I've got two blog posts which I guess we can link to ...Mike Julian: Yeah we'll throw those in the show notes.Yan Cui: ... In the show notes. One other thing, which I think is quite important is security. Again, as developers, we are just not used to thinking about security and I see a lot of organizations try to tackle this security problem with this hammer called VPC. As if having security is gonna solve all of your problems and most of VPC ... In fact, every single VPC I've seen in production, none of them do egress filtering, so that if anyone is able to compromise your network security then you find yourself in this fully trusted environment where services talk to each other because with no authentication because you just assume it's trusted because you're inside of VPC now, but then we've seen several times how easy it is to compromise the whole ecosystem by attacking the dependencies everyone [has]. I think it was last year when a researcher managed to compromise something like 14% of all NPM packages which accounts for something like a quarter of the monthly downloads of NPM, including-Mike Julian: Well that's gonna make me sleep well.Yan Cui: So imagine if someone just compromised one of your dependencies and put a few lines of code there to scan your environment variables and then send it to their own backend to harvest all these different AWS credentials and see whether or not you can do some funky stuff or to be commanding with them. And that is not something that you can really protect by putting VPC in front of things. And yet, we see people try to take this huge hammer and apply into serverless or the same, even though when it comes to Lambda, you pay a massive price for using VPCs in terms of how much cold start you experience. My experience tells me that having a Lambda function running inside a VPC can add as much as 10-seconds to your cold start time, which basically rules out any use of facing APIs you have. But with Lambda, you can actually control your permissions down to the function level and that's again something that I see people struggle with because we don't like to think about, oh this is a IAM permissions and stuff. It's difficult, it's laborious.Mike Julian: Well you know, I think the real problem is that no one knows how IAM actually works.Yan Cui: To be fair though, I guess I'm probably a bad example because I've been using AWS for such a long time and I'm used to the mechanics of IAM and writing the permissions and the policies, but yes, it is much more complicated than people-Mike Julian: It is a little esoteric.Yan Cui: Yes, definitely. And I have seen some tools now coming onto the market which I think PureSec is one of them and a few other ones are all looking at, how do we automate this process to both identify what your function needs by doing a static analysis on your code to see how you're interacting with AWS SDK to see, oh, your function talks to this table and when you deploy or doing a CICD pipeline, you notice that, hey, your function doesn't have the right permissions, it's overly permissive. Because again, a lot of people are using just star. Email function access everything, which also means now your function is compromised. The attacker can get your credentials and do everything with that sort of temporary credentials you have. So some of these tools is going to automate whatever pain that we experience as developers in terms of figuring out what permissions our function actually needs and then trying to automatically generate those templates that we can just put into our different framework. And you talked about a deployment framework being [clunky right now. There are quite a lot of different deployment frameworks that takes care of a lot of the sort of plumbing and complexity under the hood. I don't know if you ever tried to provision an API gateway instance that are using CloudFormation or Terraform, it's horrendous.Mike Julian: It's not exactly simple.Yan Cui: It's so, so complicated because the way resources are organized in API gateway. But with something like the serverless framework or AWS SAM or a number of other frameworks out there, I can just write a human readable URL in one line that translates to I don't know, maybe a 100 lines of a CloudFormation template code.Mike Julian: That's awful.Yan Cui: This is just not stuff that I wanna deal with, so there are frameworks out there that ease a lot of burdens with deployment and similar things. On the visibility side of things as well, there's also quite a lot of companies that are focusing on tackling that side of the equation in terms of giving you better choice ability. Because one of the things we find with serverless, is that people are now building more and more event-driven architectures because it's so easy to do them nowadays.Mike Julian: Right.Yan Cui: And part of the problem with that is, they are a lot harder to trace, compared to direct API codes. With API codes, I can easily just pass along some correlation ID along the headers and then a lot of the existing tools like Amazon X-Ray can just kick in and integrate with API Gateway and Lambda already out of the box, but as soon as my event goes over asynchronous event sources like SNS, Kinese or SQS, then I lose a trace entirely because they don’t support this asynchronous  event sources. But there are companies like Epsagon who are now looking at that problem specifically and trying to understand how the whole, how data flows through the entirety of the system, whether or not it's synchronized through APIs, or whether or not it's asynchronous to the event streams or task queues or SNS topics that you have. And there are also companies that are focusing on the cost side of things, understanding the cost of user transactions that spends across this massive web of different functions, loosely coupled together through different event sources, CloudZero being one of those. I guess the foremost, companies are focusing on the cost side of the cost story of the serverless architectures. So there are quite a lot of interesting stops that are focusing on various different aspects of the problems that we've just described so far. And I think definitely the next six to twelve months, we're gonna see more and more innovation in this space, even beyond what all the things that Amazon's already doing under the hood.Mike Julian: Yeah that sounds like it will be awesome. This whole area still feels pretty immature to me. I know there's people using in production. There's also people that were using Mongo in production and it was dropping data like crazy every day. So more power to them if they don't like data. But I like stable things. So it sounds like serverless, it's still maturing. It is ready, but we're still kinda working some of the kinks out? That would be a fair characterization?Yan Cui: I think that's a fair characterization in terms of tooling space because a lot things are provided by the platform and as I mentioned before, Amazon is good at meeting the basic needs that you have. So you can probably get by with a lot of the tools out of the box, but that also I guess just slows down some of the self-commercial tooling support it comes with, something like containers comes with Kubernetes because again, you only get so much out of the box so that's a huge opportunity for vendors to jump in very very quickly, but at the same time, I think those innovations are happening a lot faster than people realize. Maybe one of the problems is just in terms of the education, getting the information about all the different tools that's coming into the space and make people aware of them.Mike Julian: That's really interesting, and what I think a lot of people forget is exactly how old Docker is because Docker was kind of in the same position of serverless, where it was really cool but it was still pretty immature. And thinking about when these things came out, now that we're seeing Kubernetes which is maturing that ecosystem further, that is actually in production. We know the patterns, and we know how all that stuff is being deployed, we know how to manage it, we know the security. It is pretty mature, but how long did it actually take to get there? And looking at it, you have Docker, its initial release was in 2013. That's like five years ago, which has blown my mind and Kubernetes initial release was in 2014, four years ago. But it's only really been in the past year or two that Kubernetes has been what we'd call mature. And now we're starting to see this massive uptick of abstraction layers on top of Docker in the form of Kube. At some point, I think we're gonna see that with serverless, where it's not just like, oh we're deploying this Lambda function and calling it a day. I think we're gonna see a lot more ... Tooling a lot more abstraction that brings it all together and makes it so much easier to deal with, especially like at scale.Yan Cui: Yeah I absolutely agree and just in terms of the dates you just mentioned, the first initial announcement on Lambda was 2014, so in terms of age, it's not that much younger compared to Docker and the Kubernetes.Mike Julian: Wow.Yan Cui: Where it has differed, is that it's a brand new paradigm, whereas with containers and with Kubernetes, it's a lot easier for you to lift and shift existing workloads without having to massively restructure your application to be intermative for this paradigm. With Lambda, and with serverless, there is that requirement that in order to be idiomatic, there's a lot of restructuring and rethinking you need to do because with them, it's a mind-shift change. And that takes a lot longer than just technology change.Mike Julian: Right, yeah. We're talking about something completely new here. So it's not like, oh we'll just go implement Lambda over night and we'll call it a day. We'll just move our whole application over. It's not like when we start putting things in containers. We could actually put a thing in a container, but really all we're doing by lifting and shifting was, moving from one server to another except now it's a smaller server.Yan Cui: Yes.Mike Julian: We had the idea of the fat container where you had absolutely everything in a container. That is a bad idea, it's a dumb pattern. And it's going the same way with serverless, I think. You can't just lift and shift. It is a brand new architectural pattern. It requires a lot of serious thought.Yan Cui: Yeah, and I think one of the pitfalls I just see in terms of the serverless adoptions sometimes is that, we are so embraced in this whole movement into a new paradigm that sometimes we just forsake all the things we've learned in the past, even though a lot of principles still very much apply. And in fact, a lot of things I've been writing about is basically how do we take previous principles, but apply them, adjust them and make them work in this new paradigm? Because, the practices and patterns may have to change because some things just doesn't work anymore. A lot of principles still very much apply. Why do we do structure login? Why do we do sampling in production? All those things, the principles still very much apply when it comes to serverless. It's just, how we get there is different. And I think that is one of the key things I had to learn the last couple of years is that, a lot of things that we learn in the past, just with databases, a lot of things we learn about databases are still very much there to stay even if we don't need a specific skill set that DBAs provide for us in the new world of NoSQL databases. When it comes to serverless, I guess a leap from understanding and looking at practices, to understand the principles behind them, why do we do it, how can we apply those principles, that's super important when it comes to making a successful adoption of serverless in your organization.Mike Julian: That's an absolutely fascinating perspective because I completely agree. What I absolutely love about it is, the principles of site reliability haven't actually changed. The principles of how we run and manage systems, has it really changed a whole lot in the past 10 years? Which is fantastic. That's how it should be. We should always be looking for true principles. It's stuff that kind of pillars of how we behave and how we look at what we work on. How we do it, changes all the time and it absolutely should, but the principles shouldn't change that much. So that's interesting of trying to apply the ... The principles that we already know to be true. The practices that we know, work. And how do we apply it to a new paradigm? And sure, maybe some of them aren't going to apply very well and we maybe have to create a new one, which I'm sure there will be coming out of this. But, we don't have to start from scratch.Yan Cui: No, what's that saying again? Those who don't know the history are doomed to repeat them.Mike Julian: Right, exactly. We've talked a lot about the failures and the challenges, and you keep mentioning this idea, the business case for serverless. So sell me on it. I want to deploy serverless in my company. I'm just an engineer, but I really like it, so I wanna move everything to it. I wanna do a new application in it. What should I be thinking about? How do I come up with this business case?Yan Cui: I think the most important question there is, what does the business care about? And I think pretty much every business I know of, cares about delivery and speed. As a business, you want to deliver the best possible user experience and you want to build the right features that your users actually want, but to do that, you need to be able to hit the market quickly, and inexpensively, so that you can also then iterate on those ideas and that allows you to tell the good ideas from the bad ones and then you can double down on a good ideas and make them really great. And the more you have to do it, the more your engineering team have to do it themselves, than by definition, the slower you gonna be able to move. And that's why businesses should care about serverless because it frees the engineering teams from having to worry about a whole load of concerns. They need to know how the applications are hosted and let the real experts, the people that work for AWS, to worry about those undifferentiated heavy lifting. And then that frees the brainpower that you actually have, which by the way are super expensive on solving the problems that your users actually care about. No user cares about whether or not your application runs on containers or VMs or serverless, but they do care about when you gonna deliver them and they do care about building the right features. And that again, that needs you to optimize for a time to market and also, it will iterate quickly. A lot of people talk about vendor locking as if Amazon's gonna one day just worry about Amazon holding the key to your kingdom, but I think the real-Mike Julian: That's the last thing I'm worried about.Yan Cui: Yeah exactly, I think the biggest problem we should worry about is a competitor who can iterate faster than you, locking you out of the market altogether.Mike Julian: Right.Yan Cui: Yeah so I think that's why they should really really care about serverless.Mike Julian: I agree with that. That sounds great. The biggest thing that I see with technology is, with engineers and their engineering architectural decisions, it seems that a lot of decisions are based essentially on resume-driven development. I've met a lot of engineers where I built this new application in Go because I wanted to learn Go, and I'm like, that's cool, what does the business have to say about that? And it's like well, "I convinced my boss to use Go." I'm like, "No you did." Like your entire shop's in PHP, you basically just said PHP is shit. That was your business case. Instead like, yes we should be looking at this from the perspective of how quickly can I get this new product to market? How quickly can I ship this feature? And yeah there might be some scenarios where switching a language or switching a framework would be useful, but I agree with you that we really should be focused significantly more on time to market and time to value. We're here to help our businesses make money, or in my case, help my business make money. But for me, I have an application that I'm writing in PHP right now. It's PHP and MySQL and it's gonna be a core facet of my own company. And most engineers would say I'm crazy for writing PHP, but the entire point is that I don't have time to deck around. I need to have this out in the market.Yan Cui: Yeah absolutely, totally agree. And those kind of conversations, I've had quite a few of them in the past myself, and also I've heard a lot of similar arguments in terms of, oh why should we use, for example, functional programming. And one office already wrote the function of programming community for quite a long time and are still a big fan of function and programming, but not for the reason that it makes your code size more readable, but again, it's about moving up the abstraction ladder so that I have to do less and it's about getting that leverage to be able to do more with less and I think that's the argument that we should be making more, I suppose to, how I like to read my codes.Mike Julian: Right, let’s take this from two different perspectives. For the people that are brand new to serverless, what can they do this week, or today, to learn more about it? And for the people that already have serverless in their infrastructure, what can they do this week to improve their situation?Yan Cui: I think learning by doing is always the best way to get to grips on something. So if you are just starting, definitely with serverless, it's so easy to get started and play around with something, and when you're done, just delete everything with confirmation, you sync or button click, or if you're using the right tools, it scans a single command. So definitely go build something. If you got questions that you don't know how the platform behave, then build a proof of concept, try out yourself. It's super, super simple nowadays. That's how I've learnt a lot of things. I've learnt it now through serverless, is just by running experiments. Come up with the question, coming out with the hypothesis on how I expect things to do it, or how the platform to behave, do a proof of concept to answer those questions and then again, I like to write about things so that I have a record for it afterwards but also I can share with other people, things that I've learned and afterwards as well.Yan Cui: And if you already started, and you want to take your game to the next level, don't wanna be boasting myself, but do check in my blog, I have shared a lot of the things that I've learnt about running serverless in production and solved problems you run into, and addressing a lot of the observability concerns, and I also have a video course with Manning as well. Feel free to check out where we actually build something from scratch and apply a lot of things that I've been talking about for the last year and a half, two years, in terms of how do you do auto basic observability things, how to think about security, VPCs and performance and so on. So all of that will be available on the podcast episode notes. Yeah, and also just go out there and talk to other people and learn from them. There's a lot of very knowledgeable in this space already. People like Ben Kehoe from iRobot, people like Paul Johnston and Jeremy Daily and there are quite a lot of people who have been very active in sharing their knowledge as well and their experiences. Definitely, go out there, find other people with who are doing this, and try and learn from them.Mike Julian: That's awesome. So thank you so much for joining us. Where can people find more about you and your work?Yan Cui: You can find me on theburningmonk.com and that's my blog, I try to write actively and you can also find me on Twitter as well. I try to share new things that I find interesting, anything I learn and whenever I write something also, I publish there as well. And if you don't wanna miss anything, I also have a newsletter you can subscribe to on my blog. And so I've tried to write up regular summaries, updates for things I've been doing. And also, I'm available for doing some consultancy work if you need some help in your organization. Or to get started, but also to tackle specific problems that you have with serverless as well.Mike Julian: Wonderful. Well thank you so much for joining us. And on that note, thanks for listening to the Real World DevOps podcast. If you wanna stay up to date on the latest episodes, you can find us at realworlddevops.com. And on iTunes, Google Play or wherever you get your podcast. I'll see you in the next episode.Yan Cui: See you guys.


14 Feb 2019

Rank #10