Cover image of The Podlets - A Cloud Native Podcast

The Podlets - A Cloud Native Podcast

The Podlets is a weekly show that explores cloud native, one buzzword at a time. Each week experts in the field will discuss and contrast distributed systems concepts, practices, trade-offs, and lessons learned to help you on your cloud native journey. This space moves fast, and we shouldn’t reinvent the wheel. If you are an engineer, operator, or technically minded decision maker, this podcast is for you! Find us at https://thepodlets.io.

Weekly hand curated podcast episodes for learning

Popular episodes

All episodes

The best episodes ranked using user listens.

Podcast cover

CI and CD in Cloud Native (Ep 11)

A warm welcome to John Harris who will be joining us for his first time on the show today to discuss our exciting topic, CI and CD in cloud native! CI and CD are two terms that usually get spoken about together but are actually two different things entirely if you think about them. We begin by getting into exactly what these differences are, highlighting the regulatory aspects of CD in contrast to the future-focussed nature of CI. We then move on to a deep exploration of their benefits in optimizing processes in cloud native space through automation and surveillance from development to production environments. You’ll hear about the benefits of automatic building in container orchestration, the value of make files and local test commands, and the evolution of CI from its ‘rubber chicken’ days with Martin Fowler and Jez Humble. We take a deep dive into the many ways that containers differ from regular binary as far as deployment methods, build speed, automation, run targets, realtime reflections of changes, and regulation. Moreover, we talk to the challenges of transitioning between testing and production environments, getting past human error through automation, and using sealed secrets to manage clusters. We also discuss the benefits and drawbacks of different CI tools such as Kubebuilder, Argo, Jenkins X, and Tekton. Our conversation gets wrapped up by looking at some of the exciting developments on the horizon of CI and CD, so make sure to tune in! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Bryan Liles Nicholas Lane Key Points From This Episode: • The difference between CI and CD.• Understanding the meaning of CD: ‘continuous delivery’ and ‘continuous deployment’.• Building an artifact that can be deployed in the future is termed ‘continuous integration’.• The benefits of continuous integration for container orchestration: automatic building.• What to do before starting a project regarding make files and local test commands.• Kubebuilder is a tool that scaffolds out the creation of controllers and web hooks.• Where CI has got to as far as location since its ‘rubber chicken’ co-located days.• The prescience of Martin Fowler and Jez Humble regarding continuous integration.• The value of running tests in a CI process for quality maintenance purposes.• What makes containers great as far as architecture, output, deployment, and speed.• The benefits of CD regarding deployment automation, reflection, and regulation.• Transitioning between testing and production environments using targets, clusters, pipelines.• Getting past human error through automation via continuous deployment.• What containers mean for the traditional idea of environments.• How labeling factors into the simplicity of transitioning from development to production.• What GitOps means for keeping track of changes in environments using tags.• How sealed secrets stop the need to change an app when managing clusters.• The tools around CD and what a good CD system should look like.• Using Argo and Spinnaker to take better advantage of hardware.• How JenkinsX helps mediate YAML when installing into clusters.• Why the customizable nature of CI tools can be seen as negative.• The benefits of using cloud native-built tools like Tekton.• Perspectives on what is missing in the cloud native space.• A definition of blue-green deployments and how they operate in service meshes.• The business abstraction elements of CI tools that are lacking.• Testing and data storage-related aspects of CI/CD that need to be developed. Quotes: “With the advent of containers, now it’s as simple as identifying the images you want and basically running that image in that environment.” — @bryanl [0:18:32] “The whole goal whenever you’re thinking about continuous delivery or continuous deployment is that any human intervention on the actual moving of code is a liability and is going to break.” — @bryanl [0:21:27] “Any time you’re in developer tooling, everyone wants to do something slightly differently. All of these tools are so tweak-able that they become so general.” — @johnharris85 [0:34:23] Links Mentioned in Today’s Episode: John Harris — https://www.linkedin.com/in/johnharris85/Jenkins — https://jenkins.io/CircleCI — https://circleci.com/Drone — https://drone.io/Travis — https://travis-ci.org/GitLab — https://about.gitlab.com/Docker — https://www.docker.com/Go — https://golang.org/Rust — https://www.rust-lang.org/Kubebuilder — https://github.com/kubernetes-sigs/kubebuilderMartin Fowler — https://martinfowler.com/Jez Humble — https://continuousdelivery.com/about/David Farley — https://dfarley.com/index.htmlAMD — https://www.amd.com/enIntel — https://www.intel.com/content/www/us/en/homepage.htmlWindows — https://www.microsoft.com/en-za/windowsLinux — https://www.linux.org/Intel 386 — http://www.computinghistory.org.uk/det/6192/Introduction-of-Intel-386/386SX — https://www.computerworld.com/article/2475341/flashback--remembering-the-386sx.html386DX — https://en.wikipedia.org/wiki/Intel_80386Pentium — https://www.intel.com/content/www/us/en/products/processors/pentium.htmlAMD64 — https://www.webopedia.com/TERM/A/AMD64.htmlARM — https://en.wikipedia.org/wiki/ARM_architectureTomcat — http://tomcat.apache.org/Netflix — https://www.netflix.com/za/GitOps — https://www.weave.works/technologies/gitops/Weave — https://www.weave.works/Argo — https://www.intuit.com/blog/technology/introducing-argo-flux/Spinnaker — https://www.spinnaker.io/Google X — https://x.company/Jenkins X — https://jenkins.io/projects/jenkins-x/YAML — https://yaml.org/Tekton — https://github.com/tektonCouncourse CI — https://concourse-ci.org/ Transcript: EPISODE 11 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically-minded decision maker, this podcast is for you. [EPISODE] [00:00:41] BL: Back to the Kubelets Podcast, episode 11. I’m Bryan Liles, and today we have Nicholas Lane. [00:00:50] NL: Hello! [00:00:51] BL: And joining us for the first time, we have John Harris. [00:00:55] JH: Hey everyone. How is it going? [00:00:56] BL: All right! So today we’re going to talk about CI and CD in cloud native. I want to start this off with this whole term CI and CD. We talk about them together, that are two different things almost entirely if you think about them. But CI stands for continuous integration, and then we have CD. What does CD stand for?  [00:01:19] NL: Compact disk. [00:01:20] BL: Right. True, and actually I’ve used that term before. I actually do agree. But what else does CD stand for? [00:01:28] NL: It’s continuous deployment right? [00:01:30] BL: Yeah, and? [00:01:31] JH: Continuous delivery.  [00:01:32] NL: Oh! I forgot about that one. [00:01:35] BL: Yeah, that’s the interesting thing, is that as we talk about tech and we give things acronyms, CD is just a great one. Change in directories, compact disk, continuous delivery and continuous deployment. Here’s the bonus question, does anyone here know the difference between continuous delivery and continuous deployment?  [00:01:58] NL: Now that’s interesting. [00:01:59] JH: I would go ahead and say continuous delivery is the ability to move changes through the pipeline, but you still have the ability to do human intervention at any stage, and usually deployments production and continuous delivery would be a business decision, whereas continuous deployment is no gating and everything just go straight to product.  [00:02:18] BL: Oh, John! Gold start for you, because that is one of the common ones. I just like to bring that up because we always talk about CI and CD as they are just one thing, but they’re actually way bigger topics and we’ve already introduced three things here. Let’s start at the beginning and let’s talk about continuous integration, a.k.a CI.  I’ll start off. We have CI, and what is the goal of CI? I think that we always get boggled down with tech terms and all these technology and all these packages from all these companies. But I’d like to boil CI down to one simple thing. The process of continuous integration is to build an artifact that can be deployed somewhere at some future date at some future time by some future person, process. Everything else is a detail of the system you choose to use. Whether you use Jenkins, or CircleCI, or Drone, or you built your own thing, or you’re using Travis, or any of the other online CI tools. At the end of the day, you’re building either – If you’re doing web development. Maybe you’re building out Docker files, because we’re in cloud native. I mean docker images, because we’re in cloud native. But if you’re not, maybe you’re just building JARs, WARs, or EARs, or a ZIP file, or a binary, or something. I’d just like to start off, start this off with there. Any more thoughts on continuous integration? [00:03:48] NL: Yeah. I think the only times that I’ve ever used something that’s like continuous integration is when I’ve been doing like more container orchestration, like development, things on top of like things like Kubernetes, for instance. The thing I really like about it is like the concept of being able to like, from my computer, save and do an automatic save and push to a local repo and have all of the pieces get built for me automatically somewhere else, and I just love that so much because it saves so much brain thinky juice to run every command to make the binary you need. [00:04:28] BL: So did you actually create those scripts yourself? [00:04:30] NL: Some of them. When I’ve used things like GitLab, I use the pipeline that exists there and just fiddled around with like a little bit of code, like some bash there, but like not too much because GitLab has a pretty robust pipeline. Travis — I don’t think I needed to actually. Travis had a pretty good just go make Docker build, scripts already templated out for you. [00:04:53] JH: Yeah. I’d like to tell people whenever you start any project, whether it’s big or small, especially if it’s on – Not on Windows. I’ll tell you something different if it’s on Windows. But if you’re developing on a Mac or developing on Linux, the first thing you should do in your project is create a make file or your programming language equivalent of a make file, and then in that make file what you should do is write a command that will build your software that runs its tests locally, and also builds – whatever the process is.  I mean, if you’re running in Go, you do a Go build. If you’re using Rust, build with Rust, or C++, or whatever before you even write any code. The reason why is because the hardest part is making your code build, and if you leave that to the end, you’re actually making it harder on yourself. If your code build works from the beginning, all you have to do is change it to fit what you’re doing rather than thinking about it when it’s crunch time.  [00:05:57] NL: I actually ran into that exact scenario recently, because I’ve been building some tooling around some Kubernetes stuff, and the first one I did, I built it all manually by hand. Then at the end I was like – I gave it to the person who wanted it and they’re like, “So, where’s the make file?” I’m like, “Where’s the what?” So I had go in and like fill in the make file, and that was a huge pain in the butt.  Then recently the other thing I’ve been using is Kubebuilder. John, you and I have been talking about Kubebuilder quite a bit, but using Kubebuilder, and one of the things it does for you is it scaffolds out and a make file for you, and that was like going from me doing it by myself to having it already exist for you or just having it at the beginning was so much better. I totally agree with you, Brian. [00:06:42] BL: So quick point of order here. For those of us who don’t know what Kubebuilder is. What is Kubebuilder? [00:06:48] NL: Kubebuilder is a tool that was created by members of the Kubernetes Community to scaffold out the creation of controllers and web hooks. What a controller is in Kubernetes is a piece of software that waits, sort of watches a specific object or many specific objects and reconciles them. If they noticed that something has changed and you want to make an action based on that change, the controller does that for you. [00:07:17] JH: Okay. So it actually makes the action of working with CRDs and Kubernetes much easier than creating it all yourself.  [00:07:26] NL: Correct. Yeah. So, for instance, the one that I made for myself was a tool that watched, updated and watched a specific CRD, but it wasn’t necessarily a controller. It was just like flagging on whether or not a change occurred, and I used the dynamic client, and that was a huge headache on of itself.  Kubebuilder has like the ability to watch not just CRDs, but any object in Kubernetes and then reconcile them based on changes.  [00:07:53] NL: It’s pretty great. [00:07:54] BL: All right. So back to CI. John, do you have any opinions on CI or anecdotes or anything like that?  [00:07:59] JH: Yeah. I think one of the interesting things about the original kind of philosophy of CI outside of tooling was like trunk-based development that every develop changes get integrated into trunk as soon as possible. You don’t get into integration hell and rebasing. I guess it’s kind of interesting when you apply that to a cloud native landscape where like when that stuff came out with like Martin Fowler or Jez Humble probably 10, 15 years ago almost now, a lot of dev teams were co-located. You could do CI. I think there was a rubber chicken method where you didn’t use a tool. It was just whoever had the chicken that’s responsible for the build. Just to pull everyone else’s changes.  But now it seems like everything is branch-based. When you look at a project like Kubernetes, there’s a huge number of contributors all geographically displaced, different time zones, lots of different branches and features going on at the same time. It’s interesting how these original principles of continuous integration from the beginning now apply to these huge projects in the cloud native landscape.  [00:08:56] BL: Yeah, that’s actually a great point of how prescient Martin Fowler has been for many, many years, and even with Jez Humble being able to see these problems 10, 15 years ago and be able to describe them. I believe Jez Humble wrote the CD book, the continuous delivery book. [00:09:15] JH: Yeah, with David Farley, I think.  [00:09:18] NL: Yeah. Yeah, he did. So, John, you brought up some good things about CI. I try to simplify everything. I think the mark of someone who really knows what they’re talking about is being able to explain everything in the simplest words possible, and then you can work backwards when people understand.  I started off by saying that CI produces an artifact. I didn’t talk about branches or anything like that, or even the integration piece. But now let’s go into that a little bit. There are a lot of misconceptions about CI in general, but one of the things that we talk about is that you have to run test. No, you don’t have to run test, but should you? Yes, 100% of the time. Your CI process, your integration process should actually build your software and run the test, because running the test on this dedicated service or hardware wherever it is ensures that the quality of your software is there at least as much as your developers have insured the quality in the test.  It’s very important those run, and a lot of bugs of course can be spotted by running a CI. I mean, we are all sorts of developers here, and I tell you what, sometimes I forget to run the test locally and CI catches me before a commit makes it into master and it has a huge typo or a whole bunch of print lines in there. Moving on here, thinking about CI and cloud native. Whenever you’re creating a cloud native app, have you ever thought about the differences between let’s say creating just a regular binary that maybe runs on a server, but not in a container on somebody’s cloud native stack, i.e. Kubernetes? Have you ever thought about the differences of things to think about?  [00:11:04] BL: Yeah. So part of it is – I would imagine or I believe it’s like things like resource, like what resources you need or what architecture you’re deploying into. You need the binary to make like run in this – With containerization, it’s easy because you’re like, “I know that the container is going to be this architecture,” but you can’t necessarily guarantee that outside of a containerized world. I mean, I suppose you can being like with the right tooling setup you can be like, “I only want to run on this.”  But that isn’t necessarily guaranteed, because any computer that runs on could be just whatever architecture that happens to land on, right?  Also, something to – I think of is like how do you start processes on disparate computers in a controlled fashion? Something like, again, with containers, you can trust that the container runtime will run it for you. But without that, it seems like a much harder task. [00:12:01] NL: Yeah, I would agree. Then I said that containers in general just help us out, because most of our workloads go on some AMD or Intel 64 bit and it’s Linux. We know what our output is going to be. So it’s not like in the old days where you had to actually figure out what your run target was. I mean, that’s even on Intel stacks. I mean, I’m updating myself here where you had like – When the 386 was out and then you had the 386SX and the 386DX, there were different things there, and you actually compile your code different. Then when the 46 came out and then when we had introduction of Pentium chips, things were different.  But now we can pretty much all target AMD64, and in some cases, I mean, there are some chip things like the bigger encryption things that are in the newer chips. But for the most part, we know what our deployed target is going to be.  But the cool thing is also that we don’t have to have Intel or AMD64. It could be ARM32 or ARM64, and with the addition to a lot of the work that has been going on in Windows land lately, we can have Windows images. I don’t know so many people were doing that yet. I’m not out and part of the field, but I like that the opportunity is there. [00:13:25] JH: Oh! I think one of the interesting things is the deployment method as well. Now with containers, everything is kind of an immutable rip and replace. Like if we develop an application, we know that the old container is going to stop when I deploy a new one. I think Netflix were doing a little bit of this before containers and some other folks with like baking AMIs and using that immutable method. But I think before that it was if we had a WAR file, we had to throw it back into Tomcat, let Tomcat pick it up or whatever. Everything was a little bit more flaky in terms of deployment. We had to do a lot of checks around deployment rather than just bring something out, bring something back in blue/green, whatever.  [00:13:59] BL: Well, I actually like that you brought that up, because that’s actually one of the greatest parts of this whole cloud native thing, is that when we’re using containers and we’re deploying with containers, we know what our file system is going to look like, because we created it. There would not be some rogue file or another configuration there that will trip up our deployment, because at build time, we’ve created the environment. It’s much better than that facility that Netflix was doing with baking AMIs.  In a previous life, I actually ran the facility for baking AMIs at a large company where we had thousands of developers on more than a thousand dev teams, and we had a lot of spyware. Whenever you had to build an image, it was fine in one account, but if you had let’s say a thousand accounts with the way that AWS works and encrypted images, you actually had to copy all the images to all the accounts. It couldn’t actually boot it from your account. That process would literally take all night to get it done across all of our accounts.  If you made a mistake, guess what? You get to do it again. So I am glad that we actually have this thing called a container and all these things based on CRI, the container runtime, that we are able to quickly build containers.  I don’t want to just limit this conversation to continuous integration. Let’s get into the other parts too with deployment and delivery. What is so novel about CD and the cloud native world? [00:15:35] NL: I think to me it’s the ability to have your code or your artifact or whatever it is, whatever you’re working on. When you make a change, you can see the change reflected in reality, whatever your reality looks like, without your intervention. I mean, you might have had to set up all the pipelines and all that jargon, but when you press save in VS code and it creates a branch and runs all your tests and then deploys it for you or delivers it for you into what you’d define as reality, that’s just so nice, because it really kind of sucks having to do the like, “Okay, I’ve got a new deployment. Destroy the old deployment. Put in the new one or like rev the new image tag or whatever in the deployment you’re doing.” All these manual steps, again, thinky-brain juice, it takes pieces of your attention away, and having these pieces like added for you is just so nice. [00:16:30] BL: Yeah, what do you think, John?  [00:16:32] JH: Yeah. I think just something in the state of DevOps we’ve bought one of the best predictors for a company’s success is like cycle time of feature from ideation to production. I think like the faster we can get that cycle – It kind of gets me interested. How long does an application take to build? If it takes two hours, how good are you at getting features out there quickly? Maybe one of the drivers with microservices, smaller pieces independently deployed, we can get features out to production quicker, because I think the name of the game is just about enabling developers to put the decision in the hands of the business to decide when the customer should see that feature. I think the tighter we can make that cycle, the better for everyone. [00:17:14] BL: Oh, no! I agree. I love and hate web services, but what I do like is the idea of making these abstractions smaller, and if the abstractions are smaller, it’s less code. A lot of the languages we use now are faster compiling, let’s say, a large C++ project. That could take literally two hours to compile. But now when we have languages like Go, and Rust is not as fast, but it’s not slow as well. Then we have all of our interpret languages, whether it’d be Python, or JavaScript, or TypeScript, where we can actually go from an idea, run the test in a few minutes and build this image that we can actually run and see it almost in real-time.  Now with the complexity of the tools, I mean, the features that are built in the tools, we can now easily manage multiple deployment environments, because think about before, you would have a dev environment, and that would be the Wild West. That would be literally where it would be awful. You might have to rebuild it every couple of months. Then you would have staging, and then maybe you would have some kind of pre-prod environment just as like your final smoke test, and then you would have your production.  Maintaining all the software on all those was extremely hard. But now with the advent of containers, now it’s as simple as identifying the images you want and basically running that image in that environment. I like where we’ve ended up. But with all power comes new problems, and just because we can deploy quicker means we just run into a lot of different problems we didn’t run into before.  The first one that I’ll bring up is the complexity. Auto conversion between environments, so moving code between test staging and production. How do we do that? Any ideas before I throw some out there?   [00:19:11] NL: I guess you would have different, or maybe the same pipeline but different targets for like if say you’re using something like Kubernetes. You could have one part of your pipeline deploy initially to this Kubernetes context, which points to like one cluster. It’s building up clusters by environment type and then deploying into those, running your tests, see if it runs properly and then switch over to the next context to apply that image tag and that information and then just go down the chain until you go to production.  [00:19:44] BL: Well, that’s interesting. One thing I’d like to throw out there, and I’m not advocating any particular product. But the idea of having pipelines for continuous integration and your CD process is great, where you can now have gates and you can basically automate the whole thing. Code goes into CI and we built an artifact, and a message can go out automatically to an approver or not, and that message could say, “Hey! This code is going to be integrated into our trunk or our master branch.” They can either do it themselves manually as a lot of people do or they can actually maybe click on a link or check a checkbox and this gets integrated in.  Then what automatically could happen at this point is, and I’ve seen a lot of companies doing this, is now we take that software and we spin up a new whole environment and we just install that software. For that one particular feature that you worked on, you can actually get an automatic environment for that.  Then what we can do is we can take that environment itself and we can now merge this maybe into a staging branch or tag it with a staging label, and that automatically gets moved to staging. Depending on how complicated you are, how advanced you are, now you can actually have it go out to your product people or people who make decisions, maybe your executives, and they can view the software in whatever context it happens to be in. Then they can say, “Okay.” Now that’s when we’re talking about now we can hit okay and the software just keeps on moving to the pipeline and it gets into production. The whole goal here, and this is actually where your goal should be just in general whenever you’re thinking about continuous delivery or continuous deployment is that any human intervention on the actual moving of code is a liability and is going to break, and it’s going to break because on Friday afternoon at 5:25 PM, someone’s thinking about the weekend and they’re not thinking about code, and they’re going to break your build. Our goal is to build these delivery systems that are Friday afternoon proof. We can push code anytime. It doesn’t matter. We trust our process.   [00:22:03] JH: I think it’s a great point about environments. I think back in the day, an environment used to be a set of machines, and then test used to be – staging was where there were kind of more stable versions of APIs and folks were more coordinated pushing things into them. What really is an environment? Like you said, when we push micro services or whatever service, we can spin up an entire Kubernetes cluster just for that service. We can set it up. We can run whatever tests we want. We could tear it down.  With the advent of Elastic compute, and now containers, they really enabled this world where like the traditional idea of an environment and what constitutes an environment is starting to get a bit kind of sloppy and blend into each other.  [00:22:42] BL: I like it though. I think it’s progress.  [00:22:45] NL: I totally agree. The one that scares me but I also find like really interesting, is the idea of having all of your environments in one set of machines. So clusters. Having a multi-tenanted set of machines for like dev staging and production, they’re all running in the same place and they’re all just separated by like what configuration of like connectivity from different networking and things like that set up.  When a user hits your website, bryanliles.com, they should go to the production images, but those are binaries, and those binaries should be running in the same space essentially as the development ones. It’s scary, but it’s also like allows for like some really fast testing and integration. I find it to be very fascinating. [00:23:33] BL: I mean that’s where we want to be. I find more often than not that people have separate clusters for dev and staging and production. But using the Kubernetes API, you don’t have to do that, because what we can do is we can force deployment or workload to a set of machines based on their label. That’s actually one of the very strong positives for Kubernetes. Forget all the complexity.  One of the things that makes it easy is to say that I want this particular deployment to only live on my development machines. Well, which development machine? I don’t care. What if we increase our development pool size? We just re-label nodes. It doesn’t matter. Now we can just control that. When it comes down to controlling cost and complexity, this is actually one idea that Kubernetes is leading and just making it easier to actually use more of your hardware.  [00:24:31] NL: Yeah. Absolutely. That’s so great because if you think about it from a CI/CD standpoint, at that point all you have to do is just change the label to where you’re applying this piece of code. So you’re like, “Node selector, label equals dev. Okay, now it’s staging. Okay, now it’s prod.”   [00:24:47] BL: So this brings me into the next part of what I want to talk about or introduce to you all today. We’re on a journey as you probably can tell. Now whenever we have our CI process and we’re building and we’re deploying, where do we store our configurations?  [00:25:04] NL: [inaudible 00:25:04]. [00:25:06] BL: Ever thought about that? [00:25:08] NL: Okay. I mean, in a Kubernetes perspective, you might be using something like etcd to sort of – But like everything else, what if you’re using Travis? [inaudible 00:25:16] store everything. Everything should be versioned, right? Everything should be – [00:25:20] BL: Yeah, 100%. [00:25:24] NL: I would store everything these as much as possible. Now, do I do that all the time? God, no! Absolutely not. I’m a human being after all. [00:25:32] BL: I mean, that’s what I actually want to bring up, is this concept of GitOps. GitOps was a coined term by my friend, Alexis, who works at Weave. I think Weave created this. Really what it’s about is instead of having – basically, Kubernetes is declarative, and our configurations can be declarative too, because what we can do is make sure is we can have tech space configurations, and for one reason it’s because tech space means it can be versioned. It can be diffs. We take those text versions and we put them in our same repository we put our code in. How do we know what’s in production at any given time or any given time in the past? We just look at the tags of what we did.  We had a push at 5:15 on August 13th. Of course, this is 5:15, you could see time, because any other time doesn’t exist in the computer land. So what we could do is we could just basically tag that particular version as like 2019-08-13. If I said 5-17-55, and we call 01 just so we could have 100 deploys in a day. If we started doing that, now not only can we control what we have, but we can also know what was on in any given environment at any given time.  Because with Git and with Mercurial and any other of these – Well, only the popular ones, with Git and Mercurial, you can definitely do this. Any given commit can have multiple tags. You could actually have a tag that hit dev and then a tag that, let’s say, hits staging, and then a tag that hit production, the exact same code but three different tags. So you know at any given time what happened.   [00:27:18] JH: Yeah, the config thing is so important. I think that was another Jez Humble quote where it was like, “Give me three hours access to your code and I’ll break it. But give me 5 minutes with your configurations and I’ll break it.” Almost like every big bug is, right, someone was accidentally pointing the prod server to the staging database like, “Oops! Their API was pointing to the wrong port, and everything came down,” or we changed the wrong versions or whatever. I think that’s one of the intersections of developers and operations folks. We kind of talked about like Dev Ops and things like that. I really love the idea of everything being kept in Git and using GitOps, but then we’ve got things like secrets and configuration that shouldn’t be seen or being able to be edited by developers, but need to be for ops folks. But we still want to keep the single point of truth. Things like sealed secrets have really enabled us to move along in this area where we can keep everything in text-based version.  [00:28:08] BL: All right. Quick point of order here. Sealed secrets is a controller/CRD created by Bitnami. What it allows you do is, John –  [00:28:23] JH: It allows you – It creates a CRD, which is sealed secret, which is a special resource type in your cluster and also creates a key, which is only available to that operator running in your cluster. You can submit a sealed secret in plain text or you can submit a secret in plain text and it will throw it back out as an encrypted secret with that key and then you can check that into version control. Then when you go to deploy your software, you can deploy that encrypted secret into the cluster. The operator will pick it up, decrypt it using only the key that it has access to and then put it back in the cluster as a regular secret. Your application just interacts with regular Kubernetes secrets. You don’t need to change your app. They deal with all the encryption outside of the user intervention.   [00:29:03] BL: I think the most important part of what you said is that this allows us to have no excuses about what we can store in our repositories for our configuration, because someone is going to make the argument, “No, we can’t store secrets, because someone’s going to be able to see them.” Well, guess what? We never even stored an unencrypted secret in our repository. They’re all encrypted, and it’s still secrets. It’s [inaudible 00:29:25]. I don’t know if anyone’s cracked yet. I’m sure maybe a state level actor has thought of it. But for us regular people, even our companies, like even at VMware, or even at Google, they have not done it yet.  So it’s still pretty safe. Thinking even further now, and really what I’m trying to paint the picture of is not just how do you do CD, but really what CD could look like and how it can actually make you happy rather than sad.  The next item I wanted to think about was tools around CD and creating tools and what does a good continuous delivery system look like. I kind of hinted about this earlier whenever I was talking about pipelines. The ability to take advantage of your hardware, so we’re deploying to let’s say 100 servers. We’re pulling 5 or 6 services to 100 node cluster. We can do those all at once, and what we can do is you want to have a system that can actually run like this. I could think of a couple.  From Intuit, there is Argo, and they have Argo CD. There is the tool created by Google and maybe Netflix. I want to have to look that one up. It’s funny, because they quoted – [00:30:40] JH: Spinnaker? [00:30:42] BL: Spinnaker. They quoted me in their book, and I don’t remember their name. I’m sorry anyone from Spinnaker product listening. Once again, not advocating any products, but they have the concept of doing pipelines. Then you also have other things for your projects, like if you’re using open source, Drone. Another X Google – I think it was X-Googler that made this. Basically, they have ways you can do more than one thing at a time.  The most important piece about this is not only can you do more than one thing at a time, is that you have a programmatic check that it’ll make sure that you can verify that whatever you did was successful. We deployed to staging or we deployed to our smoke test servers for our smoke test, and that requires our testing people and an executive signoff. They can actually just wait until they get their signoff or maybe if it goes over a day or so, they can actually – It just fails, and now the build is done. But that part is pretty neat. Any other topics over here before I start throwing out more? [00:31:45] NL: I think I just have thoughts on some of the tools that we’ve used. Everyone  Jenkins. Jenkins can do anything that you want it to do, but you really have to tighten the screws on it. It is super powerful. It’s kind of like Bash, like Bash scripting. It’s super powerful, but you have to know precisely what you’re doing, otherwise it can really hurt you.  Actually, I have used Spinnaker in the past, and I’ve really liked it. It has a good UI, very good pipelines. Easy blue/green or canary deployment mechanism, I thought that was great. I’ve looked at Drone, believe it or not, but Drone is actually pretty cool. Check out Drone. I really liked it.  [00:32:25] BL: Well, since we’re throwing out products, Jenkins, does have JenkinsX. I have not given it the full rundown yet. But what I do like about it, and I think everyone should pay attention to this if you’re doing a product in this space, is that when you install JenkinsX, you install it locally to your machine. You basically get this binary called JX, and you then tell JX to install it into your cluster. Instead of just doing kubectl apply-f a whole bunch of YAML, it actually ask you questions and it sets up GitHub repositories or wherever you need these repositories. It sets up [inaudible 00:33:01] spaces for you.  There’s no just [inaudible 00:33:05] kubectl apply-f HTTPS: I just owned your system, because that’s actually a problem. Then it solves the YAML sprawl, because YAML and Kubernetes is something that is complained about a lot, but it’s how it’s configured. But it’s also just a detail what we’re supposed to be doing, and we actually work with Joe Beda and I could talk about this all the time, is that the YAML is the implementation, but it’s not the idea. The idea is that we build tools on top of that that create YAML so users have to see less YAML. I think that’s a problem with Jenkins, is that it’s so powerful and they’re like, “Well, we want powerful people or smart people to be able to do smart things. So here you go.”  The problem with that is that where do I start? It’s a little daunting. So I do think that they definitely came with the much stronger game with this JX command. Just as a little sidebar, we do it as well with our Valero project, and I think that just speaks, should be like the bar for anything. If you’re installing something into a cluster, you should come up with a command line tool that helps you manage the lifecycle of whatever you’re installing to the operator, YAML, whatever.  [00:34:18] JH: I think what’s interesting about the options, this is definitely one area where there’s so much nuance. Any time you’re in developer tooling, everyone wants to do something slightly differently. All of these tools are so tweak-able that they become so general. I think it’s probably one of the criticisms that could be leveraged against Jenkins is that you can do everything, and that’s actually a negative as well as a positive. Sometimes it’s too overwhelming. There are too many ways of doing things. I’m a fan of some of the more kind opinionated tools in that space.  [00:34:45] BL: Yeah. I like opinionated tools as well, but the problem that we’re having in this cloud native space is that, yeah, Kubernetes is five-years-old now. We are just getting to the point where we actually understand what a good decision is, because there was a lot of guesses before and we’ve done a lot of things, and some of these have been good ideas, but in some cases they have not been great ideas.  Even I ran the project case on it. Great idea on paper, but implementation, it required people to know too many things. We’d learned a lot of lessons from that. That’s what I think we’re going to find out in this space is that we’re going to learn little lessons. I say this project from my last project that I was going to bring up is something that I think has learned some of the lessons.  Google sponsors a project called Tekton, and if you go to – It’s like I believe, and they have some continuous delivery stuff in there and they implement pipelines. But the neat part is, and this is actually the best part, it’s actually a cloud native built service. So every step of your delivery process, from creating images, to actually putting them on clusters, is backed by a Docker image or a container, and I think that part is pretty neat. So now you can define your steps.  What is your step? Well, you can use one of their pre-baked, run this command, or if you have something special, like the example before I was giving out where you would say that you need an approval, maybe it’s a Slack approval. You send something with Slack and it has a checkbox, check yes if you like me. What we can do now is we can actually control that and it’s easy to write something a little Docker image that can actually make that call and then get the request and then it can move it on.  If you’re looking at more of a toolkit full of good ideas, I do think that Tekton has definitely has some lots of industry. People are looking at it and it’s probably the best example of getting it right in the cloud native way. Because a lot of the products we have now are not cloud native. We’re talking about Jenkins. We’re talking about Spinnaker and we talk about Drone and Travis, which is totally a SaaS product. They’re not cloud native.  Actually, the neat part about Tekton is that it actually comes with its own controllers and its own CRDs. So you can actually build these things up using your familiar Kubernetes tooling, which means in theory we could actually use the tooling that we are deploying. We can actually control it in the same way as our applications, because it’s just yet another object that goes in our cluster.  [00:37:21] NL: That does sound pretty cool. One other that I meant to bring up was Concourse. Have you check out Concourse yet?  [00:37:27] BL: CouncourseCI. I have not. I have used it, but never in a way where I would have a big opinion on it.  [00:37:34] NL: I’m kind of in the same place. I think it’s a good idea. It seems really neat, but I need to kick the tires a little more. I will say that I really like the UI. The structure of the UI is really nice. Everything makes sense, and anything you can click on like drills into something a bit deeper. I think that’s pretty cool, but it is one of the shout that I went out to as well as like another tool that I’m aware of.  [00:37:52] BL: Yeah, that’s pretty interesting. So we’ve gone about 40 minutes now. Let’s actually start winding this down, and the way that I’m going to suggest that we wind this down is thinking about where we are now. What’s missing in this space and what else could we actually be doing in the cloud native space to make this work out better?  [00:38:12] NL: I think I’d like to see better structured or better examples of blue-green or canary deployments with tests associated, and that might just be like me not looking hard enough at this problem. But anytime I began looking at blue-green, I get the idea of what someone’s done, but I would love to see some implementation details, or any of these opinionated tools having opinions around blue-green and what they specifically do to test it. I feel like I’m just not seeing that.  [00:38:41] BL: With blue-green, blue-green is hard to do in Kubernetes without an external tool, because for everyone, a blue-green deployment is, I have a software deployment and we’ll give it a color. We’ll call it blue, and I have the next version, and we’ll call it green. Really what I can do is I basically have two versions of my application deployed and I can use my load balancer, or in this case, my service to just change the label or the selector in my service and now I can point at at my green from my blue. Then I want to deploy again, I can just deploy another blue and then change my label selector again.  The problem with this is that you can do it in Kubernetes, just fine. But out of the box with Kubernetes, you will drop traffic, because guess what? What happens to a connection that was initiated or a session that was initiated on the blue cluster when you went to green? Actually, this is a whole conversation in itself about service meshes and this is actually one of the reasons service mesh is a big topic, because you can do this blue-green, or another example would be Netflix and Redblack, or you get the creative people who are like rainbow deployments, because just having two is not good enough for them. So they want to have any number of deployments going at one time. I agree with that 100%.  [00:39:57] JH: I think, yeah, integrating tools like launch. [inaudible 00:40:01] and I think there are more which enable – I think we’re missing the business abstractions on this stuff so far. Like you said, it’s kind of hard to do if you need to go into the gritty of it right now, but I think the business abstractions of if we deploy a different version to a certain subset of customers, can we get all of those metrics? Can we get those traces back in? Will you automate it, roll it out? Can we increase the percentage of customers that are seeing those things? Have that all controlled in a Kubernetes native way, but having roll it up to a business and more of an abstraction. I think that stuff is currently missing. I think the underpinning kind of technologies are coming up, stuff like service mesh, but I think it’s the abstraction that’s really going to make it useful, which doesn’t exist today.  [00:40:39] BL: Yeah. Actually, that’s pretty close to what I was going to say. We built all these tooling that helps us basically as technologists, but really what it comes down to is the business. A lot of the things we’re talking about where we’re talking about CD is important to the business, but when we’re talking about metrics or trace collection, that’s not important to the business, because they only care about the SLA. This is on the SLO side.  What we really need to do is mature our processes enough that we can actually marry our outputs to something that other people can understand that has no jargon and it’s sales going up, sales going down. Everything else is just a detail.  So, anything else?  [00:41:20] NL: Something I think I’d like to see is in our testing, if there was a good way to accurately show the effect of something at load in a CI/CD component. Because one of the things that I’ve run into is like I’ve got this great idea for how this code should work and when I deploy it, it works great. The like a thousand people touch it all at once and it doesn’t work right anymore. I’d love to have some tool along the way that can test things out of load and like show me something that I could fix before all those people touch it. [00:41:57] BL: Yes, that would be a good tool to have. So John, anything else for you? [00:42:02] JH: I’ll open a can of worms right at the end and say the biggest problem here is probably going to be data when we have a lot of systems we need to talk to each other and we need the data to align between those systems and we have now proliferation of environments and clusters. Like how do we get that data reliably into the place that it needs to be to make up testing robust enough to get things out there? It’s probably an episode on some –  [00:42:23] BL: Yeah, that’s a big conversation that if we could answer it, we wouldn’t working at VMware. We would have our own companies doing all these great things. But we can definitely iterate on it. So with that, I think we’re going to wrap it up. Thanks for listening to the Kubelets. I’m Bryan Liles, and with me today was Nicholas Lane and John – Yeah, and John Harris.   [00:42:47] JH: Thanks everyone. [00:42:47] BL: All right, we’ll see you next time. [END OF EPISODE] [00:42:50] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


6 Jan 2020

Rank #1

Podcast cover

The Past, Present and Future of Kubernetes with Craig McLuckie (Ep 13)

Today on The Podlets Podcast, we are joined by VMware's Vice President of Research and Development, Craig McLuckie! Craig is also a founder of Heptio, who were acquired by VMware and during his time at Google he was part of bringing Kubernetes into being. Craig has loads of expertise and shareable experience in the cloud native space and we have a fascinating chat with him, asking about his work, Heptio and of course, Kubernetes! Craig shares some insider perspective on the space, the rise of Kubernetes and how the increase in Kubernetes' popularity can be managed. We talk a lot about who can use Kubernetes and the prerequisites for implementation; Craig insists it is not a one-size-fits-all scenario. We also get into the lack of significantly qualified minds and how this is impacting competition in the hiring pool. Craig comments on taking part in the open source community and the buy-in that is required to meaningfully contribute as well as sharing his thoughts on the need to ship new products and services regularly. We finish off the episode with some of Craig's perspectives on the future of Kubernetes, dangers it poses to code if neglected and the next phase of its lifespan. For this amazing chat with a true expert in his field, make sure to join us on for this episode! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Special guest: Craig McLuckie Hosts: Carlisia Campos Duffie Cooley Josh Rosso Key Points From This Episode: • A brief introduction to Craig's history and his work in the cloud native space. • The questions that Craig believes more people should be asking about Kubernetes. • Weighing the explosion of the Kubernetes space; fragmentation versus progress. • The three pieces of enterprise software and aiming to enlarge the 'crystalline core'.• Craig's thoughts on specialized Kubernetes operating systems and their tradeoffs. • Quantifying the readiness of an organization to implement Kubernetes. • Craig's reflections on Heptio and the lessons he feels he learned in the process.• The skills shortage for Kubernetes and how companies are approaching this issue. • Balancing the needs and level of the community and shipping products regularly.• Involvement in the open source community and the leap of faith that is inherent in the process. • The question of microliths; making monoliths more complex and harder to manage. • Masking problems with Kubernetes and how detrimental this can be to your code.  • Craig's thoughts on the future of the Kubernetes space and possible changes.• The two duty cycles of any technology; the readiness phase that follows the hype.  Quotes: “I think Kubernetes has opened it up, not just in terms of the world of applications that can run Kubernetes, but also this burgeoning ecosystem of supporting technologies that can create value.” — @cmcluck [0:06:20] “You're not a cool mainstream enterprise software provider if you don’t have a Kubernetes story today. I think we’ll start to see continued focus and consolidation around a set of the larger organizations that are operating in this space.” — @cmcluck [0:06:39] “We are so much better served as a software company if we can preserve consistency from environment to environment.” — @cmcluck [0:09:12] “I’m a fan of rendered down, container-optimized operating system distributions. There’s a lot of utility there, but I think we also need to be practical and recognize that enterprises have gotten comfortable with the OS landscape that they have.” — @cmcluck [0:14:54] Links Mentioned in Today’s Episode: Craig McLuckie on LinkedIn Craig McLuckie on Twitter The Podlets on Twitter Kubernetes VMware Brendan Burns Cloud Native Computing Foundation Heptio Mesos Valero vSphere Red Hat IBM Microsoft Amazon KubeCon Transcript: EPISODE 13 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically-minded decision maker, this podcast is for you. [INTERVIEW] [00:00:41] CC: Hi, everybody. Welcome back to The Podlets podcast, and today we have a special guest, Craig McLuckie. Craig, I have the hardest time pronouncing your last name. You will correct me, but let me just quickly say, well, I’m Carlisia Campos and today we also have Duffy Colley and Josh Rosso on the show. Say that three times fast, Craig McLuckie.  Please help us say your last name and give us a brief introduction. You are super well-known in the Kubernetes community and inside VMware, but I’m sure there are not enough people that should know about you that didn’t know about you.  [00:01:20] CM: All right. I’ll do a very quick intro. Hi, I’m Craig McLuckie. I’m a Vice President of Research and Development here at VMware. Prior of VMware, I spent a fair amount of time at Google where my friend Joe and I were responsible for building and shipping Google Compute Engine, which was an interesting exercise in bringing traditional enterprise virtualized workloads into the very sophisticated Google data center.  We then went ahead and as our next project with Brendan Burns, started Kubernetes, and that obviously worked out okay, and I was also responsible for the ideation and formation of the Cloud Native Computing Foundation. I then wanted to work with Joe again. So we started Heptio, a little startup in the Kubernetes ecosystem. Almost precisely a year ago, we were acquired by VMware. So I’m now part of the VMware company and I’m working on our broader strategy around cloud native apps under the brand [inaudible 00:02:10]. [00:02:11] CC: Let me start off with a question. I think it is going to be my go-to first question for every guest that we have in the show. Some people are really well-versed in the cloud native technologies and Kubernetes and some people are completely not. Some people are asking really good questions out there, and I try to too as I’m one of those people who are still learning.  So my question for you is what do you think people are asking that they are not asking the right frame, that you wish they would be asking that question in a different way. [00:02:45] CM: It’s a very interesting question. I don’t think there’s any bad questions in the world, but one question I encountered a fair bit is, “Hey, I’ve heard about this Kubernetes thing and I want one.” I’m not sure it’s actually the right question, right? Kubernetes is a powerful technology. I definitely think we’re in this sort of peak hype phase of the project. There are a set of opportunities that Kubernetes really brings a much more robust ability to manage, it abstracts  a way infrastructure — there are some very powerful things. But to be able to be really successful with Kubernetes project, there’re a number of additional ingredients that really need to be thought through.  The questions that ought to be asked are, "I understand the utility of Kubernetes and I believe that it would bring value to my organization, but do I have the skills and capabilities necessary to stand up and run a successful Kubernetes program?" That’s something to really think about. It’s not just about the nature of the technology, but it really brings in a lot of new concepts that challenge organizations.  If we think about applications that exist in Kubernetes, there’s challenges with observability. When you think the mechanics of delivering into a containerized sort of environment, there are a lot of dos and don’ts that make a ton of sense there. A lot of organizations I’ve worked with are excited about the technology, but they don’t necessarily have the depth of understanding of where it's best used and then how to operate it.  The second addendum to that is, “Okay, I’m able to deploy Kubernetes, but what happens the next day? What happens if I need to update it? When I need to maintain it? What happens when I discover that I need not one Kubernetes cluster or even 10 Kubernetes clusters, but a hundred or a thousand or 10,000.” Which is what we are starting to see out there in the industry. “Have I taken the right first step on that journey to set me up for success in the long-term?” I do think there’s just a tremendous amount of opportunity and excitement around the technology, but also think it’s something that organizations really need to look at as not just about deploying a platform technology, but introducing the necessary skills that are necessary to operate and maintain it and the supporting technologies that are necessary to get the workloads on to it in a sustainable way.  [00:04:42] JR: You’ve raised a number of assumptions around how people think about it I think, which are interesting. Even just starting with the idea of the packaging problem that represents containerization is a reasonable start. So infrequently, do we describe like the context of the problems that — all of the problems that Kubernetes solve that frequently I think people just get way ahead of themselves. It’s a pretty good description.  [00:05:04] DC: So maybe in a similar vein, Craig, we had mentioned all the pieces that go into running Kubernetes successfully. You have to bolt some things on maybe for security or do some things to ensure observability as adequate, and it seems like the ecosystem has taken notice of all those needs and has built a million projects and products around that space.  I’m curious of your thoughts on that because it’s like in one way it’s great because it shows it’s really healthy and thriving. In another way, it causes a lot of fragmentation and confusion for people who are thinking whether they can or cannot run Ku, because there are so many options out there to accomplish those kinds of things. So I was just curious of your general thoughts on that and where it’s headed. [00:05:43] CM: It’s fascinating to see the sort of burgeoning ecosystem around Kubernetes, and I think it’s heartening, because if you think at the very highest level, the world is going to go one of two ways with the introduction of the hyper-scale public cloud. It’s either going to lead us into a world which feels like mainframe era again, where no one ever got [inaudible 00:06:01] Amazon in this case, or by Microsoft, whatever the case. Whoever sort of merges over time as the dominant force.  But it also represents some challenges where you have these vertically integrated closed systems, innovation becomes prohibitively difficult. It’s hard to innovate in a closed system, because you’re innovating only for organizations that have already taken that dependancy. I think Kubernetes has opened it up, not just in terms of the world of applications that can run Kubernetes, but also this burgeoning ecosystem of supporting technologies that can create value. There’s a reason why startups are building around Kubernetes. There’s a reason they’re looking to solve these problems.  I do think we’ll see a continued period of consolidation. You're not a cool mainstream enterprise software provider if you don’t have a Kubernetes story today. I think we’ll start to see continued focus and consolidation around a set of the larger organizations that are operating in this space. It’s not accidental that Heptio is a part of VMware at this point. When I looked at the ecosystem, it was pretty clear we need to take a boat to fully materialize the value of Kubernetes and I am pleased to be part of this organization.   So I do think you’ll start to see a variety of different vendors emerging with a pretty clear, well-defined opinions and relatively turnkey solutions that address the gamut of capabilities. One organization needs to get into Kubernetes. One of the things that delights me about Kubernetes is that if you are a sophisticated organization that is self-identifying as a software company, and this is sort of manifest in the internet space if you’re running a sort of hyper-scale internet service, you are kind of by definition a software company.  You probably have the skills on hand to make great choices around what projects, follow the communities, identify when things are reaching point of critical mass. You’re running in a space where your system is relatively homogenous. You don’t have just the sort of massive gamut of workloads, a lot of dimension enterprise organizations have. There’s going to be different approaches to the ecosystem depending on which organization is looking at the problem space.  I do think this is prohibitively challenging for a lot of organizations that are not resourced at the level of a hyper-scale internet company from a technology perspective, where their day job isn’t running a production service for millions or billions of users. I do think situations like that, it makes a tremendous amount of sense to identify and work with someone you trust in the ecosystem, that can help you just navigate the wild map that is the Kubernetes landscape, that can participate in a number of these emerging communities that has the ability to put their thumb on the scale where necessary to make sure that things converge.  I think it’s situational. I think the lovely thing about Kubernetes is that it does give organizations a chance to cut their teeth without having to dig into like a deep procurement cyclewith a major vendor. We see a lot of self-service Kubernetes projects getting initiated. But at some point, almost inevitably, people need a little bit more help, and that’s the role of a lot of these vendors. I think that I truly hope that I’m personally committed to, is that as we start to see the convergence of this ecosystem, as we start to see the pieces falling into place, that we retain an emphasis on the value of community that we also sort of avoid the balkanization and fragmentation, which sometimes comes out of these types of systems.  We are so much better served as a software company if we can preserve consistency from environment to environment. The reality is as we start looking at large organizations, enterprises that are consuming Kubernetes, it’s almost inevitable that they’re going to be consuming Kubernetes from a number of different sources. Whether the sources are cloud provider delivering Kubernetes services or whether they handle Kubernetes clusters that are dedicated centralized IT team is delivering or whether it’s vendor provided Kubernetes. There’s going to be a lot of different flavors and variants on it.  I think working within the community not as king makers, but as concerned citizens that are looking to make sure that there are very high-levels of consistency from offering to offering, means that our customers are going to be better served. We’re right now in a time where this technology is burgeoning. It’s highly scrutinized, but it’s not necessarily very widely deployed. So I think it’s important to just keep an eye on that sort of community centricity. Stay as true to our stream as possible. Avoid balkanization, and I think everyone will benefit from that.  [00:10:16] DC: Makes sense. One of the things I took away from my year, I was just looking kind of back at my year and learning, consolidating my thoughts on what had happened. One of the big takeaways for me in my customer engagements this year was that a number of customers outright came out explicitly and said, “Our success as a company is not going to be measured by our ability to operate Kubernetes, which is true and obvious.”  But at the same time, I think that that’s a really interesting moment of awareness for a lot of the people that I work with out there in the field, where they realized, you know what, Kubernetes may be the next best thing. It may be an incredible technology, but fundamentally, it’s not going to be the measure by which we are graded success. It’s going to be what we do on top of that that is more interesting.  So I think that your point about that ecosystem is large enough that people will be consuming Kubernetes for multiple searches is sort of amplified by that, because people are going to look for that easy button as inroad. They’re going to look for some way to get the Kubernetes thing so that they can actually start exploring what will happen on top of it as their primary goal rather than how to get Kubernetes from an operational perspective or even understand the care and feeding of it because they don’t see that as the primary measure of success.  [00:11:33] CM: That is entirely true. When I think about enterprise software, there’s sort of these three pieces of it. The first piece is the sort of crystaline core of enterprise software. That’s consistent from enterprise to enterprise to enterprise. It’s purchased from primary vendors or it’s built by open source communities. It represents a significant basis for everything.  There’s the sort of peripheral, the sort of sea of applications that exist around that enterprises built that are entirely unique to their environment, and they’re relatively fluid. Then there’s this weird sort of interstitial layer, which is the integration glue that exists between their crystalline core and those applications and operating practices that enterprises create.  So I think from my side, we benefit if that crystalline core is as large as possible so that enterprises don’t have to rely on bespoke integration practices as much possible. We also need to make allowances for the idea that that interstitial layer between the sort of core of a technology like Kubernetes and the applications may be modular or sort of extended by a variety of different vendors. If you’re operating in this space, like the telco space, your problems are going to be unique to telco, but they’re going to be shared by every other telco provider.  One of the beautiful things about Kubernetes is it is sufficiently modular, it is a pretty well-thought resistant. So I think we will start to see a lot of specialization in terms of those integration pieces. A lot of specialization in terms of how Kubernetes is fit to a specific area, and I think that represents an awful opportunity for the community to continue to evolve. But I also think it means that we as contributors to the project need to make allowances for that. We can’t hold opinion to the point where it precludes massive significant value for organizations as they look at modularized and extending the platform. [00:13:19] CC: What is your opinion on people making specialized Kubernetes operating systems? For example, we’re talking about telcos. I think there’s a Kubernetes OSS specifically for telcos that strip away things that kind of industry doesn’t need. What are the tradeoffs that you see?  [00:13:39] CM: It’s almost inevitable that you’re going to start to see specialized operating system distributions that are tailored to container-based workloads. I think as we start looking at like the telco space with network function virtualization, Kubernetes promises to be something that we never really saw before. At the end of the day, telco is very broadly deployed open stack as this primary substrate for network function virtualization.  But at the end of the day, they ended up not just deploying one rendition of open stack. But in many cases, three, four, five, depending on what functions they wanted to run, and there wasn’t a sufficient commonality in terms of the implementations. It became very sort of vendor-centric and balkanized in many ways. I think there’s an opportunity here to work hard as a community to drive convergence around a lot of those Kubernetes constructs so that, sure, the operating system is going to be different.  If you’re running an NFV data plane implementation, doing a lot of bit slinging, it’s going to look fundamentally different to anything else in the industry, right? But that shouldn’t necessarily mean that you can’t use the same tools to organize, manage and reason about the workloads. A lot of the innovations that happen above that shouldn’t necessarily be tied to that. I think there’s promise there and it’s going to be an amazing test for Kubernetes itself to see how well it scales into those environments. By and large, I’m a fan of rendered down, container-optimized operating system distributions. There’s a lot of utility there, but I think we also need to be practical and recognize that enterprises have gotten comfortable with the OS landscape that they have. So we have to make allowances that as part of containerizing and distributing your application, maybe you don’t necessarily need to and hopefully re-qualify the underlying OS and challenge a lot of the assumptions. So I think we just need to pragmatic about it.   [00:15:19] DC: I know that’s a dear topic to Josh and I. We’ve fought that battle in the past as well. I do think it’s another one of those things where it’s a set of assumptions. It’s fascinating to me how many different ecosystems are sort of collapsing, maybe not ecosystems. How many different audiences are brought together by a technology like container orchestration. That you are having that conversation with, “You know what? Let’s just change the paradigm for operating systems.” That you are having that conversation with, “Let’s change the paradigm for observability and lifecycle stuff. Let’s change the paradigm for packaging. We’ll call it containers.”  You know what I mean? It’s so many big changes in one idea. It’s crazy.  [00:15:54] CM: It’s a little daunting if you think about it, right? I always say, change is easiest across one dimension, right? If I’m going to change everything all at once across all the dimensions, life gets really hard. I think, again, it’s one of these things where Kubernetes represents a lot of value. I walk into a lot of customer accounts and I spend a lot of time with customers. I think based on their experiences, they sort of make one of two assumptions.  There’s a set of vendors that will come into an environment and say, “Hey, just run this tool against your virtual machine images – and Kubernetes, right?” Then they have another set of vendors that will come in and say, “Yeah. Hey, you just need to go like turn this thing into 12 factor cloud native service mesh-linked applications driven through CICD, and your life is magic.” There are some cases where it makes sense, but there’re some cases where it just doesn’t. Hey, what uses a 24 gigabyte container? Is that really solving the problems that you have in some systematic way? At the other end of the spectrum, like there’s no world in which an enterprise organization is rewriting 3,000, 5,000 applications to be cloud native from the ground up. It just is not going to happen, right? So just understanding the return investment associated with the migration into Kubernetes. I’m not saying where it make sense and where it doesn’t. It’s such an important part of this story. [00:17:03] JR: On that front, and this is something Duffy and I talk to our customers about all the time. Say you’re sitting with someone and you’re talking about potentially using Kubernetes or they’re thinking about it, are there like some key indicators that you see, Craig, as like, “Okay. Maybe Kubernetes does have that return on investment pretty soon to justify it." Or maybe even in the reverse, like some things where you think, “Okay, these people are just going to implement Kubernetes and it’s going to become shelf weary.” How do you qualify as an org, “I might be ready to bring on something like Kubernetes.”   [00:17:32] CM: It’s interesting. For me, it’s almost inevitably – as much about the human skills as anything else. I mean, the technology itself isn’t rocket science. I think the sort of critical success criteria, when I start looking at engagement, is there a cultural understanding of what Kubernetes represents?  Kubernetes is not easy to use. That initial [inaudible 00:17:52] to the face is kind of painful for people that are used to different experiences. Making sure that the basic skills and expectations are met is really important. I think there’s definitely some sort of acid test around workloads fit as you start looking at Kubernetes. It’s an evolving ecosystem and it’s maturing pretty rapidly, but there are still areas that need a little bit more heavy lifting, right?  So if you think about like, “Hey, I want to run a vertically-scaled OLTP database in Kubernetes today.” I don’t know. Maybe not the best choice. If the customer knows that, if they have enough familiarity or they’re willing to engage, I think it makes a tremendous amount of sense.  By and large, the biggest challenge I see is not so much in the Kubernetes space. It’s easy enough to get to a basic cluster. There’re sort of two dimensions to this, there is day two operations. I see a lot of organizations that have worked to create scale up programs of platform technologies. Before Kubernetes there was Mesos and there’s obviously PCF that we’ll be coming more increasingly involved in.  Organizations that have chewed on creating and deploying a standardized platform often have the operational skills, but you also need to look at like why did that previous technology really meet sort of criteria, and do you have the skills to operate it on a day two basis? Often there’s not – They’ve worked out the day two operational issues, but they still haven’t figured out like what it means to create a modern software supply chain that can deliver into the Kubernetes space.  They haven’t figured out necessarily how to create the right incentive structures and experiences for the developers that are looking to build, package and deliver into that environment. That’s probably the biggest point of frustration I see with enterprises, is, “Okay. I got to Kubernetes. Now what?” That question just hasn’t been answered. They haven’t really thought through, “These are the CICD processes. This is how you engage your cyber team to qualify the platform for these classes of workloads. This is how you set up a container repo and run scans against it. This is how you assign TTL on images, so you don’t just get massive repo.”  There’s so much in the application domain that just needs to exist that I think people often trivialize and it’s really taking the time and picking a couple of projects being measured in the investments. Making sure you have the right kind of cultural profile of teams that are engaged. Create that sort of celebratory moment of success. Make sure that the team is sort of metricking and communicating the productivity improvements, etc. That really drives the option and engagement with the whole customer base. [00:20:11] CC: It sounds to me like you have a book in the making.   [00:20:13] CM: Oh! I will never write a book. It just seems like a lot of work. Brendan and a buch of my friends write books. Yeah, that seems like a whole lot of work.   [00:20:22] DC: You had mentioned that you decided you wanted to work with Joe again. You formed Heptio. I was actually there for a year. I think I was around for a bit longer than that obviously. I’m curious what your thoughts about that were as an experiment win. If you just think about it as that part of the journey, do you think that was a success and what did you learn from that whole experiment that you wished everybody knew, just from a business perspective? It might have been business or it might have been running a company, any of that stuff.  [00:20:45] CM: So I’m very  happy with the way that Heptio went. There were a few things that sort of stood out for me as things that folks should think about if they’re going to start a startup or they want to join a startup. The first and foremost I would say is design the culture to the problem at hand. Culture isn’t accidental. I think that Heptio had a pretty distinct and nice culture, and I don’t want to sound self-congratulatory.  I mean, as with anything, a certain amount of this is work, but a lot of it is luck as well. Making sure that the cultural identity of the company is well-suited to the problem at-hand. This is critical, right? When I think about what Heptio embodied, it was really tailored to the specific journey that we were setting ourselves up for.  We were looking to be passionate advocates for Kubernetes. We were looking to walk the journey with our customers in an authentic way. We were looking to create a company that was built around sustainability. I think the culture is good and I encourage folks either the thing you’re starting is a startup or looking to join one, to think hard about that culture and how it’s going to map to the problems they’re trying to solve.  The other thing that I think really motivated me to do Heptio, and I think this is something that I’m really excited to continue on with VMware, was the opportunity to walk the journey with customers. So many startups have this massive reticence to really engage deeply in professional services. In many ways, Google is fun. I had a blast there. It’s a great company to work for. We were able to build out some really cool tech and do good things.  But I grew kind of tired of writing letters from the future. I was, “Okay, we are flying cars." When you're interacting with the customer. I can’t start my car and get to work. It’s great that you have flying cars, but right now I just need to get in my car, drive down the block and get out and get to work.  So walking the journey with customers is probably the most important learning from Heptio and it’s one of the things I’m kind of most proud of. That opportunity to share the pain. Get involved from day one. Look at that as your most valuable apparatus to not just build your business, but also to learn what you need to build. Having a really smart set of people that are comfortable working directly with customers or invested in the success of those customers is so powerful.  So if you’re in the business or in the startup game, investors may be leery of building out a significant professional service as a function, because that’s just how Silicon Valley works. But it is absolutely imperative in terms of your ability to engage with customers, particularly around nascent technologies, filled with gaps where the product doesn’t exist. Learn from those experiences and bring that back into the core product.  It’s just a huge part of what we did. If I was ever in a situation where I had to advice a startup in the sort of open source space, I’d say lean into the professional service. Lean into field engineering. It’s a critical way to build your business. Learn what customers need. Walk the journey with them and just develop a deep empathy. [00:23:31] CC: With new technology, that was a concern about having enough professionals in the market who are knowledgeable in that new technology. There is always a gap for people to catch up with that.  So I’m curious to know what customers or companies, prospective customers, how they are thinking in terms of finding professionals to help them? Are they’re concerned that there’s enough professionals in the market? Are they finding that the current people who are admins and operators are having an easy time because their skills are transferable, if they’re going to embark on the Kubernetes journey? What are they telling you?  [00:24:13] CM: I mean, there’s a huge skills shortage. This is one of the kind of primary threats to the short term adoption of Kubernetes. I think Kubernetes will ultimately permeate enterprise organizations. I think it will become a standard for distributed systems development. Effectively emerging as an operating system for distributed systems, is people build more natively around Kubernetes. But right now it’s like the early days of Linux, where you deploy Linux, you’d have to kind of build it from scratch type of thing. It is definitely a challenge.  For enterprise organizations, it’s interesting, because there’s a war for talent. There’s just this incredible appetite for Kubernetes talent. There’s always that old joke around the job description for like 10 years of Kubernetes experience on a five-year project. That certainly is something we see a lot.  I’d take it from two sides. One is recognizing that as an enterprise organization, you are not going to be able to hire this talent. Just accept that sad truth. You can hire a seed crystal for it, but you really need to look at that as something that you’re going to build out as an enablement function for your own consumption.  As you start assessing individuals that you’re going to bring on in that role, don’t just assess for Kubernetes talent. Assess for the ability to teach. Look for people that can come in and not just do, but teach and enable others to do it, right? Because at the end of the day, if you need like 50 Kubernauts at a certain level, so does your competitor and all of your other competitors. So does every other function out there. There’s just massive shortage of skills.  So emphasizing your own – taking on the responsibility of building your own expertise. Educating your own organization. Finding ways to identify people that are motivated by this type of technology and creating space for them and recognizing and rewarding their work as they build this out. Because it’s far more practical to hire into existing skillset and then create space so that the people that have the appetite and capability to really absorb these types of disruptive technologies can do so within the parameters of your organization.  Create the structures to support them and then make it their job to help permeate that knowledge and information into the organization. It’s just not something you can just bring in. The skills just don’t exist in the broader world. Then for professionals that are interested in Kubernetes, this is definitely a field that I think we’ll see a lot of job security for a very long-time. Taking on that effort, it’s just well worth the journey.  Then I’d say the other piece of this is for vendors like VMware, our job can’t be just delivering skills and delivering technology. We need to think about our role as an enablers in the ecosystem as folks that are helping not just build up our own expertise of Kubernetes that we can represent to customers, but we’re well-served by our customers developing their own expertise. It’s not a threat to us. It actually enables them to consume the technologies that we provide. So focusing on that enablement through us as integration partners and [inaudible] community, focusing on enablement for our customers and education programs and the things that they need to start building out their capacity internally, is going to serve us all well. [00:27:22] JR: Something going back to maybe the Heptio conversation, I’m super interested in this. Being a very open source-oriented company, at VMware this is of course this true as well. We have to engage with large groups of humans from all different kinds of companies and we have to do that while building and shipping product to some degree. So where I’m going with this is like – I remember back in the Heptio days, there was something with dynamic audit logging that we were struggling with, and we needed it for some project we were working on.  But we needed to get consensus in a designed approve at like a bigger community level. I do know to some degree that did limit our ability to ship quickly. So you probably know where I’m going with this. When you’re working on projects or products, how do you balance, making sure the whole community is coming along with you, but also making sure that you can actually ship something?   [00:28:08] DM: That harkens back to that sort of catch phrase that Tim Sinclair always uses. If you want to go fast, go alone. If you want to go far, go together. I think as with almost everything in the world, these things are situational, right? There are situations where it is so critical that you bring the community along with you that you don’t find yourself carrying the load for something by yourself that you just have to accept and absorb that it’s going to be pushing string.  Working with an engaged community necessitates consensus, necessitates buy-in not just from you, but from potentially your competitors. The people that you’re working with and recognizing that they’ll be doing their own sort of mental calculus around whether this advantages them or not and whatnot. But hopefully, I think certainly in the Kubernetes community, this is general recognition that making the underlying technology accessible. Making it ubiquitous, making it intrinsically supportable profits everyone. I think there’re a couple of things that I look at. Make the decision pretty early on as to whether this is something you want to kind of spark off and sort of stride off on your own an innovate around, whether it’s something that’s critical to bring the community along with you around.  I’ll give you two examples of this, right? One example was the work we did around technologies like Valero, which is a backup restore product. It was an urgent and critical need to provide a sustainable way to back up and recover Kubernetes. So we didn’t have the time to do this through Kubernetes. But also it didn’t necessarily matter, because everything we’re doing was build this addendum to Kubernetes.  That project created a lot of value and we’ve donated to open source project. Anyone can use it. But we took on the commitment to drive the development ourselves. It’s not just we need it to. Because we had to push very quickly in that space. Whereas if you look at the work that we’re doing around things like cluster API and the sort of broader provisioning of Kubernetes, it’s so important that the ecosystem avoids the tragedy of the commons around things like lifecycle management.  It’s so important that we as a community converge on a consistent way to reason about the deployment upgrade and scaling of Kubernetes clusters. For any single vendor to try to do that by themselves, they’re going to take on the responsibility of dealing with not just one or two environments if you’re a hyperscale cloud provider [inaudible 00:30:27] many can do that. But we think about doing that for, in our case, “Hey, we only deploy into vSphere. Not just what’s coming next, but also earlier versions of vSphere. We need to be able to deploy into all of the hyper-scalers. We need to deploy into some of the emerging cloud providers. We need to start reasoning about edge. We need to start thinking about all of these. We’re a big company and we have a lot of engineers. But you’re going to get stretched very thin, very quickly if you try to chew that off by yourself.  So I think a lot of it is situational. I think there are situations where it does pay for organizations to kind of innovate, charge off in a new direction. Run an experiment. See if it sticks. Over time, open that up to the community as it makes sense. The thing that I think is most important is that you just wear your heart on your sleeve. The worst thing you can do is to present a charter that, “Hey, we’re doing this as a community-centric, open project with open design, open community, open source,” and then change your mind later, because that just creates dramas.  I think it’s situational. Pick the path that makes sense to the problem at-hand. Figure out how long your customer can wait for something. Sometimes you can bring things back to communities that are very open and accepting community. You can look at it as an experiment, and if it makes sense in that experiment perform factor, present it back to the Kubernetes communities and see if you can kind of get it back in. But in some case it just makes sense to work within the structure and constraints of the community and just accept that great things from a community angle take a lot of time.  [00:31:51] CC: I think too, one additional thing that I don’t think was mentioned is that if a project grows too big, you can always break it off. I mean, Kubernetes is such a great example of that. Break it off into separate components. Break it off into separate governance groups, and then parts can move with different speeds. [00:32:09] CM: Yeah, and there’s all kinds of options. So the heart of it is no one rule, right? It’s entirely situational. What are you trying to accomplish on what arise and acknowledge and accept that the evolution of the core of Kubernetes is slowing as it should. That’s a signal that the project is maturing. You cannot deliver value at a longer timeline that your business or your customers can absorb then maybe it makes sense to do something on the outside. Just wear your heart on your sleeve and make sure your customers and your partners know what you’re doing.  [00:32:36] DC: One of your earlier points about how do companies – I think Josh's question and was around how do companies attract talent. You’re basically pointing, and I think that there are some relation to this particular topic because, frequently, I’ve seen companies find some success by making room for open source or upstream engineers to focus on the Kubernetes piece and to help drive that adoption internally.  So if you’re going to adopt something like a Kubernetes strategy as part of a larger company goal, if you can actually make room within your organization to bring people who are – or to support people who want to focus on that up stream, I think that you get a lot of ancillary benefits from that, including it makes it easier to adopt that technology and understand it and actually have some more skin in the game around where the open source project itself is going.  [00:33:25] CM: Yeah, absolutely. I think one of the lovely things about the Kubernetes community is this idea of your position is earned, not granted, right? The way that you earn influence and leadership and basically the good will of everyone else in that community is by chopping wood, carrying water. Doing the things that are good for the community.  Over time, any organization, any human being can become influential and lead based on their merits of their contributions. It’s important that vendors think about that. But at the same time, I have a hard time taking exception with practically any use of open source. At the end of the day, open source by its nature is a leap of faith. You’re making that technology accessible. If someone else can take it, operationalize it well and deliver value for organizations, that’s part of your contract.  That’s what you absorb as a vendor when you start the thing. So people shouldn’t feel like they have to. But if you want to influence and lead, you do need to. Participate in these communities in an open way.  [00:34:22] DC: When you were helping form the CNCF and some of those projects, did you foresee it being like a driving goal for people, not just vendors, but also like consumers of the technologies associated with those foundations? [00:34:34] CM: Yeah, it was interesting. Starting the CNCF, I can speak from the position of where I was inside Google. I was highly motivated by the success of Kubernetes. Not just personally motivated, because it was a project that I was working on. I was motivated to see it emerge as a standard for distributed systems development that attracts the way the infrastructure provider. I’m not ashamed of it.  It was entirely self-serving. If you looked at Google’s market position at that time, if you looked at where we were as a hyper-scale cloud provider. Instituting something that enabled the intrinsic mobility of workloads and could shuffle around the cards on the deck so to speak [inaudible 00:35:09].  I also felt very privileged that that was our position, because we didn’t necessarily have to create artificial structures or constraints around the controls of the system, because that process of getting something to become ubiquitous, there’s a natural path if you approach it as a single provider. I’m not saying who couldn’t have succeeded with Kubernetes as a single provider. But if Red Hat and IBM and Microsoft and Amazon had all piled on to something else, it’s less obvious, right? It’s less obvious that Kubernetes would have gone as far as it did. So I was setting up CNCF, I was highly motivated by preserving the neutrality. Creating structures that separated the various sort of forms of governance. I always joke that at the time of creating CNCF, I was motivated by the way the U.S. Constitution is structured. Where you have these sort of different checks and balances. So I wanted to have something that would separate vendor interests from things that are maintaining taste on the discreet project.  The sort of architecture integrity, and maintain separation from customer segments, so that you’d create the sort of natural self-balancing system. It was definitely in my thinking, and I think it worked out pretty well. Certainly not perfect, but it did lead down a path which I think has supported the success  of the project a fair bit.  [00:36:26] DC: So we talked a lot about Kubernetes. I’m curious, do you have some thoughts, Carlisia?   [00:36:31] CC: Actually, I know you have a question about microliths. I was very interested in exploring that.   [00:36:37] CM: There’s an interesting pattern that I see out there in the industry and this manifests in a lot of different ways, right? When you think about the process of bringing applications and workloads into Kubernetes, there’s this sort of pre-dispositional bias towards, “Hey, I’ve got this monolithic application. It’s vertically scaled. I’m having a hard time with the sort of team structure. So I’m going to start tuning it up into a set of microservices that I can then manage discretely and ideally evolve on a separate cadence. This is an example of a real customer situation where someone said, “Hey, I’ve just broken this monolith down into 27 microservices.”  So I was sort of asking a couple of questions. The first one was when you have to update those 27 – if you want to update one of those, how many do you have to touch? The answer was 27. I was like, “Ha! You just created a microlith.” It’s like a monolith, except it’s just harder to live with. You’re taking a packaging problem and turn it into a massively complicated orchestration problem.  I always use that jokingly, but there’s something real there, which is there’s a lot of secondary things you need to think through as you start progressing on this cloud native journey. In the case of microservice development, it’s one thing to have API separated microservices. That’s easy enough to institute. But instituting the organization controls around an API versioning strategy such you can start to establish stable  API with consistent schema and being able to sort of manage the dependencies to consuming teams requires a level of sophistication that a lot of organizations haven’t necessarily thought through.  So it’s very easy to just sort of get caught up in the hype without necessarily thinking through what happens downstream. It’s funny. I see the same thing in functions, right? I interact with organizations and they’re like, “Wow! We took this thing that was running in a container and we turned it into 15 different functions.” I’m like, “Ha! Okay.” You start asking questions like, “Well, do you have any challenges with state coherency?” They’re like, “Yeah! It’s funny you say that. Because these things are a little bit less transactionally coherent, we have to write state watches. So we try and sort of watermark state and watch this thing."  I’m like, “You’re building a distributed transaction coordinator on your free time. Is this really the best use of your resources?" Right? So it really gets back to that idea that there’s a different tool for a different job. Sometimes the tool is a virtual machine. Sometimes it’s not. Sometimes the tool is a bare metal deployment. If you’re building a quantitative trading application that’s microsecond latency sensitive, you probably don’t want to hypervisor there.  Sometimes a VM is the natural destination and there’s no reason to move from a VM. Sometimes it’s a container. Sometimes you want to start looking at that container and just modularizing it so you can run a set of things next to each other in the same process space. Sometimes you’re going to want to put APIs between those things and separate them out into separate containers. There’s an ROI. There’s a cause and there’s a benefit associated with each of those transitions. More importantly, there are a set of skills that you have to have as you start looking at their continuum and making sure that you’re making good choices and being wise about it. [00:39:36] CC: That is a very good observation. Design is such an important part of software development. I wonder if Kubernetes helps mask these design problems. For example, the ones you are mentioning, or does Kubernetes sort of surfaces them even more?  [00:39:53] CM: It’s an interesting philosophical question. Kubernetes certainly masks some problems. I ran into an early – this is like years ago. I ran into an early customer, who confided in me, "I think we’re writing worse code now." I was like, ”What do you mean?” He was like, “Well, it used to be when we went out of memory on something, we get paged. Now we’ve set out that we go and it just restarts the container and everything continuous.” There’s no real incentive for the engineers to actually go back and deal with the underlying issues and recourse it, because the system is just more intrinsically robust and self-healing by nature.  I think there's definitely some problems that Kubernetes will compound. If you’re very sloppy with your dependencies, if you create a really large, vertically scaled monolith that’s running at VM today, putting it in a container is probably strictly going to make your life worse. Just be respectful of that. But at the same time, I do think that the discipline associated with transition to Kubernetes, if you walk it a little bit further along. If you start thinking about the fact that you’re not running a lot of imperative processes during a production in a push, where deployment container is effectively a bin copy with some minimal post-deployment configuration changes that happen. It sort of leads you on to a much happier path naturally.  I think it can mask some issues, but by and large, the types of systems you end up building are going to be more intrinsically operationally stable and scalable. But it is also worth recognizing that it’s — you are going to encounter corner cases. I’ve run into a lot of customers that will push the envelope in a direction that was unanticipated by the community or they accidentally find themselves on new ground that’s just unstable, because the technology is relatively nascent. So just recognizing that if you’re going to walk down a new path, I’m not saying don’t, just recognize that you’re probably going to encounter some stuff that’s going to take over to working through.  [00:41:41] DC: We get an earlier episode about API contracts, which I think highlights some of these stuff as well, because it sort of gets into some of those sharp edges of like why some of those things are super important when you start thinking about microservices and stuff.  We’re coming to the end of our time, but one of the last questions I want to ask you, we’ve talked a lot about Kubernetes in this episode, I’m curious what the future holds. We see a lot of really interesting things happening in the ecosystem around moving more towards serverless. There are a lot of people who are like — thinking that perhaps a better line would be to move away from like infrastructure offering and just basically allow cloud providers in this stuff to manage your nodes for you. We have a few shots on goal for that ourselves.  It’s been really an interesting evolution over the last year in that space. I’m curious, what sort of lifetime would you ascribe to it today? What do you think that this is going to be the thing in 10 years? Do you think it will be a thing in 5 years? What do you see coming that might change it? [00:42:32] CM: It’s interesting. Well, first of all, I think 2018 was the largest year ever for mainframe sales. So we have these technologies, once they’re in enterprise, it tends to be pretty durable. The duty cycle of enterprise software technology is pretty long-lived. The real question is we’ve seen a lot of technologies in this space emerge, ascend, reach a point of critical mass and then fade and they’re disrupted by the technologies. Is Kubernetes going to be a Linux or is Kubernetes going to be a Mesos, right?  I mean, I don’t claim to know the answer. My belief, and I think this is probably true, is that it’s more like a Linux. When you think about the heart of what Kubernetes is doing, is it’s just providing a better way to build and organized distributed systems. I’m sure that the code will evolve rapidly and I’m sure there will be a lot of continued innovation enhancement.  But when you start thinking about the fact that what Kubernetes has really done is brought controller reconciler based management to distributed systems developed everywhere. When you think about the fact that pretty much every system these days is distributed by nature, it really needs something that supports that model. So I think we will see Kubernetes sticking. We’ll see it become richer. We’ll start to see it becoming more applicable for a lot of things that we’re starting to just running in VMs.  It may well continue to run in VMs and just be managed by Kubernetes. I don’t have an opinion about how to reason about the underlying OS and virtualization structure. The thing I do have opinion about is it makes a ton of sense to be able to use a declarative framework. Use a set of well-structured controllers and reconcilers to drive your world into a non-desired state.  I think that pattern will be – it’s been quite successful. It can be quite durable. I think we’ll start to see organizations embrace a lot of these technologies over time. It is possible that something brighter, shinier, newer, comes along. Anyone will tell you that we made enough mistakes during the journey and there is stuff that I think everyone regret some of the Kubernetes train.  I do think it’s likely to be pretty durable. I don’t think it’s a silver bullet. Nothing is, right? It’s like any of these technologies, there’s always the cost and there’s a benefit associated with it. The benefits are relatively well-understood. But there’s going to be different tools to do different jobs. There’s going to be new patterns that emerge that simplify things. Is Kubernetes the best framework for running functions? I don’t know. Maybe. Kind of like what the [inaudible] people are doing. But are there more intrinsically optimal ways to do this, maybe. I don’t know.  [00:45:02] JR: It has been interesting watching Kubernetes itself evolve in that moving target. Some of the other technologies I’ve seen kind of stagnate on their one solution and don’t grow further. But that’s definitely not what I see within this community. It’s like always coming up with something new.   Anyway, thank you very much for your time. That was an incredible session. [00:45:22] CM: Yeah. Thank you. It’s always fun to chat.  [00:45:24] CC: Yeah. We’ll definitely have you back, Craig. Yes, we are coming up at the end, but I do want to ask if you have any thoughts that you haven’t brought up or we haven’t brought up that you’d like to share with the audience of this podcast.  [00:45:39] CM: I guess the one thing that was going through my head earlier I didn’t say which is as you look at these technologies, there’s sort of these two duty cycles. There’s the hype duty cycle, where technology ascends in awareness and everyone looks at it as an answer to all the everythings. Then there’s the readiness duty cycle, which is sometimes offset.  I do think we’re certainly peak hype right now in Kubernetes if you attended KubeCon. I do think there’s perhaps a gap between the promise and the reality for a lot of organizations. It's always just council caution and just be judicious about how you approach this. It’s a very powerful technology and I see a very bright future for it. Thanks for your time.  [00:46:17] CC: Really, thank you so much. It’s so refreshing to hear from you. You have great thoughts.  With that, thank you very much. We will see you next week.  [00:46:28] JR: Thanks, everybody. See you.  [00:46:29] DC: Cheers, folks. [END OF INTERVIEW]  [00:46:31] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


20 Jan 2020

Rank #2

Similar Podcasts

Podcast cover

The Network (Ep 15)

There are two words that get the blame more often than not when a problem cannot be rooted: the network! Today, along with special guest, Scott Lowe, we try to dig into what the network actually means. We discover, through our discussion that the network is, in fact, a distributed system. This means that each component of the network has a degree of independence and the complexity of them makes it difficult to understand the true state of the network. We also look at some of the fascinating parallels between networks and other systems, such as the configuration patterns for distributed systems. A large portion of the show deals with infrastructure and networks, but we also look at how developers understand networks. In a changing space, despite self-service becoming more common, there is still generally a poor understanding of networks from the developers’ vantage point. We also cover other network-related topics, such as the future of the network engineer’s role, transferability of their skills and other similarities between network problem-solving and development problem-solving. Tune in today! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Duffie Cooley Nicholas Lane Josh Rosso Key Points From This Episode: • The network is often confused with the server or other elements when there is a problem.• People forget that the network is a distributed system, which has independent routers.• The distributed pieces that make up a network could be standalone computers.• The parallels between routing protocols and configuration patterns for distributed systems.• There is not a model for eventually achieving consistent networks, particularly if they are old.• Most routing patterns have a time-sensitive mechanism where traffic can be re-dispersed.• Understanding a network is a distributed system gives insights into other ones, like Kubernetes.• Even from a developers’ perspective, there is a limited understanding of the network.• There are many overlaps between developers and infrastructural thinking about systems.• How can network engineers apply their skills across different systems?• As the future changes, understanding the systems and theories is crucial for network engineers.• There is a chasm between networking and development.• The same ‘primitive’ tools are still being used for software application layers.• An explanation of CSMACD, collisions and their applicability. • Examples of cloud native applications where the network does not work at all.• How Spanning Tree works and the problems that it solves.• The relationship between software-defined networking and the adoption of cloud native technologies.• Software-defined networking increases the ability to self-service.• With self-service on-prem solutions, there is still not a great deal of self-service. Quotes: “In reality, what we have are 10 or hundreds of devices with the state of the network as a system, distributed in little bitty pieces across all of these devices.” — @scott_lowe [0:03:11] “If you understand how a network is a distributed system and how these theories apply to a network, then you can extrapolate those concepts and apply them to something like Kubernetes or other distributed systems.” — @scott_lowe[0:14:05] “A lot of these software defined networking concepts are still seeing use in the modern clouds these days” — @scott_lowe[0:44:38] “The problems that we are trying to solve in networking are not different than the problems that you are trying to solve in applications.” — @mauilion[0:51:55] Links Mentioned in Today’s Episode: Scott Lowe on LinkedIn — https://www.linkedin.com/in/scottslowe/ Scott Lowe’s blog — https://blog.scottlowe.org/ Kafka — https://kafka.apache.org/ Redis — https://redis.io/ Raft — https://raft.github.io/ Packet Pushers — https://packetpushers.net/ AWS — https://aws.amazon.com/ Azure — https://azure.microsoft.com/en-us/ Martin Casado — http://yuba.stanford.edu/~casado/ Transcript: EPISODE 15  [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [0:00:41.4] DC: Good afternoon everybody.  In this episode, we’re going to talk about the network. My name is Duffie Cooley and I’ll be the lead of this episode and with me, I have Nick. [0:00:49.0] NL: Hey, what’s up everyone. [0:00:51.5] DC: And Josh. [0:00:52.5] JS: Hi. [0:00:53.6] DC: And Mr. Scott Lowe joining us as a guest speaker. [0:00:56.2] SL: Hey everyone. [0:00:57.6] DC: Welcome, Scott. [0:00:58.6] SL: Thank you. [0:01:00.5] DC: In this discussion, we’re going to try and stay away, like we do always, we’re going to try and stay away from particular products or solutions that are related to the problem. The goal of it is to really kind of dig in to like what the network means when we refer to it as it relates to like cloud native applications or just application design in general. One of the things that I’ve noticed over time and I’m curious, what you all think but like, one of the things I’ve done over time is that people are kind of the mind that if it can’t root cause a particular issue that they run into, they’re like, “That was the network.” Have you all seen that kind of stuff out there? [0:01:31.4] NL: Yes, absolutely. In my previous life, before being a Kubernetes architect, I actually used my networking and engineering degree to be a network administrator for the Boeing Company, under the Boeing Corporation. Time and time again, someone would come to me and say, “This isn’t working. The network is down.” And I’m like, “Is the network down or is the server down?” Because those are different things. Turns out it was usually the server. [0:01:58.5] SL: I used to tell my kids that they would come to me and they would say, the Internet is down and I would say, “Well, you know. I don’t think the entire Internet is down, I think it’s just our connection to the Internet.” [0:02:10.1] DC: Exactly. [0:02:11.7] JS: Dad, the entire global economy is just taking a total hit. [0:02:15.8] SL: Exactly, right. [0:02:17.2] DC: I frequently tell people that my first distributed system that I ever had a real understanding of was the network, you know? It’s interesting because it kind of like, relies on the premises that I think a good distributed system should in that there is some autonomy to each of the systems, right? They are dependent on each other or even are inter communicate with each other but fundamentally, like when you look at routers and things like that, they are autonomous in their own way. There’s work that they do exclusive to the work that others do and exclusive to their dependencies which I think is very interesting. [0:02:50.6] SL: I think the fact that the network is a distributed system and I’m glad you said that Duffie, I think the fact the network is a distributed system is what most people overlook when they start sort of blaming the network, right? Let’s face it, in the diagrams, right, the network’s always just this blob, right? Here’s the network, right? It’s this thing, this one singular thing. When in reality, what we have are like 10 or hundreds of devices with the state of the network as a system, distributed in little bitty pieces across all of these devices. And no way, aside from logging in to each one of these devices are we able to assemble what the overall state is, right? Even routing protocols mean, their entire purpose is to assemble some sort of common understanding of what the state of the network is. Melding together, not just IP addresses which are these abstract concept but physical addresses and physical connections. And trying to reason to make decisions about them, how we center across and it’s far more complex and a lot of people understand, I think that’s why it’s just like the network is down, right? When reality, it’s probably something else entirely. [0:03:58.1] DC: Yeah, absolutely. Another good point to bring up is that each of these distributed pieces of this distributed system are in themselves like basically like just a computer. A lot of times, I’ve talked to people and they were like, “Well, the router is something special.” And I’m like, “Not really. Technically, a Linux box could just be a router if you have enough ports that you plug into it. Or it could be a switch if you needed to, just plug in ports.” [0:04:24.4] NL: Another good interesting parallel there is like when we talk about like routing protocols which are a way of – a way that allow configuration changes to particular components within that distributed system to be known about by other components within that distributed system. I think there’s an interesting parallel here between the way that works and the way that configuration patterns that we have for distributed systems work, right? If you wanted to make a configuration only change to a set of applications that make up some distributed system, you might go about like leveraging Ansible or one of the many other configuration models for this.  I think it’s interesting because it represents sort of an evolution of that same idea in that you’re making it so that each of the components is responsible for informing the other components of the change, rather than taking the outside approach of my job is to actually push a change that should be known about by all of these concepts, down to them. Really, it’s an interesting parallel. What do you all think of that? [0:05:22.2] SL: I don’t know, I’m not sure. I’d have to process that for a bit. But I mean, are you saying like the interesting thought here is that in contrast to typical systems management where we push configuration out to something, using a tool like an Ansible, whatever, these things are talking amongst themselves to determine state? [0:05:41.4] DC: Yeah, it’s like, there are patterns for this like inside of distributed systems today, things like Kafka and you know, Kafka and Gossip protocol, stuff like this actually allows all of the components of a particular distributed system to understand the common state or things that would be shared across them and if you think about them, they’re not all that different from a routing protocol, right? Like the goal being that you give the systems the ability to inform the other systems in some distributed system of the changes that they may have to react to. Another good example of this one, which I think is interesting is like, what they call – when you have a feature behind a flag, right? You might have some distributed configuration model, like a Redis cache or database somewhere that you’ve actually – that you’ve held the running configuration of this distributed system. And when you want to turn on this particular feature flag, you want all of the components that are associated with that feature flag to enable that new capability. Some of the patterns for that are pretty darn close to the way that routing protocol models work. [0:06:44.6] SL: Yeah, I see what you're saying. Actually, that’ makes a lot of sense. I mean, if we think about things like Gossip protocols or even consensus protocols like Raft, right? They are similar to routing protocols in that they are responsible for distributing state and then coming to an agreement on what that state is across the entire system. And we even apply terms like convergence to both environments like we talk about how long it takes routing protocol to converge. And we might also talk about how long it takes for and ETCD cluster to converge after changing the number of members in the cluster of that nature. The point at which everybody in that distributed system, whether it be the network ETCD or some other system comes to the same understanding of what that shared state is. [0:07:33.1] DC: Yeah, I think that’s a perfect breakdown, honestly. Pretty much every routing technology that’s out there. You know, if you’re taking that – the computer of the network, you know, it takes a while but eventually, everyone will reconcile the fact that, “Yeah, that node is gone now.” [0:07:47.5] NL: I think one thing that’s interesting and I don’t know how much of a parallel there is in this one but like as we consider these systems like with modern systems that we’re building at scale, frequently we can make use of things like eventual consistency in which it’s not required per se for a transaction to be persisted across all of the components that it would affect immediately. Just that they eventually converge, right? Whereas with the network, not so much, right? The network needs to be right now and every time and there’s not really a model for eventually consistent networks, right? [0:08:19.9] SL: I don’t know. I would contend that there is a model for eventually consistent networks, right? Certainly not on you know, most organizations, relatively simple, local area networks, right? But even if we were to take it and look at something like a Clos fabric, right, where we have top of rack switches and this is getting too deep for none networking blokes that we know, right? Where you take top of rack switches that are talking layer to the servers below them or the end point below them. And they’re talking layer three across a multi-link piece up to the top, right? To the spine switches, so you have leaf switches, talking up spine switches, they’re going to have multiple uplinks. If one of those uplinks goes down, it doesn’t really matter if the rest off that fabric knows that that link is down because we have the SQL cost multi pathing going across that one, right? In a situation like that, that fabric is eventually consistent in that it’s okay if you know, knee dropping link number one of leaf A up to spine A is down and the rest of the system doesn’t know about that yet. But, on the other hand, if you are looking at network designs where convergence is being handled on active standby links or something of that nature or there aren’t enough paths to get from point A to point B until convergence happens then yes, you’re right. I think it kind of comes down to network design and the underlying architecture and there are so many factors that affect that and so many designs over the years that it’s hard to – I would agree and from the perspective of like if you have an older network and it’s been around for some period of time, right? You probably have one that is not going to be tolerant, a link being down like it will cause problems. [0:09:58.4] NL: Adds another really great parallel in software development, I think. Another great example of that, right? If we consider for a minute like the circuit breaking pattern or even like you know, most load balancer patterns, right? In which you have some way of understanding a list of healthy end points behind the load balancer and were able to react when certain end points are no longer available. I don’t consider that a pattern that I would relate to specifically if they consent to eventual consistency. I feel like that still has to be immediate, right? We have to be able to not send the new transaction to the dead thing. That has to stop immediately, right? It does in most routing patterns that are described by multi path, there is a very time sensitive mechanism that allows for the re-dispersal of that traffic across known paths that are still good. And the work, the amazing amount of work that protocol architects and network engineers go through to understand just exactly how the behavior of those systems will work. Such that we don’t see traffic. Black hole in the network for a period of time, right? If we don’t send traffic to the trash when we know or we have for a period of time, while things converge is really has a lot going for it. [0:11:07.0] SL: Yeah, I would agree. I think the interesting thing about discussing eventual consistency with regards to the networking is that even if we take a relatively simple model like the DOD model where we only have four layers to contend with, right? We don’t have to go all the way to this seven-layer OSI model. But even if we take a simple layer like the DOD four-layer model, we could be talking about the rapid response of a device connected at layer two but the less than rapid response of something operating at layer three or layer four, right?  In the case of a network where we have these discreet layers that are intentionally loosely coupled which is another topic, we could talk about from a distribution perspective, right? We have these layers that are intentionally loosely coupled, we might even see consistency and the application of the cap theorem, behave differently at different layers of their model. [0:12:04.4] DC: That’s right. I think it’s fascinating like how much parallel there is here. As you get into like you know, deep architectures around software, you’re thinking of these things as it relates to like these distributed systems, especially as you’re moving toward more cloud native systems in which you start employing things like control theory and thinking about the behaviours of those systems both in aggregate like you know, some component of my application, can I scale this particular component horizontally or can I not, how am I handling state. So many of those things have parallels to the network that I feel like it kind of highlights I’m sure what everybody has heard a million times, you know, that there’s nothing new under the sun. There’s million things that we could learn from things that we’ve done in the past. [0:12:47.0] NL: Yeah, totally agree. I recently have been getting more and more development practice and something that I do sometimes is like draw out like how all of my functions and my methods, and take that in rack with each other across a consisting code base and lo and behold when I draw everything out, it sure does look a lot like a network diagram. All these things have to flow together in a very specific way and you expect the kind of returns that you’re looking for. It looks exactly the same, it’s kind of the – you know, how an atom kind of looks like a galaxy from our diagram? All these things are extrapolated across like –  [0:13:23.4] SL: Yeah, totally. [0:13:24.3] NL: Different models. Or an atom looks like a solar system which looks like a galaxy. [0:13:28.8] SL: Nicholas, you said your network administrator at Boeing? [0:13:30.9] NL: I was, I was a network engineer at Boeing. [0:13:34.0] SL: You know, as you were sitting there talking, Duffie, so, I thought back to you Nick, I think all the times, I have a personal passion for helping people continue to grow and evolve in their career and not being stuck. I talk to a lot of networking folks, probably dating because of my involvement, back in the NSX team, right? But folks being like, “I’m just a network engineer, there’s so much for me to learn if I have to go learn Kubernetes, I wouldn’t even know where to start.” This discussion to me underscores the fact that if you understand how a network is a distributed system and how these theories apply to a network, then you can extrapolate those concepts and apply them to something like Kubernetes or other distributed systems, right? Immediately begin to understand, okay. Well, you know, this is how these pieces talk to each other, this is how they come, the consensus, this is where the state is stored, this is how they understand and exchange date, I got this. [0:14:33.9] NL: if you want to go down that that path, the controlled plane of your cluster is just like your central routing back bone and then the kublets themselves are just your edge switches going to each of your individual smaller network and then the pods themselves have been nodes inside of the network, right? You can easily – look at that, holy crap, it looks exactly the same. [0:14:54.5] SL: Yeah, that’s a good point. [0:14:55.1] DC: I mean, another interesting part, when you think about how we characterize systems, like where we learn that, where that skillset comes from. You raise a very good point. I think it’s an easier – maybe slightly easier thing to learn inside of networking, how to characterize that particular distributed system because of the way the components themselves are laid out and in such a common way. Where when we start looking at different applications, we find a myriad of different patterns with particular components that may behave slightly differently depending, right? Like there are different patterns within software like almost on per application bases whereas like with networks, they’re pretty consistently applied, right? Every once in a while, they’ll be kind of like a new pattern that emerges, that it just changes the behavior a little bit, right? Or changes the behavior like a lot but at the same time, consistently across all of those things that we call data center networks or what have you. To learn to troubleshoot though, I think the key part of this is to be able to spend the time and the effort to actually understand that system and you know, whether you light that fire with networking or whether you light that fire with like just understanding how to operationalize applications or even just developing and architecting them, all of those things come into play I think. [0:16:08.2] NL: I agree. I’m actually kind of curious, the three of us have been talking quite a bit about networking from the perspective that we have which is more infrastructure focused. But Josh, you have more of a developer focused background, what’s your interaction and understanding of the network and how it plays? [0:16:24.1] JS: Yeah, I’ve always been a consumer of the network. It’s something that is sat behind an API and some library, right? I call out to something that makes a TCP connection or an http interaction and then things just happen. I think what’s really interesting hearing talk and especially the point about network engineers getting into thee distributed system space is that I really think that as we started to put infrastructure behind API’s and made it more and more accessible to people like myself, app developers and programmers, we started – by we, you know, I’m obviously generalizing here. But we started owning more and more of the infrastructure. When I go into teams that are doing big Kubernetes deployments, it’s pretty rare, that’s the conventional infrastructure and networking teams that are standing up distributed systems, Kubernetes or not, right? It's a lot of times, a bunch of app developers who have maybe what we call dev-ops, whatever that means but they have an application development background, they understand how they interact with API’s, how to write code that respects or interacts with their infrastructure and they’re standing up these systems and I think one of the gaps of that really creates is a lot of people including myself just hearing you all talk, we don’t understand networking at that level.  When stuff falls over and it’s either truly the network or it’s getting blamed on the network, it’s often times, just because we truly don’t understand a lot of these things, right? Encapsulation, meshes, whatever it might be, we just don’t understand these concepts at a deep level and I think if we had a lot more people with network engineering backgrounds, shifting into the distributed system space. It would alleviate a bit of that, right? Bringing more understanding into the space that we work in nowadays. [0:18:05.4] DC: I wonder if maybe it also would be a benefit to have like more cross discussions like this one between developers and infrastructure kind of focused people, because we’re starting to see like as we’re crossing boundaries, we see that the same things that we’re doing on the infrastructure side, you’re also doing in the developer side. Like cap theorem as Scott mention which is the idea that you can have two out of three of consistency, availability and partitioning. That also applies to networking in a lot of ways. You can only have a network that is either like consistent or available but it can’t handle partitioning. It can be a consistent to handle partitioning but it’s not always going to be available, that sort of thing. These things that apply in from the software perspective also apply to us but we think about them as being so completely different. [0:18:52.5] JS: Yeah, I totally agree. I really think like on the app side, a couple of years ago, you know, I really just didn’t care anything outside of the JVM like my stuff on the JVM and if it got out to the network layer of the host like just didn’t care, know, need to know about that at all. But ever since cloud computing and distributed systems and everything became more prevalent, the overlap has become extremely obvious, right? In all these different concepts and it’s been really interesting to try to ramp up on that. [0:19:19.6]:19.3] NNL: Yeah, I think you know Scott and I both do this. I think as I imagine, actually, this is true of all four of us to be honest. But I think that it’s really interesting when you are out there talking to people who do feel like they’re stuck in some particular role like they’re specialists in some particular area and we end up having the same discussion with them over and over again. You know, like, “Look, that may pay the bills right now but it’s not going to pay the bills in the future.”  And so you know, the question becomes, how can you, as a network engineer take your skills forward and not feel as though you’re just going to have to like learn everything all over again. I think that one of the things that network engineers are pretty decent at is characterizing those systems and being able to troubleshoot them and being able to do it right now and being able to like firefight those capabilities and those skills are incredibly valuable in the software development and in operationalizing applications and in SRE models. I mean, all of those skills transfer, you know? If you’re out there and you’re listening and you feel like I will always be a network engineer, consider that you could actually take those skills forward into some other role if you chose to. [0:20:25.1] JS: Yeah, totally agree. I mean, look at me, the lofty career that I’ve been come to. [0:20:31.4] SL: You know, I would also say that the fascinating thing to me and one of the reasons I launched, I don’t say this to like try and plug it but just as a way of talking about the reason I launched my own podcast which is now part of packet pushers, was exploring this very space and that is like we’ve got folks like Josh who comes from the application development spacing is now being, you know, in a way, forced to own and understand more infrastructure and we’ve got the infrastructure folks who now in a way, whether it be through the rise of cloud computing and abstractions away from visible items are being forced kind of up the stack and so they’re coming together and this idea of what does the future of the folks that are kind of like in our space, what does that look like? How much longer does a network engineer really need to be deeply versed in all the different layers? Because everything’s been abstracted away by some other type of thing whether it’s VPC’s or Azure V Nets or whatever the case is, right? I mean, you’ve got companies bringing the VPC model to on premises networks, right? As API’s become more prevalent, as everything gets sort of abstracted away, what does the future look like, what are the most important skills and it seems to me that it’s these concepts that we’re talking about, right?  This idea of distributed systems and how distributed systems behave and how the components react to one another and understanding things like the cap theorem that are going to be most applicable rather than the details of trouble shooting VGP or understanding AWS VPC’s or whatever the case may be. [0:22:08.5] NL: I think there is always going to be a place for the people who know how things are running under the hood from like a physical layer perspective, that sort of thing, there’s always going to be the need for the grave beards, right? Even in software development, we still have the people who are slinging kernel code in C. And you know, they’re the best, we salute you but that is not something that I’m interested in it for sure. We always need someone there to pick up the pieces as it were. I think that yeah, having just being like, I’m a Cisco guy, I’m a Juniper guy, you know? I know how to pawn that or RSH into the switch and execute these commands and suddenly I’ve got this port is now you know, trunk to this V neck crap, I was like, Nick, remember your training, you know?  How to issue those commands, I wonder, I think that that isn’t necessarily going away but it will be less in demand in the future. [0:22:08.5] SL: I’m curious to hear Josh’s perspective as like having to own more and more of the infrastructure underneath like what seems to be the right path forward for those folks? [0:23:08.7] JS: Yeah, I mean, unfortunately, I feel like a lot of times, it just ends up being trial by fire and it probably shouldn’t be that. But the amount of times that I have seen a deployment of some technology fall over because we overlapped the site range or something like that is crazy. Because we just didn’t think about it or really understand it that well. You know, like using one protocol, you just described BGP. I never ever dreamt of what BGP was until I started using attributed systems, right? Started using BGP as a way to communicate routes and the amount off times that I’ve messed up that connection because I don’t have a background in how to set that up appropriately, it’s been rough. I guess my perspective is that the technology has gotten better overall and I’m mostly obviously in the Kubernetes space, speaking to the technologies around a lot of the container networking solutions but I’m sure this is true overall. It seems like a lot of the sharp edges have been buffed out quite a bit and I have less of an opportunity to do things terribly wrong. I’ve also noticed for what it’s worth, a lot of folks that have my kind of background or going out to like the AWS is the Azure’s of the world. They’re using all these like, abstracted networking technologies that allow t hem to do really cool stuff without really having to understand how it works and they’re often times going back to their networking team on prem when they have on prem requirements and being like it should be this easy or XY and Z and they’re almost like pushing the networking team to modernize that and make things simpler. Based on experiences they’re having with these cloud providers. [0:24:44.2] DC: Yeah, what do you mean I can’t create a load balancer that crosses between these two disparate data centers as it easily is. Just issuing a single command. Doesn’t this just exist from a networking standpoint? Even just the idea that you can issue an API command and get a load balancer, just that idea alone, the thousands of times I have heard that request in my career. [0:25:08.8] JS: And like the actual work under the hood to get that to work properly is it’s a lot, there’s a lot of stuff going on. [0:25:16.5] SL: Absolutely, yeah,  [0:25:17.5] DC: Especially when you’re into plumbing, you know? If you’re going to create a load balancer with API, well then, what API does the load balancer use to understand where to send that traffic when it’s being balanced. How do you handle discovery, how do you hit like – obviously, yeah, there’s no shortage on the amount of work there. [0:25:36.0] JS: Yeah. [0:25:36.3] DC: That’s a really good point, I mean, I think sometimes it’s easy for me to think about some of these API driven networking models and the cost that come with them, the hidden cost that come with them. An example of this is, if you’re in AWS and you have a connectivity between wo availability, actually could be any cloud, it doesn’t have to be an AWS, right? If you have connectivity between two different availability zones and you’re relying on that to be reliable and consistent and definitely not to experience, what tools do you have at your disposal, what guarantees do you have that that network has even operating in a way that is responsive, right? And in a way, this is kind of taking us towards the observability conversation that I think we’ve talked a little bit about the past. Because I think it highlights the same set of problems again, right? You have to understand, you have to be able to provide the consumers of any service, whether that service is plumbing, whether it’s networking, whether it’s your application that you’ve developed that represents a set of micro service. You have to provide everybody a way or you know, have to provide the people who are going to answer the phone at two in the morning. Or even the robots that are going to answer the phone at two in the morning. I have to provide them some mechanism by which to observe those systems as they are in use. [0:26:51.7] JS: I’m not convinced that very many of the cloud providers do that terribly well today, you know? I feel like I’ve been burned in the past without actually having an understanding of the state that we’re in and so it is interesting maybe the software development team can actually start pushing that down toward the networking vendors out there out in the world. [0:27:09.9] NL: Yeah that would be great. I mean I have been recently using a managed Kubernetes service. I have been kicking the tires on it a little bit. And yeah there has been a couple of times where I had just been got by networking issues. I am not going to get into what I have seen in a container network interface or any of the technologies around that. We are going to talk about that another time. But the CNI that I am using in this managed service was just so wonky and weird.  And it was failing from a network standpoint. The actual network was failing in a sense because the IP addresses for the nodes themselves or the pods wasn’t being released properly and because of our bag. And so, the rules associated with my account could not remove IP addresses from a node in the network because it wasn’t allowed to and so from a network, I ran out of IP addresses in my very small site there. [0:28:02.1] SL: And this could happen in database, right? This could happen in a cache of information, this could happen in pretty much the same pattern that you are describing is absolutely relevant in both of these fields, right? And that is a fascinating thing about this is that you know we talk about the network generally in these nebulous terms and that it is like a black box and I don’t want them to know anything about it. I want to learn about it, I don’t want to understand it.  I just want to be able to consume it via an API and I want to have the expectation that everything will work the way it is supposed to. I think it is fascinating that on the other side of that API are people maybe just like you who are doing their level best to provide, to chase the cap theorum into it’s happy end and figure out how to actually give you what you need out of that service, you know? So, empathy I think is important.  [0:28:50.4] NL: Absolutely, to bring that to an interesting thought that I just had where on both sides of this chasm or whatever it is between networking and develop, the same principles exists like we have been saying but just to elicited on it a little bit more, it’s like on one side you have like I need to make sure that these ETCD nodes communicate with each other and that the data is consistent across the other ones. So, we use a protocol called RAFT, right?  And so that’s eventually existent tool then that information is sent onto a network, which is probably using OSPF, which is “open shortest path first” routing protocol to become eventually consistent on the data getting from one point to the other by opening the shortest path possible. And so these two things are very similar. They are both these communication protocols, which is I mean that is what protocol means, right? The center for communication but they’re just so many different layers.  Obviously of the OSI model but people don’t put them together but they really are and we keep coming back to that where it is all the same thing but we think about it so differently. And I am actually really appreciating this conversation because now I am having a galaxy brain moment like boo. [0:30:01.1] SL: Another really interesting one like another galaxy moment, I think that is interesting is if you think about – so let us break them down like TCP and UTP. These are interesting patterns that actually do totally relate again just in software patterns, right? In TCP the guarantee is that every data gram, if you didn’t get the entire data gram you will understand that you are missing data and you will request a new version of that same packet.  And so, you can provide consistency in the form of retries or repeats if things don’t work, right? Not dissimilar from the ability to understand like that whether you chuck some in data across the network or like in a particular data base, if you make a query for a bunch of information you have to have some way of understanding that you got the most recent version of it, right? Or ETCD supports us by using the revision by understanding what revision you received last or whether that is the most recent one.  And other software patterns kind of follow the same model and I think that is also kind of interesting. Like we are still using the same primitive tools to solve the same problems whether we are doing it at a software application layer or whether we are doing it down in the plumbing at the network there, these tools are still very similar. Another example is like UTP where it is basically there are no repeats. You either got the packet or you didn’t, which sounds a lot like an event stream to me in some ways, right?  Like it is very interesting, you just figured out like I put in on the line, you didn’t get it? It is okay, I will put another line here in a minute you can react to that one, right? It is an interesting overlap.  [0:31:30.6] NL: Yeah, totally.  [0:31:32.9] JS: Yeah, the comparison to event streams or message queues, right? There is an interesting one that I hadn’t considered before but yeah, there are certainly parallels between saying, “Okay I am going to put this on the message queue,” and wait for the acknowledgement that somebody has taken it and taken ownership of it as oppose to an event stream where it is like this happened. I admit this event. If you get it and you do something with it, great.  If you don’t get it then you don’t do something with it, great because another event is going to come along soon. So, there you go.  [0:32:02.1] DC: Yep, I am going to go down a weird topic associated with what we are just talking about. But I am going to get a little bit more into the weeds of networking and this is actually directed into us in a way. So, talking about the kind of parallels between networking and development, in networking at least with TCP and networking, there is something called CSMACD, which is “carry your sense multi,” oh I can’t remember what the A stands for and the CD.  [0:32:29.2] SL: Access.  [0:32:29.8] DC: Multi access and then CD is collision detection and so basically what that means is whenever you sent out a packet on the network, the network device itself is listening on the network for any collisions and if it detects a collision it will refuse to send a packet until a certain period of time and they will do a retry to make sure that these packets are getting sent as efficiently as possible. There is an alternative to that called CMSCA, which was used by Mac before they switched over to using a Linux based operating system.  And then putting a fancy UI in front of it, which collision avoidance would listen and try and – I can’t remember exactly, it would time it differently so that it would totally just avoid any chance that there could be collision. It would make sure that no packets were being sent right then and then send it back up. And so I was wondering if something like that exists in the realm between the communication path between applications.  [0:33:22.5] JS: Is it collision two of the same packets being sent or what exactly is that?  [0:33:26.9] DC: With the packets so basically any data going back and forth.  [0:33:29.7] JS: What makes it a collision?  [0:33:32.0] SL: It is the idea that you can only transmit one message at a time because if they both populate the same media it is trash, both of them are trash.  [0:33:39.2] JS: And how do you qualify that. Do you receive an ac from the system or?  [0:33:42.8] NL: No there is just nothing returned essentially so it is like literally like the electrical signals going down the wire. They physically collide with each other and then the signal breaks.  [0:33:56.9] JS: Oh, I see, yeah, I am not sure. I think there is some parallels to that maybe with like queuing technologies and things like that but can’t think of anything on like direct app dev side. [0:34:08.6] DC: Okay, anyway sorry for that tangent. I just wanted to go down that little rabbit-hole a little bit. It was like while we are talking about networking, I was like, “Oh yeah, I wanted to see how deep down we can make this parallel going?” so that was the direction I went.  [0:34:20.5] SL: Like where is that that CSMACD, a piece is like seriously old school, right? Because it only applied to half duplex Ethernet and as soon as we went to full duplex Ethernet it didn’t matter anymore.  [0:34:33.7] DC: That is true. I totally forgot about that.  [0:34:33.8] JS: It applied the satellite with all of these as well.  [0:34:35.9] DC: Yeah, I totally forgot about that. Yeah and with full duplex, we totally just space on that. This is – damn Scott, way to make me feel old.  [0:34:45.9] SL: Well I mean satellite stuff, too, right? I mean it is actually any shared media upon which you have to – where if this stuff goes and overlap there, you are not going to be able to make it work right? And so, I mean it is interesting. It is actually an interesting PNL. I am struggling to think of an example of this as well. I mean my brain is going towards circuit breaking but I don’t think that that is quite the same thing.  It is sort the same thing that in a circuit breaking pattern, the application that is making the request has the ability obviously because it is the thing making the request to understand that the target it is trying to connect to is not working correctly. And so, it is able to make an almost instantaneous decision or at least a very shortly, a very timely decision about what to do when it detects that state. And so that’s a little similar and that you can and from the requester side you can do things if you see things going awry.  And really and in reality, in the circuit breaking pattern we are making the assumption that only the application making the request will ever get that information fast enough to react to it.  [0:35:51.8] JS: Yeah where my head was kind of going with it but I think it is pretty off is like on a low level piece of code like it is maybe something you write in C where you implement your own queue in that area and then multiple threads are firing off the same time and there is no block system or mechanism if two threads contend to put something in the same memory space that that queue represents. That is really going down the rabbit hole. I can’t even speak to what degree that is possible in modern programming but that is where my head was.  [0:36:20.3] NL: Yeah that is a good point.  [0:36:21.4] SL: Yeah, I think that is actually a pretty good analogy because the key commonality here is some sort of shared access, right? Multiple threads accessing the same stack or memory buffer. The other thing that came to mind to me was like some sort of session multiplexing, right? Where you are running multiple application layer sessions inside a single sort of network connection and those network sessions getting comingled in some fashion.  Whether through identifiers or sequence number or something else of that nature and therefore, you know garbling the ultimate communication that is trying to be sent.  [0:36:59.2] DC: Yeah, locks are exactly the right direction, I think.  [0:37:03.6] NL: That is a very good point. [0:37:05.2] DC: Yeah, I think that makes perfect sense. Good, all right. Yes, we nailed it.  [0:37:09.7] SL: Good job.  [0:37:10.8] DC: Can anybody here think of a software pattern that maybe doesn’t come across that way? When you are thinking about some of the patterns that you see today in cloud native applications, is there a counter example, something that the network does not do at all? [0:37:24.1] NL: That is interesting. I am trying to think where event streams. No, that is just straight up packets.  [0:37:30.7] JS: I feel like we should open up one of those old school Java books of like 9,000 design patterns you need to know and we should go one by one and be like, “What about this” you know? There is probably something I can’t think of it off the top of my head.  [0:37:43.6] DC: Yeah me neither. I was trying to think of it. I mean like I can think of a myriad of things that do cross over even the idea of only locally relevant state, right? That is like a cam table on a switch that is only locally relevant because once you get outside of that switching domain it doesn’t matter anymore and it is like there is a ton of those things that totally do relate, you know? But I am really struggling to come up with one that doesn’t –  One thing that is actually interesting is I was going to bring up – we mentioned the cap theorem and it is an interesting one that you can only pick like two and three of consistency availability and partition tolerance. And I think you know, when I think about the way that networks solve or try to address this problem, they do it in some pretty interesting way. It’s like if you were to consider like Spanning Tree, right? The idea that there can really only be one path through a series of broadcast domains.  Because we have multiple paths then obviously we are going to get duplicity and the things are going to get bad because they are going to have packets that are addressed the same things across and you are going to have all kinds of bad behaviors, switching loops and broadcast storms and all kinds of stuff like that and so Spanning Tree came along and Spanning Tree was invented by an amazing woman engineer who created it to basically ensure that there was only one path through a set of broadcast domains.  And in a way, this solved that camp through them because you are getting to the point where you said like since I understand that for availability purpose, I only need one path through the whole thing and so to ensure consistency, I am going to turn off the other paths and to allow for partition tolerance, I am going to enable the system to learn when one of those paths is no longer viable so that it can re-enable one of the other paths.  Now the challenge of course is there is a transition period in which we lose traffic because we haven’t been able to open one of those other paths fast enough, right? And so, it is interesting to think about how the network is trying to solve with the part that same set of problems that is described by the cap theorem that we see people trying to solve with software routine.  [0:39:44.9] SL: No man I totally agree. In a case like Spanning Tree, you are sacrificing availability essentially for consistency and partition tolerance when the network achieves consistency then availability will be restored and there is other ways to doing that. So as we move into systems like I mentioned clos fabrics earlier, you know a cost fabric is a different way of establishing a solution to that and that is saying I’d later too. I will have multiple connections.  I will wait those connections using the higher-level protocol and I will sacrifice consistency in terms of how the routes are exchanged to get across that fabric in exchange for availability and partition columns. So, it is a different way of solving the same problem and using a different set of tools to do that, right?  [0:40:34.7] DC: I personally find it funny that in the cap theorem there is at no point do we mention complexity, right? We are just trying to get all three and we don’t care if it’s complex. But at the same time, as a consumer of all of these systems, you care a lot about the complexity. I hear it all the time. Whether that complexity is in a way that the API itself works or whether even in this episode we are talking about like I maybe don’t want to learn how to make the network work.  I am busy trying to figure out how to make my application work, right? Like cognitive load is a thing. I can only really focus on so many things at a time where am I going to spend my time? Am I going to spend it learning how to do plumbing or am I going to spend it actually trying the right application that solves my business problem, right? It is an interesting thing.  [0:41:17.7] NL: So, with the rise of software defined networking, how did that play into the adoption of cloud native technologies?  [0:41:27.9] DC: I think it is actually one of the more interesting overlaps in the space because I think to Josh’s point again. his is where we were taking I mean I work for a company called [inaudible 0:41:37], in which we were virtualizing the network and this is fascinating because effectively we are looking at this as a software service that we had to bring up and build and build reliably and scalable. Reliably and consistently and scalable. We want to create this all while we are solving problems.  But we need it to do within an API. It is like we couldn’t make the assumption with the way that networks were being defined today like going to each component and configuring them or using protocols was actually going to work in this new model of software confined networking. And so, we had an incredible amount of engineers who were really focused from a computer science perspective on how to effectively reinvent network as a software solution.  And I do think that there is a huge amount of cross over here like this is actually where I think the waters meet between the way the developers think about the problems and the way that network engineers think about the problem but it has been a rough road I will say. I will say that STN I think is actually has definitely thrown a lot of network engineers under their heels because they’re like, “Wait, wait but that is not a network,” you know? Because I can’t actually look at it and characterize it in the way that I am accustomed to looking at characterizing the other networks that I play with.  And then from the software side, you’re like, “Well maybe that is okay” right? Maybe that is enough, it is really interesting.  [0:42:57.5] SL: You know I don’t know enough about the details of how AWS or Azure or Google are actually doing their networking like and I don’t even know and maybe you guys all do know – but I don’t even know that aside from a few tidbits here and there that AWS is going to even divulge the details of how things work under the covers for VPC’s right?  But I can’t imagine that any modern cloud networking solution whether it would be VBPC’s or VNET’s or whatever doesn’t have a significant software to find aspect to it. You know, we don’t need to get into the definitions of what STN is or isn’t. That was a big discussion Duffie and I had six years ago, right? But there has to be some part of it that is taking and using the concepts that are common in STN right? And applying that. Just as the same way as the cloud vendors are using the concepts from compute virtualization to enable what they are doing.  I mean like the reality is that you know the work that was done by the Cambridge folks on Zen was a massive enabler trade for AWS, right? The word done on KVM also a massive enabler for lots of people. I think GCP is KBM based and V Sphere where VM Ware data as well. I mean all of this stuff was a massive enablers for what we do with compute virtualization in the cloud. I have to think that whether it is – even if it wasn’t necessarily directly stemming out of Martin Casado’s open flow work at Stanford, right?  That a lot of these software define networking concepts are still seeing use in the modern clouds these days and that is what enables us to do things like issue an API call and have an isolated network space with its own address space and its own routing and satiated in some way and managed.  [0:44:56.4] JS: Yeah and on that latter point, you know as a consumer of this new software defined nature of networking, it is amazing the amount of I don’t know, I started using like a blanket marketing term here but agility that it is added, right? Because it has turned all of these constructs that I used to file a ticket and follow up with people into self-service things that when I need to poke holes in the network, hopefully the rights are locked down, so I just can’t open it all up. Assuming I know what I am doing and the rights are correct it is totally self-service for me. I go into AWS, I change the security group roll and boom, the ports have changed and it never looked like that prior to this full takeover of what I believe is STN almost end to end in the case of AWS and so on. So, it is really just not only has it made people like myself have to understand more about networking but it has allowed us to self-service a lot of the things.  That I would imagine most network engineers were probably tired of doing anyways, right? How many times do you want to go to that firewall and open up that port? Are you really that excited about that? I would imagine not so.  [0:45:57.1] NL: Well I can only speak from experience and I think a lot of network engineers kind of get into that field because it really love control. And so, they want to know what these ports are that are opening and it is scary to be like this person has opened up these ports, “Wait what?” Like without them even totally knowing. I mean I was generalizing, I was more so speaking to myself as being self-deprecating. It doesn’t apply to you listener.  [0:46:22.9] JS: I mean it is a really interesting point though. I mean do you think it makes the networking people or network engineers maybe a little bit more into the realm of observability and like knowing when to trigger when something has gone wrong? Does it make them more reactive in their role I guess. Or maybe self-service is not as common as I think it is. It is just from my point of view, it seems like with STN’s the ability to modify the network more power has been put into the developers’ hands is how I look at it, you know?  [0:46:50.7] DC: I definitely agree with that. It is interesting like if we go back a few years there was a time when all of us in the room here I think are employed by VMware. So, there was a time where VMware’s thing was like the real value or one of the key values that VMware brought to the table was the idea that a developer come and say “Give me 10 servers.” And you could just call an API or make it or you could quickly provision those 10 servers on behalf of that developer and hand them right back.  You wouldn’t have to go out and get 10 new machines and put them into a rack, power them and provision them and go through that whole process that you could actually just stamp those things out, right? And that is absolutely parallel to the network piece as well. I mean if there is nothing else that SPN did bring to the fore is that, right? That you can get that same capability of just stamping up virtual machines but with networks that the API is important in almost everything we do.  Whether it is a service that you were developing, whether it is a network itself, whether it is the firewall that we need to do these things programmatically.  [0:47:53.7] SL: I agree with you Duffie. Although I would contend that the one area that and I will call it on premises STN shall we say right? Which is the people putting on STN solutions. I’d say the one area at least in my observation that they haven’t done well is that self-service model. Like in the cloud, self-service is paramount to Josh’s point. They can go out there, they can create their own BPC’s, create their own sub nets, create their own NAT gateways, Internet gateways to run security groups. Load balancers, blah-blah, all of that right?  But it still seems to me that even though we are probably 90, 95% of the way there, maybe farther in terms of on premise STN solutions right that you still typically don’t see self-service being pushed out in the same way you would in the public cloud, right? That is almost the final piece that is needed to bring that cloud experience to the on-premises environment.  [0:48:52.6] DC: That is an interesting point. I think from an infrastructure as a service perspective, it falls into that realm. It is a problem to solve in that space, right? So when you look at things like OpenStack and things like AWS and things like JKE or not JKE but GCE and areas like that, it is a requirement that if you are going to provide infrastructure as a service that you provide some capability around networking but at the same time, if we look at some of the platforms that are used for things like cloud native applications.  Things like Kubernetes, what is fascinating about that is that we have agreed on a least come – we agreed on abstraction of networking that is maybe I don’t know, maybe a little more precooked you know what I mean? In the assumption within like most of the platforms as a service that I have seen, the assumption is that when I deploy a container or I deploy a pod or I deploy some function as a service or any of these things that the networking is going to be handled for me.  I shouldn’t have to think about whether it is being routed to the Internet or not or routed back and forth between these domains. I should if anything only have to actually give you intent, be able to describe to you the intent of what could be connected to this and what ports I am actually going to be exposing and that the platform actually hides all of the complexity of that network away from me, which is an interesting round to strike.  [0:50:16.3] SL: So, this is one of my favorite things, one of my favorite distinctions to make, right? And that is this is the two worlds that we have been talking about, applications and infrastructure and the perfect example of these different perspectives and you even said it or you talked there Duffie like from an IS perspective it is considered a given that you have to be able to say I want a network, right? But when you come at this from the application perspective, you don’t care about a network.  You just want network connectivity, right? And so, when you look at the abstractions that IS vendors and solutions or products have created then they are IS centric but when you look at the abstractions that have been created in the cloud data space like within Kubernetes, they are application centric, right? And so, we are talking about infrastructure artifacts versus application artifacts and they end up meeting but they are coming at this from two different very different perspectives.  [0:51:18.5] DC: Yeah.  [0:51:19.4] NL: Yeah, I agree.  [0:51:21.2] DC: All right, well that was a great discussion. I imagine that we are probably get into – at least I have a couple of different networking discussions that I wanted to dig into and this conversation I hope that we’ve helped draw some parallels back and forth between the way – I mean there is both some empathy to spend here, right? I mean the people who are providing the service of networking to you in your cloud environments and your data centers are solving almost exactly the same sorts of availability problems and capabilities that you are trying to solve with your own software.  And I think in itself is a really interesting takeaway. Another one is that again there is nothing new under the sun. The problems that we are trying to solve in networking are not different than the problems that you are trying to solve in applications. We have far fewer tools and we generally network engineers are focused on specific changes that happen in the industry rather than looking at a breathe of industries like I mean as Josh pointed out, you could break open a Java book.  And see 8,000 patterns for how to do Java and this is true, every programming language that I am aware of I mean if you look at Go and see a bunch of different patterns there and we have talked about different patterns for just developing cloud native aware applications as well, right? I mean there is so many options in the software versus what we can do and what are available to us within networks. And so I think I am rambling a little bit but I think that is the takeaway from this session.  Is that there is a lot of overlap and there is a lot of really great stuff out there. So, this is Duffie, thank you for tuning in and I look forward to the next episode.  [0:52:49.9] NL: Yep and I think we can all agree that Token Ring should have won.  [0:52:53.4] DC: Thank you Josh and thank you Scott.  [0:52:55.8] JS: Thanks.  [0:52:57.0] SL: Thanks guys, this was a blast.  [END OF EPISODE] [0:52:59.4] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


3 Feb 2020

Rank #3

Podcast cover

The Dichotomy of Security (Ep 10)

Security is inherently dichotomous because it involves hardening an application to protect it from external threats, while at the same time ensuring agility and the ability to iterate as fast as possible. This in-built tension is the major focal point of today’s show, where we talk about all things security. From our discussion, we discover that there are several reasons for this tension. The overarching problem with security is that the starting point is often rules and parameters, rather than understanding what the system is used for. This results in security being heavily constraining. For this to change, a culture shift is necessary, where security people and developers come around the same table and define what optimizing to each of them means. This, however, is much easier said than done as security is usually only brought in at the later stages of development. We also discuss why the problem of security needs to be reframed, the importance of defining what normal functionality is and issues around response and detection, along with many other security insights. The intersection of cloud native and security is an interesting one, so tune in today! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Duffie Cooley Bryan Liles Nicholas Lane Key Points From This Episode: Often application and program security constrain optimum functionality. Generally, when security is talked about, it relates to the symptoms, not the root problem. Developers have not adapted internal interfaces to security. Look at what a framework or tool might be used for and then make constraints from there. The three frameworks people point to when talking about security: FISMA, NIST, and CIS. Trying to abide by all of the parameters is impossible. It is important to define what normal access is to understand what constraints look like. Why it is useful to use auditing logs in pre-production. There needs to be a discussion between developers and security people. How security with Kubernetes and other cloud native programs work. There has been some growth in securing secrets in Kubernetes over the past year. Blast radius – why understanding the extent of security malfunction effect is important. Chaos engineering is a useful framework for understanding vulnerability. Reaching across the table – why open conversations are the best solution to the dichotomy. Security and developers need to have the same goals and jargon from the outset. The current model only brings security in at the end stages of development. There needs to be a place to learn what normal functionality looks like outside of production. How Google manages to run everything in production. It is difficult to come up with security solutions for differing contexts. Why people want service meshes. Quotes: “You’re not able to actually make use of the platform as it was designed to be made use of, when those constraints are too tight.” — @mauilion [0:02:21] “The reason that people are scared of security is because security is opaque and security is opaque because a lot of people like to keep it opaque but it doesn’t have to be that way.” — @bryanl [0:04:15] “Defining what that normal access looks like is critical to us to our ability to constrain it.” — @mauilion [0:08:21] “Understanding all the avenues that you could be impacted is a daunting task.” — @apinick [0:18:44] “There has to be a place where you can go play and learn what normal is and then you can move into a world in which you can actually enforce what that normal looks like with reasonable constraints.” — @mauilion [0:33:04] “You don’t learn to ride a motorcycle on the street. You’d learn to ride a motorcycle on the dirt.” — @apinick [0:33:57] Links Mentioned in Today’s Episode: AWS — https://aws.amazon.com/Kubernetes https://kubernetes.io/IAM https://aws.amazon.com/iam/Securing a Cluster — https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/TGI Kubernetes 065 — https://www.youtube.com/watch?v=0uy2V2kYl4U&list=PL7bmigfV0EqQzxcNpmcdTJ9eFRPBe-iZa&index=33&t=0sTGI Kubernetes 066 —https://www.youtube.com/watch?v=C-vRlW7VYio&list=PL7bmigfV0EqQzxcNpmcdTJ9eFRPBe-iZa&index=32&t=0sBitnami — https://bitnami.com/Target — https://www.target.com/Netflix — https://www.netflix.com/HashiCorp — https://www.hashicorp.com/Aqua Sec — https://www.aquasec.com/CyberArk — https://www.cyberark.com/Jeff Bezos — https://www.forbes.com/profile/jeff-bezos/#4c3104291b23Istio — https://istio.io/Linkerd — https://linkerd.io/ Transcript: EPISODE 10 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores cloud native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [0:00:41.2] NL: Hello and welcome back to The Kubelets Podcast. My name is Nicholas Lane and this time, we’re going to be talking about the dichotomy of security. And to talk about such an interesting topic, joining me are Duffie Coolie. [0:00:54.3] DC: Hey, everybody. [0:00:55.6] NL: Bryan Liles. [0:00:57.0] BM: Hello [0:00:57.5] NL: And Carlisia Campos. [0:00:59.4] CC: Glad to be here. [0:01:00.8] NL: So, how’s it going everybody? [0:01:01.8] DC: Great. [0:01:03.2] NL: Yeah, this I think is an interesting topic. Duffie, you introduced us to this topic. And basically, what I understand, what you wanted to talk about, we’re calling it the dichotomy of security because it’s the relationship between security, like hardening your application to protect it from attack and influence from outside actors and agility to be able to create something that’s useful, the ability to iterate as fast as possible. [0:01:30.2] DC: Exactly. I mean, the idea from this came from putting together a talks for the security conference coming up here in a couple of weeks. And I was noticing that obviously, if you look at the job of somebody who is trying to provide some security for applications on their particular platform, whether that be AWS or GCE or OpenStack or Kubernetes or anything of these things. It’s frequently in their domain to kind of define constraints for all of the applications that would be deployed there, right? Such that you can provide rational defaults for things, right? Maybe you want to make sure that things can’t do a particular action because you don’t want to allow that for any application within your platform or you want to provide some constraint around quota or all of these things. And some of those constraints make total sense and some of them I think actually do impact your ability to design the systems or to consume that platform directly, right?  You’re not able to actually make use of the platform as it was designed to be made use of, when those constraints are too tight. [0:02:27.1] DC: Yeah. I totally agree. There’s kind of a joke that we have in certain tech fields which is the primary responsibility of security is to halt productivity. It isn’t actually true, right? But there are tradeoffs, right? If security is too tight, you can’t move forward, right? Example of this that kind of mind are like, if you’re too tight on your firewall rules where you can’t actually use anything of value. That’s a quick example of like security gone haywire. That’s too controlling, I think. [0:02:58.2] BM: Actually. This is an interesting topic just in general but I think that before we fall prey to what everyone does when they talk about security, let’s take a step back and understand why things are the way they are. Because all we’re talking about are the symptoms of what’s going on and I’ll give you one quick example of why I say this. Things are the way they are because we haven’t made them any better. In developer land, whenever we consume external resources, what we were supposed to do and what we should be doing but what we don’t do is we should create our internal interfaces. Only program to those interfaces and then let that interface of that adapt or talk to the external service and in security world, we should be doing the same thing and we don’t do this.  My canonical example for this is IAM on AWS. It’s hard to create a secure IM configuration and it’s even harder to keep it over time and it’s even harder to do it whenever you have 150, 100, 5,000 people dealing with this. What companies do is they actually create interfaces where they could describe the part of IAM they want to use and then they translate that over.  The reason I bring this up is because the reason that people are scared of security is because security is opaque and security is opaque because a lot of people like to keep it opaque. But it doesn’t have to be that way. [0:04:24.3] NL: That’s a good point, that’s a reasonable design and wherever I see that devoted actually is very helpful, right? Because you highlight a critical point in that these constraints have to be understood by the people who are constrained by them, right? It will just continue to kind of like drive that wedge between the people who are responsible for them top finding t hem and the people who are being affected by them, right? That transparency, I think it’s definitely key. [0:04:48.0] BM: Right, this is our cloud native discussion, any idea of where we should start thinking about this in cloud native land? [0:04:56.0] DC: For my part, I think it’s important to understand if you can like what the consumer of a particular framework or tool might need, right? And then, just take it from there and figure out what rational constraints are. Rather than the opposite which is frequently where people go and evaluate a set of rules as defined by some particular, some third-part company. Like you look at CIS packs and you look at like a lot of these other tooling. I feel like a lot of people look at those as like, these are the hard rules, we must comply to all of these things. Legally, in some cases, that’s the case. But frequently, I think they’re just kind of like casting about for some semblance of a way to start defining constraint and they go too far, they’re no longer taking into account what the consumers of that particular platform might meet, right? Kubernetes is a great example of this. If you look at the CIS spec for Kubernetes or if you look at a lot of the talks that I’ve seen kind of around how to secure Kubernetes, we defined like best practices for security and a lot of them are incredibly restrictive, right? I think of the problem there is that restriction comes at a cost of agility. You’re no longer able to use Kubernetes as a platform for developing microservices because you provided so much constraints that it breaks the model, you know? [0:06:12.4] NL: Okay. Let’s break this down again. I can think of a top of my head, three types of things people point to when I’m thinking about security. And spoiler alert, I am going to do some acronyms but don’t worry about the acronyms are, just understand they are security things. The first one I’ll bring up is FISMA and then I’ll think about NIST and the next one is CIS like you brought up. Really, the reason they’re so prevalent is because depending on where you are, whether you’re in a highly regulated place like a bank or you’re working for the government or you have some kind of automate concern to say a PIPA or something like that. These are the words that the auditors will use with you. There is good in those because people don’t like the CIS benchmarks because sometimes, we don’t understand why they’re there. But, from someone who is starting from nothing, those are actually great, there’s at least a great set of suggestions. But the problem is you have to understand that they’re only suggestions and they are trying to get you to a better place than you might need. But, the other side of this is that, we should never start with NIST or CIS or FISMA. What we really should do is our CISO or our Chief Security Officer or the person in charge of security. Or even just our – people who are in charge, making sure our stack, they should be defining, they should be taking what they know, whether it’s the standards and they should be building up this security posture in this security document and these rules that are built to protect whatever we’re trying to do. And then, the developers of whoever else can operate within that rather than everything literally. [0:07:46.4] DC: Yeah, agreed. Another thing I’ve spent some time talking to people about like when they start rationalizing how to implement these things or even just think about the secure surface or develop a threat model or any of those things, right? One of the things that I think it’s important is the ability to define kind of like what normal looks like, right? What normal access between applications or normal access of resources looks like. I think that your point earlier, maybe provides some abstraction in front of a secure resource such that you can actually just share that same fraction across all the things that might try to consume that external resource is a great example of the thing. Defining what that normal access looks like is critical to us to our ability to constrain it, right? I think that frequently people don’t start there, they start with the other side, they’re saying, here are all the constraints, you need to tell me which ones are too tight. You need to tell me which ones to loosen up so that you can do your job. You need to tell me which application needs access to whichever application so that I can open the firewall for you. I’m like, we need to turn that on its head. We need the environments that are perhaps less secure so that we can actually define what normal looks like and then take that definition and move it into a more secured state, perhaps by defining these across different environments, right? [0:08:58.1] BM: A good example of that would be in larger organizations, at every part of the organization does this but there is environments running your application where there are really no rules applied. What we do with that is we turn on auditing in those environments so you have two applications or a single application that talks to something and you let that application run and then after the application run, you go take a look at the audit logs and then you determine at that point what a good profile of this application is. Whenever it’s in production, you set up the security parameters, whether it be identity access or network, based on what you saw in auditing in your preproduction environment. That’s all you could run because we tested it fully in our preproduction environment, it should not do any more than that. And that’s actually something – I’ve seen tools that will do it for AWS IM. I’m sure you can do for anything else that creates auditing law. That’s a good way to get started. [0:09:54.5] NL: It sounds like what we’re coming to is that the breakdown of security or the way that security has impacted agility is when people don’t take a rational look at their own use case. instead, rely too much on the guidance of other people essentially. Instead of using things like the CIS benchmarking or NIST or FISMA, that’s one that I knew the other two and I’m like, I don’t know this other one. If they follow them less as guidelines and more as like hard set rules, that’s when we get impacts or agility. Instead of like, “Hey. This is what my application needs like you’re saying, let’s go from there.” What does this one look like? Duffie is for saying. I’m kind of curious, let’s flip that on its head a little bit, are there examples of times when agility impacts security? [0:10:39.7] BM: You want to move fast and moving fast is counter to being secure? [0:10:44.5] NL: Yes. [0:10:46.0] DC: Yeah, literally every single time we run software. When it comes down to is developers are going to want to develop and then security people are going to want to secure. And generally, I’m looking at it from a developer who has written security software that a lot of people have used, you guys had know that. Really, there needs to be a conversation, it’s the same thing as we had this dev ops conversation for a year – and then over the last couple of years, this whole dev set ops conversation has been happening. We need to have this conversation because from a security person’s point of view, you know, no access is great access. No data, you can’t get owned if you don’t have any data going across the wire. You know what? Can’t get into that server if there’s no ports opened. But practically, that doesn’t work and we find is that there is actually a failing on both sides to understand what the other person was optimizing for. [0:11:41.2] BM: That’s actually where a lot of this comes from. I will offer up that the only default secure posture is no access to anything and you should be working from that direction to where you want to be rather than working from, what should we close down? You should close down everything and then you work with allowing this list for other than block list. [0:12:00.9] NL: Yeah, I agree with that model but I think that there’s an important step that has to happen before that and that’s you know, the tooling or thee wireless phone to define what the application looks like when it’s in a normal state or the running state and if we can accomplish that, then I feel like we’re in a better position to find what that LOI list looks like and I think that one of the other challenges there of course, let’s backup for a second. I have actually worked on a platform that supported many services, hundreds of services, right? Clearly, if I needed to define what normal looked like for a hundred services or a thousand services or 2,000 services, that’s going to be difficult in a way that people approach the problem, right? How do you define for each individual service? I need to have some decoration of intent. I need the developer to engage here and tell me, what they’re expecting, to set some assumptions about the application like what it’s going to connect to, those dependences are –  That sort of stuff. And I also need tooling to verify that. I need to be able to kind of like build up the whole thing so that I have some way of automatically, you know, maybe with oversight, defining what that security context looks like for this particular service on this particular platform. Trying to do it holistically is actually I think where we get into trouble, right? Obviously, we can’t scale the number of people that it takes to actually understand all of these individual services. We need to actually scale this stuff as software problem instead. [0:13:22.4] CC: With the cloud native architecture and infrastructure, I wonder if it makes it more restrictive because let’s say, these are running on Kubernetes, everything is running at Kubernetes. Things are more connected because it’s a Kubernetes, right? It’s this one huge thing that you’re running on and Kubernetes makes it easier to have access to different notes and when the nodes took those apart, of course, you have to find this connection. Still, it’s supposed to make it easy. I wonder if security from a perspective of somebody, needing to put a restriction and add miff or example, makes it harder or if it makes it easier to just delegate, you have this entire area here for you and because your app is constrained to this space or name space or this part, this node, then you can have as much access as you need, is there any difference? Do you know what I mean? Does it make sense what I said? [0:14:23.9] BM: There was actually, it’s exactly the same thing as we had before. We need to make sure that applications have access to what they need and don’t have access to what they don’t need. Now, Kubernetes does make it easier because you can have network policies and you can apply those and they’re easier to manage than who knows what networking management is holding you have. Kubernetes also has pod security policies which again, actually confederates this knowledge around my pod should be able to do this or should not be able to run its root, it shouldn’t be able to do this and be able to do that. It’s still the same practice Carlisia, but the way that we can control it is now with a standard set off tools. We still have not cracked the whole nut because the whole thing of turning auditing on to understand and then having great tool that can read audit locks from Kubernetes, just still aren’t there. Just to add one more last thing that before we add VMWare and we were Heptio, we had a coworker who wrote basically dynamic audit and that was probably one of the first steps that we would need to be able to employ this at scale.  We are early, early, super early in our journey and getting this right, we just don’t have all the necessary tools yet. That’s why it’s hard and that’s why people don’t do it. [0:15:39.6] NL: I do think it is nice to have t hose and primitives are available to people who are making use of that platform though, right? Because again, kind of opens up that conversation, right? Around transparency. The goal being, if you understood the tools that we’re defining that constraint, perhaps you’d have access to view what the constraints are and understand if they’re actually rational or not with your applications. When you’re trying to resolve like I have deployed my application in dev and it’s the wild west, there’s no constraints anywhere. I can do anything within dev, right? When I’m trying to actually promote my application to staging, it gives you some platform around which you can actually sa, “If you want to get to staging, I do have to enforce these things and I have a way and again, all still part of that same API, I still have that same user experience that I had when just deploying or designing the application to getting them deployed.” I could still look at again and understand what the constraints are being applied and make sure that they’re reasonable for my application. Does my application run, does it have access to the network resources that it needs to? If not, can I see where the gaps are, you know? [0:16:38.6] DC: For anyone listening to this. Kubernetes doesn’t have all the documentation we need and no one has actually written this book yet. But on Kubernetes.io, there are a couple of documents about security and if we have shownotes, I will make sure those get included in our shownotes because I think there are things that you should at least understand what’s in a pod security policy. You should at least understand what’s in a network security policy. You should at least understand how roles and role bindings work. You should understand what you’re going to do for certificate management. How do you manage this certificate authority in Kubernetes? How do you actually work these things out? This is where you should start before you do anything else really fancy. At least, understand your landscape. [0:17:22.7] CC: Jeffrey did a TGI K talk on secrets. I think was that a series? There were a couple of them, Duffie?  [0:17:29.7] DC: Yeah, there were. I need to get back and do a little more but yeah. [0:17:33.4] BM: We should then add those to our shownotes too. Hopefully they actually exist or I’m willing to see to it because in assistance. [0:17:40.3] CC: We are going to have shownotes, yes. [0:17:44.0] NL: That is interesting point, bringing up secrets and secret management and also, like secured Inexhibit. There are some tools that exist that we can use now in a cloud native world, at least in the container world. Things like vault exist, things like well, now, KBDM you can roll certificate which is really nice. We are getting to a place where we have more tooling available and I’m really happy about it. Because I remember using Kubernetes a year ago and everyone’s like, “Well. How do you secure a secret in Kubernetes?” And I’m like, “Well, it sure is basics for you to encode it. That’s on an all secure.” [0:18:15.5] BM: I would do credit Bitnami has been doing sealed secrets, that’s been out for quite a while but the problem is that how do you suppose to know about that and how are you supposed to know if it’s a good standard? And then also, how are you supposed to benchmark against that? How do you know if your secrets are okay? We haven’t talked about the other side which is response or detection of issues. We’re just talking about starting out, what do you do? [0:18:42.3] DC: That’s right. [0:18:42.6] NL: It is tricky. We’re just saying like, understanding all the avenues that you could be impacted is kind of a daunting task. Let’s talk about like the Target breach that occurred a few years ago? If anybody doesn’t remember this, basically, Target had a huge credit card breach from their database and basically, what happened is that t heir – If I recalled properly, their OIDC token had a – not expired but the audience for it was so broad that someone had hacked into one computer essentially like a register or something and they were able to get the OIDC token form the local machine.  The authentication audience for that whole token was so broad that they were able to access the database that had all of the credit card information into it. These are one of these things that you don’t think about when you’re setting up security, when you’re just maybe getting started or something like that.  What are the avenues of attack, right? You’d say like, “OIDC is just pure authentication mechanism, why would we need to concern ourselves with this?” And then but not understanding kind of what we were talking about last because the networking and the broadcasting, what is the blast radius of something like this and so, I feel like this is a good example of sometimes security can be really hard and getting started can be really daunting. [0:19:54.6] DC: Yeah, I agree. To Bryan’s point, it’s like, how do you test against this? How do you know that what you’ve defined is enough, right? We can define all of these constraints and we can even think that they’re pretty reasonable or rational and the application may come up and operate but how do you know? How can you verify that? What you’ve done is enough? And then also, remember. With OIDC has its own foundations and loft. You realize that it’s a very strong door but it’s only a strong door, it also do things that you can’t walk around a wall and that it’s protecting or climb over the wall that it’s protecting. There’s a bit of trust and when you get into things like the target breach, you really have to understand blast radius for anything that you’re going to do. A good example would be if you’re using shared key kind of things or like public share key. You have certificate authorities and you’re generating certificates. You should probably have multiple certificate authorities and you can have a basically, a hierarchy of these so you could have basically the root one controlled by just a few people in security. And then, each department has their own certificate authority and then you should also have things like revocation, you should be able to say that, “Hey, all this is bad and it should all go away and it probably should have every revocation list,” which a lot of us don’t have believe it or not, internally. Where if I actually kill our own certificate, a certificate was generated and I put it in my revocation list, it should not be served and in our clients that are accepting that our service is to see that, if we’re using client side certificates, we should reject these instantly. Really, what we need to do is stop looking at security as this one big thing and we need to figure out what are our blast radius. Firecracker, blowing up in my hand, it’s going to hurt me. But Nick, it’s not going to hurt you, you know? If someone drops in a huge nuclear bomb on the United States or the west coast United States, I’m talking to myself right now. You got to think about it like that. What’s the worst that can happen if this thing gets busted or get shared or someone finds that this should not happen? Every piece off data that you have that you consider secure or sensitive, you should be able to figure out what that means and that is how whenever you are defining a security posture that’s butchered to me. Because that is why you’ll notice that a lot of companies some of them do run open within a contained zone. So, within this contained zone you could talk to whomever you want. We don’t actually have to be secure here because if we lose one, we lost them all so who cares?  So, we need to think about that and how do we do that in Kubernetes? Well, we use things like name spaces first of all and then we use things like this network policies and then we use things like pod security policies. We can lock some access down to just name spaces if need be. You can only talk to pods and your name space. And I am not telling you how to do this but you need to figure out talking with your developer, talking to the security people.  But if you are in security you need to talk to your product management staff and your software engineering staff to figure out really how does this need to work? So, you realize that security is fun and we have all sorts of neat tools depending on what side you’re on. You know if you are on red team, you’re half knee in, you’re blue team you are saving things. We need to figure out these conversations and tooling comes from these conversations but we need to have these conversation first.  [0:23:11.0] DC: I feel like a little bit of a broken record on this one but I am going to go back to chaos engineering again because I feel like it is critical to stuff like this because it enables a culture in which you can explore both the behavior of applications itself but why not also use this model to explore different ways of accessing that information? Or coming up with theories about the way the system might be vulnerable based on a particular attack or a type of attack, right?  I think that this is actually one of the movements within our space that I think provides because then most hope in this particular scenario because a reasonable chaos engineering practice within an organization enables that ability to explore all of the things. You don’t have to be red team or blue team. You can just be somebody who understands this application well and the question for the day is, “How can we attack this application?” Let’s come up with theories about the way that perhaps this application could be attacked. Think about the problem differently instead of thinking about it as an access problem, think about it as the way that you extend trust to the other components within your particular distributed system like do they have access that they don’t need. Come up with a theory around being able to use some proxy component of another system to attack yet a third system.  You know start playing with those ideas and prove them out within your application. A culture that embraces that I think is going to be by far a more secure culture because it lets developers and engineers explore these systems in ways that we don’t generally explore them.  [0:24:36.0] BM: Right. But also, if I could operate on myself I would never need a doctor. And the reason I bring that up is because we use terms like chaos engineering and this is no disrespect to you Duffie, so don’t take it as this is panacea or this idea that we make things better and true. That is fine, it will make us better but the little secret behind chaos engineering is that it is hard. It is hard to build these experiments first of all, it is hard to collect results from these experiments.  And then it is hard to extrapolate what you got out of the experiments to apply to whatever you are working on to repeat and what I would like to see is what people in our space is talking about how we can apply such techniques. But whether it is giving us more words or giving us more software that we can employ because I hate to say it, it is pretty chaotic in chaos engineering right now for Kubernetes. Because if you look at all the people out there who have done it well.  And so, you look at what Netflix has done with pioneering this and then you listen to what, a company such us like Gremlin is talking about it is all fine and dandy. You need to realize that it is another piece of complexity that you have to own and just like any other things in the security world, you need to rationalize how much time you are going to spend on it first is the bottom line because if I have a “Hello, World!” app, I don’t really care about network access to that.  Unless it is a “Hello, World!” app running on the same subnet as some doing some PCI data then you know it is a different conversation.  [0:26:05.5] DC: Yeah. I agree and I am certainly not trying to version as a panacea but what I am trying to describe is that I feel like I am having a culture that embraces that sort of thinking is going to enable us to be in a better position to secure these applications or to handle a breach or to deal with very hard to understand or resolve problems at scale, you know? Whether that is a number of connections per second or whether that is a number of applications that we have horizontally scaled.  You know like being able to embrace that sort of a culture where we asked why where we say “well, what if…” or if we actually come up you know embracing the idea of that curiosity that got you into this field, you know what I mean like the thing that is so frequently our cultures are opposite of that, right? It becomes a race to the finish and in that race to the finish, lots of pieces fall off that we are not even aware of, you know? That is what I am highlighting here when I talk about it.  [0:26:56.5] NL: And so, it seems maybe the best solution to the dichotomy between security and agility is really just open conversation, in a way. People actually reaching across the aisle to talk to each other. So, if you are embracing this culture as you are saying Duffie the security team should be having constant communication with the application team instead of just like the team doing something wrong and the security team coming down and smacking their hand.  And being like, “Oh you can’t do it this way because of our draconian rules” right? These people are working together and almost playing together a little bit inside of their own environment to create also a better environment. And I am sorry.I didn’t mean to cut you off there, Bryan.  [0:27:34.9] BM: Oh man, I thought it was fleeting like all my thoughts. But more about what you are saying is, is that you know it is not just more conversations because we can still have conversations and I am talking about sider and subnets and attack vectors and buffer overflows and things like that. But my developer isn’t talking, “Well, I just need to be able to serve this data so accounting can do this.”  And that’s what happens a lot in security conversations. You have two groups of individuals who have wholly different goals and part of that conversation needs to be aligning or jargon and then aligning on those goals but what happens with pretty much everything in the development world, we always bring our networking, our security and our operations people in right at the end, right when we are ready to ship, “Hey make this thing work.” And really it is where a lot of our problems come out.  Now security either could or wanted to be involved at the beginning of a software project what we actually are talking about what we are trying to do. We are trying to open up this service to talk to this, share this kind of data. Security can be in there early saying, “Oh no you know, we are using this resource in our cloud provider. It doesn’t really matter what cloud provider and we need to protect this. This data is sitting here at rest.” If we get those conversations earlier, it would be easier to engineer solutions that to be hopefully reused so we don’t have to have that conversation in the future. [0:29:02.5] CC: But then it goes back to the issue of agility, right? Like Duffie was saying, wow you can develop, I guess a development cluster which has much less restrictive restrictions and they move to a production environment where the proper restrictions are then – then you find out or maybe station environment let’s say. And then you find out, “Oh whoops. There are a bunch of restrictions I didn’t deal with but I didn’t move a lot faster because I didn’t have them but now, I have to deal with them.”  [0:29:29.5] DC: Yeah, do you think it is important to have a promotion model in which you are able to move toward a more secure deployment right? Because I guess a parallel to this is like I have heard it said that you should develop your monolith first and then when you actually have the working prototype of what you’re trying to create then consider carefully whether it is time to break this thing up into a set of distinct services, right?  And consider carefully also what the value of that might be? And I think that the reason that that’s said is because it is easier. It is going to be a lower cognitive load with everything all right there in the same codebase. You understand how all of these pieces interconnect and you can quickly develop or prototype what you are working on. Whereas if you are trying to develop these things into individual micro services first, it is harder to figure out where the line is.  Like where to divide all of the business logic. I think this is also important when you are thinking about the security aspects of this right? Being able to do a thing when which you are not constrained, define all of these services and your application in the model for how they communicate without constraint is important. And once you have that when you actually understand what normal looks like from that set of applications then enforce them, right?  If you are able to declare that intent you are going to say like these are the ports on the list on for these things, these are the things that they are going to access, this is the way that they are going to go about accessing them. You know if you can declare that intent then that is actually that is a reasonable body of knowledge for which the security people can come along and say, “Okay well, you have told us. You informed us. You have worked with us to tell us like what your intent is. We are going to enforce that intent and see what falls out and we can iterate there.”  [0:31:01.9] CC: Yeah everything you said makes sense to me. Starting with build the monolith first. I mean when you start out why which ones will have abstract things that you don’t really – I mean you might think you know but you’re only really knowing practice what you are going to need to abstract. So, don’t abstract things too early. I am a big fan of that idea. So yeah, start with the monolith and then you figure out how to break it down based on what you need.  With security I would imagine the same idea resonates with me. Don’t secure things that you don’t need you don’t know just yet that needs securing except the deal breaker things. Like there is some things we know like we don’t want production that are being accessed some types of production that are some things we know we need to secure so from the beginning.  [0:31:51.9] BM: Right. But I will still iterate that it is always denied by default, just remember that. It is security is actually the opposite way. We want to make sure that we have the least amount and even if it is harder for us you always want to start with un-allowed TCP communication on port 443 or UDP as well. That is what I would allow rather than saying shut everything else off. But this, I would rather have the way that we only allow that and that also goes in with our declarative nature in cloud native things we like anyways. We just say what we want and everything else doesn’t exists. [0:32:27.6] DC: I do want to clarify though because I think what you and I, we are the representative of the dichotomy right at this moment, right? I feel like what you are saying is the constraint should be the normal, being able to drop all traffic, do not allow anything is normal and then you have to declare intent to open anything up and what I am saying is frequently developers don’t know what normal looks like yet. They need to be able to explore what normal looks like by developing these patterns and then enforce them, right, which is turning the model on its head.  And this is actually I think the kernel that I am trying to get to in this conversation is that there has to be a place where you can go play and learn what normal is and then you can move into a world in which you can actually enforce what that normal looks like with reasonable constraint. But until you know what that is, until you have that opportunity to learn it, all we are doing here is restricting your ability to learn. We are adding friction to the process.  [0:33:25.1] BM: Right, well I think what I am trying to say here layer on top of this is that yes, I agree but then I understand what a breach can do and what bad security can do. So I will say, “Yeah, go learn. Go play all you want but not on software that will ever make it to production. Go learn these practices but you are going to have to do it outside of” – you are going to have a sandbox and that sandbox is going to be unconnected from the world I mean from our obelisk and you are going to have to learn but you are not going to practice here. This is not where you learn how to do this. [0:33:56.8] NL: Exactly right, yeah. You don’t learn to ride a motorcycle on the street you know? You’d learn to ride a motorcycle on the dirt and then you could take those skills later you know? But yeah I think we are in agreement like production is a place where we do have to enforce all of those things and having some promotion level in which you can come from a place where you learned it to a place where you are beginning to enforce it to a place where it is enforced I think is also important.  And I frequently describe this as like development, staging and production, right? Staging is where you are going to hit the edges from because this is where you’re actually defining that constraint and it has to be right before it can be promoted to production, right? And I feel like the middle ground is also important.  [0:34:33.6] BM: And remember that production is any environment production can reach. Any environment that can reach production is production and that is including that we do data backup dumps and we clean them up from production and we use it as data in our staging environment. If production can directly reach staging or vice versa, it is all production. That is your attack vector. That is also what is going to get in and steal your production data.  [0:34:59.1] NL: That is absolutely right. Google actually makes an interesting not of caveat to that but like side point to that where like if I understand the way that Google runs, they run everything in production, right? Like dev, staging and production are all the same environment. I am more positing this is a question because I don’t know if anybody of us have the answer but I wonder how they secure their infrastructure, their environment well enough to allow people to play to learn these things? And also, to deploy production level code all in the same area? That seems really interesting to be and then if I understood that I probably would be making a lot more money.  [0:35:32.6] BM: Well it is simple really. There were huge people process at Google that access gatekeeper for a lot of these stuff. So, I have never worked in Google. I have no intrinsic knowledge of Google or have talked to anyone who has given me this insight, this is all speculation disclaimer over. But you can actually run a big cluster that if you can actually prove that you have network and memory and CPU isolation between containers, which they can in certain cases and certain things that can do this.  What you can do is you can use your people process and your approvals to make sure that software gets to where it needs to be. So, you can still play on the same clusters but we have great handles on network that you can’t talk to these networks or you can’t use this much network data. We have great things on CPU that this CPU would be a PCI data. We will not allow it unless it’s tied to CPU or it is PCI. Once you have that in place, you do have a lot more flexibility. But to do that, you will have to have some pretty complex approval structures and then software to back that up.  So, the burden on it is not on the normal developer and that is actually what Google has done. They have so many tools and they have so many processes where if you use this tool it actually does the process for you. You don’t have to think about it. And that is what we want our developers to be. We want them to be able to use either our networking libraries or whenever they are building their containers or their Kubernetes manifest, use our tools and we will make sure based on either inspection or just explicit settings that we will build something that is as secure as we can given the inputs. And what I am saying is hard and it is capital H hard and I am actually just pitting where we want to be and where a lot of us are not. You know most people are not there.  [0:37:21.9] NL: Yeah, it would be nice if we had like we said earlier like more tooling around security and the processes and all of these things. One thing I think that people seem to balk on or at least I feel is developing it for their own use case, right? It seems like people want an overarching tool to solve all the use cases in the world. And I think with the rise of cloud native applications and things like container orchestration, I would like to see people more developing for themselves around their own processes, around Kubernetes and things like that.  I want to see more perspective into how people are solving their security problems, instead of just like relying on let’s say like HashiCorp or like Aqua Sec to provide all the answers like I want to see more answers of what people are doing.  [0:38:06.5] BM: Oh, it is because tools like Vault are hard to write and hard to maintain and hard to keep correct because you think about other large competitors to vault and they are out there like tools like CyberArk. I have a secret and I want to make sure only certain will keep it. That is a very difficult tool but the HashiCorp advantage here is that they have made tools to speak to people who write software or people who understand ops not just as a checkbox.  It is not hard to get. If you are using vault it is not hard to get a secret out if you have the right credentials. Other tools is super hard to get the secret out if you even have the right credential because they have a weird API or they just make it very hard for you or they expect you to go click on some gooey somewhere. And that is what we need to do. We need to have better programming interfaces and better operator interfaces, which extends to better security people are basis for you to use these tools.  You know I don’t know how well this works in practice. But the Jeff Bezos, how teams at AWS or Amazon or forums, you know teams communicate on API and I am not saying that you shouldn’t talk, but we should definitely make sure that our API’s between teams and team who owns security stuff and teams who are writing developer stuff that we can talk on the same level of fidelity that we can having an in person conversation, we should be able to do that through our software as well.  Whether that be for asking for ports or asking for our resources or just talking about the problem that we have that is my thought-leadering answer to this. This is “Bryan wants to be a VP of something one day” and that is the answer I am giving. I’m going to be the CIO that is my CIO answer.  [0:39:43.8] DC: I like it. So cool.  [0:39:45.5] BM: Is there anything else on this subject that we wanted to hit? [0:39:48.5] NL: No, I think we have actually touched on pretty much everything. We got a lot out of this and I am always impressed with the direction that we go and I did not expect us to go down this route and I was very pleased with the discussion we have had so far.  [0:39:59.6] DC: Me too. I think if we are going to explore anything else that we talked about like you know, get it more into that state where we are talking about like that we need more feedback loops. We need people developers to talk to security people. We need security people talk to developers. We need to have some way of actually pushing that feedback loop much like some of the other cultural changes that we have seen in our industry are trying to allow for better feedback loops and other spaces. And you’ve brought up dev spec ops which is another move to try and open up that feedback loop but the problem I think is still going to be that even if we improved that feedback loop, we are at an age where – especially if you ended up in some of the larger organizations, there are too many applications to solve this problem for and I don’t know yet how to address this problem in that context, right?  If you are in a state where you are a 20-person, 30-person security team and your responsibility is to secure a platform that is running a number of Kubernetes clusters, a number of Vsphere clusters, a number of cloud provider implementations whether that would be AWS or GC, I mean that is a set of problems that is very difficult. It is like I am not sure that improving the feedback loop really solves it. I know that I helps but I definitely you know, I have empathy for those folks for sure.  [0:41:13.0] CC: Security is not my forte at all because whenever I am developing, I have a narrow need. You know I have to access a cluster.I have to access a machine or I have to be able to access the database. And it is usually a no brainer but I get a lot of the issues that were brought up. But as a builder of software, I have empathy for people who use software, consume software, mine and others and how can’t they have any visibility as far as security goes?  For example, in the world of cloud native let’s say you are using Kubernetes, I sort of start thinking, “Well, shouldn’t there be a scanner that just lets me declare?” I think I am starting an episode right now –should there be a scanner that lets me declare for example this node can only access this set of nodes like a graph. But you just declare and then you run it periodically and you make sure of course this goes down to part of an app can only access part of the database.  It can get very granular but maybe at a very high level I mean how hard can this be? For example, this pod can only access that pods but this pod cannot access this name space and just keep checking what if the name spaces changes, the permission changes. Or for example would allow only these answers can do a backup because they are the same users who will have access to the restore so they have access to all the data, you know what I mean? Just keep checking that is in place and it only changes when you want to.  [0:42:48.9] BM: So, I mean I know we are at the end of this call and I want to start a whole new conversation but this is actually is why there are applications out there like Istio and Linkerd. This is why people want service meshes because they can turn off all network access and then just use the service mesh to do the communication and then they can use, they can make sure that it is encrypted on both sides and that is a honey cave on all both sides. That is why this is operated.  [0:43:15.1] CC: We’ll definitely going to have an episode or multiple on service mesh but we are on the top of the hour. Nick, do your thing.  [0:43:23.8] NL: All right, well, thank you so much for joining us on another interesting discussion at The Kubelets Podcast. I am Nicholas Lane, Duffie any final thoughts?  [0:43:32.9] DC: There is a whole lot to discuss, I really enjoyed our conversations today. Thank you everybody.  [0:43:36.5] NL: And Bryan?  [0:43:37.4] BM: Oh it was good being here. Now it is lunch time.  [0:43:41.1] NL: And Carlisia.  [0:43:42.9] CC: I love learning from you all, thank you. Glad to be here.  [0:43:46.2] NL: Totally agree. Thank you again for joining us and we’ll see you next time. Bye.  [0:43:51.0] CC: Bye.  [0:43:52.1] DC: Bye.  [0:43:52.6] BM: Bye.  [END OF EPISODE] [0:43:54.7] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


30 Dec 2019

Rank #4

Most Popular Podcasts

Podcast cover

Cloud Native Apps (Ep 16)

Do you know what cloud native apps are? Well, we don’t really either, but today we’re on a mission to find out! This episode is an exciting one, where we bring all of our different understandings of what cloud native apps are to the table. The topic is so interesting and diverse and can be interpreted in a myriad of ways. The term ‘cloud native app’ is not very concrete, which allows for this open interpretation. We begin by discussing what we understand cloud native apps to be. We see that while we all have similar definitions, there are still many differences in how we interpret this term. These different interpretations unlock some other important questions that we also delve into. Tied into cloud native apps is another topic we cover today – monoliths. This is a term that is used frequently but not very well understood and defined. We unpack some of the pros and cons of monoliths as well as the differences between monoliths and microservices. Finally, we discuss some principles of cloud native apps and how having these umbrella terms can be useful in defining whether an app is a cloud native one or not. These are complex ideas and we are only at the tip of the iceberg. We hope you join us on this journey as we dive into cloud native apps! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Bryan Liles Josh Rosso Nicholas Lane Key Points From This Episode: What cloud native applications mean to Carlisia, Bryan, Josh, and Nicholas. Portability is a big factor of cloud native apps. Cloud native applications can modify their infrastructure needs through API calls. Cloud native applications can work well with continuous delivery/deployment systems. A component of cloud native applications is that they can modify the cloud. An application should be thought of as multiple processes that interact and link together. It is possible resources will begin to be requested on-demand in cloud native apps. An explanation of the commonly used term ‘monolith.’ Even as recently as five years ago, monoliths were still commonly used. The differences between a microservice approach and a monolith approach. The microservice approach requires thinking about the interface at the start, making it harder. Some of the instances when using a monolith is the logical choice for an app. A major problem with monoliths is that as functionality grows, so too does complexity. Some other benefits and disadvantages of monolith apps. In the long run, separating apps into microservices gives a greater range of flexibility. A monolith can be a cloud native application as well. Clarification on why Brian uses the term ‘microservices’ rather than cloud native. ‘Cloud native’ is an umbrella term and a set of principles rather than a strict definition. If it can run confidently on someone else’s computer, it is likely a cloud native application. Applying cloud native principles when building an app from scratch makes it simpler. It is difficult to adapt a monolith app into one which uses cloud native principles. The applications which could never be adapted to use cloud native principles. A checklist of the key attributes of cloud native applications. Cloud native principles are flexible and can be adapted to the context. It is the responsibility of thought leaders to bring cloud native thinking into the mainstream. Kubernetes has the potential to allow us to see our data centers differently. Quotes: “An application could be made up of multiple processes.” — @joshrosso [0:14:43] “A monolith is simply an application or a single process that is running both the UI, the front-end code and the code that fetches the state from a data store, whether that be disk or database.” — @joshrosso[0:16:36] “Separating your app is actually smarter than the long run because what it gives you is the flexibility to mix and match.” — @bryanl[0:22:10] “A cloud native application isn’t a thing. It is a set of principles that you can use to guide yourself to running apps in cloud environments.” — @bryanl[0:26:13] “All of these things that we are talking about sound daunting. But it is better that we can have these conversations and talk about things that don’t work rather than not knowing what to talk about in general.”  — @bryanl[0:39:30] Links Mentioned in Today’s Episode: Red Hat — https://www.redhat.com/en IBM — https://www.ibm.com/ VWware — https://www.vmware.com/ The New Stack — https://thenewstack.io/ 10 Key Attributes of Cloud-Native Applications — https://thenewstack.io/10-key-attributes-of-cloud- native-applications/ Kubernetes — https://kubernetes.io/ Linux — https://www.linux.org/ Transcript: EPISODE 16 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [0:00:41.4] NL: Hello and welcome back, my name is Nicholas Lane. This time, we’ll be diving into what it’s all about. Cloud native applications. Joining me this week are Brian Liles. [0:00:53.2] BL: Hi. [0:00:54.3] NL: Carlisia Campos. [0:00:55.6] CC: Hi everybody, glad to be here. [0:00:57.6] NL: And Josh Rosso. [0:00:58.6] JR: Hey everyone. [0:01:00.0] NL: How’s it going everyone? [0:01:01.3] JR: It’s been a great week so far. I’m just happy that I have a good job and able to do things that make me feel whole. [0:01:08.8] NL: That’s awesome, wow. [0:01:10.0] BL: Yeah, I’ve been having a good week as well in doing a bit of some fun stuff after work. Like my soon to be in-laws are in town so I’ve been visiting with them and that’s been really fun. Cloud native applications, what does that mean to you all? Because I think that’s an interesting topic. [0:01:25.0] CC: Definitely not a monolith. I think if you have a monolith running on the clouds, even if you start it out that way, I wouldn’t say it’s a cloud native app, I always think of containerized applications and if you’re using the container system then it’s usually because you want to have a smaller systems in more of them, that sort of thing.  Also, when I think of cloud native applications, I think that they were developed  the whole strategy of the development in the whole strategy of deploying and shipping has been designed from scratch to put on the cloud system. [0:02:05.6] JR: I think of it as applications that were designed to run in container. And I also think about things like services, like micro services or macro services to know what you want to call them that we have multiple applications that are made to talk not just with themselves but with other apps and they deliver a bigger functionality through their coordination. Then what I also want to go cloud native apps, I think of apps that we are moving to the cloud, that’s a big topic in itself but applications that we run in the cloud. All of our new fancy services and our SaaS offerings, a lot of these are cloud native apps.  But then on the other side, I think about applications, they are cloud native are tolerant to failure and on the other side, can actually talk about sells of their health and who they’re talking to. [0:02:54.8] CC: Gets very complicated. [0:02:56.6] BL: Yeah. That was the side of that I haven’t thought about. [0:03:00.7] JR: Actually, it’s for me that always come to mind are obviously portability, right? Wherever you're running this application, it can run somewhat consistently, be it on different clouds or even a lot of people, you know, are running their own cloud which is basically their on-prem cloud, right? That application being able to move across any of those places and often times, containerization is one of the mechanisms we use to do that, right? Which is what we all stated.  Then I guess the other thing too is like, this whole cloud ecosystem, be it a cloud provider or your own personal – are often times very API driven, right? So, the applications, maybe being able to take advantage of some of those API’s should they need to. Be it for scaling purposes otherwise. It’s really interesting model. [0:03:43.2] NL: It’s interesting, for me like this question because so far, everyone is getting similar but also different answers. And for me, I’m going to give a silent answer to me, a cloud native application is a lot of things we said like portable. I think of micro services when II] think of a cloud native application. But it’s also an application that can modify the infrastructure it needs via API calls, right? If your application needs a service or needs a networking connection, it can – the application itself can manifest that via cloud offering, right? That’s what I always thought of as a cloud native application, right? If you need like a database, the application can reach out to like AWS RDS and spin up the database and that was an aspect of I always found very fascinating with cloud native applications, it isn’t necessarily the definition but for me, that’s the part that I was really focused on I think is quite interesting. [0:04:32.9] BL: Also, CI/CD cloud native apps are made to work well with our CI, our seamless integration and our continuous delivery/deployment systems as well, that’s like another very important aspect of cloud native applications. We should be able to deploy them to production without typing anything in. should be some kind of automated process. [0:04:56.4] NL: Yeah, that is for sure. Carlisia, you mentioned something that I think it’s good for us to talk about a little bit which is terminology. I keeping coming back to that. You mentioned monolithic apps, what are monoliths then? [0:05:09.0] CC: I am so hung up on what you just said, can we table that for a few minutes? You said cloud native applications for you is an application that can interact with the infrastructure and maybe for example, is the database. I wonder if you have an example or if you could expand on that, I want to – if everybody agrees with that, I’m not clear on what that even is. Because as a developer which is always my point of view is what I know. It’s a lot of responsibility for the application to have. And for example, when I would think cloud native and I’m thinking now, maybe I’m going off on a tangent here. But we have Kubernetes, isn’t that what Kubernetes is supposed to do to glue it all together? So, the application only needs to know what it needs to do. But spinning up an all tight system is not one of the things it would need to do? [0:05:57.3] BL: Sure, actually, I was going to use Kubernetes as my example for cloud native application. Because Kubernetes is what it is, an app, right? It can modify the cloud that it’s running. And so, if you have Kubernetes running in AWS, you can create ELB’s, elastic load balancers. It can create new nodes. It can create new databases if you need, as I mentioned. Kubernetes itself is my example like a cloud native application. I should say that that’s a good callout. My example of what a cloud native application isn’t necessarily like that’s a rule. All cloud native applications have to modify the cloud in which they exist in. It’s more that they can modify. That is a component of a cloud native application. Kubernetes is being an example there. I don’t know, I guess things like operators inside of Kubernetes like the rook operator will create storage for you when you spin up like root create a Ceph cluster, it will also spin up like the ELB’s behind it or at least I believe it does. Or that kind of functionality. [0:06:57.2] CC: I can see what you're saying because for example, if I choose to use the storage inside something like Kubernetes, then you will be required of my app to call an SDK and connect so that their storage somehow. So, in that sense I guess, you are using your app. Someone correct me if I’m wrong but that’s how the connection is created, right? You just request – but you’re not necessarily saying I want this thing specific, you just say I want this sort of thing like which has their storage and then you define that elsewhere. So, your applications don’t need to know details bit definitely needs to say, I need this. I’m talking about again, when your data storage is running on top of Kubernetes and not outside of it. [0:07:46.4] BL: Yeah. [0:07:47.3] NL: That brings up an interesting part of this whole term cloud native app. Because it’s like everything else in the space, our terms are not super concrete and an interesting piece about this is that an application – does an application half the map one to one with the running process? What is an application? [0:08:06.1] NL: That is interesting because it could say that a serverless app or a serverless rule, whatever serverless really is, I guess we can get into that in another episode. Are those cloud native applications? They’re not just running anywhere. [0:08:19.8] JR: I will punt on that because I know my boundaries are and that definitely not in my boundaries. But the reason I bring this up is because a little while ago, it’s probably year ago in a Kubernetes [inaudible 0:08:32] apps, we actually have a conversation about what an application was. And the consensus from the community and from the group members was that actually, an application could be made up of multiple processes. So, let’s say you were building this large SaaS service and because you’re selling dog food online, your application could be your dog food application. But you have inventory management. You have a front end, maybe you haven’t had service, you have a shipping manager and things like that. Sales tax calculator. Are those all applications? Or is it one application? The piece about cloud application are cloud native applications because what we found in Kubernetes is that the way we’re thinking about applications is, an application is multiple processes, that can be linked together and we can tell the whole story of how all those interact and working. Just something else, another way to think about this. [0:09:23.5] NL: Yeah, that is interesting, I never really considered that before but that makes a lot of sense. Particularly with the rise of things like GRPC and the ability to send dedicated messages to are like well codified messages too different processes. That gives rise to things like this multi-tenant process as an application. [0:09:41.8] BL: Right. But going back to your idea around cloud native applications being able to commandeer the resources that they’re needing. That’s something that we do see. We see it within Kubernetes right now. I’ll give you above and beyond the example that you gave is that whenever you create a staple set. And Kubernetes, the operator behind staple set that actually goes and provisions of PPC for you, you requested a resource and whatever you change the number of instances from one to like five, guess what? you get four more PPC’s. Just think about it, that is actually something that is happening, it’s a little transparent with people. but I can see to the point of we’re just requesting a new resource and if we are using cloud services to watch our other things, or our cloud native services to watch our applications, I could see us asking for this on demand or even a service like a database or some other type of queuing thing on demand. [0:10:39.2] CC: When I hear things like this, I think, “ Wow, it sounds very complicated. "But then I start to think about it and I think it’s really neat because it is complicated but the alternative would have been way more complicated. I mean, we can talk about, this is sort of how it’s done now. I mean, it’s really hard to go into details on a one-hour episode. We can’t cover the how it’s done or make conceptually, we are sort of throwing a lot of words out there sort of conceptualize it but we can also try to talk about it in a conceptual way how it is done in a non-cloud native world.  [0:11:15.3] NL: Yeah, I kind of want to get back to the question I posed before, what is a monolithic app, what is a none cloud native app? And not all none cloud native apps are monoliths but this is actually something that I’ve heard a lot and I’ll be honest. I have an idea of what a monolithic app is but I think I don’t have a very good grasp of it. We kind of talked a bit about like what a cloud native app is, what is a none cloud native or what came before a cloud native applications. What is a monolith? [0:11:39.8] CC: I’m personally not a big fan of monoliths. Of course, I worked with them but once micro services started becoming common and started developing in that mode. I am much more of a fan of breaking things down for so many different reasons. It is a controversial topic for sure. But to go back to your question, the monolith is basically, you have an app, sort of goes to what Brian was saying, it’s like, what is an app? If you think of an app and like one thing, Amazon is an app, right?  It’s an app that we use to buy things as consumers. And you know, the other part is the cloud. But let’s look at it like it’s an app that we use to buy things as consumers, we know it’s broken down to so many different services. There is the checkout service, there is the cart service. I mean, I’m imagining, these I can imagine thought, the small services that compose that one Amazon app.  If it was a monolith, those services that you know – those things are different systems that are talking together. The whole thing would be on one code base. It would reside in same code base or it will be deployed together. It will be shipped together. If you make a change in one place and you needed to deploy that, you have to deploy the whole thing together. You might have teams that are working on separate aspects but they’re working against the same code base. And maybe because of that, that will lend itself to teams not really specializing on separate aspects because everything is together so you might make one change of the impacts another place and then you have to know that part as well. So, there is a lot less specialization and separation of teams as well. [0:13:32.3] BL: Maybe to give an example of my experience and I think it aligns with a lot of the details Carlisia just went over. Even taking five years back, my experience at least was, we’d write up a ticket and we’d ask somebody to make a server space for us, maybe run [inaudible 0:13:44] on it, right? We’d write all this Java code and we’d package it into these things that run on a JDL somewhere, right? We would deploy this whole big application you know?Let’s call it that dog food app, right?  It would have maybe even like a state layer and have the web server layer, maybe have all these different pieces all running together, this big code base as Carlisia put it. And we’d deploy it, you know, that process took a lot of time and was very consuming especially when we needed to change stuff, we didn’t have all these modern API’s and this kind of decoupled applications, right? But then, over time, you know, we started learning more and more about the notion of isolating each of these pieces or layers. So that we could have the web server, isolated in its how, put some site container or a unit and then the state layer and the other layers even isolated, you know, the micro service approach more or less. And then we were able to scale independently and that was really awesome. so we saw a lot of the gains in that respect. We basically moved our complexity to other areas, we took our complexity that you need to all happen in the same memory space and we moved a lot of it into the network with this new protocols of that different services talk to one another. It’s been an interesting thing kind of seeing the monolith approach and the micro service approach and how a lot of these micro service apps are in my opinion a lot more like cloud native aligned, if that makes sense? Just seeing how the complexity shows around in that regard. [0:15:05.8] CC: Let me just say one more thing because it’s actually the biggest aspect of micro services that I like the most in comparison, you know, the aspect of monolith that I hate the most and that I don’t hate it, I appreciate the least, let’s put it that way. Is that, when you have a monolith, it is so easy because design is hard so it’s so easy to couple different parts of your app with other parts of your app and have couples cold and coupled functionality. When you break this into micro services, that is impossible.  Because it was working with separate code bases. If you force to think what is your interface, you’re always thinking about the interface and what people need to consume from you, your interface is the only way into your app, into your system. I really like the aspect that it forces you to think about your API. And people will argue, “Well you can’t put the same amount of effort into that if you have a monolith.” Absolutely, but in reality, I don’t see it. And like Josh was saying, it is not a walk on the park, but I’d much rather deal with those issues, those complexities that Microsoft has create then the challenges of running a big –  I’m talking about big monoliths, right? Not something trivial. [0:16:29.8] JR: I will come to distil this about how I look at monoliths and how it fits into this conversation. A monolith is simply an application that is or a single process in this case that is running both the UI, the front-end code and the code that fetches the state from a data store, whether that be disk or database. That is what a monolith is. The reasons people use monoliths are many but I can actually think of some very good reasons. If you have code reuse and let’s say you have a website and you were trying to – you have farms and you want to be able to use those form libraries or you have data access and you want to be able to reuse that data access code, a monolith is great. The problem with monoliths is as functionality becomes larger, complexity becomes larger and not at the same rate. I’m not going to say that it is not linear but it’s not quite exponential. Maybe it logs into or something like that. But the problem is that at a certain point, you’re going to have so much functionality, you’re not going to be able to put it inside of one process, see Rails. Rails is actually a great example of this where we run into the issues where we put so much application into a rail source directory and we try to run it and we basically run up with these huge processes. And we split them up. But what we found is that we could actually split out the front-end code to one process. We could spit out the middle ware, see multiple process in the middle, the data access layer to another process and we could use those, we could actually take advantage of multiple CPU cores or multiple computers. The problem with this is that with splitting this out, it’s complexity. So, what if you have a [inaudible 0:18:15] is, what I’m trying to say here in a very long way is that monoliths have their places. As a matter of fact, the encourage, at least I still encourage people to start with the monolith. Put everything in one place. Whenever it gets too big, you spit it out.  But in a cloud native world, because we’re trying to take advantage of containers, we’re trying to take advantage of cords on CPUs, we’re trying to take advantage of multiple computers to do that in the most efficient way, you want to split your application up into smaller pieces so that your front end versus your middle layer, versus your data access layer versus your data layer itself can run on as many computers and as many cores as possible. Therefore, spreading thee risk and spreading the usage because everything should be faster. [0:19:00.1] NL: Awesome. That is some great insight into monolithic apps and also the benefit and pros and cons of them. Like something I didn’t have before. Because I’ve only ever heard of a praise monolithic apps and then it’s like said in hushed tones or what the swear word directly after it. And so, it’s interesting to hear the concept of it being that each way you deploy your application is complex but there are different tradeoffs, right?  It’s the idea that I was like, “Why don’t you want to turn your monolithic into micro services? Well, there’s so much more overhead, so much more yak shaving you have to do to get there to take advantage of micro services. That was awesome, thank you so much for that insight. [0:19:39.2] CC: I wanted to reiterate a couple aspects of what Brian said and Josh said in regards to that. One huge advantage, I mean, your application needs to be substantial enough that you feel like you need to do that, you’re going to get some advantage from it. when you had that point, and you do that, you’re clearing to services like Josh was saying and Brian was saying, you have the ability to increase your capabilities, your process capabilities based on one aspect of the system that needs it. So, you have something that requires very low processing, you run that service with certain level of capabilities. And something that like your orders process or your orders micro service. You increase the processing power for that much more than some other part. When it comes to running this in the cloud native world, I think this is more an infrastructure aspect. But my understanding is that you can automate all of that, you can determine, “Okay, I have analyzed my requirements based on history and what I need is stacks. So, I’m going to say tell the cloud native infrastructure, this is what I need in the automation will take care of bringing the system up to that if anything happens.”  We are always going to be healing your system in an automated way and this is something that I don’t think gets talked about enough like we say, we talk about, “Oh things split up this way and they’re run this way but in an automated mode that these makes all of the difference.  [0:21:15.4] NL: Yeah that makes a lot of sense actually. So, basically analytic apps don’t give us the benefit of automation or automated deployment versus like micro services kind of give us and cloud native applications give us the rise.  [0:21:28.2] BL: Yes, and think about this, whenever you have five micro services delivering your applications functionality and you need to upgrade the front-end code for the HTML, whatever generates the HTML. You can actually replace that piece or replace that piece and that not bring your whole application down. And even better yet, you can replace that piece one at a time or two at a time, still have the majority of your applications still running and maybe your users won’t even know at all.  So, let’s say you have a monolith and you are running multiple versions of this monoliths. When you take that whole application down, you literally take the whole application down not only do you lose front-end capacity, you also lose back-end capacity as well. So, separating your app is actually smarter than the long run because what it gives you is the flexibility to mix and match and you could actually scale the front end at a different level than you did at the backend.  And that is actually super important in [inaudible 0:22:22] land and actually Python land and .NET land if you’re writing monoliths. You have to scale at the level of your monolith and if you can scale that then you are having wasted resources. So smaller micro services, smaller cloud native apps makes the run of containers, actually will use less resources.  [0:22:41.4] JR: I have an interesting question for us all. So obviously a lot of cloud native applications usually maybe look like these micro services we’re describing, can a monolith be a cloud native application as well? [0:22:54.4] BL: Yes, it can.  [0:22:55.1] JR: Cool.  [0:22:55.6] NL: Yeah, I think so. As long as the – basically monolith can be deployed in the mechanism that we described like CSAD or can take advantage of the cloud. I believe the monolith can be a cloud native application, sure.  [0:23:08.8] CC: There are monolith – because I am glad you brought that up because I was going to bring that up because I hear Brian using the micro services in cloud native apps interchangeably and it makes it really hard for me to follow, “Okay, so what is not cloud native application or what is not a cloud native service and what is not a cloud native monolith?”  So, to start this thread with the question that Josh just asked, which also became my question: if I have a monolith app running a cloud provider is that a cloud native app? If it is not what piece of puzzle needs to exists for that to be considered a cloud native app? And then the follow up question I am going to throw it out there already is why do we care? What is the big deal if it is or if it isn’t?  [0:23:55.1] BL: Wow, okay. Well let’s see. Let’s unpack this. I have been using micro service and cloud native interchangeably probably not to the best effect. But let me clear up something here about cloud native versus micro services. Cloud native is a big term and it goes further than an application itself. It is not only the application. It is also the environment of the application can run in. It is the process that we use to get the application to production.  So, monoliths can be cloud native apps. We can run them through CI/CD. They can run in containers. They can take advantage of their environment. We can scale them independently. but we use micro services instead this becomes easier because our surface area is smaller. So, what I want to do is not use that term like this. Cloud native applications is an umbrella term but I will never actually say cloud native application. I always say a micro service and the reason why I will say the micro service is because it is much more accurate description of that process that is running. Cloud native applications is more of the umbrella.  [0:25:02.0] JR: It is really interesting because a lot of the times that we are working with customers when they go out and introduce them to Kubernetes, we are often times asked, “How do I make my application cloud native?” To what you are talking about Brian and to your question Carlisia, I feel like a lot of times people are a little bit confused about it because sometimes they are actually asking us, “How do I break this legacy app into smaller micro services,” right?  But sometimes they are actually asking like, “How do I make it more cloud native?” And usually our guidance or the things that we are working with them on is exactly that, right? It is like getting that application container so we can get it portable whether it is a monolith or a micro service, right? We are containerizing it. We are making it more portable. We are maybe helping them out with health checks that the infrastructure environment that they are running in can tap into it and know the health of that application whether it’s to restart it with Kubernetes as an example.  We are going through and helping them understand those principles that I think fall more into the umbrella of cloud native like you are saying Brian if I am following you correctly and helping them kind of enhance their application. But it doesn’t necessarily mean splitting it apart, right? It doesn’t mean running it in smaller services. It just means following these more cloud native principles. It is hard talk up so that was continuing to say cloud native right?  [0:26:10.5] BL: So that is actually a good way of putting it. A cloud native application isn’t a thing. It is a set of principles that you can use to guide yourself to running apps in cloud environments. And it is interesting. When I say cloud environments I am not even really particularly talking about Kubernetes or any type of scheduler. I am just talking about we are running apps on other people’s computers in the cloud this is what we should think about and it goes through those principles.  Where we use CI/CD, storage maybe most likely will be ephemeral. Actually, you know what? That whole process, that whole virtual machine that we are running on that is ephemeral too, everything will go away. So, cloud native applications is basically a theory that allows us to be strategic about running applications with other people’s computers and storage and networking and compute may go away. So, we do this at this way, this is how to get our 5-9’s or 4-9’s above time because we can actually do this.  [0:27:07.0] NL: That is actually a great point. The cloud native application is one that can confidently run on somebody else’s computer. That is a good stake in the ground.  [0:27:15.9] BL: I stand behind that and I like the way that you put it. I am going to steal that and say I made it up.  [0:27:20.2] NL: Yeah, go ahead. We have been talking about monoliths and cloud native applications. I am curious, since you all are developers, what is your experience writing cloud native applications?  [0:27:31.2] JR: I guess for green field projects where we are starting from scratch and we are kind of building this thing, it is a really pleasant experience because a lot of things are sort of done for us. We just need to know how to interact with the API or the contract to get the things we need. So that is kind of my blanket statement. I am not trying to say it is easy, I am just saying like it has become quite convenient in a lot of respects when adopting these cloud native principles.  Like the idea that I have a docker file and I build this container and now I am running this app that I am writing code for all over the place, it’s become such a more pleasant experience and at least in my experience years and years ago with like dropping things into the tomcat instances running all over the place, right? But I guess what’s also been interesting is it’s been a bit hard to convert older applications into the cloud native space, right?  Because I think the point Carlisia had started with around the idea of all the code being in one place, you know it is a massive undertaking to understand how some of these older applications work. Again, not saying that all older applications are only monoliths. But my experience has been that they generally are. Their bigger code base is hard to understand how they work and where to modify things without breaking other things, right?  When you go and you say, “All right, let’s adopt some cloud native principles on this app that has been running on the mainframe for decades” right? That is a pretty hard thing to do but again, green field projects I found it to be pretty convenient.  [0:28:51.6] CC: It is actually easy, Josh. You just rewrite it. [0:28:54.0] JR: Totally yes. That is always a piece of cake. ,[0:28:56.9] BL: You usually write it in Go and then it is cloud native. That is actually the secret to cloud native apps. You write it in Go, you install it, you deploy in Kubernetes, mission accomplish, cloud native to you.  [0:29:07.8] CC: Anything written in Go is cloud native. We are declaring that here, you heard that here first.  [0:29:13.4] JR: That is a great question, it’s like how do we get there? That is a hard question and not one that I would basically just wave a magic set of words over and say that we are there. But what I would say is that as we start thinking of moving applications to cloud native first, we need to identify applications that cannot be called updated and I could actually give you some. Your Windows 2003 applications and yes, I do know some of you are running 2003 still.  Those are not cloud native and they never will be and the problem is that you won’t be able to run them in a containerized environment. Microsoft says stop using 2003, you should stop using it. Other applications that won’t be cloud native are applications that require a certain level of machine or server access. We have been able to attract GPU’s. But if you’re working on the IO level like you are actually looking at IO or if you are looking at hardware interrupts.  Or you are looking at anything like that, that application will never be cloud native. Because there is no way that we can in a shared environment, which most likely your application will be running in, in the cloud. There is no way that first of all that the hypervisor that is actually running your virtual machine wants to give you that process or give you that access or that is not being shared from one to 200 other processes on that server.  So, applications that want low level access or have real time, you don’t want to run those in the cloud. They cannot be cloud native. That still means a lot of applications can be.  [0:30:44.7] CC: So, I keep thinking of if I own a tech stack and I every once in a while stop and evaluate, if I am squeezing as most tech as I can out of my system? Meaning am I using the best technology out there to the extent that fits my needs? If I am that kind of person and I don’t know – it’s like when I say I am a decision maker and even if I was a tech person like I am also a tech person, I still would not have – unless I am one of the architects.  And sometimes even the architects don’t have an entire vision. I mean they have to talk to other architects who have a greater vision of the whole system because systems that can be so big. But at any rate, if I am an architect or I own the tech stack one way or another, my question is, is my system a cloud native system? Is my app a cloud native app? I am not even sure that we clarified enough for people to answer that. I mean it is so complicated, maybe we did hopefully we helped a little bit.  So basically, this will be my question, how do I know if I am there or not? Because my next step would be well if I am not there then what am I missing? Let me look into it and see if the cost benefit is worth it. But if I don’t know what is missing, what do I look at? How do I evaluate? How do I evaluate if I am there and if I am not, what do I need to do? So, we talked about this a little bit on episode one, which we talked about cloud native like what is cloud native in general and now we are talking about apps.  And so, you know, there should be a checklist of things that cloud native should at least have these sets of things. Like the 12-factor app, what do you need to have to be considered 12 factor app. We should have a checklist, 12 factor app I think having that checklist is being part of micro-service in the cloud native app. But I think there needs to be more. I just wish we would have that not that we need to come up with that list now but something to think about. Someone should do it, you know?  [0:32:57.5] JR: Yeah, it would be cool.  [0:32:58.0] CC: Is it reasonable or now to want to have that checklist? [0:33:00.6] BL: So, there is, there is that checklist that exist I know that Red Hat has one. I know that IBM has one. I would guess VMware has one on one of our web pages. Now the problem is they’re all different. What I do and this is me trying to be fair here. The New Stack basically they talk about things that are happening in the cloud and tech. If you search for The New Stack in cloud native application, there is a 10-bullet list. That is what I send to people now.  The reason I send that one rather than any vendor is because a vendor is trying to sell you something. They are trying to sell you their vision of cloud native where they excel and they will give you products that help you with that part like CI/CD, “oh we have a product for that.” I like The New Stack list and actually, I Googled it while you were talking Carlisia because I wanted it to bring it up. So, I will just go through the titles of this list and we’ll make sure that we make this link available.  So, there is 10 Key Attributes of Cloud-Native Applications. Package as light weight to containers. Developed with best-of-breed languages and frameworks, you know that doesn’t mean much but that is how nebulous this is. Designed as loosely coupled microservices. Centered around API’s for interaction and collaboration. Architected with clean separation of stateless and stateful services. Isolated from server and operating system dependencies.  Deployed on self-service elastic cloud infrastructure. Managed through agile DevOps processes. Automated capabilities. And the last one, Defined policy-driven resource allocation. And as you see, those are all very much up for interpretation or implementations. So, a cloud native app from my point of view tries to target most of these items and has an opinion on most of these items. So, a cloud native app isn’t just one thing. It is a mindset that I am running. Like I said before, I am running my software on other people’s computers, how can I best do this.? [0:34:58.1] CC: I added the link to our shownotes. When I look at this list, I don’t see observability. That word is not there. Does it fall under one of those points because observability is another new-ish term that seems to be in parcel of cloud native? Correct me here, people.  [0:35:19.1] JR: I am. Actually, the eighth item, ‘Manage through agile DevOps processes,’ is actually – they don’t talk about monitoring observability. But for an application for a person who is not developing application, so whether you have a dev ops team or you have an SRE practice, you are going to have to be able to communicate the status and the application whether it be through metrics logs or metrics logs or whatever the other one is.  I am thinking – traces. So that is actually I think is baked in it is just not called out. So, to get the proper DevOps you would need some observability that is how you get that status when you have a problem.  [0:35:57.9] CC: So, this is how obscure these things can be. I just want to point this out. It is so frustrating, so literally we have item eight, which Brian has been, as the main developer so he is super knowledgeable. He can look at that and know what it means. But I look at that and the words log metrics, observability none of these words are there and yet Brian knew that that is what it means that that is what he meant. And I don’t disagree with him. I can see it now but why does it have to be so obscure?  [0:36:29.7] JR: I think a big thing to consider too is like it very much lands on spectrum, right? Like something you would ask Carlisia is how do I qualify if my app is cloud native or what do I need to do? And you know a lot of people in my experience are just adopting parts of this list and that’s totally fine. You know worrying about whether you fully qualify as a cloud native app since we have talked about it as more of a set of principles is something –  I don’t know if there is too too much value in worrying about whether you can block that label onto your app as much as it is, “Oh I can see our organization our applications having these problems.” Like lacking portability when we move them across providers or going back to observability, not being able to know what is going on inside of the application and where the network packets are headed and they switched to being asked we’re late to see these happening.  And as those problems come on, really looking at and adopting these principles where it is appropriate. Sometimes it might not be with the engineering efforts without them, one of the more cloud native principles. You know you just have to pick and choose what is most valuable to you.  [0:37:26.7] BL: Yes, and actually this is what we should be doing as experts, as thought-leaders, as industry movers and shakers. Our job is to make this easier for people coming behind us. At one time, it was hard to even start an application or start your operating system. Remember when we had to load AN1, you know? Remember we had to do that back in the day on our basic, on our Comado64’s or Apple or Apple2. Now you turn your computer on and it comes with instantly.  We click on application and it works. We need to actually bring this whole cloud movement to that point where things like if you include these libraries and you code with these API’s you get automatic observability. And I am saying that with air quotes but you get the ability to have this thing to monitor it in some fashion. If you use this practice and you have this stack, CI/CD should be super simple for you and we are just not quite there yet.  And that is why the industry is definitely rotating around this and that is why there has been a lot of buzz around cloud native and Kubernetes is because people are looking at this to actually solve a lot of these problems that we’ve had. Because they just haven’t been solvable because everybody stacks are too different. But this one though, the reason Linux is I think ultimately successful is because it allowed us to do things and all of these Linux things we liked and it worked on all sorts of computers.  And it got that mindset behind it behind companies. Kubernetes could also do this. It allows us to think about our data centers as potentially one big computer or fewer computers that allows us to make sure things are running. And once we have this, now we can develop new tools that will help us with our observability, with our getting software into production and upgraded and where we need it.  [0:39:17.1] NL: Awesome. So, on that, we are going to have to wrap up for this week. Let’s go ahead and do a round of closing thoughts.  [0:39:22.7] JR: I don’t know if I have any closing thoughts. But it was a pleasure talking about cloud native applications with you all. Thanks.  [0:39:28.1] BL: Yeah, I have one thought is that all of these things that we are talking about it sounds kind of daunting. But it is better that we can have these conversations and talk about things that don’t work rather than not knowing what to talk about in general. So this is a journey for us and I hope you come for more of our journey.  [0:39:46.3] CC: First I was going to follow up on Josh and say I am thoughtless. But now I want to fill up on Brian’s and say, no I have no opinions. It is very much what Brian said for me, the bridging of what we can do using cloud native infrastructure in what we read about it and what we hear about it like for people who are not actually doing it is so hard to connect one with the other. I hope by being here and asking questions and answering questions and hopefully people will also be very interactive with us.  And ask us to talk about things they want to know that we all try to connect it too little by little. I am not saying it is rocket science and nobody can understand it. I am just saying for some people who don’t have multi background experience, they might have big gaps.  [0:40:38.7] NL: And that is for sure. This was a very useful episode for me. I am glad to know that everybody else is just as confused at what cloud native applications actually mean. So that was awesome. It was a very informative episode for me and I had a lot of fun doing it. So, thank you all for having me. Thank you for joining us on this week of the Kublets Podcast. And I just want to wish our friend Brian a very happy birthday. Bye you all.  [0:41:03.2] CC: Happy birthday Brian.  [0:41:04.7] BL: Ahhhh.  [0:41:05.9] NL: All right, bye everyone.  [END OF EPISODE] [0:41:07.5] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


10 Feb 2020

Rank #5

Podcast cover

Jobs in Cloud Native (Ep 14)

Our topic in today's great episode is how we think jobs in software engineering have changed since the advent of cloud native computing. We begin by giving our listeners an idea of our jobs and speak more to what a job in cloud native would look like as well as how Kubernetes fits into the whole picture. Next up we cover some old challenges and how advances in the field have made those go away while simultaneously opening the gateway to even more abstract problems. We talk about some of the specific new developments and how they have changed certain jobs. For example, QA has not disappeared but rather evolved toward becoming ever more automated, and language evolution has left more space for actual development instead of debugging. Our conversation shifts toward some tips for what to know to get into cloud native and where to find this information. We wrap up our conversation with some thoughts on the future of this exciting space, predicting how it might change but also how it should change. Software engineering is still in a place where it is continuously breaking new ground, so tune in to hear why you should be learning as much as you can about development right now. Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Bryan Liles Nicholas Lane Key Points From This Episode: • The work descriptions of our hosts who merge development, sysadmin, and consulting.• What a cloud native related job looks like.• Conceptualizing cloud native in relation to development, sysadmin, and DevOps.• A cloud native job is anything related to building software for other people’s computers.• Kubernetes is just one way of helping software run easily on a cloud.• Differences between cloud native today and 10 years ago: added ease through more support.• How cloud native developing is the new full stack due to the wide skillset required.• An argument that old challenges are gone but have introduced more abstract ones.• Advances making transitioning from testing to production more problem-free.• How QA has turned into SDE, meaning engineers now write software that tests.• Why jobs have matured after the invention of cloud native.• Whether the changes in jobs have been one of titles or function.• How languages like Rust, Go, and Swift have changed developer jobs by being less buggy.• What good support equates to, beyond names like CRE and company size.• The many things people who want to get into cloud native should know.• Prospective cloud native workers should understand OSs, networking, and more.• Different training programs for learning Kubernetes such as CKA and CKAD.• Resources for learning such as books, YouTube videos, and podcasts.• Predictions and recommendations for the future of cloud native. • Tips for recruiters such as knowing the software they are hiring for. Quotes: “What is the cloud? The cloud is other people’s computers. It's LPC, and what is Kubernetes? Well, basically, it’s a way that we can run our software on other people’s computers, AKA the cloud.” — @bryanl [0:07:35] “What we have now is we know what we can do with distributed computing and now we have a great set of software for multiple vendors who allow us to do what we want to do.” — @bryanl[0:10:03] “There are certain challenges now in cloud native that are gone, so the things that were hard before like spinning up a server or getting the database are gone and that frees us to worry about more complicated or more abstract ideas.” — @apinick  [0:12:58] “The biggest problem with what we are doing is that we are trailblazing. So a lot of the things that are happening, like the way that Kubernetes advances every few months is new, new, new, new.” — @bryanl  [0:36:11] “Now is the literal best time to get into writing software and specifically for cloud native applications.” — @bryanl  [0:42:22] Links Mentioned in Today’s Episode: Azure — https://azure.microsoft.com/en-us/ Google Cloud Platform — https://cloud.google.com/ AWS — https://aws.amazon.com/ Amazon RDS — https://aws.amazon.com/rds/ Mesosphere — https://d2iq.com/ Aurora — https://stackshare.io/stackups/aurora-vs-mesos-vs-mesosphere Marathon — https://mesosphere.github.io/marathon/ Rails Rumble — http://blog.railsrumble.com/ Terraform — https://www.terraform.io/intro/index.html Swift — https://developer.apple.com/swift/ Go — https://golang.org/ Rust — https://www.rust-lang.org/ DigitalOcean — https://www.digitalocean.com/ Docker — https://www.docker.com/ Swarm — https://www.santafe.edu/research/results/working-papers/the-swarm-simulation-system-a-toolkit-for-building HashiCorp — https://www.hashicorp.com/ Programming Kubernetes on Amazon — https://www.amazon.com/Programming-Kubernetes-Developing-Native-Applications/dp/1492047104 The Kubernetes Cookbook on Amazon — https://www.amazon.com/Kubernetes-Cookbook-Building-Native-Applications/dp/1491979682 Kubernetes Patterns on Amazon — https://www.amazon.com/Kubernetes-Patterns-Designing-Cloud-Native-Applications/dp/1492050288 Cloud Native DevOps with Kubernetes on Amazon — https://www.amazon.com/Cloud-Native-DevOps-Kubernetes-Applications/dp/1492040762 Kubernetes in Action on Amazon — https://www.amazon.com/Kubernetes-Action-Marko-Luksa/dp/1617293725 Managing Kubernetes on Amazon — https://www.amazon.com/Managing-Kubernetes-Operating-Clusters-World/dp/149203391X Transcript: EPISODE 14 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [0:00:41.3] NL: Hello and welcome back. This week, we’ll be discussing the thing that’s brought us all together, our jobs. But not just our jobs. I think we’re going to be talking about the difference kind of jobs you can find in cloud native land. This time, I’m your host, Nicolas Lane and with me are Brian Liles. [0:00:57.1]  BL: Howdy. [0:00:58.0] NL: And Carlisia Campos. [0:00:59.6] CC: Hi everybody, glad to be here. [0:01:02.6] NL: How’s it going you all? [0:01:03.7] CC: Very good. [0:01:05.4] NL: Cool. To get us started, let’s talk about our jobs and like what it means to have a job and like the cloud native land from our current perspective. Brian, you want to go ahead and kick us off? [0:01:17.8] BL: Wow, cloud native jobs. What is my job? My job is – I look at productivity of developers and people who are using Kubernetes. My job is to understand cloud native apps but also understand that the systems that they are running on are complex and whether they’d be Windows or Linux or Mac based, being able to understand those too. Really, my job is the combination of a senior developer, composed with a senior level admin. Whether it be Windows or Linux.  Maybe I am the actual epitome of DevOps. [0:01:58.5] NL: Yeah you seem to be kind of fusion of the two. Carlisia? [0:02:03.3] CC: My job is – so I’m mainly a developer but to do the job that I need to do, I need to be a bit of a DevOps person as well because as I’ve talked many times here on the show, I work on an open source too called Valero that does backup and recovery for Kubernetes clusters. I need to be able to boot up our cluster at least with three main providers. Azure, Google Cloud Platform, and AWS. I need to know how to do that, how to tweak things, how to troubleshoot things and I don’t think when we think of just a straight up developer, that usually is not part of the daily activity. In that sense, I think, I’m not sure how we would define the cloud native job but I think my job, if there is such a thing, my job definitely is a cloud native job because I have to interact with these cloud native technologies, even beyond what I – the actual app that I’m developing which runs inside a Kubernetes cluster so it all ties in. You Nick? [0:03:16.0] NL: My job is I’m a cloud native architect or a Kubernetes architect, I’m not sure what we’re calling ourselves these days honestly. What that means is we work with customers to help them along their cloud native journey. Either that means helping them set up like a Kubernetes cluster and then getting them like running with certain tools that are going to make there life easier or helping them develop tools in their cloud environments to help make the running of their jobs easier. We kind of run the gamut of developers and sys. admins a bit and consultants. We kind of touch a little bit of everything. Let’s take a step back now and talk about what we think a cloud native job looks like? Because for me, that’s kind of hard to describe. A cloud native job seems to be any job that has to do with some cloud native technology but that’s kind of broad, right? You could have things from sysadmins, people who are running their cloud infrastructure for the company who are like managing things like, you know, rights access, accounting, that sort of thing, to people who are doing development like yourselves, like Brian and Carlisia, you guys are doing this type of work. Is there anything that you think is like unique to a cloud native job? [0:04:35.2] CC: Yeah, it’s very interesting to talk about I think because especially in relation to if you don’t have a cloud native job, what do you have and how is it different? I wonder if the new cloud native job title is the new full stack developer for developers because, I think it’s easier to conceptualize what a cloud native job is for a systems admin or dev ops person.  But for developer, I think it’s a little more tricky, right? Is it the new full stack? Is it now that the developer even if you’re not doing – for example, my application runs inside Kubernetes, it’s an extension of Kubernetes but some applications just run on Kubernetes as a platform. Now, are we talking about developers with a cloud native title like ‘cloud native software engineer’ and for those developers, does it mean that they now have to design, code and deploy consistently?  You know, in my old days, when I – before doing this type of work, I would deploy apps but it was not all the time. There was a system, every single job I had, the system was different. The one thing that I love about Kubernetes is that if I was just a regular app developer, again as supposed to like extending Kubernetes, right?  If I was building apps that would run on Kubernetes as supposed to extending Kubernetes, and if I had to deploy them at Kubernetes, if I move jobs and they were working with Kubernetes, this process would be exactly the same and that’s one really cool thing about. I wouldn’t mind – in other words, I wouldn’t mind so much if I had to do deployment in the deployment, the process was the same everywhere. Because it’s really painful to do like a one off deployment here and there, each place was different, I had to write a ton of notes to make sure, you know – it was like, 200 stacks and if anyone of them, you had to troubleshoot and I’m not a systems admin so it will be a struggle. [0:06:44.6] BL: Yeah. [0:06:45.8] CC: Because each system – it’s not that I couldn’t learn but each system would be different and I make – anyway, I think I went off on a tangent. [0:06:53.1] NL: No worries. [0:06:54.1] CC: But I also wanted to mention that I searched on LinkedIn for cloud native in the jobs section and there are a ton of job titles, job postings with cloud native in the title like a lot of it is architect but there is also product manager, there is also software engineer, I found the one that was senior Kubernetes engineer. It’s definitely a thing. [0:07:21.0] BL: All right. What is the question here?  [0:07:25.6] NL: It was what do we think a cloud native job looks like essentially? [0:07:29.6] BL: All right. I’m going to blow your mind here. Basically, what is the cloud? The cloud is other people’s computers. It's LPC and what is Kubernetes? Well, basically, it’s a way that we can run our software on other people’s computers, AKA the cloud. Kubernetes makes running software in the cloud easier. What that really breaks down to is if you are writing software on other people’s computers or if you were designing software that runs well on clouds, well, you’re a cloud native person. Actually, the term is basically been co opted for marketing purposes by who knows who.Basically, everyone. But what I think is, as long as you are working on software that runs on modern infrastructure which means that nodes may go away, you might not own all of your services, you might use a least database server, you know, something like RDS from Amazon. Everyone working in that realm, working with software that may go away with things that aren’t theirs, was doing cloud native work. We just happen to be doing on Kubernetes because that’s the most popular option right now. It isn’t the only option and it probably won’t be the final option. [0:08:48.7] CC: Do you see any difference between what is required for a job like that today? Versus maybe 10 years ago or five years ago? Brian? [0:08:58.0] BL: Yeah, actually, I do see some differences. One of the biggest differences is that there’s a lot more services out there that are provided to help you do what you need to do and so 10 years ago, having a database provider would be hard because one, the network wouldn’t be good enough and you’re hosting company probably didn’t have that unless you were at AWS and even they didn’t have that. Now, what we get to take advantage off is things are just easier, it’s easier to fire up databases, it’s easier to add nodes to our production. It’s easier to have multiple productions, it’s easier to keep that all in order. It’s easier to put automated configuration around that than it was 10 years ago.  Now, five years ago, back in 2014, I would actually say that the way that we progressed since then is that we became more mature. I remember when Kubernetes came out and I thought it was going to win but Mesosphere was, mesosphere with Aurora, or marathon was actually better than Kubernetes, just it worked out of the box, for what we thought we could do with it but now, what we have now is we know what we can do with distributed computing and now we have a great set of software for multiple vendors who allow us to do what we want to do. That’s the best part about now versus five years ago. [0:10:17.7] CC: Yeah, I have to agree with that, it’s definitely easier. As a developer, I’m not going to tell you it’s easy but it’s easier. As an example. I remember when that was Rails Rumble maybe 10 years ago, I don’t know. [0:10:31.3] BL: Yeah, I remember. [0:10:34.0] CC: You did a video showing step by step how to boot up a Linux server to run apps on that server. I don’t remember why we needed to boot up from scratch. Remember that Brian? [0:10:46.9] BL: I do remember that. That was 2007 or eight? It was a long time ago.  [0:10:53.0] CC: That was one of the place that made me very impressed about you because I followed all the steps and at the end it worked. You just was – you were right on with – as far as the instructions went. I think doing that, I think it took me about two hours, I remember it took a long time and because this again, these are things that I do once in a while, I don’t do these things all the time.  Now, we can use a Terraform script and have something running in a matter of 15 minutes if you have. [0:11:26.9] BL: Side bar. Quick side bar. Yeah, we can use Terraform. I use Terraform for even all my personal infrastructure so things that are running in my house use Terraform. All my work stuff uses Terraform. But still, it’s sometimes easier to just write a script or type in the commands on the command line or click something. We’re still not to the point where using things like Terraform actually makes us not want to do it manually. That’s how I know that we’re not to our ultimate level maturity yet. But, if you want to, the options are there and they’re pretty good. [0:12:01.8] CC: Yeah. [0:12:03.6] NL: Carlisia, you said something that kind of reminded me and maybe kind of get down this path. While we’re talking about like there are certain challenges that we aren’t faced anymore in a cloud native land like things are easier, there are certain things that are easier, not to say that our jobs are easy, like you’re saying Carlisia. But it was something along the lines of like a developer now needs to be – like a cloud native job is now the full stack kind of job or full stack developer. That was the name of the game back in the day, now, it’s a cloud native job. I actually kind of agree with that in a sense where a cloud native developer or anyone in the cloud native realm has to exist not just in their own silo anymore. You need to understand more of the infrastructure that you’re using to write your code on someone else’s computer better. I actually kind of like that. [0:12:56.6] CC: Exactly. [0:12:58.0] NL: Yeah, there are certain challenges now in cloud native that are gone so the things that were hard before like spinning up a server, you know, getting the database, these things are gone and that now that frees us to worry about more complicated or more abstract ideas like how do we have everyone agree on the API to use and thus rises Kubernetes. [0:13:19.1] CC: Yeah, I see that as a very positive thing. It might sound like – it’s a huge burden to ask developers to now have to now this but again, if we stick to the same stack, the burden diminishes really quickly because you learn it once and then that’s it. That’s been a huge advantage. If it works out this way, I mean, I’m all for like you know, the best technology should win. But there is that advantage. If we remain using the same container orchestrator, you know, we use containers, we can run our code as if we were running any machine. One advantage that I see is that I’ve had cases where you know, these was working on my computer (™) and it will be deployed and one little stupid thing wouldn’t work because the way the URL was redirected, didn’t work, broke things, I got yelled at. I’m like, “Okay, you want me to do this right? Give me a server.” Back then, good luck, “I’m going to give you a server, no way.” It was just so expensive, developers will be lucky to get a tasking of our staging environments. And, even when you get there, you had to coordinate with QA and there was a process. Now, because I have access to my own servers here, right? I can just imagine if I were a developer, building apps to run Kubernetes, admin could just say, “Okay, you have these resources, go for it.” I’ll have my own name space and I could run my code as if it was running and a production environment and I’ll have just more assurance that my code works basically. Which is to me so satisfying. It’s one less thing to worry about if I deploy something to production, I have already tested it. [0:15:17.3] NL: Yeah, that’s great. That’s something I really do cherish about the current landscape. We can actually test these things out locally and have confidence that they’ll work at least fairly well in production, right? [0:15:30.2] CC: It’s not just running things locally, you can actually get access to like a little slice of let’s say an AWS server and just shift your things there and test it there. But because these system admins people, they can just carve out that little one slice for your team of even in the per person basis, maybe that’s too much but it’s relatively uncomplicated to do that and not very costly. [0:15:56.9] NL: Yeah. You mentioned a team and the name of a team that I haven’t heard of in quite some time which is QA. How do we think the rise of cloud native have affected jobs and also kind of tangential to that, what were jobs like prior to cloud native because I haven’t heard of a QA team in many of the organizations that I’ve touched. Now, I’m not touching their like production dev team that they actually make this, I just haven’t heard of that name in a while and I’m wondering if like jobs like that have kind of gone away with the rise of cloud native. [0:16:27.7] BL: No, I’m going to end that rumor right here, that is a whole untruth. [0:16:33.8] NL: That was not a rumor, I’ve just conjecture I my part, literally unfounded.  [0:16:38.7] BL: We got to think what does QA do? QA is supposed to be responsible for the quality of our applications and when they first started, there wasn’t a lot of good tooling so a lot of our QA people were manual testers. They started the app, they clicked on everything, they put in all the inputs until it came back and they were professional app breakers. I’d say, over a decade ago, we got more automated tools and moving into now, you can automate your web browser, you can actually write software to do all the actions that a human would do. What we found is that QA as profession has actually matured and you can see that because Google, I don’t think they even have QA, they have, what do they call them? Software engineers under test or SDE’s. What they do are – they’re developers in their own right but they write software that makes it easier for developers to write code that works well and a code that can be tested. I think that the roll has matured and has taken another angle in a lot of cases but even where we work. There are QA engineers in our group and we still need them because you’ve seen the meme where you talk about unit testing and it would be like a door that had all the right parts but it didn’t fit in it’s casing or too hot handles on a sink. The pieces work right? They both put out hot water but together they didn’t work. We still have that, it just looks a little bit different now. Also, a lot of software is not written in a huge monolithic cycle where we would take six months to release anew version or a year or a year and a half. Now, people are trying to turn around a lot of software quicker so QA has had to optimize how they work to work in those processes. They’re still there though. [0:18:37.8] CC: I would hope so. I mean, I can’t answer the question if the question is do we have as much QA efforts out there as before. I don’t know, but I hope so because if you don’t have a QA, if you’re not QAing your apps, then you’re users are. That’s not good. For my team for example, we do our own QA but we do QA. We don’t have separate people doing it, we do it ourselves. It might be just because it’s pretty special, I mean, we are a small team to begin with and what we do is very specialized. It will be difficult to bring someone in and teach them and if they’re just running QA, I don’t know, maybe – I don’t think it will be that difficult, we can just have constructions, you know. “Run this command, this is what the output should be” – I don’t’ think it would be that difficult, take that back, but still, we do it ourselves. [0:19:31.7] NL: The question was more – less in line with like, “What happened to QA?” It was more like, how do we think that cloud native has affected jobs and the job market and it sounds like, that jobs have changed because of cloud native, they’ve matured as we were just discussing with QA where people aren’t doing the same kind of drudgery or the same kind of toil that they were doing before. Now, we’re using more tooling to do our jobs and kind of lifting up each position to be more cloud native-y, right? More development and infrastructure focus at the same time. At least, that’s what I was getting from it. [0:20:09.8] BL: Yeah, I think that is true but I think all types of development jobs, especially jobs that are in the cloud native space have changed. One good example would be, with organizations moving to cloud native apps, we’re starting to see, and this is all anecdotal, I have no evidence to back this up, that there are more developers who are on call for the software they write because one, they know it better than anyone else and they’re closer to it. And two, because having an ops group that just supports app isn’t conducive to being productive because there’s no way that one group can understand all apps. What we’re finding is that in this new cloud native era that jobs are maturing and they’re getting new functionality, they’re losing some functionality, some jobs are combining but it’s still at the end of the day, it’s the same thing we were doing 20 years ago but it all just has new titles and we use new software to do it. Which is good. Because on some of these ideas that we came up with, 20, 30 years ago are still good ones today. [0:21:15.5] NL: Yeah, that’s actually an interesting question. Do you think that it’s just the titles that are changing or are the functions changing, right? It’s like sys admins used to be sys admins, now they’re CREs, well then there are dev ops for a while and now they’re CRE’s, more SRA’s I should say. Our support team are now CREs, Customer Reliability Engineering. Is that just a title change or are there functional differences? I’m inclined to believe that they’re functional differences. [0:21:43.9] BL: I think it’s both, I think it’s the same reason why all engineers after two years in the field are somehow senior engineers. People feel like they have progress when they get new titles even though you’re the most junior engineer on this team, how can you be a senior engineer? And then also the same thing with CRE, shout out to Google for making that term popular but really, what it comes down to is maybe the focus changed but maybe it didn’t. Maybe we were already doing that, maybe we were already doing resilience engineering with our customers and maybe we already had great customer support or customer success team. But I do think that there has been some changes in jobs because what we’re finding is that with modern languages that people are using so teams are moving away from C++ to things like Swift and Go and Rust. We’re finding that because our software is easier to write, we actually don’t have to think about some of the things that we did before. With Go, technically, you don‘t have to worry about memory access. With Rust, 100%, you don’t have to worry about null pointer exceptions, they don’t exist. Now that we freed our developers to do more development rather than more debugging, then what we can find is that the jobs will actually change over time. But at the end of the day, and even where we work right now and then all over the place, people are devs, they do ops stuff, they do security stuff or they’re pointing here at Boston. I challenge anyone listening to this to find something where I am not telling the truth, we might do both or more than one thing but at the end of the day, we can still break it down to what people do. [0:23:24.8] NL: Yeah, Carlisia, any thoughts on that? [0:23:26.1] CC: No, I think that was nice with me, sounds right. [0:23:29.8] NL: Yeah, I agree. I think that there are some functional changes. I think that support versus CRE isn’t just like getting tickets received and then going to a ticket queue and filing those things. I think there are some changes with like, I know from our CRE team they are actively going out and saying like, “Here’s our opinion based on these technologies and this is like why we validated these things,” that are reevaluating their support model constantly and just making sure that they’re like abreast of everything that’s going on so they can more resiliently engineer their customer support. [0:24:04.5] BL: But hold on, one second though. That’s what I’m talking about with the marketing because guess what? It is supported, a good support team would be doing all those things whether it’s called customer reliability engineering or whatever, it’s support, it’s customer success, it’s getting in front of our customer’s problems and having the answers before they even ask the question, that’s good support. Whenever we label things like CRE, that’s somebody and some corporate marketing center who thought that that was a good idea. But it doesn’t mean because you don’t call that CRE, it’s not good support because I will tell you in the past, DigitalOcean, we did that and the term CRE didn’t even exist yet but we were out there in front of problems whenever we could be and we thought that was good for our customers. What we’re finding is that people have the capabilities now with the progress of whatever technologies we have that we can actually give our customers good support, and you don’t have to be a Google sized company to do that anymore, that’s the plus.  [0:25:02.9] NL: Yeah, I agree with that.  [0:25:05.5] CC: I want us to talk a little bit about for people who are not working in a cloud native space but they see it coming or they want to move towards doing something more in that area, what should they be looking at? What should that be brushing up on their learning or incorporating into their what they are currently doing and of course different roles and so it will be different for each different role. We have developers we have DevOps or SRE or admins or operators, managers, recruiters. It changes a little bit for everybody.  [0:25:47.5] BL: Well I will hop in here first and say it is all code at the end of the day. When it comes down to what we are doing in cloud native for ops, it doesn’t really matter. You could take a lot of the same principles and do them on prem or wherever else you happen to be. I mean I am not trying to diminish the role of anyone that we work with or anyone in our industry whenever I say this though but when it comes down to it, what I see is people understand the operating system, mostly Linux.  People understand public key encryption so they understand PKI, you know we deal with a lot of certs. They understand networking, they can tell you how many IPs are in a 23 and if I am giving you side or numbers out there. These are things that people know. I don’t think there is anything specific for cloud native other than learning Kubernetes itself or Mesosphere or Docker, Swarm or whatever or the tool from HashiCorp that always escapes me whenever I have to say it out loud. But it is all the same thing.  What you need to know and to be good at any job where you are doing ops, you need to understand the theory of operating computers. You need to understand operating systems, networking and how that all works and then all things around and some security. For developers now, it is a little bit interesting because a lot of the apps that we are writing these days are more stateless. So for a developer you need to think more about my app may crash. So anything that I am holding a memory that is important can go away at any given time. So either, one, I need to store it on more than one thing, I need to have it in a restrictive fashion or, two, I need to store it in the database instantly  And I would once again challenge anyone to say that if you are a developer who can actually understand those topics, you would be able to apply for a cloud native job in this space because frankly a lot of developers, a lot of cloud native developers writing apps working cloud native, two years ago they were doing something else.  [0:27:50.1] CC: Yeah that sounds right. I think for developers where you said I think focusing on authentication, how do you handle secret keys and the question of row authentication and row authorization and if you can even be well-versed in developing clients and servers and handling certs for that interaction and I guess it comes down to being well-versed in distributed systems development is what this whole cloud native is all about and on top of that I think being well-versed on how to push your apps into containers.  You know create images, creating containers, pushing them into repository, pulling them from the repository and manipulating and creating containers in different ways and then on top of that maybe you want to learn Kubernetes and we can talk about that too but I wanted to give Nick a chance to talk about his aspects.  [0:28:59.0] NL: I agree with pretty much everything you guys have said. I think the only thing I would add is like really understanding how to use and work with an API and an API driven environment because that is what a lot of cloud native is, right? It is using someone else’s computer so how do you do that? It is via an API like we’re talking about containers and orchestration, those are all done hopefully within API. Luckily, if you are using Kubernetes, which likely you are. It is all API driven.  And so using an API, I think, and getting familiar with that. Most developers I think at some point are familiar with that but just that would be the main thing I would think too, outside of what you and Brian have already said are what is needed to do like a cloud native job.  [0:29:40.2] CC: Yeah. Now if someone wanted to learn Kubernetes, well there is the Kubernetes Academy.  [0:29:47.5] NL: There is a Kubernetes Academy.  [0:29:49.4] CC: That is pretty awesome but do you think going through the certification would help?  [0:29:55.3] NL: I think that is a good place to start. So the current certification that exists is the CKA, the Certified Kubernetes Administrator and I think that is a good starting place, especially for someone who is not really touched Kubernetes before. If they’re like, “How do I know the basics of Kubernetes?” going through that certification process I think will be a huge step forward because that really covers most of what you are going to touch on day to day basis for Kubernetes. [0:30:21.6] CC: And there is the CKAD as well, which is for developer. The CKAD is Certified Kubernetes –  [0:30:29.7] BL: Application Developer.  [0:30:31.7] CC: Application Developer and the other one is Certified Kubernetes Admin.  [0:30:34.9] NL: Yeah I was like, “administrative developer?” like.  [0:30:38.2] CC: If you are brand new I think it is worth while doing the developer first because it is mostly the commands. You go through the commands just so you have a knowledge of how to interact with Kubernetes and the admin is more like how you manage and how do you troubleshoot a cluster and how do you manage cluster. So it is more involved I think. You need to know more but in any case, I agree with you that it would help because it serves as a syllabus for what to learn.  It is like, “Okay, these other things that if you know these things, it would help you a lot if you had to do anything with Kubernetes.” [0:31:14.6] NL: Yeah, I don’t think that you need to have a certification to do a job. I really don’t think so unless it is like required by law like you have to.  [0:31:23.2] CC: No, yeah not at all what I am saying but if you don’t know anything at all and you’re like, “Where do I start?” I would recommend that. That is not a bad place to start or if you know some things but you feel like you don’t know others and you want to fill in the gaps and you don’t know what your gaps are, also same idea. What do you think Brian? Do you think having this certification would be useful?  [0:31:46.5] BL: I don’t know, some people need it but I am also barely graduated from high school and I don’t have a college degree. So I have always leaned on myself for learning things on my own schedule, in my own pace on my own terms but some people do need the structure provided to them by certifications and I’ve only heard good things of people taking those tests. So I think for some people it is actually really good but for others, it might be a waste of time because what will actually happen if you get that certification?  If you work at some large companies, I do know this for a fact by getting your AWS certificates actually had a money thing behind it but in a lot of places I don’t know but it couldn’t hurt. That is the most important piece. It can’t hurt.  [0:32:36.3] NL: Yeah, I totally agree. You learn at least something even when I am taking a certification exam for something that I was already pretty aware of, I always learned at least one thing by taking like an examination. The last good question that you likely have never even thought off but I also agree with Brian where it is like I don’t have my CKA and I think I am a pretty damn good expert of Kubernetes. So I don’t think anything would change for me to take the exam.  [0:33:00.2] CC: Oh yeah. I work with so many people who have none of those sort of certifications and they are absolutely experts. I was talking about like it would help me. I want to take those two certifications because it helped me fill in the gaps and I know there is a lot that I am going to learn especially with the admin one. So it is using the curriculum as a guide for what I need to learn and then testing, did I really learn and also it made me feel good but other than that, I don’t think it has any – I don’t know, I don’t think it is bad either.  [0:33:33.4] BL: And that is the most important piece, what you just said, it made you feel good because you take certifications to test your knowledge against yourself in a lot of cases. So I think it is good. I just realized you can – I mean people cannot see behind me. I don’t think I have as many books as Carlisia’s up there but I have read all mine except for like four of them.  [0:33:52.6] CC: Yeah, I did not read all of these books. I mean a lot of these books are school related books that I kept because they are really good and books that I have acquired and I have read ome but not all the entire book. Some things I use for reference but definitely have not read. Don’t be impressed, I have not read all of these books. Hopefully one day when I retire maybe. Anyway –  [0:34:17.7] BL: I think that one interesting thing would be the amount of study that you need to do to gain a certification when you are not working in the space actually gives you that little bit of push that you need to make sure that you understand that you know what you need to know. So if you organically came to cloud native as I did, as I’ll explain in my story, you know I am not really interested in that certification.  But if I wanted to change, and maybe I wanted to change my focus to doing more graphic stuff and there was a certification for this, maybe I would think about it just to make sure that I knew I was eligible for these jobs that I am trying for, so.  [0:34:57.8] CC: Yeah.  [0:34:58.8] NL: Yeah, that makes sense. Also, my books are over there and I have read most of the way through many of my books but not all the way through because a lot of them are boring. [0:35:09.5] BL: But I will say and since we are talking about books and talking about getting yourself into Kubernetes land, right now is actually the best time to buy books because there is lots of them and I am not actually saying that these books are super awesome but some of them are. Notably this Programming Kubernetes book is pretty awesome and the reason it is so awesome is because my quote is on the back of it.  [0:35:33.3] NL: I was going to say.  [0:35:34.4] BL: Yeah, my name is on the back of it and then another book that I just picked up lately is called The Kubernetes Cookbook and it is for building cloud native applications from O’Reilly and the reason that I like it is because I have always, since I mean 20 years ago, I love creating O’Reilly cookbooks because small problem, answer with an exclamation, and then there is another one called Kubernetes Patterns, which I just started and I think it is pretty good too.  And just to say that these are not endorsements but this is what I am reading right now. It is like a thousand pages here. The things that I am trying to get through right now to keep up to date with what we are trying to do because actually the biggest problem with what we are doing is that we are trail blazing. So a lot of the things that are happening, like the way that Kubernetes advances every few months is new, new, new, new.  So there is not a lot of higher art in what we are doing that is public. So what you need to do is turn yourself into someone who actually understands the theory of what we are doing rather than the practical application of it. Understand that piece too but you got to understand the theory, which is why I said I’ve literally doing the same thing for the last 25 years because I learned how to program and I learned a Unix and then I learned Linux and then I learned networking.  Take all of those lessons and I can apply them all the time. So that is actually the most important part about any of this.  [0:36:56.9] CC: Yeah, I agree with you like going through the fundamentals helps so much more than going through the specifics and in fact, trying to learn specifics without having fundamentals it can be very painful but then you try to learn the fundamentals and then you go, “Oh yeah, it totally makes sense. I have been trying to listen to YouTube lectures on the server systems and I have a lot of moments of, “Ah that is why Kubernetes works this way to address this problem.”  And I have that programming book, which is not in my office. I have to find it but yes that is a very good book, I have this.  [0:37:37.9] BL: Oh Cloud Native DevOps with Kubernetes. That is another good book.  [0:37:43.5] CC: Yes.  [0:37:44.3] BL: I have it too.  [0:37:46.3] CC: I have like that as one?  [0:37:47.6] BL: Yes.  [0:37:48.2] CC: Good book and this, I haven’t gotten through it yet.  [0:37:52.7] BL: It is called Kubernetes in Action.  [0:37:54.3] CC: Yes, thank you for saying the name because if you are not in the video you wouldn’t know.  [0:37:58.5] BL: So really what we are saying now –  [0:37:59.7] CC: People say great things about the Kubernetes in Action, this one.  [0:38:02.9] BL: So I actually want to bring up another thing and say, I read a lot. I like to read. I read a lot of blog posts and here is another crazy thing, the YouTube videos from like KubeCon every year or every few months, we publish a 180 talks for free and there is some good lessons in those. So the good thing about getting into cloud native is that you can get into it for cheap because all of this information out here Kubernetes source is free. Go read it.  I mean 5,000 developers have worked on it. I am sure you will get a lot out of that, go do that but like YouTube talks, blog post, just following your favorite SIGK’s, Special Interest Group for Kubernetes, their community meetings. You can learn so much about how this space works and really how to write software in it without spending a dime other than have a computer and Internet. [0:38:55.5] CC: Yeah and I am going to give a tip for people that I actually caught on not too long ago. I subscribed to YouTube premium, which I think is $5 a month. It is the best $5 I have ever spent because really I don’t have time to sit in front of a video unless it is very special and just watch something and reading is also very – after I spend a whole day reading codes, my mind doesn’t want to read anything else. So I love podcasts and I listen to a lot of podcast.  And now the YouTube videos are even have been more educational to me because the premium version of YouTube is if your phone locks it will still play.  [0:39:40.2] BL: And you can download the videos.  [0:39:42.0] CC: You can download the videos too. Yeah if you go on a camping trip or airplane you have them so it’s been fantastic. I just put my headset, my little Bluetooth headset and as I am doing laundry or as I’m cooking or anything, I am always listening to something. There goes the tip.  [0:40:01.9] NL: Yeah, I totally agree. I love YouTube premium. No ads as Brian said is the best. I am going to throw out a book recommendation, one written by my colleague and a good friend, Craig Tracy, or co-written called Managing Kubernetes and it is actually like I was saying that these tech books are kind of boring, this one is actually a lot of fun to read. It is written well in a way that I found I kept turning at the page. So I really liked it.  [0:40:26.3] BL: Yeah, it is only 150 pages too.  [0:40:29.3] NL: Yeah that is pretty short.  [0:40:30.5] BL: And the software that Carlisia writes is the last chapter of it, the next to last chapter so.  [0:40:37.6] NL: Oh shoot, all right throw it out then.  [0:40:40.4] BL: Well no, I am just saying it is another good book and I like the way you bring this up because this information is out there but I know were coming close to the end and I had one thing that I want to talk about today.  [0:40:50.3] NL: I was just about to bring that up, please take us away.  [0:40:52.3] BL: All right, so we talked about where we come from and we talked about things in the space about the jobs, how we keep up to date but really, the most important piece is what happens in the future. You know Kubernetes is only five years old so theoretically cloud native jobs are only a few years old. So how does cloud native move in the future and I do have some thoughts on this one.  So what we are going to see is what we have seen over the last two decades is that our stacks will get more complex, we will run more apps, we will have more CPU’s and more networking and it is not even Morris Law stuff. We’ll just have more stuff. So what I find is that in the future, what we need to think about are things like automation. We need to think about better resilience. Apps that can actually take care of themselves. So your app goes down, what happens? Well nothing because it brought itself back up.  So I see that the jobs that we have now are just going to evolve into better versions of what we have right now. So developers will still be developing. The more interesting piece is that we are going to have more developers because more people are taking these boot camp courses, more people are going into computer science in school. So we are actually going to have more developers out there. So all that means is that we are just going to have more problems to solve at least for the next few years.  The generation from now, I couldn’t tell you what is going to happen. Maybe we will all be out of work. I will be retired so I probably won’t care but just think about this. Now is the literal best time to get into writing software and specifically for cloud native applications whether you are in operations or you are writing applications that run on clouds or anything like this. This is the best time because it is still beginning and there is more work to do than we have people and if you look through jobs postings you’ll realize that wow, everyone is looking for this.  [0:42:48.3] CC: Yeah and at the same time, there is a sufficient amount of resources out there for you to learn even if you don’t want to – if you want to or you can’t pay. We now are so much at the beginning that there is nothing so it is a very good time. [0:43:04.6] NL: Yeah, the wealth of knowledge is out there that is for free is unheard of. It is unprecedented and yeah, I totally agree that this is the best time. Brian, if we go by your thesis throughout this entire episode, basically we are going to be doing the same thing in 20 years as we are doing now. It is the same thing we did 20 years ago. So it is probably going to be like you said, developers are going to develop-ate, sys admins are going to sys administrate.  [0:43:28.6] CC: I love that.  [0:43:30.1] BL: And security people are going to complain about everything.  [0:43:33.4] NL: That is how we are going to change. So we are just going to be running on like quantum applications in 20 years but they are still going to be if/else statements.  [0:43:41.1] CC: My prediction is that we are going to have greater server access, like easier server access, and especially developers and there will be more buttons to press and more visual tools so you don’t have to be necessarily logging into a server to command lines that we have more tools abstracting all of that detailed work that develops.  [0:44:07.0] BL: So more abstractions on top of abstractions.  [0:44:10.1] CC: Yeah that is my prediction. Why not? [0:44:13.3] BL: Well you know what? I mean if that is true because that is what we have been doing forever now so we are going to continue on doing this thing.  [0:44:20.0] CC: Because it is what people want.  [0:44:22.0] BL: Because it works.  [0:44:23.0] CC: Yeah, it makes life easier for some people. I don’t see why we wouldn’t move in that direction but before we wrap up, unless you guys want to make predictions too, I really wanted to touch base on the hiring side of things. The recruiters and hiring managers before interviewing, I can’t imagine there is a whole bunch of people out there who need to recruit people to do these cloud native jobs and how can we help them? Like can we give them some tips? How can they attract people? What should they be looking for?  [0:45:03.4] NL: Well, I guess my thought is that I really feel like recruiters need to start learning the technology that they are hiring for. I don’t think that they can hide behind the idea that they’re recruiters and they don’t need to know. If you want to hire good people, if you want to weed out the bad people or whatever it is that you are trying to do, you need to actually learn the technology that you are hiring for and I think like we are saying, there is now a wealth of knowledge that is free for you to access, please look.  [0:45:32.9] CC: I am not going to disagree with that.  [0:45:34.3] BL: And the interesting thing is when he says learn it, he doesn’t mean that you have to be able to produce it but you should understand how it works at the minimum.  [0:45:42.8] NL: Yeah and also know when someone’s BS-ing you in the text screen.  [0:45:48.2] CC: But it is not easy because you might be going in the direction with the intention of learning and you might misunderstand things and you know how deep do you have to go to not misunderstand the technology?  [0:46:06.1] BL: You know what? I don’t think there is an answer for that. I think it is just you don’t know and there is something in between being an expert. You need to be something in between where if you’re hiring for cloud native in Kubernetes, you can’t offer a job that wants 10 years of Kubernetes experience. First of all, Kubernetes is huge and no one has all Kubernetes experience throughout the whole stack and second of all, Kubernetes is only five years old.  So please don’t do that to yourself as well. So you should know how old it is and at least know the parts and what your team is going to be working on but for managers, wow, actually I don’t have a good answer for that. So I am just going to I’ll plan on that one. [0:46:45.1] CC: Well, how would it be different? Actually it is going to sound like I asked a loaded question but I just now had this realization. I don’t think it would be different from what we were saying in regards to giving tips for people to prepare themselves, to make a move into this space if they are not working with any of this stuff. It will be the same, like try to find people who know distributed systems, they can debug well. I am not even to go into working well with people. That is such a given. Let’s just keep it to the text stack and all of those things that we recommended for people to learn, I don’t know.  [0:47:26.0] BL: Yeah, it sounds good to me.  [0:47:28.1] NL: All right, well I think that just about wraps it up for this week of the podcast, the Kubelets Podcast. I thought this was a really interesting discussion. It was cool to talk about where we were and where we are going and you know, and what brought us all together as I said.  [0:47:44.2] CC: Nick, do you want to share with us what your tagline for this episode was?  [0:47:48.1] NL: Yeah, the tagline for this episode is CREAM: Cash Rules Everything Around Me.  [0:47:53.3] BL: Dollar-dollar bills you all.  [0:47:55.8] CC: Ka-ching, ka-ching, ka-ching [0:47:58.8] NL: All right, thank you so much. Thank you Brian, thanks for joining us.  [0:48:03.9] BL: Thank you for having me. [0:48:05.3] NL: Yeah and thank you Carlisia.  [0:48:07.6] CC: This was really good, thank you.  [0:48:09.7] NL: Yeah, I had a lot of fun. Bye, y’all.  [0:48:13.5] BL: Bye.  [0:48:14.1] CC: Bye.  [END OF EPISODE] [0:48:14.8] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


27 Jan 2020

Rank #6

Podcast cover

Learning Distributed Systems (Ep 12)

In this episode of The Podlets Podcast, we welcome Michael Gasch from VMware to join our discussion on the necessity (or not) of formal education in working in the realm of distributed systems. There is a common belief that studying computer science is a must if you want to enter this field, but today we talk about the various ways in which individuals can teach themselves everything they need to know. What we establish, however, is that you need a good dose of curiosity and craziness to find your feet in this world, and we discuss the many different pathways you can take to fully equip yourself. Long gone are the days when you needed a degree from a prestigious school: we give you our hit-list of top resources that will go a long way in helping you succeed in this industry. Whether you are someone who prefers learning by reading, attending Meetups or listening to podcasts, this episode will provide you with lots of new perspectives on learning about distributed systems. Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Duffie Cooley Michael Gasch Key Points From This Episode: • Introducing our new host, Michael Gasch, and a brief overview of his role at VMware. • Duffie and Carlisia’s educational backgrounds and the value of hands-on work experience. • How they first got introduced to distributed systems and the confusion around what it involves.  • Why distributed systems are about more than simply streamlining communication and making things work. • The importance and benefit of educating oneself on the fundamentals of this topic. • Our top recommended resources for learning about distributed systems and their concepts. • The practical downside of not having a formal education in software development. • The different ways in which people learn, index and approach problem-solving. • Ensuring that you balance reading with implementation and practical experience. • Why it’s important to expose yourself to discussions on the topic you want to learn about. • The value of getting different perspectives around ideas that you think you understand. • How systems thinking is applicable to things outside of computer science.• The various factors that influence how we build systems.   Quotes: “When people are interacting with distributed systems today, or if I were to ask like 50 people what a distributed system is, I would probably get 50 different answers.” — @mauilion [0:14:43] “Try to expose yourself to the words, because our brains are amazing. Once you get exposure, it’s like your brain works in the background. All of a sudden, you go, ‘Oh, yeah! I know this word.’” — @carlisia [0:14:43] “If you’re just curious a little bit and maybe a little bit crazy, you can totally get down the rabbit hole in distributed systems and get totally excited about it. There’s no need for having formal education and the degree to enter this world.” — @embano1 [0:44:08] Learning resources suggested by the hosts: Book, Designing Data-Intensive Applications, M. Kleppmann Book, Distributed Systems, M. van Steen and A.S. Tanenbaum (free with registration) Book, Thesis on Raft, D. Ongaro. - Consensus - Bridging Theory and Practice (free PDF) Book, Enterprise Integration Patterns, B.Woolf, G. Hohpe Book, Designing Distributed Systems, B. Burns (free with registration) Video, Distributed Systems Video, Architecting Distributed Cloud Applications Video, Distributed Algorithms Video, Operating System - IIT Lectures Video, Intro to Database Systems (Fall 2018) Video, Advanced Database Systems (Spring 2018) Paper, Time, Clocks, and the Ordering of Events in a Distributed System Post, Notes on Distributed Systems for Young Bloods Post, Distributed Systems for Fun and Profit Post, On Time Post, Distributed Systems @The Morning Paper Post, Distributed Systems @Brave New Geek Post, Aphyr’s Class materials for a distributed systems lecture series Post, The Log - What every software engineer should know about real-time data’s unifying abstraction Post, Github - awesome-distributed-systems Post, Your Coffee Shop Doesn’t Use Two-Phase Commit Podcast, Distributed Systems Engineering with Apache Kafka ft. Jason Gustafson Podcast, The Systems Bible - The Beginner’s Guide to Systems Large and Small - John Gall Podcast, Systems Programming - Designing and Developing Distributed Applications - Richard Anthony Podcast, Distributed Systems - Design Concepts - Sunil Kumar Links Mentioned in Today’s Episode: The Podlets on Twitter — https://twitter.com/thepodlets Michael Gasch on LinkedIn — https://de.linkedin.com/in/michael-gasch-10603298 Michael Gasch on Twitter — https://twitter.com/embano1 Carlisia Campos on LinkedIn — https://www.linkedin.com/in/carlisia Duffie Cooley on LinkedIn — https://www.linkedin.com/in/mauilion VMware — https://www.vmware.com/ Kubernetes — https://kubernetes.io/ Linux — https://www.linux.org Brian Grant on LinkedIn — https://www.linkedin.com/in/bgrant0607 Kafka — https://kafka.apache.org/ Lamport Article — https://lamport.azurewebsites.net/pubs/time-clocks.pdf Designing Date-Intensive Applications — https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable-ebook/dp/B06XPJML5D Designing Distributed Systems — https://www.amazon.com/Designing-Distributed-Systems-Patterns-Paradigms/dp/1491983647 Papers We Love Meetup — https://www.meetup.com/papers-we-love/ The Systems Bible — https://www.amazon.com/Systems-Bible-Beginners-Guide-Large/dp/0961825170 Enterprise Integration Patterns — https://www.amazon.com/Enterprise-Integration-Patterns-Designing-Deploying/dp/0321200683 Transcript: EPISODE 12 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [00:00:41] CC: Hi, everybody. Welcome back. This is Episode 12, and we are going to talk about distributed systems without a degree or even with a degree, because who knows how much we learn in university. I am Carlisia Campos, one of your hosts. Today, I also have Duffie Cooley. Say hi, Duffie.  [00:01:02] DC: Hey, everybody.  [00:01:03] CC: And a new host for you, and this is such a treat. Michael Gasch, please tell us a little bit of your background.  [00:01:11] MG: Hey! Hey, everyone! Thanks, Carlisia. Yes. So I’m new to the show. I just want to keep it brief because I think over the show we’ll discuss our backgrounds a little bit further. So right now, I’m with VMware. So I’ve been with VMware almost for five years. Currently, I'm in the office of the CTO. I’m a platform architect in the office of the CTO and I mainly use Kubernetes on a daily basis from an engineering perspective. So we build a lot of prototypes based on customer input or ideas that we have, and we work with different engineering teams.  Kurbernetes has become kind of my bread and butter but lately more from a consumer perspective like developing with Kurbenetes or against Kubernetes, instead of the formal ware of mostly being around implementing and architecting Kubernetes.  [00:01:55] CC: Nice. Very impressive. Duffie? [00:01:58] MG: Thank you.  [00:01:59] DC: Yeah.  [00:02:00] CC: Let’s give the audience a little bit of your backgrounds. We’ve done this before but just to frame the episodes, so people will know how we come in as distributed systems.  [00:02:13] DC: Sure. In my experience, I spent – I don’t have a formal education history. I spent most of my time kind of like in a high school time. Then from there, basically worked into different systems administration, network administration, network architect, and up into virtualization and now containerization. So I’ve got a pretty hands-on kind of bootstrap experience around managing infrastructure, both at small-scale, inside of offices, and the way up to very large scale, working for some of the larger companies here in the Silicon Valley.  [00:02:46] CC: All right. My turn I guess. So I do have a computer science degree but I don’t feel that I really went deep at all in distributed systems. My degree is also from a long time ago. So mainly, what I do know now is almost entirely from hands-on work experience. Even so, I think I'm very much lacking and I’m very interested in this episode, because we are going to go through some great resources that I am also going to check out later. So let’s get this party started.  [00:03:22] DC: Awesome. So you want to just talk about kind of the general ideas behind distributed systems and like how you became introduced to them or like where you started in that journey? [00:03:32] CC: Yeah. Let’s do that.  [00:03:35] DC: My first experience with the idea of distributed systems was in using them before I knew that they were distributed systems, right? One of the very first distributed systems as I look back on it that I ever actually spent any real time with was DNS, which I consider to be something of a distributed system. If you think about it, they have name servers, they have a bunch of caching servers. They solve many of the same sorts of problems.  In a previous episode, we talked about how networking, just the general idea of networking and handling large-scale architecting networks. It’s also in a way very – has a lot of analogues into distributed systems. For me, I think working with and helping solve the problems that are associated with them over time gave me a good foundational understanding for when we were doing distributed systems as a thing later on in my career.  [00:04:25] CC: You said something that caught my interest, and it’s very interesting, because obviously for people who have been writing algorithms, writing papers about distributed systems, they’re going to go yawning right now, because I’m going to say the obvious.  As you start your journey programming, you read job requirements. You read or you must – should know distributed systems. Then I go, “What is distributed system? What do they really mean?” Because, yes, we understand apps stuck to apps and then there is API, but there’s always for me at least a question at the back of my head. Is that all there is to it? It sounds like it should be a lot more involved and complex and complicated than just having an app stuck on another app.  In fact, it is because there are so many concepts and problems involved in distributed systems, right? From timing, clock, and sequence, and networking, and failures, how do you recover. There is a whole world in how do you log this properly, how do you monitor. There’s a whole world that revolves around this concept of systems residing in different places and [inaudible 00:05:34] each other.  [00:05:37] DC: I think you made a very good point. I think this is sort of like there’s an analog to this in containers, oddly enough. When people say, “I want a container within and then the orchestration systems,” they think that that's just a thing that you can ask for. That you get a container and inside of that is going to be your file system and it’s going to do all those things. In a way, I feel like that same confusion is definitely related to distributed systems.  When people are interacting with distributed systems today or if I were to ask like 50 people what a distributed system is, I would probably get 50 different answers. I think that you got a pretty concise definition there in that it is a set of systems that intercommunicate to perform some function. It’s like found at its base line. I feel like that's a pretty reasonable definition of what distributed systems are, and then we can figure out from there like what functions are they trying to achieve and what are some of the problems that we’re trying to solve with them.  [00:06:29] CC: Yeah. That’s what it’s all about in my head is solving the problems because at the beginning, I was thinking, “Well, it must be just about communicating and making things work.” It’s the opposite of that. It’s like that’s a given. When a job says you need to understand about distributed systems, what they are really saying is you need to know how to deal with failures, not just to make it work. Make it work is sort of the easy part, but the whole world of where the failures can happen, how do you handle it, and that, to me is what needing to know distributed system comes in handy.  In a couple different things, like at the top layer or 5% is knowing how to make things work, and 95% is knowing how to handle things when they don’t work, because it’s inevitable.  [00:07:19] DC: Yeah, I agree. What do you think, Michael? How would you describe the context around distributed systems? What was the first one that you worked with? [00:07:27] MG: Exactly. It’s kind of similar to your background, Duffie, which is no formal degree or education on computer science right after high school and jumping into kind of my first job, working with computers, computer administration.  I must say that from the age of I think seven or so, I was interested in computers and all that stuff but more from a hardware perspective, less from a software development perspective. So my take always was on disassembling the pieces and building my own computers than writing programs. In the early days, that just was me.  So I completely almost missed the whole education and principles and fundamentals of how you would write a program for a single computer and then obviously also for how to write programs that run across a network of computers. So over time, as I progress on my career, especially kind of in the first job, which was like seven years of different Linux systems, Linux administrations, I kind of – Like you, Duffie, I dealt with distributed systems without necessarily knowing that I'm dealing with distributed systems. I knew that it was mostly storage systems, Linux file servers, but distributed file servers. Samba, if some of you recall that project.  So I knew that things could fail. I know it could fail, for example, or I know it could not be writable, and so a client must be stuck but not necessarily I think directly related to fundamentals of how distributed systems work or don’t work. Over time, and this is really why I appreciate the Kubernetes project in community, I got more questions, especially when this whole container movement came up. I got so many questions around how does that thing work. How does scheduling work? Because scheduling kind of was close to my interest in the hardware design and low-level details. But I was looking at Kubernetes like, “Okay. There is the scheduler.” In the beginning, the documentation was pretty scarce around the implementation and all the control as for what’s going on. So I had to – I listen to a lot of podcasts and Brian Grant’s great talks and different shows that he gave from the Kubernetes space and other people there as well.  In the end, I had more questions than answers. So I had to dig deeper. Eventually, that led me to a path of wanting to understand more formal theory behind distributed systems by reading the papers, reading books, taking some online classes just to get a basic understanding of those issues. So I got interested in results scheduling in distributed systems and consensus. So those were two areas that kind of caught my eyes like, “What is it? How do machines agree in a distributed system if so many things can go wrong?”  Maybe we can explore this later on. So I’m going to park this for a bit. But back to your question, which was kind of a long-winded answer or a road to answering your question, Duffie. For me, a distributed system is like this kind of coherent network of computer machines that from the outside to an end-user or to another client looks like one gigantic big machine that is [inaudible 00:10:31] to run as fast. That is performing also efficient. It constitutes a lot of characteristics and properties that we want from our systems that a single machine usually can’t handle. But it looks like it's a big single machine to a client.   [00:10:46] DC: I think that – I mean, it is interesting like, I don’t want to get into – I guess this is probably not just a distributed systems talk. But obviously, one of the questions that falls out for me when I hear that answer is then what is the difference between a micro service architecture and distributed systems, because I think it's – I mean, to your point, the way that a lot of people work with the app to learn to develop software, it’s like we’re going to develop a monolithic application just by nature. We’re going to solve a software problem using code.  Then later on, when we decide to actually scale this thing or understand how to better operate it under a significant load, then we started thinking about, “Okay. Well, how do we have to architect this differently in such a way that it can support that load?” That’s where I feel like the beams cut across, right? We’re suddenly in a world where you’re not only just talking about microservices. You’re also talking about distributed systems because you’re going to start thinking about how to understand transactionality throughout that system, how to understand all of those consensus things that you're referring to. How do they affect it when I add mister network in there? That’s cool.  [00:11:55] MG: Just one comment on this, Duffie, which took me a very long time to realize, which is coming – From my definition of what a distributed system is like this group of machines that they perform work in a certain sense or maybe even more abstracted like at a bunch of computers network together.  What I kind of missed most of the time, and this goes back to the DNS example that you gave in the beginning, was the client or the clients are also part of this distributed system, because they might have caches, especially in DNS. So you always deal with this kind of state that is distributed everywhere. Maybe you don't even know where it kind of is distributed, and the client kind of works with a local stale data.  So that is also part of a distributed system, and something I want to give credit to the Kafka community and some of the engineers on Kafka, because there was a great talk lately that I heard. It’s like, “Right. The client is also part of your distributed system, even though usually we think it's just the server. That those many server machines, all those microservices.” At least I missed that a long time.  [00:12:58] DC: You should put a link to that talk in our [inaudible 00:13:00]. That would be awesome. It sounds great. So what do you think, Carlisia? [00:13:08] CC: Well, one thing that I wanted to mention is that Michael was saying how he’s been self-teaching distributed systems, and I think if we want to be competent in the area, we have to do that. I’m saying this to myself even.  It’s very refreshing when you read a book or you read a paper and you really understand the fundamentals of an aspect of distributed system. A lot of things fall into place in your hands. I’m saying this because even prioritizing reading about and learning about the fundamentals is really hard for me, because you have your life. You have things to do. You have the minutiae in things to get done. But so many times, I struggle.  In the rare occasions where I go, “Okay. Let me just learn this stuff trial and error,” it makes such a difference. Then once you learn, it stays with you forever. So it’s really good. It’s so refreshing to read a paper and understand things at a different level, and that is what this episode is. I don’t know if this is the time to jump in into, “So there are our recommendations.” I don't know how deep, Michael, you’re going to go. You have a ton of things listed. Everything we mention on the show is going to be on our website, on the show notes. So nobody needs to be necessarily taking notes.  Anything thing I wanted to say is it would be lovely if people would get back to us once you listened to this. Let us know if you want to add anything to this list. It would be awesome. We can even add it to this list later and give a shout out to you. So it’d be great. [00:14:53] MG: Right. I don’t want to cover this whole list. I just wanted to be as complete as possible about a stuff that I kind of read or watched. So I just put it in and I just picked some highlights there if you want.  [00:15:05] CC: Yeah. Go for it.  [00:15:06] MG: Yeah. Okay. Perfect. Honestly, even though not the first in the list, but the first thing that I read, so maybe from kind of my history of how I approach things, was searching for how do computers work and what are some of the issues and how do computers and machines agree. Obviously, the classic paper that I read was the Lamport paper on “Time, Clocks, and the Ordering of Events in a Distributed System”.  I want to be honest. First time I read it, I didn’t really get the full essence of the paper, because it doesn't prove in there. The mathematic proof for me didn't click immediately, and there were so many things and concepts and physics and time that were thrown at me where I was looking for answers and I had more questions than answers. But this is not to Leslie. This is more like by the time I just wasn't prepared for how deep the rabbit hole goes.  So I thought, if someone asked me for – I only have time to read one book out of this huge list that I have there and all the other resources. Which one would it be? Which one would I recommend? I would recommend Designing Data-Intensive Apps by Martin Kleppmann, which I’ve been following his blog posts and some partial releases that he's done before fully releasing that book, which took him more than four years to release that book.  It’s kind of almost the Bible, state-of-the-art Bible when it comes to all concepts in distributed systems. Obviously, consensus, network failures, and all that stuff but then also leading into modern data streaming, data platform architectures inspired by, for example, LinkedIn and other communities. So that would be the book that I would recommend to someone if – Who does have time to read one book.  [00:16:52] DC: That’s a neat approach. I like the idea of like if you had one thing, if you have one way to help somebody ramp on distributed systems and stuff, what would it be? For me, it’s actually I don't think I would recommend a book, oddly enough. I feel like I would actually – I’d probably drive them toward the kind of project, like the kind [inaudible 00:17:09] project and say, “This is a distributed system all but itself.” Start tearing it apart to pieces and seeing how they work and breaking them and then exploring and kind of just playing with the parts. You can do a lot of really interesting things.  This is actually another book in your list that was written by Brendan Burns about Designing Distributed Systems I think it’s called. That book, I think he actually uses Kubernetes as a model for how to go about achieving these things, which I think is incredibly valuable, because it really gets into some of the more stable distributed systems patterns that are around.  I feel like that's a great entry point. So if I had one thing, if I had to pick one way to help somebody or to push somebody in the direction of trying to learn distributed systems, I would say identify those distributed systems that maybe you’re already aware of and really explore how they work and what the problems with them are and how they went about solving those problems. Really dig into the idea of it. It’s something you could put your hands on and play with. I mean, Kubernetes is a great example of this, and this is actually why I referred to it.  [00:18:19] CC: The way that works for me when I’m learning something like that is to really think about where the boundaries are, where the limitations are, where the tradeoffs are. If you can take a smaller system, maybe something like The Kind Project and identify what those things are. If you can’t, then ask around. Ask someone. Google it. I don’t know. Maybe it will be a good episode topic for us to do that. This part is doing this to map things out. So maybe we can understand better and help people understand things better. So mainly like yeah. They try to do the distributed system thesis are. But for people who don’t even know what they could be, it’s harder to identify it. I don’t know what a good strategy for that would be, because you can read about distributed systems and then you can go and look at a project. How do you map the concept to learning to what you’re seeing in the code base? For me, that’s the hardest thing.  [00:19:26] MG: Exactly. Something that kind of I had related experience was like when I went into software development, without having formal education on algorithms and data structures, sometimes in your head, you have the problem statement and you're like, “Okay. I would do it like that.” But you don't know the word that describes, for example, a heap structure or queue because you’ve never – Someone told you that is heap, that is a queue, and/or that is a stick.  So, for me, reading the book was a bit easier. Even though I have done distributed systems, if you will, administration for many years, many years ago, I didn't realize that it was a distributed system because I never had this definition or I never had those failure scenarios in mind and it never had a word for consensus. So how would I search for something like how do machines agree? I mean, if you put that on Google, then likely they will come – Have a lot of stuff. But if you put it in consensus algorithm, likely you get a good hit on what the answer should be.  [00:20:29] CC: It is really problematic when we don't know the names of things because – What you said is so right, because we are probably doing a lot of distributed systems without even knowing that that’s what it is. Then we go in the job interview, and people are, “Oh! Have you done a distributed system?” No. You have but you just don’t know how to name things. But that’s one – [00:20:51] DC: Yeah, exactly. [00:20:52] CC: Yeah. Right? That’s one issue. Another issue, which is a bigger issue though is at least that’s how it is for me. I don’t want to speak for anybody else but for me definitely. If I can’t name things and I face a problem and I solve it, every time I face that problem it’s a one-off thing because I can’t map to a higher concept.  So every time I face that problem, it’s like, “Oh!” It’s not like, “Oh, yeah!” If this is this kind of problem, I have a pattern. I’m going to use that to this problem. So that’s what I’m saying. Once you learn the concept, you need to be able to name it. Then you can map that concept to problems you have.  All of a sudden, if you have like three things [inaudible 00:21:35] use to solve this problem, because as you work with computers, coding, it’s like you see the same thing over and over again. But when you don’t understand the fundamentals, things are just like – It’s a bunch of different one-offs. It’s like when you have an argument with your spouse or girlfriend or boyfriend. Sometimes, it’s like you’re arguing 10 times in a month and you thought, “Oh! I had 10 arguments.” But if you’d stop and think about it, no. We had one argument 10 times. It’s very different than having 10 problems versus having 1 problem 10 times, if that makes sense.  [00:22:12] MG: It does. [00:22:11] DC: I think it does, right?  [00:22:12] MG: I just want to agree.  [00:22:16] DC: I think it does make sense. I think it’s interesting. You’ve highlighted kind of an interesting pattern around the way that people learn, which I think is really interesting. That is like some people are able to read about patterns or software patterns or algorithms or architectures and have that suddenly be an index of their heads. They can actually then later on correlate what they've read with the experience that they’re having around the things they're working on.  For some, it needs to be hands-on. They need to actually be able to explore that idea and understand and manipulate it and be able to describe how it works or functions in person, in reality. They need to have that hands-on like, “I need to touch it to understand it,” kind of experience. Those people also, as they go through those experiences, start building this index of patterns or algorithms in their head. They have this thing that they can correlate to, right, like, “Oh! This is a time problem,” or, “This is a consensus problem,” or what have you, right? [00:23:19] CC: Exactly.  [00:23:19] DC: You may not know the word for that saying but you're still going to develop a pattern in your mind like the ability to correlate this particular problem with some pattern that you’ve seen before. What's interesting is I feel like people have taken different approaches to building that index, right? For me, it’s been troubleshooting. Somebody gives me a hard problem, and I dig into it and I figure out what the problem is, regardless of whether it's to do with distributed systems or cooking. It could be anything, but I always want to get right in there and figure out what that problem and start building a map in my mind of all of the players that are involved.  For others, I feel like with an educational background, if you have an education background, I think that sometimes you end up coming to this with a set of patterns already instilled that you understand and you're just trying to apply those patterns to the experience you’re having instead. It’s just very – It’s like horse before the cart or cart before the horse. It’s very interesting when you think about it.  [00:24:21] CC: Yes.  [00:24:22] MG: The recommendation that I just want to give to people that are like me who like reading is that I went overboard a bit in the beginnings because I was so fascinated by all the stuff, and it went down the rabbit hole deeper, deeper, deeper, deeper. Reading and reading and reading. At some point, even coming to weird YouTube channels that talk about like, “Is time real and where does time emerge from?” It became philosophical even like the past where I went to.  Now, the thing is, and this is why I like Duffie’s approach with like breaking things and then undergo like trying to break things and understanding how they work and how they can fail is that immediately you practice. You’re hands-on. So that would be my advice to people who are more like me who are fascinated by reading and all the theory that your brain and your mind is not really capable of kind of absorbing all the stuff and then remembering without practicing. Practicing can be breaking things or installing things or administrating things or even writing software. But for me, that was also a late realization that I should have maybe started doing things earlier than the time I spent reading.  [00:25:32] CC: By doing, you mean, hands-on? [00:25:35] MG: Yeah.  [00:25:35] CC: Anything specific that you would have started with? [00:25:38] MG: Yes. On Kubernetes – So going back those 15 years to my early days of Linux and Samba, which is a project. By the time, I think it was written in C or C++. But the problem was I wasn’t able to read the code. So the only thing that I had by then was some mailing lists and asking questions and not even knowing which questions to ask because of lack of words of understanding. Now, fast-forward into Kubernetes’ time, which got me deeper in distributed systems, I still couldn't read the code because I didn't know [inaudible 00:26:10]. But I forced myself to read the code, which helped a little bit for myself to understand what was going on because the documentation by then was lacking. These days, it’s easier, because you can just install [inaudible 00:26:20] way easier today. The hands-on piece, I mean.  [00:26:23] CC: You said something interesting, Michael, and I have given this advice before because I use this practice all the time. It's so important to have a vocabulary. Like you just said, I didn't know what to ask because I didn’t know the words. I practice this all the time. To people who are in this position of distributed systems or whatever it is or something more specific that you are trying to learn, try to expose yourself to the words, because our brains are amazing. Once you get exposure, it’s like your brain works in the background. All of a sudden, you go, “Oh, yeah! I know this word.”  So podcasts are great for me. If I don't know something, I will look for a podcast on the subject and I start listening to it. As the words get repeated, just contextually. I don’t have to go and get a degree or anything. Just by listening to the words being spoken in context, absorb the meaning of it. So podcasting is great or YouTube or anything that you can listen. Just in reading too, of course. The best thing is talking to people. But, again, it’s really – Sometimes, it’s not trivial to put yourself in positions where people are discussing these things.  [00:27:38] DC: There are actually a number of Meetups here in the Bay Area, and there’s a number of Meetups – That whole Meetup thing is sort of nationwide across the entire US and around the world it seems like now lately. Those Meetups I feel like there are a number of Meetups in different subject areas. There’s one here in the Bay Area called Papers We Love, where they actually do explore interesting technical papers, which are obviously a great place to learn the words for things, right? This is actually where those words are being defined, right? When you get into the consensus stuff, they really get into – One even is Raft. There are many papers on Raft and many papers on multiple things that get into consensus. So definitely, whether you explore a meetup on a distributed system or in a particular application or in a particular theme like Kubernetes, those things are great places just to kind of get more exposure to what people are thinking about in these problems.  [00:28:31] CC: That is such a great tip.  [00:28:34] MG: Yeah. The podcast is twice as good as well, because for people, non-natives – English speaker, I mean. Oh, people. Not speakers. People. The thing is that the word you’re looking for might be totally different than the English word. For example, consensus in Germany has this totally different meaning. So if I would look that up in German, likely I would find nothing or not really related at all. So you have to go through translation and then finding the stuff. So what you said, Duffie, with PWL, Papers We Love, or podcasts, those words, often they are in English, those podcasts and they are natural consensus or charting or partitioning. Those are the words that you can at least look up like what does it mean. That’s what I did as well thus far.  [00:29:16] CC: Yes. I also wanted to do a plus one for Papers We Love. It’s – They are everywhere and they also have an online. They have an online version of the Papers We Love Meetup, and a lot of the local ones film their meetups. So you can go through the history and see if they talked about any paper that you are interested in.  Probably, I’m sure multiple locations talk about the same paper, so you can get different takes too. It’s really, really cool. Sometimes, it’s completely obscure like, “I didn’t get a word of what they were saying. Not one. What am I doing here?” But sometimes, they talk about things. You at least know what the thing is and you get like 10% of it. But some paper you don’t. People who deal with papers day in and day out, it’s very much – I don’t know.  [00:30:07] DC: It’s super easy when going through a paper like that to have the imposter syndrome wash over you, right, because you’re like – [00:30:13] CC: Yes. Thank you. That’s what I wanted to say. [00:30:15] DC: I feel like I’ve been in this for 20 years. I probably know a few things, right. But in talking about reading this consensus paper going, “Can I buy a vowel? What is happening?” [00:30:24] CC: Yeah. Can I buy a vowel? That’s awesome, Duffie.  [00:30:28] DC: But the other piece I want to call out to your point, which I think is important is that some people don't want to go out and be there in person. They don’t feel comfortable or safe exploring those things in person.  So there are tons of resources like you have just pointed out like the online version of Papers We Love. You can also sign into Slack and just interact with people via text messaging, right? There’s a lot of really great resources out there for people of all types, including the amount of time that you have.  [00:30:53] CC: For Papers We Love, it’s like going to language class. If you go and take a class in Italian, your first day, even though that is going to be super basic, you’re going to be like, “What?” You’ll go back in your third week. You start, “Oh! I’m getting this.” Then a month, three months, “Oh! I’m starting to be competent.”  So you go once. You’re going to feel lost and experience imposter syndrome. But you keep going, because that is a format. First, you start absorbing what the format is, and that helps you understand the content. So once your mind absorbs the format, you’re like, “Okay. Now, I have – I know how to navigate this. I know what’s coming next.” So you don’t have to focus on that. You start focusing in the content. Then little but little, you become more proficient in understanding. Very soon, you’re going to be willing to write a paper. I’m not there yet.  [00:31:51] DC: That’s awesome.  [00:31:52] CC: At least that’s how I think it goes. I don’t know.  [00:31:54] MG: I agree.  [00:31:55] DC: It’s also changed over time. It’s fascinating. If you read papers from like 20 years ago and you read papers that are written more recently, it's interesting. The papers have changed their language when considering competition. When you're introducing a new idea with a paper, frequently that you are introducing it into a market full of competition. You're being very careful about the language, almost in a way to complicate the idea rather than to make it clear, which is challenging. There are definitely some papers that I’ve read where I was like, “Why are you using so many words to describe this simple idea?” It makes no sense, but yeah.  [00:32:37] CC: I don’t want to make this episode all about Papers We Love. It was so good that you mentioned that, Duffie. It’s really good to be in a room where we’ll be watching something online where you see people asking questions and people go, “Oh! Why is this thing like this? Why is X like this,” or, “Why is Y doing like this?” Then you go, “Oh! I didn’t even think that X was important. I didn’t even know that Y was important.”  So you stop picking up what the important things are, and that’s what makes it click is now you’ve – Hooking into the important concepts because people who know more than you are pointing out and asking questions. So you start paying attention to learning what the main things it should be paying attention to, which is different from reading the paper by yourself. It’s just a ton of content that you need to sort through.  [00:33:34] DC: Yeah. I frequently self-describe it as a perspective junkie, because I feel like for any of us really to learn more about a subject that we feel we understand, we need the perspective of others to really engage, to expand our understanding of that thing. I feel like and I know how to make a peanut butter and jelly sandwich. I’ve done it a million times. It’s a solid thing. But then I watch my kid do it and I’m like, “I hadn’t thought of that problem.” [inaudible 00:33:59], right? This is a great example of that.  Those communities like Papers We Love are great opportunity to understand the perspective of others around these hard ideas. When we’re trying to understand complex things like distributed systems, this is where it’s at. This is actually how we go about achieving this. There is a lot that you can do on your own but there is always going to be more that you can do together, right? You can always do more. You can always understand this idea faster. You can understand the complexity of a system and how to break it down into these things by exploiting it with other people. That's I feel like – [00:34:40] CC: That is so well said, so well said, and it’s the reason for this show to exist, right? We come on a show and we give our perspectives, and people get to learn from people with different backgrounds, what their takes are on distributed systems, cloud native. So this was such a major plug for the show. Keep coming back. You’re going to learn a ton.  Also, it was funny that you – It was the second time you mentioned cooking, made a cooking reference, Duffie, which brings me to something I want to make sure I say on this episode. I added a few things for reference, three books. But the one that I definitely would recommend starting with is The Systems Bible by John Gall. This book is so cool, because it helps you see everything through systems. Everything is a system. A conversation can be a system. An interaction between two people can be a system. I’m not saying this book says that. It’s just like my translation and that you can look – Cooking is a system. There is a process. There is a sequence. It’s really, really cool and it really helps to have things framed in this way and then go out and read the other books on systems. I think it helps a lot. This is definitely what I am starting with and what I would recommend people start with, The Systems Bible. Did you two know this book? [00:36:15] MG: I did not. I don’t.  [00:36:17] DC: I’m not aware of it either but I really appreciate the idea. I do think that that's true. If you develop a skill for understanding systems as they are, you’ll basically develop – Frequently, what you’re developing is the ability to recognize patterns, right? [00:36:32] CC: Exactly.  [00:36:32] DC: You could recognize those patterns on anything.  [00:36:37] MG: Yeah. That's a good segue for just something that came to my mind. Recently, I gave a talk on event-driven architectures. For someone who's not a software developer or architect, it can be really hard to grab all those concepts on asynchrony and eventual consistency and idempotency. There are so many words of like, “What is this all – It sounds weird, way too complex.” But I was reading a book some years ago by Gregor Hohpe. He’s the guy behind Enterprise Integration Patterns. That’s also a book that I have on my list here. He said, “Your barista doesn't use two-phase commit.” So he was basically making this analogy of he was in a coffee shop and he was just looking at the process of how the barista makes the coffee. You pay for it and all the things that can go wrong while your coffee is brewed and served to you.  So he was making this relation between the real world and the life and human society to computer systems. There it clicked to me where I was like, “So many problems we solve every day, for example, agreeing on a time where we should meet for dinner or cooking, is a consensus problem, and we solve it.”  We even solve it in the case of failure. I might not be able to call Duffie, because he is not available right now. So somehow, we figure out. I always thought that those problems just exist in computer science and distributed systems. But I realized actually that's just a subset of the real world as is. Looking at those problems through the lens of your daily life and you get up and all the stuff, there are so many things that are related to computer systems.  [00:38:13] CC: Michael, I missed it. Was it an article you read? [00:38:16] MG: Yes. I need to put that in there as well. Yeah. It’s a plug.  [00:38:19] CC: Please put that in there. Absolutely. So far from being any kind of expert in distributed systems, but I have noticed. I have caught myself using systems thinking for even complicated conversations. Even in my personal life, I started approaching things in the systems oriented and just the – just a high-level example.  When I am working with systems, I can approach from the beginning, the end. It’s like a puzzle, putting the puzzle together, right? Sometimes, it starts from the middle. Sometimes, it starts from the edges. When I‘m having conversations that I need to be very strategic like I have one shot. Let’s say maybe I’m in a school meeting and I have to reach a consensus or have a solution or have a plan of action. I have to ask the right questions. My private self would do things linearly. Historically like, “Let’s go from the beginning and work out through the end.” Now, I don’t do that anymore. Not necessarily. Sometimes, I like, “Let me maybe ask the last question I would ask and see where it leads and just approach things from a different way.” I don’t know if this is making sense.  [00:39:31] MG: It does. It does.  [00:39:32] CC: But my thinking has changed. The way I see the possibilities is not a linear thing anymore. I see how you can truly switch things. I use this in programming a lot and also writing. Sometimes, when you’re a beginner writer, you start at the top and you go down to the conclusion. Sometimes, I start I the middle and go up, right? So you can start anywhere. It’s beautiful or it just gives you so many more options. Or maybe I’m just crazy. Don’t listen to me.  [00:40:03] DC: I don’t think you’re crazy. I was going to say, one of the funny things about Michael’s point and your point both, it’s like in a way that they have kind of referred to Conway's law, the idea that people will build systems in the way that they communicate. So this is actually – It totally brings it back to that same point of thing, right? We by nature will build systems that we can understand, because that is the constraint in which we have to work, right? So it’s very interesting.  [00:40:29] CC: Yeah. But it’s an interesting thing, because we are [inaudible 00:40:32] by the way we are forced to work. For example, I work with constraints and what I'm saying is that that has been influencing my way of thinking.  So, yes, I built systems in the way I think but also because of the constraints that I’m dealing with that I have to be – the tradeoffs I need to make, that also turns around and influences the way I think, the way I see the world and the rest of the systems and all the rest of the world. Of course, as I change my thinking, possibly you can theorize that you go back and apply that. Apply things that you learn outside of your work back to your work. It’s a beautiful back-and-forth I think.  [00:41:17] MG: I had the same experience with some – When I had to design kind of my first API and think of, “Okay. What would the consumer contract be and what would a consumer expect me to deliver in response and so on?” I was forcing myself and being explicit in communicating and not throwing everything at the client back to confusing but being very explicit and precise. Also on communication every day when you talk to people, being explicit and precise really helps to avoid a lot of problems and trouble. Be it partnership or amongst friends or at work.  This is what I took from computer science actually back into my real world in order to taking all those perceptions, perceiving things from a different perspective, and being more precise and explicit in how I respond or communicated.  [00:42:07] CC: My take on what you just said, Michael, is we design systems thinking how is this going to fail. We know this is going to fail. We’re going to design for that. We’re going to implement for that.  In real life, for example, if I need to get an agreement from someone, I try to understand the person's thinking and just go, “I just had this huge thing this week. This is in my mind.” I’m not constantly thinking about this, I’m not crazy like that. Just a little bit crazy. It’s like, “How does this person think? What do they need to know? How far can I push?” Right? We need to make a decision quickly, so the approach is everything, and sometimes you only get one shot, so yeah. I mean, correct me if I’m wrong. That's how I heard or I interpreted what you just said.  [00:42:52] MG: Yeah, absolutely. Spot on. Spot on. So I’m not crazy as well.  [00:42:55] CC: Basically, I think we ended up turning this episode into a little bit of like, “Here are great references,” and also a huge endorsement for really going deep into distributed systems, because it’s going to be good for your jobs. It’s going to be good for your life. It’s going to be good for your health. We are crazy.  [00:43:17] DC: I’m definitely crazy. You guys might be. I’m not. All right. So we started this episode with the idea of coming to learning distributed systems perhaps without a degree or without a formal education in it. We talked about a ride of different ideas on that subject. Like different approaches that each of us took, how each of us see the problem. Is there any important point that either of you want to throw back into the mix here or bring up in relation to that? [00:43:48] MG: Well, what I take from this episode, being my first episode and getting to know your background, Duffie and Carlisia, is that whoever is going to listen to this episode, whatever background you have, even though you might not be in computer systems or industry at all, I think we three all had approved that whatever background you have, if you’re just curious a little bit and maybe a little bit crazy, you can totally get down the rabbit hole in distributed systems and get totally excited about it. There’s no need for having formal education and the degree to enter this world. It might help but it’s kind of not a high bar that I was perceiving it to be 10 years ago, for example.  [00:44:28] CC: Yeah. That’s a good point. My takeaway is it always puzzled me how some people are so good and experienced and such experts in distributed systems. I always look at myself. It’s like, “How am I lacking?” It’s like, “What memo did I miss? What class did I miss? What project did I not work on to get the experience?” What I’m seeing is you just need to put yourself in that place. You need to do the work. But the good news is achieving competency in distributed systems is doable.  [00:45:02] DC: My takeaway is as we discussed before, I think that there is no one thing that comprises a distributed system. It is a number of things, right, and basically a number of behaviors or patterns that we see that comprise what a distributed system is.  So when I hear people say, “I’m not an expert in distributed systems,” I think, “Well, perhaps you are and maybe you don’t know it already.” Maybe there's some particular set of patterns with which you are incredibly familiar. Like you understand DNS better than the other 20 people in the room. That exposes you to a set of patterns that certainly give you the capability of saying that you are an expert in that particular set of patterns.  So I think that to both of your points, it’s like you can enter this stage where you want to learn about distributed systems from pretty much any direction. You can learn it from a CIS background. You can come it with no computer experience whatsoever, and it will obviously take a bit more work. But this is really just about developing and understanding around how these things communicate and the patterns with which they accomplish that communication. I think that’s the important part.  [00:46:19] CC: All right, everybody. Thank you, Michael Gasch, for being with us now. I hope to – [00:46:25] MG: Thank you.  [00:46:25] CC: To see you in more episodes [inaudible 00:46:27]. Thank you, Duffie.  [00:46:30] DC: My pleasure.  [00:46:31] CC: Again, I’m Carlisia Campos. With us was Duffie Cooley and Michael Gesh. This was episode 12, and I hope to see you next time. Bye. [00:46:41] DC: Bye.  [00:46:41] MG: Goodbye.  [END OF EPISODE] [00:46:43] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


13 Jan 2020

Rank #7

Podcast cover

Stateful and Stateless Workloads (Ep 9)

This week on The Podlets Cloud Native Podcast we have Josh, Carlisia, Duffie, and Nick on the show, and are also happy to be joined by a newcomer, Brian Liles, who is a senior staff engineer at VMWare! The purpose of today’s show is coming to a deeper understanding of the meaning of ‘stateful’ versus ‘stateless’ apps, and how they relate to the cloud native environment. We cover some definitions of ‘state’ initially and then move to consider how ideas of data persistence and co-ordination across apps complicate or elucidate understandings of ‘stateful’ and ‘stateless’. We then think about the challenging practice of running databases within Kubernetes clusters, which effectively results in an ephemeral system becoming stateful. You’ll then hear some clarifications of the meaning of operators and controllers, the role they play in mediating and regulating states, and also how important they are in a rapidly evolving but skills-scarce environment. Another important theme in this conversation is the CAP theorem or the impossibility of consistency, availability and partition tolerance all at once, but the way different databases allow for different combinations of two out of the three. We then move on to chat about the fundamental connection between workloads and state and then end off with a quick consideration about how ideas of stateful and stateless play out in the context of networks. Today’s show is a real deep dive offering perspectives from some the most knowledgeable in the cloud native space so make sure to tune in! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Duffie Cooley Bryan Liles Josh Rosso Nicholas Lane Key Points From This Episode: • What ‘stateful’ means in comparison to ‘stateless’.• Understanding ‘state’ as a term referring to data which must persist.• Examples of stateful apps such as databases or apps that revolve around databases.• The idea that ‘persistence’ is debatable, which then problematizes the definition of ‘state’. • Considerations of the push for cloud native to run stateless apps.• How inter-app coordination relates to definitions of stateful and stateless applications.• Considering stateful data as data outside of a stateless cloud native environment.• Why it is challenging to run databases in Kubernetes clusters.• The role of operators in running stateful databases in clusters.• Understanding CRDs and controllers, and how they relate to operators.• Controllers mediate between actual and desired states.• Operators are codified system administrators.• The importance of operators as app number grows in a skill-scarce environment.• Mechanisms around stateful apps are important because they ensure data integrity.• The CAP theorem: the impossibility of consistency, availability, and tolerance.• Why different databases allow for different iterations of the CAP theorem.• When partition tolerance can and can’t get sacrificed.• Recommendations on when to run stateful or stateless apps through Kubernetes.• The importance of considering models when thinking about how to run a stateful app.• Varying definitions of workloads.• Pods can run multiple workloads• Workloads create states, so you can’t have one without the other.• The term ‘workloads’ can refer to multiple processes running at once.• Why the ephemerality of Kubernetes systems makes it hard to run stateful applications. • Ideas of stateful and stateless concerning networks.• The shift from server to browser in hosting stateful sessions. Quotes: “When I started envisioning this world of stateless apps, to me it was like, ‘Why do we even call them apps? Why don’t we just call them a process?’” — @carlisia [0:02:60] “‘State’ really is just that data which must persist.” — @joshrosso [0:04:26] “From the best that I can surmise, the operator pattern is the combination of a CRD plus a controller that will operate on events from the Kubernetes API based on that CRD’s configuration.” — @bryanl [0:17:00] “Once again, don’t let developers name them anything.” — @bryanl [0:17:35] “Data integrity is so important” — @apinick [0:22:31] “You have to really be careful about the different models that you’re evaluating when trying to think about how to manage a stateful application like a database.” — @mauilion [0:31:34] Links Mentioned in Today’s Episode: KubeCon+CloudNativeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-north-america-2019/Google Spanner — https://cloud.google.com/spanner/CockroachDB — https://www.cockroachlabs.com/CoreOS — https://coreos.com/Red Hat — https://www.redhat.com/enMetacontroller — https://metacontroller.app/Brandon Philips — https://www.redhat.com/en/blog/authors/brandon-phillipsMySQL — https://www.mysql.com/ Transcript: EPISODE 009 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [INTERVIEW] [00:00:41] JR: All right! Hello, everybody, and welcome to episode 6 of The Cubelets Podcast. Today we are going to be discussing the concept of stateful and stateless and what that means in this crazy cloud native landscape that we all work.  I am Josh Rosso. Joined with me today is Carlisia.  [00:00:59] CC: Hi, everybody.  [00:01:01] JR: We also have Duffie.  [00:01:03] D: Hey, everybody.  [00:01:04] JR: Nicholas.  [00:01:05] NL: Yo!  [00:01:07] JR: And a newcomer to the podcast, we also have Brian. Brian, you want to give us a little intro about yourself?  [00:01:12] BL: Hi! I’m Brian. I work at VMWare. I do lots of community stuff, including sharing the KubeCon+CloudNativeCon. [00:01:22] JR: Awesome! Cool. All right. We’ve got a pretty good cast this week. So let’s dive right into it. I think one of the first things that we’ve been talking a bit about is the concept of what makes an application stateful? And of course in reverse, what makes an application stateless? Maybe we could try to start by discerning those two. Maybe starting with stateless if that makes? Does someone want to take that on?  [00:01:45] CC: Well, I’m going to jump right in. I have always been a developer, as supposed to some of you or all of you have who have system admin backgrounds. The first time that I heard the stateless app, I was like, “What?” That wasn’t recent, okay? It was a long time ago, but that was a knot in my head. Why would you have a stateless app? If you have an app, you’re going to need state. I couldn’t imagine what that was. But of course it makes a lot of sense now. That was also when we were more in the monolithic world.   [00:02:18] BM: Actually that’s a good point. Before you go into that, it’s a great point. Whenever we start with apps or we start developing apps, we think of an application. An application does everything. It takes input and it does stuff and it gives output. But now in this new world where we have lots of apps, big apps, small apps, we start finding that there’s apps that only talk and coordinate with other apps. They don’t do anything else. They don’t save any data. They don’t do anything. That’s what – where we get into this thing called stateless apps. Apps don’t have any type of data that they store locally. [00:02:53] CC: Yeah. It’s more like when I envision in my head. You said it brilliantly, Brian. It’s almost like a process. When I started envisioning this world of stateless apps, to me it was like, “Why do we even call them apps? Why don’t we just call them a process?” They’re just shifting back data and forth but they’re not – To me, at the beginning, apps were always stateless. They went together.  [00:03:17] D: I think, frequently, people think of applications that have only locally relevant stuff that is actually not going to persist to disc, but maybe held in memory or maybe only relevant to the type of connection that’s coming through that application also as stateless, which is interesting, because there’s still some state there, but the premise is that you could lose that state and not lose the functionality of that code. [00:03:42] NL: Something that we might want to dive into really quickly when talking about stateless and stateful apps. What do we mean by the word state? When I first learned about these things, that was what always screwed me up. I’m like, “What do you mean state? Like Washington? Yeah. We got it over here.”   [00:03:57] JR: Oh! State. That’s that word. State is one of those words that we use to sound smarter than we actually are 95% of the time, and that’s a number I just made up. When people are talking about state, they mean databases. Yeah. But there are other types of state as well. If you maintain local cache that needs to be persistent, if you have local files that you’re dealing with, like you’re opening files. That’s still state. State really is just that it’s data that must persist. [00:04:32] D: I agree with that definition. I think that state, whether persisted to memory or persisted to disc or persisted to some external system, that’s still what we refer to as state.  [00:04:41] JR: All right. Makes sense and sounds about like what I got from it as well.  [00:04:45] CC: All right. So now we have this world where we talk about stateless apps and stateful apps. Are there even stateful apps? Do we call a database an app? If we have a distributed system where we have one stateless app over here, another stateless app over there and then we have the database that’s connected to the two of them, are we calling the database a stateful app or is that whole thing – How do we call this?  [00:05:15] NL: Yeah. The database is very much a state as an app with state. I’m very much – [00:05:19] D: That’s a close definition. Yeah.  [00:05:21] NL: Yeah. Literally, it’s the epitome of a stateful app. But then you also have these apps that talk to databases as well and they might have local data, like data that – they start a transaction and then complete it or they have a long distributed type transaction. Any apps that revolve around a database, if they store local data, whether it’s within a transaction or something else, they’re still stateful apps.  [00:05:46] D: Yup. I think you can modify and input data or modify state that has to be persisted in some way I think is a stateful app, even though I do think it’s confusing because of what – As I said before, I think that there are a bunch of applications that we think of, like not everybody considers Spark jobs to be stateful. Spark jobs, for example, are something that would bring data in, mutate that data in some way, produce some output and go away.  The definition there is that Spark would generally push the resulting data into some other external system. It’s interesting, because in that model, Spark is not considered to be a stateful app because the Spark job could fail, go away, get recreated, pick up the pieces where it left off or just redo that work until all of the work is done.  In many cases, people consider that to be a stateless application. That’s I think is like the crux – In my opinion, the crux of the confusion around what a stateful and stateless application is, is that people frequently – I think it’s more about where you store – what you mean by persistence and how that actually realizes in your application. If you’re pushing your state to an external database, is your application still stateful?  [00:06:58] NL: I think it’s a good question, or if you are gathering data from an external source and mutating it in some way, but you don’t need data to be present when you start up, is that a stateful app or a stateless app? Even though you are taking in data, modifying it and checking it, sending out to some other mechanism or serving it in your own way, does that become like a stateless app? If that app gets killed and it comes back and it’s able to recover, is it stateful or stateless? That’s a bit of a gray area, I think. [00:07:26] JR: Yeah. I feel like a lot of the customers I work with, if the application can get killed even if it has some type of local state, they still refer to it as stateless usually, to me at least, when we talk about it because they think, “I can kind of restart this application and I’m not too worried about losing whatever it may have had.” Let’s say cached for simplicity, right?  I think that kind of leads us into an interesting question. We’ve talked a lot on this podcast about cloud native infrastructure and cloud native applications and it seems like since the inception of cloud native, there’s always been this push that a stateless app is the best candidate to run or the easiest candidate to run. I’m just curious if we could dive into that for a moment. Why in the cloud native infrastructure area has there always been this push for running stateless applications? Why is it simpler? Those kinds of things. [00:08:15] BL: Before we dive into that, we have to realize – And this is just a problem of our whole ecosystem, this whole cloud native. We’re very hand-wavy in our descriptions for things. There’re a lot of ambiguous descriptions, and state is one of those. Just keep that in mind, that when we’re talking today, we’re really just talking about these things that store data and when that’s the state. Just keep that in mind as you’re listening to this.  But when it comes to distributed systems in general, the easiest system is a system that doesn’t need coordination with any other system. If it happens to die, that’s okay. We can just restart it. People like to start there. It’s the easiest thing to start.  [00:08:58] NL: Yeah, that was basically what I was going to say. If your application needs to tie into other applications, it becomes significantly more complicated to implement it, at least for your first time and in your system. These small applications that only – They don’t care about anybody else, they just take in data or not, they just do whatever. Those are super easy to start with because they’re just like, “Here. Start this up. Who cares? Whatever happens, it happens.”   [00:09:21] CC: That could be a good boundary to define – I don’t want to jump back too far, but to define where is the stateless app to me is part of a system and just say it depends for it to come back up. Does it depend on something else that has state?   [00:09:39] BL: I’ll give you an example. I can give you a good example of a stateless app that we use every day, every single one of us, none of us on this call, but when you search Google. You go to google.com and you go to the bar and you type in a search, what’s happening is there is a service at the beginning that collects that search and it federates the search over many different probably clusters of computers so they can actually do the search currently. That app that actually coordinates all that work is a stateless app most likely. All it does is just splits it up and allows more CPUs to do the work. Probably, that goes away. Probably not a problem. You probably have 10 more of them. That’s what I consider stateless. It doesn’t really own any of the data. It’s the coordinator. [00:10:25] CC: Yeah. If it goes down, it comes back up. It doesn’t need to reset itself to the state where it was before. It can truly be considered a stateless because it can just, “Okay. I reset. I’m starting from the beginning from this clear state.” [00:10:43] BL: Yes. That’s a good summary of that.  [00:10:45] CC: Because another way to think about stateless – What makes an app stateful app, does it have to be combined or like deployed and shipped together with the part that maintains the state? That’s a more clear cut definition. Then that app is definitely a stateful app.  [00:11:05] D: What we frequently talk about in like the cloud native space is like you know that you have a stateless app if you can just create 20 of them and not have to worry about the coordination of them. They are all workers. They are all going to take input. You could spread the load across those 20 in an identical way and not worry about which one you landed on. That’s stateless application.  A stateful application is a very different thing. You have to have some coordination. You have to say how many databases can you have on a backend? Because you’re persisting data there, you have to be really careful about that you only write to the master database or to the writing database and you could read of any other memories of that database cluster, that sort of stuff. [00:11:44] CC: It might seem that we are going so deep into this differentiating between stateful and stateless, but this is so important because clusters are usually designed to be ephemeral. Ephemeral means obviously they die down, they are brought back up, the nodes, and you should worry as least as possible with the state of things.  Then going back to what Joshua is saying, when we are in this cloud native world, usually we are talking about stateless apps, stateless workloads and then we’re going to just talk about what workload means. But then if that’s the case, where are the stateful apps? It’s like we have this vision that the stateful apps live outside the cloud native world? How does it work? But it’s supposed to work. [00:12:36] BL: Yup. This is the question that keeps a lot of people employed. Making sure my state is available when I need it. You know what? I’m not going to even use that word state. Making sure my data is available wherever I need it and when I need it. I don’t want to go too deep in right now, but this is actually a huge problem in the Kubernetes community in general, and we see it because there’s been lots of advice given, “Don’t run things like databases in your clusters.” This is why we see people taking the ideas of Google Spanner and like CockroachDB and actually going through a lot of work to make sure that you can run databases in Kubernetes clusters.  The interesting piece about this is that we’re actually to the point where we can run these types of workloads in our clusters, but with a caveat, big star at the end, it’s very difficult and you have to know what you’re doing. [00:13:34] JR: Yeah. I want to dovetail on that Brian, because it’s something that we see all the time. I feel like when we first started setting up, let’s call them clusters, but in our case it was Kubernetes, right? We always saw that data level always being delegated to like if you’re in Amazon, some service that they hosted and so on. But now I think more and more of the customers that at least I’m seeing. I’m sure Nicholas and Duffie too, they’re interested in doing exactly what you just described.  Cockroach is an example I literally just worked with recently, and it’s just interesting how much more thoughtful they have to be about their cluster operations. Going back to what you said Carlisia, it’s not as easy as just like trashing a cluster and instantiating a new one anymore, like they’re used to. They need to be more thoughtful about keeping that data integrity intact through things like upgrades and disaster recover.  [00:14:18] D: Another interesting point kind to your point, Brian, is that like, frequently, people are starting to have conversations and concerns around data gravity, which means that I have a whole bunch of data that I need to work with, like to a Spark job, which I mentioned earlier. I need to basically put my compute where that data is. The way that I store that data inside the cluster and use Kubernetes to manage it or whether I just have to make sure that I have some way of bringing up compute workloads close to that data. It’s actually kind of introducing a whole new layer to this whole thing. [00:14:48] BL: Yeah! Whole new layer of work and a whole new layer of complexity, because that’s actually – The crux of all this is like where we slide the complexity too, but this is interesting, and I don’t want to go too far to this one definitely. This is why we’re seeing more people creating operators around managing data. I’ve seen operators who are bringing databases up inside of Kubernetes. I’ve seen operators that actually can bring up resources outside of Kubernetes using the Kubernetes API.  The interesting thing about this is that I looked at both solutions and I said, “I still don’t know what the answer is,” and that’s great. That means that we have a lot to learn about the problem, and at least we have some paths for it.  [00:15:29] NL: Actually, that kind of reminds me of the first time I ever heard the word stateful or stateless – I’m an infrastructure guy. Was around the discussion of operators, which there’s only a couple of years ago when operators were first introduced at CoreOS and some people were like, “Oh! Well, this is how you now operate a stateful mechanism inside of Kubernetes. This is the way forward that we want to propose.” I was just like, “Cool! What is that? What’s state? What do you mean stateful and stateless?” I had no idea. Josh, you were there. You’re like, “Your frontend doesn’t care about state and your backend does.” I’m like, “Does it? I don’t know. I’m not a developer.” [00:16:10] JR: Let’s talk about exactly that, because I think these patterns we’re starting to see are coming out of the needs that we’re all talking about, right? We’ve seen at least in the Kubernetes community a lot of push for these different constructs, like something called a stateful [inaudible 00:16:21], which isn’t that important right now, but then also like an operator. Maybe we can start by defining what is an operator? What is that pattern and why does it relate to stateful apps? [00:16:31] CC: I think that would be great. I am not clear what an operator is. I know there’s going to be a controller involved. I know it’s not a CRD. I am not clear on that at all, because I only work with CRDs and we don’t define – like the project I worked on with Velero, we don’t categorize it as an operator. I guess an operator uses specific framework that exists out there. Is it a Kubernetes library? I have no idea. [00:16:56] BL: We did it to ourselves again. We’re all doing these to ourselves. From the best that I can surmise, the operator pattern is the combination of a CRD plus a controller that will operate on events from the Kubernetes API based on that CRD’s configuration. That’s what an operator is. [00:17:17] NL: That’s exactly right.  [00:17:18] BL: To conflate this, Red Hat created the operator SDK, and then you have [inaudible 00:17:23] and you have a Metacontroller, which can help you build operators. Then we actually sometimes conflate and call CRDs operators, and that’s pretty confusing for everyone. Once again, don’t let developers name anything.  [00:17:41] CC: Wait. So let’s back up a little. Okay. There is an actual library that’s called an operator.  [00:17:46] BL: Yes. There’s an operator SDK.  [00:17:47] CC: Referred to as an operator. I heard that. Okay. Great. But let me back up a little because – [00:17:49] D: The word operator can [00:17:50] CC: Because if you are developing an app for Kubernetes, if you’re extending Kubernetes, you are – Okay, you might not use CRDs, but if you are using CRDs, you need a controller, right? Because how will you do actions? Then every app that has a CRD – because the alternative to having CRDs is just using the API directly without creating CRDs to reflect to resources. If you’re creating CRDs to reflect to resources, you need controllers. All of those apps, they have CRDs, are operators.   [00:18:24] D: Yip [inaudible 00:18:25] is an operator. [00:18:26] CC: [inaudible 00:18:26] not an operator. How can you extend Kubernetes and not be qualified [inaudible 00:18:31] operator? [00:18:32] BL: Well, there’s a way. There is a way. You can actually just create a CRD and use a CRD for data storage, you know, store states, and you can actually query the Kubernetes API for that information. You don’t need a controller, but we couple them with controllers a lot to perform action based on that state we’ve saved to etcd. [00:18:50] CC: Duffie.  [00:18:51] D: I want to back up just for a moment and talk about the controller pattern and what it is and then go from there to operators, because I think it makes it easier to get it in your head. A control pattern is effectively a way to understand desired state and real state and provide some logic or business code that will allow you to converge those two states, your actual state and your desired state. This is a pattern that we see used in almost everything within a distributed system. It’s like within Kubernetes, within most of the kind of more interesting systems that are out there. This control pattern describes a pretty good way of actually managing application flow across distributed systems.  Now, operators, when they were initially introduced, we were talking about that this is a slightly different thing. Operators, when we introduced the idea, came more from like the operational burden of these stateful applications, things like databases and those sorts of stuff. With the database, etcd for example, you have a whole bunch of operational and runtime concerns around managing the lifecycle of that system. How do I add a new member to the cluster? What do I do when a member dies? How do I take action?  Right now, that’s somebody like myself waking up at 2 in the morning and working through a run book to basically make sure that that service remains operational through the night. But the idea of an operator was to take that control pattern that we described earlier and make it wake up at 2 in the morning to fix this stuff. We’re going to actually codify the operational knowledge of managing the burden of these stateful applications so that we don’t have to wake up at 2 in the morning and do it anymore. Nobody wants to do that.  [00:20:32] BL: Yeah. That makes sense. Remember back at KubCon years ago, I know it was one in Seattle where Brandon Philips was on stage talking about operators. He basically was saying if we think about SysOp, system operators, it was a way to basically automate or capture the knowledge of our system administrators in scripts or in a process or in code a la operators.  [00:20:57] D: The last part that I’ll add to this thing, which I think is actually what really describes the value of this idea to me is that there are only so many people on the planet that do what the people in this blog post do. Maybe you’re one of them that listen to this podcast. People who are operating software or operating infrastructure at scale, there just aren’t that many of us on the planet. So as we add more applications, as more people adopt the cloud native regime or start coming to a place where they can crank out more applications more quickly, we’re going to have to get to a place where we are able to automate the burden of managing those applications, because there just aren’t enough of us to be able to support the load that is coming. There just aren’t enough people on the planet that do this to be able to support that.  That’s the thing that excites me most about the operator pattern, is that it gives us a place to start. It gives us a place to actually start thinking about managing that burden over time, because if we don’t start changing the way we think about managing that burden, we’re going to run out of people. We’re not going to be able to do it.  [00:22:05] NL: Yeah. It’s interesting. With stateful apps, we keep kind of bringing them – coming back to stateful apps, because stateful apps are hard and stateless apps are easy, and we’ve created all these mechanisms around operating things with state because of how just complicated it is to make sure that your data is ready, accessible and has integrity. That’s the big one that I keep not thinking about as a SysOps person coming into the Dev world. Data integrity is so important and making sure that your data is exactly what it needs to be and was the last time you checked it, is super important. It’s only something I’m really starting to grasp. That’s why I was like these things, like operators and all these mechanisms that we keep creating and recreating and recreating keep coming about, because making sure that your stateful apps have the right data at the right time is so important. [00:22:55] BL: Since you brought this up, and we just talked about why a state is so hard, I want to introduce the new term to this conversation, the whole CAP theorem, where data would typically be – in a distributed system at least, your data will be consistent or your data can be available, or if your distributed systems falls in multiple parts, you can have partition tolerance. This is one of those computer science things where you can actually pick two. You can have it be available and have partition tolerance, but your data won’t be consistent, or you can have consistency and availability, but you won’t have partition tolerance. If your cluster splits into two for some reason, the data will be bad. This is why it’s hard, this is why people have written basically lots of PhD dissertations on this subject, and this is why we are talking about this here today, is because managing state, and particularly managing distributed, is actually a very, very hard problem. But there’s software out there that will help us, and Kubernetes is definitely part of that and stateful sets are definitely part of that as well. [00:24:05] JR: I was just going to say on those three points, consistently, availability and partition tolerance. Obviously, we’d want all three if we could have them. Is there one that we most commonly tradeoff and give up or does it go case-by-case? [00:24:17] BL: Actually, it’s been proven. You can’t have all three. It’s literally impossible. It depends. If you have a MySQL server and you’re using MySQL to actually serve data out of this, you’re going to most likely get consistency and availability. If you have it replicated, you might not have partition tolerance. That’s something to think about, and there are different databases and this is actually one of the reasons why there are different databases. This is why people use things like relational databases and they use key value stores not because we really like the interfaces, but because they have different properties around the data.   [00:24:55] NL: That’s an interesting point and something that I had recently just been thinking about, like why are there so many different types of databases. I just didn’t know. It was like in only recently heard of CAP theorem as well just before you mentioned it. I’m like, “Wow! That’s so fascinating.” The whole thing where you only pick two. You can’t get three.  Josh, to kind of go back to your question really quickly, I think that partition tolerance is the one that we throw away the most. We’re willing to not be able to segregate our database as much as possible because C and A are just too important, I think. At least that’s what I’m saying, like I am wearing an [inaudible 00:25:26] shirt and [inaudible 00:25:27] is not partition tolerant. It’s bad at it. [00:25:31] BL: This is why Google introduced Spanner, and Spanner in some situations can get free with tradeoffs and a lot of really, really smart stuff, but most people can’t run this scale. But we do need to think about partition tolerance, especially with data whenever – Let’s say you run a store and you have multiple instances across the world and someone buys something from inventory, what is your inventory look like at any particular point? You don’t have to answer my question, of course, but think about that. These are still very important problems if fiber gets cut across the Atlantic and now I’ve sold more things than I have.  Carlisia, speaking to you as someone who’s only been a developer, have you moved your thoughts on state any further? [00:26:19] CC: Well, I feel that I’m clear on – Well, I think you need to clarify your question better for me. If you’re asking if I understand what it means, I understand what it means. But I actually was thinking to ask this question to all of you, because I don’t know the answer, if that’s the question you’re asking me. I want to put that to the group. Do you recommend people, as in like now-ish, to run stateful workloads? We need to talk about workloads mean.  Run stateful apps or database in sites if they’re running a Kubernetes cluster or if they’re planning for that, do you all as experts recommend that they should already be looking into doing that or they should be running for now their stateful apps or databases outside of the cloud native ecosystem and just connecting the two? Because if that’s what your question was, I don’t know.  [00:27:21] BL: Well, I’ll take this first. I think that we should be spending lots of more time than we are right now in coming up community-tested solutions around using stateful sets to their best ability. What that means is let’s say if you’re running a database inside of Kubernetes and you’re using a stateful set to manage this, what we do need to figure out is what happens when my database goes down? The pod just kills? When I bring up a new version, I need to make sure that I have the correct software to verify integrity, rebuilt things, so that when it comes back up, it comes back up correctly. That’s what I think we should be doing. [00:27:59] JR: For me, I think working with customers, at least Kubernetes-oriented folks, when they’re trying to introduce Kubernetes as their orchestration part of their overall platform, I’m usually just trying to kind of meet them where they’re at. If they’re new to Kubernetes and distributed systems as a whole, if we have stateless, let’s call them maybe simpler applications to start with, I generally have them lean into that first, because we already have so much in front of us to learn about. I think it was either Brian or Duffie, you said it introduces a whole bunch more complexity. You have to know what you’re doing. You have to know how to operate these things. If they’re new to Kubernetes, I generally will advise start with stateless still. But that being said, so many of our customers that we work with are very interested in running stateful workloads on Kubernetes. [00:28:42] CC: But just to clarify what you said, Josh, because you spoke like an expert, but I still have beginner’s ears. You said something that sounded to me like you recommend that you go stateless. It sounded to me like that. What you really say is that they take out the stateless part of what they have, which they might already have or they might have to change and put the stateless. You’re not suggesting that, “Oh! You can’t do stateful anymore. You need to just do everything stateless.” What you’re saying is take the stateless part of your system, put that in Kubernetes, because that is really well-tested and keep the stateful outside of that ecosystem. Is that right?  [00:29:27] JR: I think that’s a better way to put it. Again, it’s not that Kubernetes can’t do stateful. It’s more of a concept of biting off more than you can chew. We still work with a lot of people who are very new to these distributed systems concepts, and to take on running stateful workloads, if we could just delegate that to some other layer, like outside of the cluster, that could be a better place to start, at least in my experience. Nicholas and Duff might have different –  [00:29:51] NL: Josh, you basically nailed it like what I was going to say, where it’s like if the team that I’m working with is interested in taking on the complexity of maintaining their databases, their stateful sets and making sure that they have data integrity and availability, then I’m all for them using Kubernetes for a stateful set.  Kubernetes can run stateful applications, but there is all this complexity that we keep talking about and maintaining data and all that. If they’re willing to take on their complexity, great, it’s there for you. If they’re not, if they’re a little bit kind of behind as – Not behind, but if they’re kind of starting out their Kubernetes journey or their distributed systems journey, I would recommend them to move that complexity to somebody else and start with something a little bit easier, like a stateless application.  There are a lot of good services that provide data as a service, right? You’ve got dataview as RDS  is great for creating stateful application. You can leverage it anytime and you’ve got like dedicated wires too. I would point them to there first if they don’t want to take on like complexity.  [00:30:51] D: I completely agree with that. An important thing I would add, which is in response to the stateful set piece here, is that as we’ve already described, managing a stateful application like a database does come with some complexity. So you should really carefully look at just what these different models provide you. Whether that model is making use of a stateful set, which provides you like ordinality, ensuring that things start up in a particular order and some of the other capabilities around that stuff.  But it won’t, for example, manage some of the complexity. A stateful set won’t, for example, try and issue a command to the new member to make sure that it’s part of an existing database cluster. It won’t manage that kind of stuff. So you have to really be careful about the different models that you’re evaluating when trying to think about how to manage a stateful application like a database.  I think because it’s actually why the topic of an operator came up kind of earlier, which was that like there are a lot of primitives within Kubernetes in general that provide you a lot of capability for managing things like stateful applications, but they may not entirely suit your needs. Because of the complexity with stateful applications, you have to really kind of be really careful about what you adopt and where you jump in. [00:32:04] CC: Yeah. I know just from working with Velero, which is a tool for doing backup and recovery migration of Kubernetes clusters. I know that we backup volumes. So if you have something mounted on a volume, we can back that up. I know for a fact that people are using that to backup stateful workloads. We need to talk about workloads. But at any case, one thing to – I think one of you mentioned is that you definitely also need to look at a backup and recovery strategy, which is ever more important if you’re doing stateful workloads. [00:32:46] NL: That’s the only time it’s important. If you’re doing stateless, who cares?  [00:32:49] BL: Have we defined what a workload is?  [00:32:50] CC: Yeah. But let me say something. Yeah, I think we should do an episode on that maybe, maybe not. We should do an episode on GitOps type of thing for related things, because even though you – Things are stateless, but I don’t want to get into it. Your cluster will change state. You can recover in stuff from like a fresh version. But as it goes through a lifecycle, it will change state and you might want to keep that state. I don’t know. I’m not the expert in that area, but let’s talk about workloads, Brian.  Okay. Let me start talking about workloads. I never heard the term workload until I came into the cloud native world, and that was about a year ago or when they started looking in this space more closely. Maybe a little bit before a year ago. It took me forever to understand what a workload was. Now I understand, especially today, we’re talking about a little bit before we started recording. Let me hear from you all what it means to you. [00:34:00] BL: This is one of those terms, and I’m sure like the last any ex-Googlers about this, they’ll probably agree. This is a Google term that we actually have zero context about why it’s a term. I’m sure we could ask somebody and they would tell us, but workloads to me personally are anything that ultimately creates a pod. Deployments create replica sets, create pods. That whole thing is a workload. That’s how I look at it. [00:34:29] CC: Before there were pods, were there workloads, or is a workload a new thing that came along with pods? [00:34:35] BL: Once again, these words don’t make any sense to us, because they’re Google terms. I think that a pod is a part of a workload, like a deployment is a part of a workload, like a replica set is part of a workload. Workload is the term that encompasses an entire set of objects. [00:34:52] D: I think of a workload as a subset of an application. When I think of an application or a set of microservices, I might think of each of the services that make up that entire application as a workload. I think of it that way because that’s generally how I would divide it up to Brian’s point into different deployment or different stateful sets or different – That sort of stuff. Thinking of them each as their own autonomous piece, and altogether they form an application. That’s my think of it. [00:35:20] CC: To connect to what Brian said, deployment, will always run in the pods, which is super confusing if you’re not looking at these things, just so people understand, because it took me forever to understand that. The connection between a workload, a deployment and a pod.  Pods contain – If you have a deployment that you’re going to shift Kubernetes – I don’t know if shift is the right word. You’re going to need to run on Kubernetes. That deployment needs to run somewhere, in some artifact, and that artifact is called a pod.  [00:35:56] NL: Yeah. Going back to what Duffie said really quickly. A workload to me was always a process, kind of like not just a pod necessarily, but like whatever it is that if you’re like, “I just need to get this to run,” whatever that is. To me that was always a workload, but I think I’m wrong. I think I’m oversimplifying it. I’m just like whatever your process is.  [00:36:16] BL: Yeah. I would give you – The reason why I would not say that is because a pod can run multiple containers at once, which ergo is multiple processes. That’s why I say it that way. [00:36:29] NL: Oh! You changed my mind.  [00:36:33] BL: The reason I bring this up, and this is probably a great idea for a future show, is about all the jargon and terminology that we use in this land that we just take as everyone knows it, but we don’t all know it, and should be a great conversation to have around that. But the reason I always bring up the whole workload thing is because when we think about workloads and then you can’t have state without workloads, really. I just wanted to make sure that we tied those two things together.   [00:36:58] CC: Why can you not have state without workloads? What does that mean? [00:37:01] BL: Well, the reason you can’t have state without workloads is because something is going to have to create that state, whether that workload is running in or out a cluster. Something is going to have to create it. It just doesn’t come out of nowhere.  [00:37:11] CC: That goes back to what Nick was saying, that he thinks a workload is a process. Was that was you said, Nick?  [00:37:18] NL: It is, yeah, but I’m renegading on that.   [00:37:23] CC: At least I could see why you said that.  Sorry, Brian. I cut you off. [00:37:28] BL: What I was saying is a workload ultimately is one or more processes. It’s not just a process. It’s not a single process. It could be 10, it could be 1.   [00:37:39] JS: I have one final question, and we can bail on this and edit it out if it’s not a good one to end with. I hope it’s not too big, but I think maybe one thing we overlooked is just why it’s hard to run stateful workloads in these new systems like Kubernetes. We talked about how there’s more complexity and stuff, but there might be some room to talk about – People have been spinning up an EC2 server, a server on the web and running MySQL on it forever. Why in like the Kubernetes world of like pods and things is it a little bit harder to run, say, MySQL just [inaudible 00:38:10]. Is that something worth diving into? [00:38:13] NL: Yeah, I think so. I would say that for things like, say, applications, like databases particularly, they are less resilient to outages. While Kubernetes itself is dedicated to – Or most container orchestrations, but Kubernetes specifically, are dedicated to running your pods continuously as long as they will, that it is still somewhat of a shifting landscape. You do have priority and preemption. If you don’t set those things up properly of if there’s just like a total failure of your system at large, your stateful application can just go down at any time. Then how do you reconcile the outage in data, whatever data that might have gotten lost? Those sorts of things become significantly more complicated in an environment like Kubernetes where you don’t necessarily have access to a command line to run the commands to recover as easy. You may not, but it’s the same. [00:39:01] BL: Yes. You got to understand what databases do. Disk is slow, whether you have spinning disk or you have disk on chip, like SSD. What databases do in a lot of cases is they store things in memory. So if it goes away, didn’t get stored. In other cases, what databases do is they have these huge transactional logs, maybe they write them out in files and then they process the transaction log whenever they have CPU time. If a database dies just suddenly, maybe its state is inconsistent because it had items that were to be processed in a queue that haven’t been processed. Now it doesn’t know what’s going on, which is why – [00:39:39] NL: That’s interesting. I didn’t know that.  [00:39:40] BL: If you kill MySQL, like kill MySQL D with a -9, why it might not come back up.  [00:39:46] JR: Yeah. Going back to Kubernetes as an example, we are living in this newer world where things can get rescheduled and moved around and killed and their IPs changed and things. It seems like this environment is, should I say, more ephemeral, and those types of considerations becoming to be more complex.  [00:40:04] NL: I think that really nails it. Yeah. I didn’t know that there were transactional logs about databases. I should, I feel like, have known that but I just have no idea.  [00:40:11] D: There’s one more part to the whole stateful, stateless thing that I think is important to cover, but I don’t know if we’ll be able to cover it entirely in the time that we have left, and that is from the network perspective. If you think about the types of connections coming into an application, we refer to some of those connections as stateful and stateless. I think that’s something we could tackle in our remaining time, or what’s everybody’s thought? [00:40:33] JR: Why don’t you try giving us maybe a quick summary of it, Duffie, and then we can end on that.  [00:40:36] CC: Yeah. I think it’s a good idea to talk about network and then address that in the context of network. I’m just thinking an idea for an episode. But give us like a quick rundown.  [00:40:45] D: Sure. A lot of the kind of older monolithic applications, the way that you would scale these things is you would have multiple of them and then you would have some intelligence in the way that you’re routing connections down to those applications that would describe the ability to ensure that when Bob accesses a website and he authenticates, he’s going to authenticate to one specific instance of this application and the intelligence up in the frontend is going to handle the routing to make sure that Bob’s connection always comes back to that same instance. This is an older pattern.  It’s been around for a very long time and it’s certainly the way that we first kind of learned to scale applications before we’ve decided to break into maker services and kind of handle a lot of this routing in a more resilient way. That was kind of one of the early versions of how we do this, and that is a pretty good example of a stateful session, and that there is actually some – Perhaps Bob has authenticated and he has a cookie that allows him, that when he comes back to that particular application, a lot of the settings, his browser settings, whether he’s using the dark theme or the light theme, that sort of stuff, is persisted on the server side rather than on the client side. That’s kind of what I mean by stateful sessions.  Stateless sessions mean it doesn’t really matter that the user is terminating to the same end of point, because we’ve managed to keep the state either with the client. We’re handling state on the browser side of things rather on the server side of things. So you’re not necessarily gaining anything by pushing that connection back to the same specific instance, but just to a service that is more widely available.  There are lots of examples of this. I mean, Brian’s example of Google earlier. Obviously, when I come back to Google, there are some things I want it to remember. I want it to remember that I’m logged in as myself. I want it to remember that I’ve used a particular – I want it to remember my history. I want it to remember that kind of stuff so that I could go back and find things that I looked at before. There are a ton of examples of this when we think about it. [00:42:40] JR: Awesome! All right, everyone. Thank you for joining us in episode 6, Stateful and Stateless. Signing off. I’m Josh Rosso, and going across the line, thank you Nicholas Lane.   [00:42:54] NL: Thank you so much. This was really informative for me.   [00:42:56] JR: Carlisia Campos. [00:42:57] CCC: This was a great conversation. Bye, everybody.  [00:42:59] JR: Our new comer, Brian Liles.  [00:43:01] BL: Until next time.  [00:43:03] JR: And Duffie Cooley.  [00:43:05] DCC: Thank you so much, everybody. [00:43:06] JR: Thanks all.  [00:43:07] CCC: Bye! [END OF EPISODE] [0:50:00.3] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


23 Dec 2019

Rank #8

Podcast cover

Keeping up with Cloud Native (Ep 17)

If you work in Kubernetes, cloud native, or any other fast-moving ecosystem, you might have found that keeping up to date with new developments can be incredibly challenging. We think this as well, and so we decided to make today’s episode a tribute to that challenge, as well as a space for sharing the best resources and practices we can think of to help manage it. Of course, there are audiences in this space who require information at various levels of depth, and fortunately the resources to suit each one exist. We get into the many different places we go in order to receive information at each part of the spectrum, such as SIG meetings on YouTube, our favorite Twitter authorities, the KubeWeekly blog, and the most helpful books out there. Another big talking point is the idea of habits or practices that can be helpful in consuming all this information, whether it be waiting for the release notes of a new version, tapping into different TLDR summaries of a topic, streaming videos, or actively writing posts as a way of clarifying and integrating newly learned concepts. In the end, there is no easy way, and passionate as you may be about staying in tune, burnout is a real possibility. So whether you’re just scratching the cloud native surface or up to your eyeballs in base code, join us for today’s conversation because you’re bound to find some use in the resources we share. Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: Carlisia Campos Josh Rosso Duffie Cooley Olive Power Michael Gasch Key Points From This Episode: Audiences and different levels of depth that our guests/hosts follow Kubernetes at. What ‘keeping up’ means: merely following news, or actually grasping every new concept? The impossibility of truly keeping up with Kubernetes as it becomes ever more complex. Patterns used to keep up with new developments: the TWKD website, release notes, etc. Twitter’s helpful provision of information, from opinions to tech content, all in one place. How helpful Cindy Sridharan is on Twitter in her orientation toward distributed systems. The active side of keeping up such as writing posts and helping newcomers. More helpful Twitter accounts such as InfoSec. How books provide one source of deep information as opposed to the noise on Twitter. Books: Programming Kubernetes; Managing Kubernetes; Kubernetes Best Practices. Another great resource for seeing Kubernetes in action: the KubeWeeky blog. A call to watch the SIG playlists on the Kubernetes YouTube channel. Tooling: tab management and Michael’s self-built Twitter searcher. Live streaming and CTF live code demonstrations as another resource. How to keep a team updated using platforms like Slack and Zoom. The importance of organizing shared content on Slack. Challenges around not knowing the most important thing to focus on. Cognitive divergence and the temptation of escaping the isolation of coding by socializing. The idea that not seeing keeping up to date as being a personal sacrifice is dangerous. Using multiple different TLDR summaries to cement a concept in one’s brain. Incentives for users rather than developers of projects to share their experiences. The importance of showing appreciation for free resources in keeping motivation up. Quotes: “An audience I haven’t mentioned is the audience that basically just throws up their hands and walks away because there’s just too much to keep track of, right?” — @mauilion [0:05:15] “Maybe it’s because I’m lazy, I don’t know? But I wait until 1.17 drops, then I go to the release notes and really kind of ingest it because I’ve just struggled so much to kind of keep up with the day to day, ‘We merged this, we didn’t merge this,’ and so on.” — @joshrosso [0:10:18] “If you find value in being up to date with these things, just figure out – there are so many resources out there that address these different audiences and figure out what the right measure for you is. You don’t have to go deep on the code on everything.” — @mauilion [0:27:57] “Actually putting the right content in the right channel, at least from a higher level, helps me decide whether I want to like look at that channel today, and stuff that should be in the channel is not kind of in a conversation channel.” — @opowero [0:32:21] “When I see something that is going to give me the fundamentals, like I have other priorities now, I sort of always want to consume that to learn the fundamentals, because I think in the long term phase of, but then I neglect physically what I need to know to do in the moment.” — @carlisia[0:33:39] “Just do nothing, because our brain needs that. We need to not be listening, not be reading, just nothing. Just sit and look at the ceiling. Our brain needs that. Ideally, look at nature, like look outside, look at the air, go for a walk. We need that, because that recharges the brain.” — @carlisia  [0:42:38] “Just consuming and keeping up, that doesn’t necessarily mean you don’t give back.” — @embano1 [0:49:32] Links Mentioned in Today’s Episode: Chris Short — https://chrisshort.net/ Last Week in Kubernetes Development — http://lwkd.info/ 1.17 Release Notes — https://kubernetes.io/docs/setup/release/notes/ Release Notes Filter Page — https://relnotes.k8s.io/ Cindy Sridharan on Twitter — https://twitter.com/copyconstruct InfoSec on Twitter — https://twitter.com/infosec?lang=en Programming Kubernetes on Amazon —https://www.amazon.com/Programming-Kubernetes-Developing-Cloud-Native-Applications/dp/1492047104 Managing Kubernetes on Amazon — https://www.amazon.com/Managing-Kubernetes-Operating-Clusters-World/dp/149203391X Brendan Burns on Twitter — https://twitter.com/brendandburns Kubernetes Best Practices on Amazon — https://www.amazon.com/Kubernetes-Best-Practices-Blueprints-Applications-ebook/dp/B081J62KLW/ KubeWeekly — https://kubeweekly.io/ Kubernetes SIG playlists on YouTube — https://www.youtube.com/channel/UCZ2bu0qutTOM0tHYa_jkIwg/playlists Twitch — https://www.twitch.tv/ Honeycomb — https://www.honeycomb.io/ KubeKon EU 2019 — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/ Aaron Crickenberger on LinkedIn — https://www.linkedin.com/in/spiffxp/ Stephen Augustus on LinkedIn — https://www.linkedin.com/in/stephenaugustus Office Hours — https://github.com/kubernetes/community/blob/master/events/office-hours.md Transcript: EPISODE 17[INTRODUCTION][0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.[EPISODE][0:00:41.5] DC: Good afternoon everybody and welcome to The Podlets. In this episode, we’re going to talk about, you know, one of the more challenging things that we all have to do, just kind of keep up with cloud native and how we each approach that and what we do. Today, I have a number of cohosts with me, I have Olive Power.[0:00:56.6] OP: Hi.[0:00:57.4] DC: Carlisia Campos.[0:00:58.6] CC: Hi everybody.[0:00:59.9] DC: Josh Rosso.[0:01:01.3] JR: Hey all.[0:01:02.8] DC: And Michael.[0:01:01.1] MICHAEL: Hey, hello.[0:01:04.8] DC: This episode, we’re going to do something a little different than we normally do. In most of our episodes, we try to remain somewhat objective around the problem and the potential solutions for it, rather than prescribing a particular solution. In this episode, however, since we’re talking about how we keep up with all of the crazy things that happen in such a fast ecosystem, we’re going to probably provide quite a number of examples or resources that you yourself could use to drive and to try and keep up to date with what’s happening out there.Be sure to check out the notes after the episode is over at thepodlets.io and you will find a link to the episodes up at the top part, click down to this episode, and check out the notes. There will be tons of resources. Let’s get started.One of the things I think about that’s interesting about keeping up with something like, you know, a Kubernetes or a fast-moving project, regardless of what that project is, whether it’s Kubernetes or, you know, for a while, it was the Mesos that I was following or OpenStack or a number have been big infrastructure projects that have been very fast moving over time and I think what’s interesting is I find that there’s multiple audiences that we kind of address when we think about what it means to ‘keep up,’ right?Keeping up with something like a project is interesting because I feel like there’s an audience that it’s actually very interested in what’s happening with the design goals or the code base of the project, and there’s an audience that is very specific to wanting to understand at a high level – like, “Give me the State of the World report like every month or so just so I can understand generally what’s happening with the project, like is it thriving? Is it starting to kind of wane? Are there big projects that it’s taking on?”And then there’s like, then I feel like there’s an audience somewhere in the middle there where they really want to see people using the project and understand, and know how to learn from those people who are using it so that they can elevate their own use of that project. They’re not particularly interested in the codebase per se but they do want to understand, are they exploring this project at a depth that makes sense for themselves? What do you all think about that?[0:03:02.0] CC: I think one thing that I want to mention is that this episode, it’s not so much about on-boarding people onto Kubernetes and the Kubernetes ecosystem. We are going to have an episode soon to talk specifically about that. How you get going, like get started. I think Duffy mentioned this so we’re going to be talking about how we all keep up with things. Definitely, there are different audiences, even when we’re talking about keeping up.[0:03:32.6] JR: Yeah, I think what’s funny about your audience descriptions, Duffy, is I feel like I’ve actually slid between those audiences a bit, right? It’s funny, back in the day, Kubernetes like one-four, one-five days, I feel like I was much more like, “What’s going on in the code?” Like trying to keep track of like how things are progressing.Now my role is a lot more focused with working with customers and standing up cube and like making a production ready. I feel like I’m a lot more, kind of reactive and more interested to see like, what features have become stable and impact me, you know what I mean? I’m far less in the weeds than I used to be. It’s a super interesting thing.[0:04:08.3] OP: Yeah, I tend to – for my role, I tend to definitely fall into the number three first which is the kind of general keeping an eye on things. Like when you see like interesting articles pop up that maybe have been linked internally because somebody said, “Oh, check out this article. It’s really interesting.”Then you find that you kind of click through five or six articles similar but then you can kind of flip to that kind of like, “Oh, I’m kind of learning lots of good stuff generally about things that folks are doing.” To actually kind of having to figure out some particular solution for one of my customers and so having to go quite deep into that particular feature.You kind of go – I kind of found myself going right in and then back out, right in, going back out depending on kind of where I am on a particular day of the week. It’s kind of a bit tricky. My brain sometimes doesn’t kind of deal with that sort of deep concentration into one particular topic and then back out again. It’s not easy.I find it quite tough actually some of the time.[0:05:05.0] DC: Yeah, I think we can all agree on that. Keeping track of everything is – it’s why the episode, right? How do we even approach it? It seems – I feel like, an audience I haven’t mentioned is the audience that basically just throws up their hands and walks away because there’s just too much to keep track of, right? I feel like we are all that at some point, you know?I get that.[0:05:26.4] OP: That’s why we have Christmas holidays, right? To kind of refresh the brain.[0:05:31.4] CC: Yeah, I maybe purposefully or maybe not even – not trying to keep up because it is too much, it is a lot, and what I’m trying to do is, go deeper on the things that I already, like sort of know. And things that I am working with on a day to day basis. I only really need to know, I feel like, I really only need to know – because I’m not working directly with customers.My scope is very well defined and I feel that I really only need to know whenever there’s a new Kubernetes release. I need to know what the release is. We usually – every once in a while, we update our project to the – we bump up the Kubernetes release that we are working against and in general, yeah, it’s like if things come my way, if it’s interesting, I’ll take a look, but mostly, I feel like I work in a spiral.If I’m doing codes related to controllers and there’s a conference talk about controllers then okay, let me take a look at this to maybe learn how to design this thing better, implement in a better way if I know more about it. If I’m doing, looking at CRDs, same thing. I really like conference talks for education but that’s not so much keeping up with what’s new. Are we talking about educating ourselves with things that we don’t know about?Things that we don’t know about. Or are we talking about just news?[0:07:15.6] JR: I think it’s everything. That’s a great question. One of my other questions when we were starting to talk about this was like, what is keeping up even mean, right? I mean, does it mean, where do you find resources that are interesting that keep you interested in the project or are you looking for resources that just kind of keep you up to date with what’s changing? It’s a great question.[0:07:36.2] MICHAEL: Actually, there was some problem that I faced when I edit the links that I wanted to share in the show. I started writing the links and then I realized, “Well, most of the stuff is not keeping up with news, it’s actually understanding the technology,” because I cannot keep up.What does help me in understanding specific areas, when I need to dig into them and I think back five or four years into early days of Kubernetes, it was easy to catch up by the time because it was just about Kubernetes. Later right, it became this platform. We realized that it actually this platform thing. Then we extended Kubernetes and then we realized there are CICD-related stuff and operations and monitoring and so the whole ecosystem grew. The landscape grew so much that today, it’s impossible to keep up, right?I think I’m interested in all those patterns that you have developed over the years that help you to manage this, let’s say complexity or stream of information.[0:08:33.9] DC: Yeah, I agree. This year, I was thinking about putting up a talk with Chris Short, it was actually last year. That was about kind of on the same topic of keeping up with it. In that, I kind of did a little research into how that happens and I feel like some of the interesting stuff that came out of that was that there are certain patterns that a project might take on that make it easier or more approachable to, you know, stay in contact with what’s happening.If we take Kubernetes as an example, there are a number of websites I think that pretty much everybody here kind of follows to some degree, that helps, sort of, kind of, address those different audiences that we were talking about.One of the ones that I’ve actually been really impressed with is LWKD which stands for Last Week in Kubernetes Development, and as you can imagine, this is really kind of focused on, kind of – I wouldn’t say it’s like super deep on the development but it is watching for things that are changing, that are interesting to the people who are curating that particular blog post, right?They’ll have things in there like, you know, code freezes coming up on this date, IPV6, IPV4, duel stack is merging, they’ll have like some of the big mile markers that are happening in a particular release and where they are in time as it relates to that release. I think if that’s a great pattern and I think that – it’s a very narrow audience, right? It would really only be interesting to people who are interested in, or who are caught up in the code base, or just trying to understand like, maybe I want a preview of what the release notes might look like, so I might just like look for like a weekly kind of thing.[0:10:03.4] JR: Yeah, speaking of the release notes, right? It’s funny. I do get to look at Last Week in Kubernetes development every now and then. It’s an awesome resource but I’ve gotten to the point where the release notes are probably my most important thing for staying up to date.Maybe it’s because I’m lazy, I don’t know, but I wait till 1.17 drops, then I go to the release notes and really kind of ingest it because I’ve just struggled so much to kind of keep up with the day to day, “We merged this, we didn’t merge this,” and so on. That has been a huge help for me, you know, day to day, week to week, month to month.[0:10:37.0] MICHAEL: Well, what was also helpful just on the release notes that the new filter webpage that they put out in 1.15, starting 1.15. Have you all seen that?[0:10:44.4] JR: I’ve never heard of it.[0:10:45.4] DC: Rel dot, whatever it is. Rel dot –[0:10:47.7] MICHAEL: Yeah, if you can share it Duffy, that’s super useful. Especially like if you want to compare releases and features added and –[0:10:55.2] DC: I’ll have to dig it up as well. I don’t remember exactly what –[0:10:56.7] CC: I’m sorry, say? Which one is that again?[0:10:59.1] MICHAEL: The real notes. I’ll put it in the hackMD.[0:11:02.8] DC: Yeah relnotes.k8s.io which is an interesting one because it’s sort of like a comparison engine that allows you to kind of compare what it would have featured like how to feature relates to different versions of stuff.[0:11:14.4] CC: That’s great. I cannot encourage enough for the listeners to look at the show notes because we have a little document here that we – can I? The resources are amazing. There are so many things that I have never even heard about and sound great – is – I want to go to this whole entire list. Definitely check it out. We might not have time to mention every single thing. I don’t want people to miss on all the goodness that’s been put together.[0:11:48.7] DC: Agreed, and again, if you’re looking for those notes, you just go to the podlets.io. Click on ‘episodes’ at the right? And then look for this episode and you’ll find that it’s there.[0:11:58.0] CC: I can see that a lot of the content in those notes are like Twitter feeds. Speaking personally, I’m not sure I’m at the stage yet where I learn a lot about Twitter feeds in terms of technical content. Do you guys find that it’s more around people’s thoughts around certain things so thought-provoking things around Kubernetes and the ecosystem rather than actual technical content. I mean, that’s my experience so far.But looking at those Twitter feeds, maybe I guess I might need to follow some of those feeds. What do you all think?[0:12:30.0] MICHAEL: Do you mean the tweets are from those like learn [inaudible 0:12:32] or the person to be tweets?[0:12:35.3] OP: You’ve listed some of there, Michael, and some sort of.[0:12:37.6] MICHAEL: I just wanted to get some clarity. The reason I listed so many Twitter accounts there is because Twitter is my only kind of newsfeed if you will. I used Feedly and RSS and others before and emails and threads. But then I just got overwhelmed and I had this feeling of missing out on all of those times.That’s why I said, “Okay, let’s just use Twitter.” To your question, most of these accounts are people who have been in the Kubernetes space for very long, either running Kubernetes, developing on Kubernetes, having opinions about Kubernetes.Opinions in general on topics related to cloud native because we didn’t want to make the search just about Kubernetes. Most of these people, I really appreciate their thoughts and some of them also just a retweet things that they see which I missed somewhere else and not necessarily just opinions. I think It’s a good mix of these accounts, providing options, some guidance, and also just news that I miss out on because not being on the other channels.[0:13:35.6] OP: Yeah, I agree because sometimes you can kind of read – I tend to require a lot of sort of blog posts and sort of web posts which, you know, without realizing it can be kind of opinionated and then, you know, it’s nice to then see some Twitter feeds that kind of actually just kind of give like a couple of words, a kind of a different view which sometimes makes me think “Okay, I understand that topic from a certain article that I’ve read, it’s just really nice to hear a kind of a different take on it through Twitter.”[0:14:03.0] CC: I think some of the accounts, like fewer of the accounts – and there are a bunch of things that – there are listed accounts here that I didn’t know before so I’ll check them out. I think fewer of the accounts are providing technical content, for example, Cindy Sridharan, not pronouncing it correctly but Cindy is great, she puts out a lot of technical content and a lot of technical opinion and observations that is really good to consume. I wish I had time to just read her blog posts and Twitter alone.She’s very oriented towards distributed systems in general, so she’s not even specific just Kubernetes. Most of the accounts are very opinionated and the benefit for me is that sometimes I catch people talking about something that I didn’t even know was a thing. It’s like, “Oh, this is a thing I should know about for the work that I do,” and like Michael was saying, you know, sometimes I catch retweets that I didn’t catch before and I just – I’m not checking out places, I’m not checking – hash tagging Reddit.I rely on Twitter and the people who I follow to – if there is a blog post that sounds important, I just trust that somebody would, that I’m going to see it multiple times until like, “Okay, this is content that is related to something and I’m working on, that I want to get better at.” Then I’ll go and look at it. My sources are mainly Twitter and YouTube and it’s funny because I love blog posts but it’s like I haven’t been reading them because it takes a long time to read a blogpost.I give preference to video because I can just listen while I’m doing stuff. I sort of stopped reading blog post which is sad. I also want to start writing posts because it’s so helpful for me to engrain the things that I’m learning and hopefully it will be helpful to other people too. But in any case, go Duffy.[0:16:02.8] DC: A number of people that I follow – I have been cultivating my feed pretty carefully, trying to get a broad perspective of technical stuff that’s happening. But also I’ve been trying to develop my persona on Twitter a bit more, right? I’m actually trying to build my audience there. What’s interesting there is I’ve been trying to – to that end, what I’ve been doing is like trying to amplify voices that I think aren’t heard enough out there, right?If I see an article by somebody who is just coming into Kubernetes. or just coming into distributed systems and they’ve taken an effort to really lay out something that they found really interesting about pretty much anything, right? I’m like, “Okay, that’s pretty awesome,” and I’ll try to amplify that, right? Sometimes I even get involved or I’ll, not directly in public on Twitter but I’ll offer to help edit or help provide whatever our guidance I can provide around that sort of stuff.If I see people like having a difficult time with a particular project or something like that, I’ll reach out privately and say, “Hey, can I help you with it so you can go out there and do a great job,” you know? That is something I love to do. I think your point about like not necessarily going at Twitter for the deep knowledge stuff but more just like making sure that you have a broad enough awareness of what’s happening in different ecosystems that you’re not surprised by the things when the things change, right?A couple of other people that I follow are Akira Asuta, I can’t say enough about that person. They are amazing, they have been doing like, incredibly deep security stuff as it relates to containerization and stuff like that for quite a while. I’m always like, learning brand new things to me when following folks like that. I’ve been kind of getting more interested in InfoSec Twitter lately, learning how people kind of approach that problem.Also some of the bias arounds that which has been pretty interesting. Both the bias against people who are in InfoSec which seems weird to me. Also, how InfoSec approaches a problem, like do they put it like a learning experience or they approach it like an attack experience.It’s been kind of fascinating to get in there.[0:18:08.1] OP: You know, I kind of use Twitter as well for some of this stuff but you know, books are kind of a resource as well but in my head, kind of like at the opposite scale. You know, I obviously don’t read as many books as I read twitter feeds, right? It’s just kind of like, with Twitter, you can kind of digest the whole of the stuff and with books, it’s kind of like – I tend to be trying – because I know, I’m only going to read – like I’m only going to read maybe one/two books a year.I’ve kind of like – as I said before, blog posts seem to take up my reading time and books kind of tend to be for like on airplanes and stuff. So if – they’re just kind of two opposite resources for me but I find actually, the content of books are probably stuff that I digest a bit more because you know, it’s kind of like, I don’t know, back to the old days. It’s kind of a physical thing on hand and I can kind of read it and digest it a bit more than the kind of throwaway stuff that kind of keeps on Twitter.Because to be honest, I don’t know what’s on Twitter. Who is kind of a person to listen to or who is not or who is – I just try and form my own opinions and then, again, it kind of gets a bit overwhelming, because it’s a lot of content just streaming through continuously, whereas a book, it’s kind of like just one source of information that is kind of like a bit more personal that I can digest a bit more.[0:19:18.1] JR: Any particular book recommendation in 2019, Olive, that you found particularly interesting?[0:19:23.5] OP: I’m still reading, and it’s on the list for the episode notes actually, Programming Kubernetes. I just want to kind of get into that sort of CRD sort of mindset a bit. I think that’s kind of an area that’s interesting and an area that a lot of people will want to use in their organizations, right, because it’s going to do some of the extensibility to Kubernetes that’s just not there out of the box and everybody wants something that’s not out of the box or always in my experience.[0:19:47.4] MICHAEL: I found the Managing Kubernetes, I think was it, by – from Brendan Burns and some other folks which was just released I think in the end of last year. Super deep and that is kind of the opposite to the Programming Kubernetes, because I like that as well. That is more geared towards understanding architecture and operations.Operational concepts –[0:20:05.0] OP: They’re probably the two books I’ve read.[0:20:08.4] MICHAEL: Okay.[0:20:08.9] OP: One a year, remember?[0:20:11.4] MICHAEL: Yeah.[0:20:14.6] OP: Prolific reading.[0:20:19.6] CC: I think if you know what you need to learn about cloud native or Kubernetes, there’s amazing books out there, and if you are still exploring Kubernetes and trying to learn, I cannot recommend this book enough. If you are watching this on YouTube, you’ll see the cover. It’s called Kubernetes Best Practices because it’s about Kubernetes best practices but what they did simultaneously and maybe they didn’t even realize is just they gave a map for the entire thing.You go, “Oh, these are all the elements in Kubernetes.” Of course, it’s saying, “Okay, this is the best way to go about setting the stuff up,” and this is relatively thin but I just think that going through this book, you get really fast overview of the elements in Kubernetes. Then you can go to other books like Managing Kubernetes to go deep and understand all of the knobs and switches.[0:21:24.6] DC: I want to bring it back to the patterns that we see successful projects. Projects that you think are approachable but, you know, projects that are out there that make it easy for you to kind of stay – or easier at least to stay up to date with them, what some of those patterns are that you think are useful for projects.We’re talking about like having a couple of different entry points from kind of a weekly report mechanism, we’ve talked about the one that LWKD is, I don’t think we got to talk about KubeWeekly which is actually a weekly blog that is actually curated by a lot of the CNCF ambassadors. KubeWeekly is also broken up in different sections, so like sometimes they’ll just talk about – but they’re actually going out actively and trying to find articles of people using Kubernetes and then trying to post those.If you’re interested in understanding how people are actually out there using it, then that’s a great place to go find articles that are kind of related to that. What are some other patterns that we see that are out there that are useful for books?[0:22:27.6] DC: One that I really like. Kubernetes, for everyone listening has this notion of special interest groups, SIGs oftentimes. They’re focused on certain areas of the project. There’s some for networking and storage and life cycles of clusters and what’s amazing, I try to watch them somewhat weekly, I don’t always succeed.They’re all on YouTube and if you go to the Kubernetes project YouTube, there’s playlists for every SIG. A lot of times I’m doing work relating to life cycles of clusters. I’ll open up the cluster life cycle playlist and I’ll just watch the weekly meetings. While it doesn’t always pertain to completely to me, it lets me understand kind of where the developers and contributor’s heads are at and where they’re kind of headed with a lot of different things.There’s a link to that as well if anyone wants to check it out.[0:23:15.9] MICHAEL: Exactly, to add to that. If you don’t have the time to watch the videos, the meeting notes that these gentlemen and women put together are amazing. Usually, I just scroll through and if it’s something to triggers, I go into the episode and watch it.[0:23:28.7] OP: I almost feel like we should talk about tooling to handle all of this stuff, for example, right now, I think I have 200 tabs opened. I just started learning about some chrome extensions to manage tabs. I haven’t started really using them but I need. I don’t have a good system. My system is open a video that I’m pretty sure I want to watch and just get to that tab eventually until something happens in my chrome goes bust and I lose everything.I wanted to mention that when we say watch YouTube, some things you don’t need to sit there and actually watch, you can just listen to it and if you pay for the five bucks for YouTube premium – I don’t get a commission you people, but I’m just saying, for me, it’s so helpful. I can just turn off you know, put my phone on my pocket and keep listening to it without having to have the phone open and on the whole time. It’s very handy.It’s just like listening to a podcast. I also listen to podcasts lots of days.[0:24:35.1] MICHAEL: For tooling, since I’m just mostly on Twitter and by the time I was using or starting to use Twitter, they didn’t have this bookmark function, so I was basically abusing likes or favorites at the time, I think, to bookmark. What I realized later, my bookmarks grew, well, my likes grew.I wanted to go back and find something but that through the Twitter search was just impossible. I blew the tiny little go tool, kind of my first exercise there to just parse my likes and then use JQ because it’s all JSON to query and manipulate the stuff. I almost use it every day because I was like, that was a talk or blog post about scheduling and just correct for scheduling and the likes.I’m sure there’s a better tool or way of doing that but for me, that’s mine too. Because that’s my workflow.[0:25:27.6] DC: Both of the two blogs that you mentioned both KubeWeekly and LWKD, they both have the ability to take – you can submit stories to them. If you come across things that are interesting and you’d like to put that up on an aggregator somewhere, this is one of the ways to kind of solve that problem because at least if it gets cleared up on an aggregator, you know that you go back to the aggregator to see it, so that helps.Some other ones I’ve seen out there, I’ve seen people, I’ve seen a number of interesting startups now, starting to kind of like put out a podcast or – and I have started to see a number of people like you know, engaging with Twitch and also doing things like what we do with TJK.io which is like have sort of some kind of a weekly thing where you are just hacking on stuff live and just exploring it whether that is related to – if you think of about TJK is we’re going to do without being related necessarily to anything that we are doing at VMware just anything to do with the community but obviously if you are working for one of the small companies like Honeycomb or some other company.A smaller kind of startup, you can really just get people more aware of that because for some reason people love to watch others code. They love to understand how people go through that, what are their thought process is and I find it awesome as well. I think it is amazing to me how big a draw that is, you know?[0:26:41.1] OP: And is there lots of them out there Duffy? Is that kind of an easy searchable thing or is it like how do you know those things are going on?[0:26:48.4] DC: Oddly enough Twitter, most of the time, yeah. I mean, most of the time I see that kind of stuff happening on Twitter, like somebody will like – I will scope with this or a number of other people will say, “Hey, I am going to do a live stream during this period of time on this,” and I have actually seen a number of people doing live streams on CTFs, which are capture the flags. That one’s really been fascinating to me because it has been how do people think about approaching the security of an application.Like where do they look for weak spots and how do you determine, how do you approach that kind of a problem, which is fascinating. So yeah, I think it is important to remember that like you know, you are not the only one trying to keep up to date with all of this stuff, right? The one thing we all have said pretty consistently here is that it is a lot, and it is not just Kubernetes, right? Like any fast moving project. It could be your favorite Ruby module that has 200 contributors, right?It doesn’t matter what it is, it is a lot to keep a track of, and it represents some of that cognitive overheads that you have to think about. That is a lot to take on. Even if it is overwhelming, if you find value in being up to date with these things, just figure out – there are so many resources out there that address these different audiences and figure out what the right measure for you is. You don’t have to go deep on the code on everything.Sometimes it might be better to just try and find a source of information that gives you a high enough of a view. Maybe you are looking at the blog posts that come out on Kubernetes.io every release and you are just looking at the release notes and if you just read the release notes every release, that is already miles ahead of what I have seen a lot of folks out there when they are starting to ask me questions about how do you keep up to date.[0:28:35.9] JR: I’m curious, we have been talking a lot about keeping up as an individual. Do you all have strategies for how you help, let’s say your overall team, keep up with all the things that are going on? To give an example, Duffy, Olive and myself, at least at one point, were on the same team and we’d go out to disparate customers and see all of these different new things that they are trying to do or new projects that they are using.So we’d have to think about how do we get together and share that internally to make sure we are bringing the whole team along with what is going on in the ecosystem especially from a customer perspective. I know one of the ways that we do that is having demos and things of that nature that we share weekly. Are there other strategies that you all use with your teams to kind of share interesting information and news?[0:29:25.5] M: So what we do is mostly the way we share in our team, and we are a small team. We use Slack. We pre-filter in terms of like if there is stuff that I think is valuable for me and probably not for the whole team – obviously we are not going to share, but I think if it is related to something that the team has or to come grant and then I will share on Slack but we don’t have any formal way. I know people use some reports, weekly reports, or other platforms to distribute but we just use Slack.[0:29:53.0] DC: I think one of the things – one of the patters that we had at [inaudible 0:29:54] that I thought was actually super helpful was that we would engage a conversation. “I learned a cool new thing about whatever today,” and so we would say, “I am going to – ” and then we would start a Zoom call around that and then people could join if they wanted to, to be a part of the live discussion or not, and if they didn’t, they would still be able to see a recorded Zoom pop up in the channel later on.So even if your time zones don’t line up, like I know it is 2 AM or 3 AM or something like that for Olive right now, you can still go back to those recorded sessions and you’ll just see it on your daily Slack stuff. You would be able to see, “Oh there was a conversation about whether you should deploy Kubernetes crossed availibility zones or not. I would like to go see that,” and see what the inputs were, and so that can be helpful.[0:30:42.5] JR: Yeah, that is a super interesting observation. It is almost like remote-first teams that are used to these processes of recording everything and putting it in a Google doc. They are more equipped for that information sharing perhaps than like the water cooler conversations you’d have in the office.[0:30:58.5] OP: And on the Slack or any of the communication tool, we have different channels because we are all in lots of channels and to have channels dedicated to a particular subject is absolutely the way to go because otherwise in my previous company that seem to be kind of one main channel that all the architect used to discussed everything on and you know sometimes you join and you’re like, “What is everybody talking about?”There would be literally about a hundred messages on some sort of theme that I have never heard of. So you come away from that thinking that, “That is the main channel. Where is the bit – is there messages in the middle that I missed that were just normal discussions as opposed to in around the technical stuff,” and so it made me a bit sad, right? I would be like, “I haven’t understood something and there is a whole load of stuff on this channel that I don’t understand.”But it is the kind of central channel for everyone. So I think you end up then start looking up things that they are discussing and then realizing actually that is not really anything related to what I need to know about today or next week. It might be something for the future but I’ve got other stuff to focus on. So my point is that those communication channels for me sometimes can make me feel a little bit behind the curve or very much sort of reactive in trying to jump on things that are actually not really anything to do with me for me now and wasting my time slightly and kind of messing with my head a little bit in that like, “I really need to try and focus out stuff,” and actually putting the right content in the right channel, at least from a higher level, helps me decide whether I want to like look at that channel today, and stuff that should be in the channel is not kind of in a conversation channel. So organization of where that content is, is important to me.[0:32:37.6] CC: I am so in the same page with you Olive. That is the way my brain works as well. I want to have multiple channels, like if we are talking about Slack or any chat tool, but some people have such aversion to multiple channels. They really have a hard time dealing with too many – like testing their threshold of what they think is too many channels. So I am always mindful too, like it has to work for everybody but if it was up to me, there will be one channel per topic. So I know where to focus on.But you said something that is so interesting. How do we even just – like you were saying in the context of channel, multiple channels, and I go, if I need to pay attention to this this week as oppose to like, I don’t need to look at this until some time in the future. How do we even decide what we focus on that is useful for us in the moment versus it would be good for me to know but I don’t need to know right now.I am super bad at this. When I see something that is going to give me the fundamentals, like I have other priorities now, I sort of always want to consume that to learn the fundamentals because I think in the long term phase of, but then I neglect physically what I need to know to do in the moment and I am trying to sort of fish there and get focused on in the moment things. Anybody else have a hard time?[0:34:04.5] DC: You are not alone on that, yeah.[0:34:06.7] CC: It is terrible.[0:34:08.3] MICHAEL: Something that I wish I would do more often as like being a good citizen is like when you read a lot, probably 90% of my time is not writing but reading, maybe even more and then I share and then on Twitter, the tweet for them the most successful ones in terms of retweets or likes are the ones where I do like TLDR’s or some screen captures like too long to read. Where people don’t have the time, they might want to read the article but they don’t have the time.But if you put in like a TLDR like either a tweet or a thread on it, a lot of people would jump onto it because they can just easily capture it and they can still read the full article if they want but that is something that I learned and it is pretty – what is the right word? Helpful to my followers and the community but I just don’t do it that often unfortunately. If I am writing, summarizing, writing, I kind of remember. That is how the brain works. It is a nice side effect.[0:35:04.9] DC: I was saying, this is definitely one of those things where you can be the change you want to see if you, you know?[0:35:08.6] M: Yeah, I know.[0:35:10.0] DC: This is awesome. I would also say that what you just raised Carlisia is like a super valid point. I mean like not everybody’s brain works the same way, right? There are people who are neuro-divergent. There are people who think very linearly and they are very comfortable with that and there are people who don’t. So it is a struggle I think regardless of how your brain is wired to understand to how to prioritize the attention you will give any given subject.In some cases, your brain is not wired – your brain is almost wired against that whole idea, like you are just not set up for success when it comes to figuring out how to prioritize your attention.[0:35:49.0] CC: You hit the nail on the head. We are so set up for failure in that department because there are so many interesting conversations and you want to hop in and you want to be a part of the conversation and part of the group and socialize. Our work is so isolating to really put our heads down and just work, it can be so isolating. So it is great to participate in conversations out there even if it is for only via Twitter. I mean, obviously we are very biased towards Twitter here in this group.But I am not even this on Twitter so just keep that in mind that we are cognizant of that but in any case, I don’t know what the answer is but what I am trying always to cut down on that, those social activities that seem so appealing. I don’t know how to do that from working out.[0:36:43.9] JR: I am in the same boat. 2020, I am hoping to let more of that go and to your point, it is not that there is no value in it. It is just, I don’t know, I am not deriving the same amount of quality out of it because I am so just multiplexed all over the place, right? So we’ll see how it goes.[0:36:59.9] CC: Oh if any listener has opinions and obviously it seems that all of us are helpless in that department. Share with us, please.[0:37:12.5] DC: It is a tricky one. I think it is also interesting because I find that when we talk about things like work-life balance, we think of the idea of maybe work-life balance is that when you come at the end of the day and you go home and you don’t think about work, right? Sometimes we think that work-life balance means that you have a certain amount of time off that you can actually spend with your family and your friends or your community, what have you, and not be engaging on multiple fronts.Just be that – have that be your focus, but when it comes to things like keeping up, when it comes to things like learning or elevating your education and stuff, it seems like, for the most part, and this is just my own assumption, I am curious how you all feel about this, that we don’t – that that doesn’t enter into it, right? Your personal time is totally on the table when it comes to how do you keep up with these things. We don’t even think about it that way, right?I know I personally don’t. I definitely have to do more and cut back on the amount of time that I spend reading. I am right there with Michael on 90% of my time when my eyes are open, they are either reading or staring up on the sky while I try to think about what I am going to write next. You know one way or the other it is like that is what I am doing.[0:38:24.0] CC: Yeah.[0:38:25.1] MICHAEL: I noticed last year on my Twitter feed, more people than the years before will complain about like personal burn out. I saw a pattern, like reading those people’s tweets, I saw a pattern there. It wasn’t really like a spiral and then they realized and they shot down like deleted Twitter from their phones or any messaging and other stuff, and I think I am at the point where I also need to do that when it comes to vacation PDO, or whatever.Because I am just like, as you said Duffy, my free time is on the table when it comes to Twitter and catching up and keeping up because work-life balance in my mind is not work but what is not work for like – Kubernetes is exciting, adding in all the space, like what is not work there? I need to really get better at that because I think I might end in the same spiral of just soaking in more until I just –[0:39:17.7] CC: Yeah and like Josh said, it is not that there isn’t a value. Obviously we derive a huge value, that is why we’re on it, but you have to weigh things and what are your goals and is that the best way to your goals from where you are right now, and maybe you know, Twitter you use for a while, ramp up your knowledge, ramp up the connections because it is great for making connections, and then you step back and focus on something else, then to go on a cycle.This is how I am thinking now. It is just like what Olive was saying, you know, books are great, blog posts are great, and I absolutely agree with that. It is just that I don’t have even the time and when I have the time, I would be reading code and I would be reading things all day long, it is just really tiring for me at the end of the day to sit down and read more. I want to invest in learning how to speed read to solve that problem because I read a lot of books and blog posts. So something on my list.[0:40:22.8] DC: One of the biggest tips on speed reading I ever learned is that frequently when you read you think of saying the word and if you can get out of that habit, if you get out of the habit of saying the word even with your mouth or you just get out of that habit that will already increase the quickness of what you read.[0:40:39.5] CC: That is so interesting.[0:40:41.4] DC: Yeah, that is a trippy one.[0:40:43.1] CC: Because I think being bilingual, I totally like – that really helps me understand things, by saying the words.[0:40:52.9] DC: I think the point that we are all working around here is, there is a great panel that came out at KubeCon EU in 2019 was put on by Aaron Crickenberger, Esther McNaMara, Steven Augustus, these folks are all very high output people. I mean, they do a lot of stuff especially with regard to community and so they put on a panel that was talking about burn out and self-care and I think that it is definitely worth checking that one out.And actually also thinking about what keeping up means to you and making sure that you are measuring that against your ability to sustain, is incredibly important, right? I feel like keeping up is one of those subjects where we end up – it is almost insidious in its way to – it is a thing that we can just do all the time. We can just spend all of our time, any free moment that you have, you are sitting on the bus, you are trying to keep up with things.And because that happens so much, I feel like that is sort of one of the ways that we can feel burnt out as you are seeing today. We can feel like we did a lot of things but there was no real result to it and keep in mind that that’s part of it, right? Like when you are thinking about how we are keeping up with it, make sure that the value to your time is still something that you have some cognizance about, that you have some thought about, like is it worth it to me to just spend this six hours reading everything, right?Or would it be better for me to spend some amount of time just not reading, you know? Like doing something else, you know? Like bake a cake for crying out loud, you know?[0:42:29.5] CC: Something that a lot of times we don’t allow ourselves to do and I decided to speak for everybody I am sorry, I just do nothing, because our brain needs that. We need to not be listening, not be reading, just nothing. Just sit and look at the ceiling, our brain needs that. Ideally, look at nature, like look outside, look at the air, go for a walk. We need that, because that recharges the brain. Anyway, one thing also that I want to bring up, maybe we can mention real quick because we are coming up at the top of the hour.How do people, projects, how do we really help the users of those projects to be up to date with what they are doing?[0:43:18.4] DC: Well yeah I mean this is the different patterns that we are talking about. So I think the blog posts help. I like the idea of having blogs that are targeted towards different audiences. I like the idea of having an aggregate here for putting up a big project. I mean obviously Kubernetes is such a huge ecosystem that if you have things like KubeWeekly and I know that there are actually quite a number of things out there that try and do this.But if we can kind of agree on one like KubeWeekly I think is a pretty good one because it is actually run by the CNCF. So it kind of falls within that sort of governance as a model but having an aggregator where you can actually produce content or curate content as it relates to your project that’s helpful, and then office-hours I think is also helpful to Josh’s point. I mean office-hours and SIG hours are very similar things. I mean like office-hours there like how to developers think about what’s happening with the space.This is an opportunity for you as an end user to show up and ask questions, those sorts of patterns I think all are incredibly helpful as a project to figure out there to those things.[0:44:17.8] OP: Yeah, I know summary articles or the sort of TLDRs that Michael mentioned earlier, I think I need more of those things in my life because I do a lot of reading, because I think my brain is a bit weird in that I need to read something about five or six different times from five or six different articles for it to sort of frame in my head.So what I am trying to – like for 2020, I have almost tried to do this, is like if I think somebody knows all about this and it would save me reading those five, six, seven articles and if that person has the time, I try and sort of reach out to them and say, “Listen, have you got 20 minutes or so to explain this topic to me? Can I ask you questions about it?” It just saves me, saves my eyes reading the screen, and it just saves me time. I just need a TLDR summary of a project or a feature or something just so I can know what it is all about in my head and talk fairly sort of confidently about it.If I need to get in front and down under the weeds then there is more reading to kind of do for me maybe the coding on the technical side, but sometimes I can’t figure out what this feature sort of means and what is its use case in the real world and I have to read through lots of articles and sometimes kind of vendor specific ones and they’ve got a different slant than maybe an independent one and trying to marry those bits up my head is a bit hard for me and there is sort of wealth of information.So if you are interested in a topic and there is hundreds of articles and you start reading four or five and they are all slightly different, eventually you figure out that – you are confident and I understand what that product is about but it has taken a long time to get there and it is taken a lot of reading time. So TLDRs is like really work and I think as Josh mentioned before, we have this thing internally where we do bench demos.And that is like a TLDR and a show and tell really quickly, like, “This is what this does and this is why we need to know about it and this is why our customers needs to know about it, the end,” you know? And that’s really, really useful because that just saves a whole bunch of people a whole bunch of time figuring out A, whether they need to know about it and B, actually now understanding that product or feature at the end of the five, 10 minutes which is what they typically are. So they are very useful short snippets of information. Maybe we are back to Twitter.[0:46:37.8] JR: Similar to the idea of giving a demo Olive, you made me think of something and that is that I think one of the ways that I keep up with the space is actually through writing along with reading and I think the notion of like – and this admittedly takes up time and the whole quality of life conversation comes in but using writing to help develop your thoughts and kind of aggregate all of these crazy inputs and try to be somewhat concise, which I know I struggle with, around something I’ve learned.It’s helped me a ton and then that asset kind of becomes reusable to share with other people the thing that you wrote. So for people listening to this I guess maybe a call to action for 2020 if that is your style as well, consider starting to write yourself and becoming a resource, right? Because even if you are new to this space, you’d be amazed at just how writing from your perspective can help other people.[0:47:26.3] DC: I think another one that I actually have been impressed with lately is that a number of consumer companies like people out there like Lyft and companies like that have actually started to surface engineering blogs around how they are using technology and how they are using technology to solve things, which I think, as a service provider, as somebody who is involved in the community of Kubernetes, I find those to be incredibly valuable because I get to actually see how those things are doing.I mean at the same time, I see things like – we talked about KubeCon, which is a convention that they have every year. Obviously the project is large enough to support it but there is actually an incentive if you are a consumer of that project to go and talk about how you are using it, right? It is incentivized in that it is more likely your talk will be accepted if you are a consumer of the product than somebody building it, right? We hear from people building it all the time.I love that idea of incentivizing people who are using this thing get out there and talk about it or share their ideas about it or how they are using it, what problems did it solve for them. That is critical I think.[0:48:31.0] CC: Can I also make a suggestion – is to not so much following on the thread that we are talking about just now but kind of on the general thread of this episode. If you have resources that you do use to keep up with things, stop this recording right now and go and give them a like, give them a follow, give them a thumbs up, show somehow appreciation because what Duffy said just now, he was saying, “Oh it is so helpful when I read a blog post.”But people who are writing, they want to know that. So give them some indication, it counts a lot. It takes a lot of effort to sit down and write something or produce a podcast and if you take any, derive any benefit from it, show appreciation. It motivates people to keep doing it.[0:49:26.4] DC: Yeah, agreed.[0:49:27.9] M: I think that is a great bind maybe to close off this episode because it reiterates that just consuming and keeping up that doesn’t necessarily mean you don’t give back, right? So this is a way of giving back, which is really important to keep that flow and creativeness.[0:49:41.8] CC: I go through a lot of YouTube videos and sometimes I just play one after the other but sometimes, you know, I have been making a point of going back and liking it. Liking the ones that I like – obviously I don’t like everything. I mean things that I don’t like I don’t listen in but you know what I mean? It takes no effort but just so people know, “OK, you did a good job here.” By the way, go to iTunes and rate us. So we will know that you liked it and it will help people find our show, our podcast, and if you are watching us on YouTube, give us a like.[0:50:16.1] DC: All right, well unless anybody has any final thoughts, that is what we wanted to cover this session. So thank you all very, very much and I look forward to seeing you next week.[0:50:25.3] M: Bye-bye.[0:50:26.3] CC: Thank you so much.[0:50:27.4] OP: Bye.[0:50:28.1] JR: Bye.[END OF EPISODE][0:50:28.7] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.[END]See omnystudio.com/listener for privacy information.


17 Feb 2020

Rank #9

Podcast cover

Disaster and Recovery (Ep 8)

In this episode of The Podlets Podcast, we are talking about the very important topic of recovery from a disaster! A disaster can take many forms, from errors in software and hardware to natural disasters and acts of God. That being said that are better and worse ways of preparing for and preventing the inevitable problems that arise with your data. The message here is that issues will arise but through careful precaution and the right kind of infrastructure, the damage to your business can be minimal. We discuss some of the different ways that people are backing things up to suit their individual needs, recovery time objectives and recovery point objectives, what high availability can offer your system and more! The team offers a bunch of great safety tips to keep things from falling through the cracks and we get into keeping things simple avoiding too much mutation of infrastructure and why testing your backups can make all the difference. We naturally look at this question with an added focus on Kubernetes and go through a few tools that are currently available. So for anyone wanting to ensure safe data and a safe business, this episode is for you! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Hosts: https://twitter.com/carlisiahttps://twitter.com/bryanlhttps://twitter.com/joshrossohttps://twitter.com/opowero Key Points From This Episode: • A little introduction to Olive and her background in engineering, architecture, and science. • Disaster recovery strategies and the portion of customers who are prepared.• What is a disaster? What is recovery? The fundamentals of the terms we are using.• The physicality of disasters; replication of storage for recovery.• The simplicity of recovery and keeping things manageable for safety.• What high availability offers in terms of failsafes and disaster avoidance.• Disaster recovery for Kubernetes; safety on declarative systems.• The state of the infrastructure and its interaction with good and bad code.• Mutating infrastructure and the complications in terms of recovery and recreation. • Plug-ins and tools for Kubertnetes such as Velero.• Fire drills, testing backups and validating your data before a disaster!• The future of backups and considering what disasters might look like. Quotes: “It is an exciting space, to see how different people are figuring out how to back up distributed systems in a reliable manner.” — @opowero [0:06:01] “I can assure you, careers and fortunes have been made on helping people get this right!” — @bryanl [0:07:31] “Things break all the time, it is how that affects you and how quickly you can recover.” —@opowero [0:23:57] “We do everything through the Kubernetes API, that's one reason why we can do selectivebackups and restores.” — @carlisia [0:32:41] Links Mentioned in Today’s Episode: The Podlets — https://thepodlets.io/The Podlets on Twitter — https://twitter.com/thepodletsVMware — https://www.vmware.com/Olive Power — https://uk.linkedin.com/in/olive-power-488870138Kubernetes — https://kubernetes.io/PostgreSQL — https://www.postgresql.org/AWS — https://aws.amazon.com/Azure — https://azure.microsoft.com/Google Cloud — https://cloud.google.com/Digital Ocean — https://www.digitalocean.com/SoftLayer — https://www.ibm.com/cloudOracle — https://www.oracle.com/HackIT — https://hackit.org.uk/Red Hat — https://www.redhat.com/Velero — https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using- heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487CockroachDB — https://www.cockroachlabs.com/Cloud Spanner — https://cloud.google.com/spanner/ Transcript: EPISODE 08[INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [00:00:41] CC: Hi, everybody. We are back. This is episode number 8. Today we have on the show myself, Carlisia Campos and Josh. [00:00:51] JR: Hello, everyone. [00:00:52] CC: That was Josh Rosso. And Olive Power.   [00:00:55] OP: Hello.  [00:00:57] CC: And also Brian Lyles.  [00:00:59] BL: Hello.  [00:00:59] CC: Olive, this is your first time, and I didn’t even give you a heads-up. But tell us a little bit about your background. [00:01:06] OP: Yeah, sure. I’m based in the UK. I joined VMware as part of the Heptio acquisition, which I joined Heptio way back last year in October. The acquisition happened pretty quickly for me. Before that, I was at Red Hat working on some of their cloud management tooling and a bit of OpenShift as well.  Before that, I worked with HP and Fujitsu. I kind of work in enterprise management a lot, so things like desired state and automation are kind of things that have followed me around through most of my career. Coming in here to VMware, working in the cloud native applications business unit is kind of a good fit for me.  I’m a mom of two and I’m based in the UK, which I have to point out, currently undergoing a heat wave. We’ve had about like 3 weeks of 25 to 30 degrees, which is warm, very warm for us. Everybody is in a great mood.   [00:01:54] CC: You have a science background, right?  [00:01:57] OP: Yeah, I studied chemistry in university and then I went on to do a PhD in cancer research. I was trying to figure out ways where we could predict how different people will going to respond to radiation treatments and then with a view to tailoring everybody’s treatment to make it unique for them rather than giving the same treatment to different who present you with the same disease but were response very, very different. Yeah, that was really, really interesting.   [00:02:22] CC: What is your role at VMware?  [00:02:23] OP: I’m a cloud native architect. I help customers predominantly focus on their Kubernetes platforms and how to build them either from scratch or help them get more production-ready depending on where they are in their Kubernetes journey. It’s been really exciting part of being part of Heptio and following through into the VMware acquisition. We’re going to speak to customers a lot at very exciting times for them. They’re kind of embarking on their Kubernetes journey a lot of them. We’re with them from the start and every step of the way. That’s really rewarding and exciting. [00:02:54] CC: Let me pick up on that thread actually, because one thing that I love about this group for me, because I don’t get to do that. You all meet customers and you know what they are doing. Get that knowledge first-hand. What would you say the percentage of the clients that you see, how disaster recovery strategy, which by the way is a topic of today’s show.  [00:03:19] OP: I speak to customers a lot. As I mentioned earlier, a lot of them are like in different stages of their journey in terms of automation, in terms of infrastructure of code, in terms of where they want to go for their next platform. But there generally in the room a team that is responsible for backup and recovery, and that’s generally sort of leads into this storage team really because you’re trying to backup state predominantly.  When we’re speaking to customers, we’ll have the automation people in the room. We’ll have the developers in the room and we’ll have the storage people in the room, and they are the ones that are primarily – Out of those three sort of folks I’ve mentioned, they’re the ones that are primarily concerned about backup. How to back up their data. How to restore it in a way that satisfies the SLAs or the time to get your systems back online in a timely manner. They are the force concerned with that. [00:04:10] JR: I think it’s interesting, because it’s almost scary how many of our customers  don’t actually have a disaster recovery strategy of any sort. I think it’s often times just based on the maturity of the platform. A lot of the applications and such, they’re worried about downtime, but not necessarily like it’s going to devastate the business in a lot of these apps. I’m not trying to say that people don’t run mission critical apps on things like Kubernetes. It’s just a lot of people are very new and they’re just kind of ramping up. It’s a really complicated thing that we work with our customers on, and there’re so many like layers to this. I’m sure layers that we’ll get into. There are things like disaster recovery of the actual platform.  If Kubernetes, as an example, goes down. Getting it back up, backing up its data store that we call etcd. There’s obviously like the applications disaster recovery. If a cluster of some sort goes own, be it Kubernetes or otherwise, shifting some CI system and redeploying that into some B cluster to bring it back up. Then to Olive’s point, what she said, it all comes back to storage. Yeah. I mean, that’s where it gets extremely complicated. Well, at least in my mind, it’s complicated for me, I should say.  When you’re thinking about, “Okay, I’m running this PostgreS as a service thing on this cluster.” It’s not that simple to just move the app from cluster A to cluster B anymore. I have to consider what do I do with the data? How do I make sure I don’t lose it out? Then that’s a pretty complicated question to answer.  [00:05:32] OP: I think a lot of the storage providers, vendors playing in that storage space are kind of looking at novel ways to solve that and have adapted their current thinking maybe that was maybe slightly older thinking to new ways of interacting with Kubernetes cluster to provide that ongoing replication of data around different systems outside of the Kubernetes and then allowing it to be ported back in when a Kubernetes cluster – If we’re talking about Kubernetes in this instance as a platform, porting that data back in.  There’re a lot of vendors playing in that space. It’s kind of an exciting space really to see how different people are figuring out how to back up distributed systems in reliable manner, because different people want different levels of backup. Because of the microservices nature of the cloud native architectures that we predominantly deal with, your application is not just one thing anymore. Certain parts of that application need to be recovered fairly quickly, and other parts don’t need to recover that quickly.  It’s all about functionality ultimately that your end customers or your end users see. If you think about visually as like a banking application, for example, where if you’re looking at things like – The customer is interacting with that and they can check their financial details and they can check the current stages of their account, then they are two different services. But the actual service to transfer money into their account is down. It’s still a pretty functional system to the end user. But in the background, all those great systems are in place to recover that transfer of money functionality, but it’s not detrimental to your business if that’s down.  There’ll be different SLAs and different objectives in terms of recovery, in terms of the amount of time that it takes for you to restore. All of that has to be factored in into disaster recovery plans and it’s up to the company and we can help as much as possible for them to figure out which feats of the applications and which feats of your business need to conform to certain SLAs in terms of recovery, because different feats will have different standards and different times in and around that space. It’s a complicated thing. It definite is. [00:07:29] BL: I want to take a step back and unpack this term, disaster recovery, because I can assure you, careers and fortunes have been made on helping people get this right. Before we get super deep into this, what’s a disaster and then what’s a recovery for that? Have you thought about that at a fundamental level? [00:07:45] OP: Just for me, if we would kind of take it at face value. A physical disaster, they could be physical ones or software-based ones. Physical ones can be like earthquakes or floodings, fires, things like that that are happening either in your region or can be fairly widespread across the area that you’re in, or software, cyber attacks that are perhaps to your own internal systems, like your system has been compromised. That’s fairly local to you.  There are two different design strategies there. Physical disaster, you have to have a recover plan that is outside of that physical boundary that you can recover your system from somewhere that’s not affected by that physical disaster. For the recovery in terms of software in terms of your system has been compromised, then the recovery from that is different. I’m not an expert on cyber attacks and vulnerabilities, but the recovery from there for companies trying to recover from that, they plan for it as much as possible. So they down their systems and try and get patches and fixes to them as quickly as possible and spin the system backups. [00:08:49] BL: I’m understanding what you’re saying. I’m trying to unpack it for those of us listening who don’t really understand it. I’m going to go through what you said and we’ll unpack it a little bit. Physical from my assumption is we’re running workloads. Let’s say we’re just going to say in a cloud, not on-premise. We’re running workloads in let’s say AWS, and in the United States, we can take care local diversity by running in East and West regions.  Also, we can take care of local diversity by running in availability, but they don’t reach it, because AWS is guaranteed that AZ1 and AZ3 have different network connections, are not in the same building, and things like that. Would you agree? Do you see that? I mean, this is for everyone out there. I’m going to go from super high-level down to more specific.  [00:09:39] OP: I personally wouldn’t argue that, except not everybody is on AWS. [00:09:43] BL: Okay. AWS, or Azure, or Google Cloud, DigitalOcean, or SoftLayer, or Oracle, or Packet. If I thought about this, probably we could do 20 more.  [00:09:55] JR: IBM.   [00:09:56] BL: IBM. That’s why I said SoftLayer. They all practice in the physical diversity. They all have different regions that you can deploy software. Whether it’s be data locality, but also for data protection. If you’re thinking about creating a planet for this, this would be something you could think about. Where does my rest? What could happen to that data? Building could actually just fall over on to itself. All the hard drives are gone. What do I do? [00:10:21] OP: You’re saying that replication is a form of backup?  [00:10:26] BL: I’m actually saying way more than that. Before you even think about things when it comes to disaster recovery, you got to define what a disaster is. Some applications can actually run out of multiple physical locations. Let’s go back to my AWS example, because it’s everywhere and everyone understands how AWS works at a high-level. Sometimes people are running things out of US-East-1 and US-West-2, and they could run both of the applications. The reason they can do that is because the individual transactions of whatever they’re doing don’t need to talk to one another. They connect just websites out of places.  To your point, when you talk about now you have the issue where maybe you’re doing inventory management, because you have a large store and you’re running it out of multiple countries. You’re in the EU and you’re somewhere on APAC as well. What do you do about that?  Well, there are a couple of ways that – I could think about how we would do that. We could actually just have all the database connections go back to one single main service. Then what we could do with that main service is that we could have it replicated in their local place and then we can replicate it in a remote place too. If the local place goes up, at least you can point all the other sites back to this one. That’s the simplest way.  The reason I wanted to bring this up, is because I don’t like acronyms all that much, but disaster recovery has two of my favorite ones and they’re called RPO and RTO. Really, what it comes down to is you need to think about when you have a disaster, no matter that disaster is or how you define it, you have RTO. Basically, it’s the time that you can be down before there’s a huge issue. Then you have something called DPO, which is without going into all the names, is how far you can go since your last backup before you have business problems.  Just thinking about those things is how we should think about our backup disaster recovery, and it’s all based on how your business works or how your project works and how long you can be down and how much data you have.  [00:12:27] CC: Which goes to what Olive was saying. Please spell out to us what RTO and RPO stand for.   [00:12:35] BL: I’m going to look them up real quick, because I literally pushed those acronym meanings out. I just know what they mean.  [00:12:40] OP: I think it’s recovery time objective and recovery data objective.  [00:12:45] BL: Yeah. I don’t know what the P stands for, but it is for data.  [00:12:49] OP: Recovery.  [00:12:51] BL: It’s the recovery points. Yeah. That’s what it is. It is the recovery point objective, RPO; and recovery time objective, RTO. You could tell that I’ve spent a lot of time in enterprise, because we don’t even define words. The acronym means what it is. Do you know what the acronym stands for anymore?   [00:13:09] OP: How far back in terms of data can we go that was still okay? How far back in time can we be down, basically, until we’re okay?  [00:13:17] CC: It is true though, and as Josh was saying, some teams or companies or products, especially companies that are starting their journey, their cloud native journey. They don’t have a backup, because there are many complicated things to deal with, and backup is super complicated, I mean, the disaster recovery strategy. Doing that is not trivial.  But shouldn’t you start with that or at least because it is so complex? It’s funny to me when people say I don’t have that kind of a strategy. Maybe just like what Bryan said why utilizing, spreading out your data through regions, that is a strategy in itself, and there’s more to it. [00:14:00] JR: Yeah. I think I oversimplified too much. Disaster recovery could theoretically be anything I suppose. Going back to what you were saying, Brian, the recovery aspect of it. Recovery for some of the customers I work with is literally to stand on a brand-new cluster, whatever that cluster is, a cluster, that is their platform. Then redeploy all the applications on top of it.  That is a recovery strategy. It might not be the most elegant and it might make assumptions about the apps that run on it, but it is a recovery strategy that somewhat simple, simple to kind of conceptualize and get started with.  I think a lot of the customers that I work with when they’re first getting their bearings with distributed system of sorts, they’re a lot more concerned about solving for high availability, which is what you just said, Carlisia, where we’re spreading across maybe multiple sites. There’s the notion of different parts of the world, but there’s also the idea of like what I think Amazon has coined availability zones. Making sure if there is a disaster, you’re somewhat resilient to that disaster like Brian was saying with moving connections over and so on.  Then once we’ve done high-availability somewhat well, depending on the workloads that are running, we might try to get a more fancy recovery solution in place. One that’s not just rebuild everything and redeploy, because the downtime might not be acceptable.  [00:15:19] BL: I’m actually going to give some advice to all the people out there who might be listening to this and thinking about disaster recovery. First of all, all that complex stuff, that book you read, forget about it. Not because you don’t need to know. It’s because you should only think about what’s in scope at any given time.  When you’re starting an application, let’s say I’m actually making a huge assumption that you’re using someone else’s cloud. You’re using public cloud. Whenever you’re in your data center, there’s a different problem. Whenever you’re using public cloud, think about what you already have. All the major public clouds had a durable object storage. Many 9s of durability and then fewer 9s, but still a lot of 9s of availability too. The canonical example there is S3. When you’re designing your applications and you know that you’re going to have disaster issues, realize that S3 is almost always going to be there, unless it was 2017 and it goes down, or the other two failures that it had. Pretty much, it will be there. Think about how do I get that data into S3. I’m just saying, you can use it for storage. It’s fairly cheap for how much storage you can get. You can make it sure it’s encrypted, and using IM, you can definitely make sure that people who have the right pillages can see it. The same goes with Azure and the same goes with Google. That’s the first phase. The second phase is that now you’re going to say, “Well, what is a relational database?” Once again, use your cloud provider. All the major cloud providers have great relational databases, and actually key value stores as well. The neat thing about them is you can actually set them up sometimes to run in a whole region. You can set them up to do automated backups. At least the minimum that you have, you actually use your cloud provider for what it’s valuable for.  Now, you’re not using a cloud provider and you’re doing it on-premise, I’m going to tell you, the simple answer is I hope you have a little bit of money, because you’re going to have to pay somebody either one of Kubernetes architects or you’re going to pay somebody else to do it. There’s no easy button for this kind of solution.  Just for this little mini-rant, I’m going to leave everyone with the biggest piece of advice, the best piece of advice that I can ever leave you if you’re running relational databases. If you are running a relational database, whether it’d be PostgreS, MySQL, Aurora, have it replicated. But here’s the kicker, have another replica that you delay and make it delay 10 minutes, 15 minutes, not much longer than that.  Because what’s going to happen, especially in a young company, especially if you’re using Rails or something like that, you’re going to have somebody who is going to have access to production, because you’re a small company, you haven’t really federated this out yet. Who’s going to drop your main database table? They’re just going to do it and it’s going to happen and you’re going to panic.  If you have it in a replica, that databases go in a replica, you have a 10-minute delay replica – 10 minutes to figure it out before the world ends. Hopefully someone deletes the master database. You’re going to know pretty quickly and you can just cut that replica out, pull that other one over. I’m not going to say where i learned this trick. We had to employ it multiple times, and it saves our butts multiple times. That’s my favorite thing to share.  [00:18:24] OP: Is that replica on separate system?  [00:18:26] BL: It was on a separate system. I actually don’t say, because it will be telling on who did it. Let’s say that it was physically separate from the other one in a different location as well.  [00:18:37] OP: I think we’ve all been there. We’ve all have deleted something that maybe –  [00:18:41] CC: I’m going to tell who did it. It was me.  [00:18:45] BL: Oh no! It definitely wasn’t me.  [00:18:46] OP: We mentioned HA. Will the panel think that there’s now a slightly inverse relationship between the amount of HA that you architect for versus the disaster recovery plan that you have implemented on the back of that? More you’re architecting around HA, like the less you architect or plan for DR. Not eliminating ether of them. [00:19:08] BL: I see it more.  Mean, it used to be 15 years ago.   [00:19:11] CC: Sorry. HA, we’re talking about high availability.  [00:19:15] BL: When you think about high availability, a lot of sites were hosted. This is really before you had public cloud and a lot of people were hosting things on WebHost or they’re hosting themselves. Even if you are a company who had like a big equinox of level 3, you probably didn’t have two facilities at two different equinoxes or level 3, which probably does had one big cage and you just had diversity in the systems in there.  We found people had these huge tape backups and we’re very diligent about swapping our tapes out. One thing you did was we made sure that – I mean, lots of practice of bringing this huge system down, because we assumed that the database would die and we would just spend a few hours bringing it back up, or days. Now with high availability, we can architect systems where that is less of a problem, because we could run more things that manage our data. Then we can also do high availability in the backend on the database side too. We can do things like multi-writes and multi-reads. We can actually write our data in multiple places.  What we find when we do this is that the loss of a single database or a slice of processing/webhosts just means that our services degraded, which means we don’t really have a disaster in this point and we’re trying to avoid disasters.  [00:20:28] JR: I think on that point, the way I’ve always thought about it, and I’ll admit this is super overly simplified, but like successful high availability or HA could make your lead to perform disaster recovery less likely, can, maybe, right? It’s possible.   [00:20:45] BL: Also realize that everybody is running in public cloud. In that case, well, you can still back your stuff up to public cloud even if you’re not running in public cloud. There are still people out there who are running big tape arrays, and I’ve seen them. I’ve seen tape arrays that are wider. I’m sitting in an 80-inch wide table, bigger than this table with robotic arms and takes the restic and you had to make sure that you got the text right for that particular day doing your implementation.  I guess what I’m saying is that there is a balance. HA, high availability, if you’re doing it in a truly high available way, you can’t miss whole classes of disaster. But I’m not saying that you will not have disaster, because if that was the case, we won’t be having this discussion right now.  I’d like to move the conversation just a little bit to more cloud native. If you’re running on Kubernetes, what should you think about for disaster recovery? What are the types of disasters we could have? How could we recover them? [00:21:39] JR: Yeah. I think one thing that comes to mind, I was actually reading the Kubernetes Best Practices book last night, but I just got an O’Reilly membership. Awesome. Really cool book. One of the things that they had recommended early on, which I thought was a really good pull out is that since Kubernetes is a declarative system where we write these manifests to describe the desired state of our application and how it should run, recommending that we make sure to keep that declarative state in source control, just like we would our code so that if something were to go wrong, it is somewhat more trivial to redeploy the application should we need to recover. That does assume we’re not worried about like data and things like that, but it is a good call out I think. I think the book made a good call out.  [00:22:22] OP: That’s on the declarative system and enable to bring your systems back up to the exact way they were before kind of itself adds comfort to the whole notion that they could be disaster. If they was, we can spin up backup relatively quickly. That’s back from the days of automation where the guys originally – I came from Red Hat, so fork at Ansible.  We’re kind of trying to do the infrastructure as a code, being able to deploy, redeploy, redeploy in the same manner as the previous installation, because I’ve been in this game long-time now and I’ve spent a lot of time working with processes in and around building physical servers. That process will get handled over to lots of different teams. It was a huge thing to build these things, to get one of these things built and signed off, because it literally has to pass through the different teams to do their own different bits of things.  The idea that you would get a language that had the functionality that suited the needs of all those different teams, of the store team, could automate their piece, which they were doing. They just wasn’t interactive with any of the other teams. The network people would automate theirs and the application install people would do their bit. The server OS people would do their bit.  Having a process that could tie those teams together in terms of a language, so Ansible, Puppet, Chef, those kinds of things try to unite those teams and it can all do your automation, but we have a tool that can take that code and run it as one system end-to-end. At the end of that, you get an up and running system. If you run it again, you get all the systems exactly the same as the previous one. If you run it again, you get another one.  Reducing the time to build these things plays very importantly into this space. Disaster is only disaster in terms of time, because things break all the time. How that affects you and how quickly you can recover. If you can recover in like seconds, in minutes and it hasn’t affected your business at all, then it wasn’t really a disaster.  The time it takes you to recover, to build your things back is key. All that automation and then leading on to Kubernetes, which is the next step, I think, this whole declarative, self-healing and implementing the desired state on a regular basis really plays well into this space.  [00:24:25] CC: That makes me think, I don’t completely understand because I’m not out there architecting people’s systems. The one thing that I do is building this backup tool, which happens to be for Kubernetes. I don’t completely get the limitations and use cases, but my question is, is it enough to have the declarations of how your infrastructure should be in source control? Because what if you’re running applications on the platform and your applications are interacting with a platform, change in the state of the platform. Is that not something that happens?  Of course, ideally, having those declarations and source control of course is a great backup, but don’t you also want to back up the changes to state as they keep happening? [00:25:14] BL: Yeah, of course. That has been used for a long-time. That’s how replication works. Literally, you take the change and you push it over the wire and it gets applied to the remote system. The problem is, is that there isn’t just one way to do this, because if you do only transaction-based. If you only do the changes, you need a good base to start with, because you have to apply those changes to something. How do you get that piece? I’m not asking you to answer that. It’s just something to think about.  [00:25:44] JR: I think you’ve hit a fatal flaw too, Carlisia, and like what that simplified just like having source control model kind of falls over. I think having that declarative kind of stamped out, this is the ideal nature of the world to this deployment and source control has benefits beyond just that of disaster recovery scenario, right?  For stateless applications especially, like we talked about in the previous podcast, it can actually be all lead potentially, which is so great. Move your CI system over to cluster B. Boom! You’re back up and running. That’s really neat. A lot of our customers we work with, once we get them to a point where they’re at that stage, they then go, “Well, what about all these persisted volumes?” which by the way is evolving on a computer, which is a Kubernetes term. But like what about all these parts on like disk that I don’t want to lose if I lose my cluster? That it totally feeds into why tools like the one you work on are so helpful. Maybe I don’t know if now would be a good time. But maybe, Carlisia, you could expand on that tool. What it tries to solve for?  [00:26:41] CC: I want to back up a little though. Let’s put aside stateful workloads and volumes and databases. I was talking about the infrastructure itself, the state of the infrastructure. I mean, isn’t that common? I don’t know the answer to this. I might be completely off. Isn’t that common for you to develop a cloud native application that is changing the state of the infrastructure, or is this something that’s not good to do?  [00:27:05] JR: It’s possible that you can write applications that can change infrastructure, but think about that. What happens when you have bad code? We all have bad code. Our people like to separate those two things. You can still have infrastructure as code, but it’s separated from the application itself, and that’s just to protect your app people from your not app people and vice versa.  A lot of that is being handled through systems that people are writing right now. You have Ansible from IBM. You have things like HashiCorp and all the things that they’re doing. They have their hosted thing. They have their own premise thing. They have their local thing. People are looking at that problem.  The good thing is that that problem hasn’t been solved. I guess good and bad at the same time, because it hasn’t been solved. So someone can solve it better. But the bad thing is that if we’re looking for good infrastructure as code software, that has not been solved yet.  [00:27:57] OP: I think if we’re talking about containerized applications, I think if there was systems that interacted or affected or changed the infrastructure, they would be separate from the applications. As you were saying, Brian, you just expanded a little bit [inaudible 00:28:11] containerized or sandboxed, processes that were running separate to the main application.  You’re separating out what’s actually running and doing function in terms of application versus systems that have to edit that infrastructure first before that main application runs. They’re two separate things. If you had to restore the infrastructure back to the way it was without rebuilding it, but perhaps have a system whereby if you have something editing the infrastructure, you would always have something that would edit it back.  If you have the process that runs to stop something, you’d also have a process that start at something. If you’re trying to [inaudible 00:28:45] your applications and if it needs to interact with other things, then that application design should include the consideration of what do I need to do to interact with the infrastructure. If I’m doing something left-wise, I have to do the opposite in equal reaction right-wise to have an effectively clean application. That’s the kind of stuff I’ve seen anyway. [00:29:04] JR: I think it maybe even fold into a whole other topic that we could even cover on another podcast, which is like the notion of the concern of mutating infrastructure. If you have a ton of hands in those cookie jars and they’re like changing things all over the place, you’re losing that potential single source of declarative truth even, right? It just could become very complicated. I think maybe to the crux of your original point, Carlisia. Hopefully I’m not super off. If that is happening a lot, I think it could actually make recover more complicated, or maybe recovery is not the way to put it, but recreating the infrastructure, if that makes sense. [00:29:36] BL: Your infrastructure should be deterministic, and that’s why I said you could. I know we talked about this before about having applications modify infrastructure. Think about that. Can and should are two different things. If you have it happen within your application due to input of any kind, then you’re no longer deterministic, unless you can figure out what that input is going to be. Be very careful about that.  That’s why people split infrastructure as code from their other code. You could still have CI, continuous integration and continuous delivery/deployment for both, but they’re on different pipelines with different release metrics and different monitoring and different validation to make sure they work correctly. [00:30:18] OP: Application design plays a very important role now, especially in terms of cloud native architecture. We’re talking a lot about microservices. A lot of companies are looking to re-architect their applications. Maybe mistakes that were made in the past, or maybe not mistakes. It’s perhaps a strong word. But maybe things that were allowed in the past perhaps are now best practices going forward. If we’re looking to be able to run things independently of each other, and by definition, applications independent on the infrastructure, that should be factored in into the architecture of those applications going forward.  [00:30:50] CC: Josh asked me to talk a little bit about Velerao. I will touch up on it quickly. First of all, we’d love to have a whole show just about infrastructure code, GitOps. Maybe that would be two episodes. Velero doesn’t do any backup of the infrastructure itself. It works at the Kubernetes level. We back up the Kubernetes clusters including the volumes. If you have any sort of stateful app attached to a pod that can get backed up as well.  If you want to restore that to even a different service provider, then the one you backed up from, we have a restic plugin that you can use. It’s embedded in the Velero tool. So you can do that using this plugin. There are few really cool things that I find really cool about Velero is, one, you can do selective backups, which really, really don’t recommend. We recommend you always back up everything, but you can do selective restores. That would be – If you don’t need to restore a whole cluster, why would you do it? You can just do parts of it.  It’s super simple to use. Why would you not have a backup? Because this is ridiculously simple. You do it through a command line, and we have a scheduler. You can just put your backup on scheduler. Determine the expiration date of each backup. A lot of neat simple features and we are actively developing things all the time.  Velero is not the only one. It’d be fair to mention, and I’m not a super well versed on the tools out there, but etcd itself has a backup tool. I’m not familiar with any of these other tools. One thing to highlight is that we do everything through the Kubernetes API. That’s for example one reason why we can do selective backup or restores. Yes, you can backup etcd completely yourself, but you have to back up the whole thing. If you’re on a managed service, you wouldn’t be able to do that, because you just wouldn’t have access.  All the tools like we use to back up to the etcd offers or a service provider. PX-motion. I’m not sure what this is. I’m reading the documentation here. There is this K10 from [inaudible 00:33:13] Canister. I haven’t used any of these tools. [inaudible 00:33:16]. [00:33:17] OP: I just want to say, Velero, the last customer I worked on, they wanted to use Velero in its capacity to be able to back up a whole cluster and then restore that whole cluster on a different cloud provider, as you mentioned. They weren’t thoroughly using it as – Well, they were using it as backup, but their primary function was that they wanted to populate the cluster as it was on a brand-new cloud provider.  [00:33:38] CC: Yeah. It’s a migration. One thing that, like I said, Velero does, is back up the cluster, like all the Kubernetes objects, because why would we want to do that? Because if you’re declaring – Someone explain to everybody who’s listening, including myself. Some people bring this up and they say, “Well, I don’t need to back up the Kubernetes objects if all of that is declared and I have the declaration is source control. If something happens, I can just do it again.   [00:34:10] BL: Untrue, because just for any given Kubernetes object, there is a configuration that you created. Let’s say if you’re creating an appointment, you need spec replicas, you need the spec templates, you need labels and selectors. But if you actually go and pull down that object afterwards, what you’ll see is there is other things inside of that object. If you didn’t specify any replicas, you get the defaults or other things that you should get defaults for.  You don’t want to have a lousy backup and restore, because then you get yourself into a place where if I go back this thing up and then I restore it to a different cluster to actually test it out to see if it works, it will be different. Just keep that in mind when you’re doing that. [00:34:51] JR: I think it just comes down to knowing exactly what Brian just said, because there certainly are times where when I’m working with a customer, there’s just such a simple use case at the notion of redeploying the application and potentially losing some of those factors that may have mutated overtime. They just shrug to it and go, “Whatever.”  It is so awesome that tools like Velero and other tools are bridging that gap, and I think to a point that Olive made, not only just backing that stuff up and capturing it state as it was in the cluster, but providing us with a good way to section out one namespace or one group of applications and just move those potentially over and so on. Yeah, it just kind of comes to knowing what exactly are you going to have to solve for and how complex your solution should be.  [00:35:32] BL: Yeah. We’re getting towards the end, and I wanted to make sure that we talked about testing your backup, because that’s a popular thing here. People take backups. I’ve done my backups, whether I dump to S3, or I have Velero dumping to S3, or I have some other method that is in an invalid backup, it’s not valid until someone comes and takes that backup, restore it somewhere and actually verifies that it works, because there’ll be nothing worse than having yourself in a situation where you need a backup and you’re in some kind of disaster, whether small or large, and going to find out that, “Oh my gosh! We didn’t even backup the important thing.”  [00:36:11] CC: That is so true. I have only been in this backup world for a minute, but I mean I’ve needed to backup things before. I don’t think I’ve learned this concept after coming here. I think I’ve known this concept. It just became stronger in my mind, so I always tell people, if you haven’t done that restore, you don’t have a backup. [00:36:29] JR: One thing I love to add on to that concept too is having my customers run like fire drills if they’re open to it. Effectively, having a list of potential terrible things that can happen, from losing a cluster to just like losing an important component. Unlike one person the team, let’s say, once a week or one a month, depending on their tolerance, just chooses something from that list and does it, not in production, but does it. It gives you the opportunity to test everything end-to-end. Did your learning fire off? When you did restore to your points, was the backup valid? Did the application come back online? It’s kind of a lot of like semi-fun, using the word fun loosely there. Fun ways that you can approach it, and it really is a good way to kind of stress test. [00:37:09] BL: I do have one small follow up on that. You’re doing backups, and no matter how you’re doing them, think about your strategy and then how long to keep data. I mean, whether it’s due to regulation or just physical space and it costs money. You just don’t backup yesterday and then you’d backup again. Backup every day and keep the last 8 days and then, like old school, would actually then have a full backup and keep that for a while just in case, because you never know. [00:37:37] CC: Good point too. Yeah. I think a lot of what we said goes to what – It was Olive I think who said it first. You have to understand your needs.  [00:37:46] OP: Yeah, just which bits have different varying degrees of importance in terms of application functionality for your end user. Which bits are absolutely critical and which bits can buy you a little bit more time to recover. [00:37:58] CC: Yeah. That would definitely vary from product to product. As we are getting into this idea of ephemeral clusters and automation and we get really good at automating things and bringing things back up, is it possible that we get to a point where we don’t even talk about disasters anymore, or you just have to grow, bring this up cluster or this system, and does it even matter why [inaudible 00:38:25]. We’re not going to talk about this aspect, because what I’m thinking is in the past, in a long, long time ago, or maybe not so long time ago. When I was working with application, and that was a disaster, it was a disaster, because it felt like a disaster. Somebody had to go in manually and find out what happened and what to fix and fix it manually. It was complete chaos and stress. Now if they just like keep rolling and automate it, something goes down, you bring it back up. Do you know what I mean? It won’t matter why. Are we going to talk about this in terms of it was a disaster? Does it even matter what caused it? Maybe it was a – Recovery from a disaster wouldn’t look any different than a planned update, for example. [00:39:12] BL: I think we’re getting to a place – And I don’t know whether we’re 5 years away or 10 years away or 20 years away, a place where we won’t have the same class of disaster that we have now. Think about where we’ve come over the past 20 years. Over the past 20 years, be basically looked at hardware in a rack is replace. I can think about 1988, 1999 and 2000. We rack a whole bunch of servers, and that server will be special.  Now, at these scales, we don’t care about that anymore. When a server goes away, we have 50 more just like it. The reason we were able to do that across large platforms is because of Linux. Now with Kubernetes, if Kubernetes keeps on going in the same trajectory, we’re going to basically codify these patterns that makes hardware loss not a thing. We don’t really care if we lose a server. You have 50 more nodes that look just like it.  We’re going to start having the software – The software is always available. Think about like the Google Spanner. Google Spanner is multi-location, and it can lose notes and it doesn’t lose data, and it’s relational as well. That’s what CockroachDB is about as well, about Spanner, and we’re going into the place where this kind of technology is available for anyone and we’re going to see that we’re not going to have these kinds of disasters that we’re having now. I think what we’ll have now is bigger distributed systems things where we have timing issues and things like that and leader election issues. But I think those cool stuff can’t be phased out at least over the next computing generation. [00:40:39] OP: It’s maybe more around architectures these days and applications designers and infrastructure architects in the container space and with Kubernetes orchestrating and maintaining your desired state. You’re thinking that things will fail, and that’s okay, because it will go back to the way it was before. The concept of something stopping in mid-run is not so scary anymore, because it would get put back to its state.  Maybe you might need to investigate if it keeps stopping and starting and Kubernetes keeps bringing it back. The system is actually still fully functional in terms of end users. You as the operator might need to investigate why that’s so. But the actual endpoint is still that your application is still up and running. Things fail and it’s okay. That’s maybe a thing that’s changed from maybe 5 years ago, 10 years ago. [00:41:25] CC: This is a great conversation. I want to thank everybody, Olive Power, Josh Rosso, Brian Lyles. I’m Carlisia Campos singing off. Make sure to subscribe. This was Episode 8. We’ll be back next week. See you. [END OF EPISODE] [0:50:00.3] KN: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.


16 Dec 2019

Rank #10