Cover image of O'Reilly Data Show Podcast

O'Reilly Data Show Podcast

The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Weekly hand curated podcast episodes for learning

Popular episodes

All episodes

The best episodes ranked using user listens.

Podcast cover

Trends in data, machine learning, and AI

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences. Here are some highlights from our conversation: Real-world use cases for new technology If you’re someone who wants to use data, data infrastructure, data science, machine learning, and AI, we’re really at the point where there are a lot of tools for implementers and developers. They’re not necessarily doing research and development; they just want to build better products and automate workflow. I think that’s the most significant development in my mind. And then I think use case sharing also has an impact. For example, at our conferences, people are sharing how they’re using AI and ML in their businesses, so the use cases are getting better defined—particularly for some of these technologies that are relatively new to the broader data community, like deep learning. There are now use cases that touch the types of problems people normally tackle—so, things that involve structured data, for example, for time series forecasting, or recommenders. With that said, while we are in an implementation phase, I think as people who follow this space will attest, there’s still a lot of interesting things coming out of the R&D world, so still a lot of great innovation and a lot more growth in terms of how sophisticated and how easy to use these technologies will be. Addressing ML and AI bottlenecks We have a couple of surveys that we’ll release early in 2019. In one of these surveys, we asked people what the main bottleneck is in terms of adopting machine learning and AI technologies. Interestingly enough, the main bottleneck was cultural issues—people are still facing challenges in terms of convincing people within their companies to adopt these technologies. And then, of course, the next two are the ones we’re familiar with: lack of data and lack of skilled people. And then the fourth bottleneck people cited was trouble identifying business use cases. What’s interesting about that is, if you then ask people how mature their practice is and you look at the people with the most mature AI and machine learning practices, they still cite a lack of data as the main bottleneck. What that tells me is that there’s still a lot of opportunity for people to apply these technologies within their companies, but there’s a lot of foundational work people have to do in terms of just getting data in place, getting data collected and ready for analytics. Focus on foundational technologies At the Strata Data conferences in San Francisco, London, and New York, the emphasis will be building technologies, bringing in technologies and cultural practices that will allow you to sustain analytics and machine learning in your organization. That means having all of the foundational technologies in place—data ingestion, data governance, ETL, data lineage, data science platform, metadata, store, and things like that, the various pieces of technology that will be important as you scale the practice of machine learning and AI in your company. At the Artificial Intelligence conferences, we remain focused on being the de facto gathering place for people interested in applied artificial intelligence. We will focus on servicing the most important use cases in many, many domains. That means showcasing, of course, the latest research in deep learning and other branches of machine learning, but also helping people grapple with some of the other important considerations, like privacy and security, fairness, reliability, and safety. …At both the Strata Data and Artificial Intelligence conferences, we will focus on helping people understand the capabilities of the technology, the strengths and limitations; that’s why we run executive briefings at all of these events. We showcase case studies that are aimed at the non-technical and business user as well—so, we’ll have two types of case studies, one more technical and one not so technical so the business decision-makers can benefit from seeing how their peers are using and succeeding with some of these technologies.


20 Dec 2018

Rank #1

Podcast cover

The evolution of data science, data engineering, and AI

This episode of the Data Show marks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born. To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future. Here are some highlights from our conversation: AI is more than machine learning I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that. Evolving infrastructure for big data In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast. Related resources: An easy to scan episode list of the Data Show “What is data science?” “What are machine learning engineers?” “What is Artificial Intelligence?” “Building tools for the AI applications of tomorrow” “Data engineering: A quick and simple definition”


24 May 2018

Rank #2

Similar Podcasts

Podcast cover

What machine learning engineers need to know

In this episode of the Data Show, I spoke with Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team. We recorded this conversation at Strata San Jose, while Anderson was in the middle of teaching his very popular two-day training course on real-time systems. We closed the conversation with Anderson’s take on Apache Pulsar, a very impressive new messaging system that is starting to gain fans among data engineers. Here are some highlights from our conversation: Why we need machine learning engineers Jesse Anderson: (2:09) One of the issues I’m seeing as I work with teams is that they’re trying to operationalize machine learning models, and the data scientists are not the one to productionize these. They simply don’t have the engineering skills. Conversely, the data engineers don’t have the skills to operationalize this either. So, we’re seeing this kind of gap in between the data science and the data engineering, and the gap I’m seeing and the way I’m seeing it being filled, is through a machine learning engineer. … I disagree with Paco that generalization is the way to go. I think it’s hyper-specialization, actually. This is coming from my experience having taught a lot of enterprises. At a startup, I would say that super-specialization is probably not going to be as possible, but at an enterprise, you are going to have to have a team that specializes in big data, and that is a part from a team, even a software engineering team, that doesn’t work with data. Putting Apache Pulsar on the radar of data engineersKey features of Apache Pulsar. Image by Karthik Ramasamy, used with permission.Jesse Anderson: (23:30) A lot of my time, since I’m really teaching data engineering is spent on data integration and data ingestion. How do we move this data around efficiently? For a lot of that time Kafka was really the only open source game in town for that. But now there’s another technology called Apache Pulsar. I’ve spent a decent amount of time actually going through Pulsar and there are some things that I see in it that Kafka will either have difficulty doing or won’t be able to do. … Apache Pulsar separates pub-sub from storage. When I first read about that, I didn’t quite get it. I didn’t quite see, why is this so important or why this is so interesting. It’s because you can individually scale your pub-sub and your storage resources independently. Now you’ve got something. Now you can say, “Well, we originally decided I wanted to store data for seven days. All right, let’s spin up some more bookkeeper processes and now we can store fourteen days, now we can store twenty one days.” I think that’s going to be a pretty interesting addition there. Where the other side of that, the corollary to that is, “Okay, we’re hitting Black Friday and we don’t have so much more data coming through as we have way more consumption and have way more things hitting our pub-sub. We could spin up more pub-sub with that.” This separation is actually allowing some interesting use cases. Related resources: “What are machine learning engineers?” “We need to build machine learning tools to augment machine learning engineers” “Differentiating via data science”: Eric Colson explains why companies must now think very differently about the role and placement of data science in organizations. “Architecting and building end-to-end streaming applications”: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications. “How machine learning will accelerate data management systems”: Tim Kraska on why ML will change how we build core algorithms and data


29 Mar 2018

Rank #3

Podcast cover

The current state of Apache Kafka

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data. On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed. Here are some highlights from our conversation: The first engineering project that made use of Apache Kafka If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place? So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka. So from there, Kafka adoption became pretty vital. Now, I think years down the line, all of LinkedIn runs on Kafka. It’s essentially the central nervous system for the whole company. Microservices and Kafka My own opinion of microservices is that it lets you add more money and turn it into software at a more constant rate by allowing engineers to focus on various parts of the application, by essentially decoupling a big monolith so that a lot of things can happen in parallel development of real applications. … The upside is that it lets you move fast. It adds a certain amount of agility to an engineering organization. But it comes with its own set of challenges. And these were not very obvious back then. How are all these microservices deployed? How are they monitored? And, most importantly, how do they communicate with each other? The communication bit is where Kafka comes in. When you break a monolith, you break state. And you distribute that state across different machines that run all those different applications. So now the problem is, ‘well, how do these microservices share that state? How do they talk to each other?’ Frequently, the expectation is that things happens in real time. The context of microservices where streams or Kafka comes in is in the communication model for those microservices. I should just say that there isn’t a one size fits all when it comes to communication patterns for microservices. Related resources: Kafka: The Definitive Guide “Architecting and building end-to-end streaming applications“: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications. “Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series“: Ira Cohen on developing machine learning tools for a broad range of real-time applications. “Building Apache Kafka from scratch“: Jay Kreps on data integration, event data, and the Internet of Things. I Logs “How companies can navigate the age of machine learning“: to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.


22 Nov 2017

Rank #4

Most Popular Podcasts

Podcast cover

Make data science more useful

In this episode of the Data Show, I speak with Cassie Kozyrkov, technical director and chief decision scientist at Google Cloud. She describes “decision intelligence” as an interdisciplinary field concerned with all aspects of decision-making, and which combines data science with the behavioral sciences. Most recently she has been focused on developing best practices that can help practitioners make safe, effective use of AI and data. Kozyrkov uses her platform to help data scientists develop skills that will enable them to connect data and AI with their organizations’ core businesses. We had a great conversation spanning many topics, including: How data science can be more useful The importance of the human side of data The leadership talent shortage in data science Is data science a bubble? Related resources: “Managing machine learning in the enterprise: Lessons from banking and health care” “Managing risk in machine learning” “What are model governance and model operations?” “Becoming a machine learning company means investing in foundational technologies” Forough Poursabzi Sangdeh: “It’s time for data scientists to collaborate with researchers in other disciplines” Jacob Ward: “How social science research can inform the design of AI systems” “AI and machine learning will require retraining your entire organization” Ihab Ilyas and Ben lorica on “The quest for high-quality data” “Product management in the machine learning era”—a tutorial at the Artificial Intelligence Conference in San Jose


1 Aug 2019

Rank #5

Podcast cover

Enabling end-to-end machine learning pipelines in real-world applications

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines. We had a great conversation spanning many topics, including: AI Fairness 360 (AIF360), a set of fairness metrics for data sets and machine learning models Adversarial Robustness Toolbox (ART), a Python library for adversarial attacks and defenses. Model Asset eXchange (MAX), a curated and standardized collection of free and open source deep learning models. Tools for model development, governance, and operations, including MLflow, Seldon Core, and Fabric for deep learning Reinforcement learning in the enterprise, and the emergence of relevant open source tools like Ray. Related resources: “Modern Deep Learning: Tools and Techniques”—a new tutorial at the Artificial Intelligence conference in San Jose Harish Doddi on “Simplifying machine learning lifecycle management” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” “Managing risk in machine learning”: considerations for a world where ML models are becoming mission critical “The evolution and expanding utility of Ray” “Local Interpretable Model-Agnostic Explanations (LIME): An Introduction” Forough Poursabzi Sangdeh on why “It’s time for data scientists to collaborate with researchers in other disciplines”


20 Jun 2019

Rank #6

Podcast cover

Tools for machine learning development

In this week’s episode of the Data Show, we’re featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML). Here are a few highlights: Tools for productive collaboration A data catalog, at a high level, basically answers questions around the data that’s available and who is using it so an enterprise can understand access patterns. … The term “data catalog” is generally used when you’ve gotten to the point where you have a team of data scientists and you need a place where they can use libraries in a setting where they can collaborate, and where they can share not only models but maybe even data pipelines and features. The more advanced data science platforms will have automation tools built in. … The ideal scenario is the data science platform is not just for prototyping, but also for pushing things to production. Tools for ML development We have tools for software development, and now we’re beginning to hear about tools for machine learning development—there’s a company here at Strata called Comet.ml, and there’s another startup called Verta.ai. But what has really caught my attention is an open source project from Databricks called MLflow. When it first came out, I thought, ‘Oh, yeah, so we don’t have anything like this. Might have a decent chance of success.’ But I didn’t pay close attention until recently; fast forward to today, there are 80 contributors for 40 companies and 200+ companies using it. What’s good about MLflow is that it has three components and you’re free to pick and choose—you can use one, two, or three. Based on their surveys, the most popular component is the one for tracking and managing machine learning experiments. It’s designed to be useful for individual data scientists, but it’s also designed to be used by teams of data scientists, so they have documented use-cases of MLflow where you have a company managing thousands of models and productions.


3 Jul 2019

Rank #7

Podcast cover

How privacy-preserving techniques can lead to more robust machine learning models

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy (a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today. What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen. Here are some highlights from our conversation: Differential privacy and machine learning In the literature, there are actually multiple ways differential privacy is used in machine learning. We can either inject noise directly at the input data level, or while we’re training a model. We can also inject noise into the gradient. At every iteration we’re computing the gradients, we can inject some sort of noise. Or we can also inject noise during aggregation. If we’re using ensembles, we can inject noise there. And we can also inject noise at the output level. So after we’ve trained the model, and we have our vectors of weights, then we can also inject noise directly to the weights. A mechanism for building robust models There could be a chance that differential privacy methods can actually make your model more general. Because, essentially, when models memorize their training data, it could be due to overfitting. So, injecting all of this noise may help the resulting model move you further away from overfitting, and you get a more general model. Related resources: “How to build analytic products in an age when data privacy has become critical” “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain. “Data regulations and privacy discussions are still in the early stages”: Aurélie Pols on GDPR, ethics, and ePrivacy. “Data collection and data markets in the age of privacy and machine learning”


2 Aug 2018

Rank #8

Podcast cover

Specialized hardware for deep learning will unleash innovation

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices. In fact, companies with enough volume have already begun building specialized processors for machine learning. But you have to either use specific cloud computing platforms or work at specific companies to have access to such hardware. A new wave of startups (including Cerebras) will make specialized hardware affordable and broadly available. Over the next 12-24 months architects and engineers will need to revisit their infrastructure and decide between general purpose or specialized hardware, and cloud or on-premise gear. In light of the training duration and cost they face using current (general purpose) hardware, some experiments might be hard to justify. Upcoming specialized hardware will enable data scientists to try out ideas that they previously would have hesitated to pursue. This will surely lead to more research papers and interesting products as data scientists are able to run many more experiments (on even bigger models) and iterate faster. As founder of one of the most anticipated hardware startups in the deep learning space, I wanted get Feldman’s views on the challenges and opportunities faced by engineers and entrepreneurs building hardware for machine learning workloads. Here are some highlights from our conversation: A renaissance for computer architecture OpenAI put out some very interesting analysis recently that showed that since 2012, the compute use for the largest AI training runs has increased by 300,000x. … What’s available to us to attack the vast discrepancy between compute demand and what we have today? Two things are available to us. The first is, exploring interesting compute architectures. I think this ushers in a golden age for compute architectures. And number two, it’s building dedicated hardware and saying: ‘We’re prepared to make trade offs to accelerate AI compute by not trying to be good at other things. By not trying to be good at graphics or by not trying to be a good web server. But we will attack this vast demand for compute by building dedicated hardware for artificial intelligence work.’ Historically, the following has been a very productive and valuable trade off: new and interesting architectures dedicated for a particular type of work. That’s the opportunity that many of these hardware companies or chip companies have seen. Communication intensive workloads When you stay on a chip, one can communicate fairly quickly. The problem is our work in artificial intelligence often spans more than one traditional chip. And the performance penalty for leaving the chip is very, very high. On-chip, you stay in silicon; off-chip, you have to wrap your communication in some sort of protocol, you need to send it, connect it over lanes on a print circuit board or maybe through a PCI switch, or maybe through an Ethernet switch or an InfiniBand switch. This adds two, three, four orders of magnitude of latency. Some of the problems the hardware vendors who are interested in solving data center training, data center inference are working on are how you can accelerate the communication between cores and across tens of thousands of cores, or even hundreds of thousands of cores across many chips. Some are inventing new techniques for special switches and modifying PCIe to do that. Others have sort of more fundamental approaches to accelerating this communication. But if you can’t communicate quickly, you can’t train a model quickly and you can’t provide inference quickly. New hardware will significantly reduce training times You should see a reduction in training times of 10-50x sometime over the next 12-18 months. … I think you’re going to see an additional 10-25x in the following year. We’re looking at three orders of magnitude in reduction in training time over the next several years. Related resources: “How big compute is powering the deep learning rocket ship”: Greg Diamos on building computer systems for deep learning and AI. “The artificial intelligence computing stack”: A look at why the U.S. and China are investing heavily in this new computing stack “How to train and deploy deep learning at scale”: Ameet Talwalkar on large-scale machine learning. “Scaling machine learning”: Reza Zadeh on deep learning, hardware/software interfaces, and why. “A new benchmark suite for machine learning”: David Patterson and Gu-Yeon Wei on MLPerf, a new set of benchmarks compiled by industry and academic contributors. “Building tools for the AI applications of tomorrow” “Toward the Jet Age of machine learning”


19 Jul 2018

Rank #9

Podcast cover

Why companies are in need of data lineage solutions

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up. There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it. Here are some highlights from our conversation: Data lineage Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful. … Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR … which was one of the hot topics I heard about at the Strata Data Conference in London in 2018. Why companies are suddenly building data lineage solutions A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you’re working with. If you’re working with 10 databases, you need to know what’s going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it’s linked to. Then it gives you some metadata information so you can drill down. Let’s say you have corrupted data, let’s say you want to debug something. All these cases tie into the actual use cases for which we want to build it. Related resources: “Deep automation in machine learning” Vitaly Gordon on “Building tools for enterprise data science” “Managing risk in machine learning” Haoyuan Li explains why “In the age of AI, fundamental value resides in data” “What machine learning means for software development” Joe Hellerstein on how “Metadata services can lead to performance and organizational improvements”


25 Apr 2019

Rank #10

Podcast cover

Using machine learning to improve dialog flow in conversational applications

In this episode of the Data Show, I spoke with Alan Nichol, co-founder and CTO of Rasa, a startup that builds open source tools to help developers and product teams build conversational applications. About 18 months ago, there was tremendous excitement and hype surrounding chatbots, and while things have quieted lately, companies and developers continue to refine and define tools for building conversational applications. We spoke about the current state of chatbots, specifically about the types of applications developers are building today and how he sees conversational applications evolving in the near future. As I described in a recent post, workflow automation will happen in stages. With that in mind, chatbots and intelligent assistants are bound to improve as underlying algorithms, technologies, and training data get better. Here are some highlights from our conversation: Chatbots and state machines The first component is what we call natural language understanding, which typically means taking a short message that a user sends and extracting some meaning from it, which means turning it into structured data. In the case we talked about regarding the SQL database, if somebody asks, for example, ‘What was my ROI on my Facebook campaigns last month?’, the first thing you want to understand is that this is a data question and you want to assign it a label identifying it as a person, and they’re not saying hello, or goodbye, or thank you, but asking a specific question. Then you want to pick out those fields to help you create a query. … The second piece is, how do you actually know what to do next? How do you build a system that can hold a conversation that is coherent? What you realize very quickly is that it’s not enough to have one input always matched to the same output. For example, if you ask somebody a yes or no question and they say, ‘yes,’ the next thing to do, of course, depends on what the original question was. … Real conversations aren’t stateless; they have some context and they need to pay attention to the history. So, the way developers do that is build a state machine. Which means, for example, that you have a bot that can do some different things. It can talk about flights; it can talk about hotels. Then you define different states for when the person is still searching, or for when they are comparing different things, or for when they finish a booking. And then you have to define rules for how to behave for every input, for every possible state. Beyond state machines The problem is that [the state machine] approach works for building your first version, but it really restricts you to what we call “the happy parts,” which is where the user is compliant and cooperative and does everything you ask them to do. But in typical cases, you ask a person, “Do you like option A, or option B?” Then you probably build the path for the person saying, A, you build a path for the person saying B. But then you give it to real users, and they say, “No, I don’t like either of those.” Or they ask a question like, “Why is A so much more expensive than B?” Or, “Let me get back to you about that.” … They don’t scale, that’s the problem. If you’re a developer and somebody has a conversation with your bot and you realize that it did the wrong thing, now you have to go look back at your (literally) thousands or tens of thousands of rules to figure out which one crashed and which one did the wrong thing. You figure out where to inject one more rule to handle one more etiquette, and that just doesn’t scale at all. … With our dialogue library Rasa core, we give the user the ability to talk to the bot and provide feedback. So, in Rasa, the whole flow of dialogue is also controlled with machine learning. And it’s learned from real sample conversations. You talk to the system and if it does something wrong, you provide feedback and it corrects itself. So, you explore the space of possible conversations interactively yourself, and then your users do as well. Related resources: Alan Nichol on “The Next Generation of AI Assistants in Enterprise” “Bots: What you need to know” “Using machine learning to monitor and optimize chatbots” “Commercial speech recognition systems in the age of big data and deep learning”: Yishay Carmiel on applications of deep learning in text and speech “How to think about AI and machine learning technologies, and their roles in automation” “Deep learning revolutionizes conversational AI” “Topic models—past, present, and future”: David Blei discusses the origins and applications of topic models.


13 Sep 2018

Rank #11

Podcast cover

Acquiring and sharing high-quality data

In this episode of the Data Show, I spoke with Roger Chen, co-founder and CEO of Computable Labs, a startup focused on building tools for the creation of data networks and data exchanges. Chen has also served as co-chair of O’Reilly’s Artificial Intelligence Conference since its inception in 2016. This conversation took place the day after Chen and his collaborators released an interesting new white paper, Fair value and decentralized governance of data. Current-generation AI and machine learning technologies rely on large amounts of data, and to the extent they can use their large user bases to create “data silos,” large companies in large countries (like the U.S. and China) enjoy a competitive advantage. With that said, we are awash in articles about the dangers posed by these data silos. Privacy and security, disinformation, bias, and a lack of transparency and control are just some of the issues that have plagued the perceived owners of “data monopolies.” In recent years, researchers and practitioners have begun building tools focused on helping organizations acquire, build, and share high-quality data. Chen and his collaborators are doing some of the most interesting work in this space, and I recommend their new white paper and accompanying open source projects.Sequence of basic market transactions in the Computable Labs protocol. Source: Roger Chen, used with permission.We had a great conversation spanning many topics, including: Why he chose to focus on data governance and data markets. The unique and fundamental challenges in accurately pricing data. The importance of data lineage and provenance, and the approach they took in their proposed protocol. What cooperative governance is and why it’s necessary. How their protocol discourages an unscrupulous user from just scraping all data available in a data market. Related resources: Roger Chen: “Data liquidity in the age of inference” Ihab Ilyas and Ben lorica on “The quest for high-quality data” Chris Ré: “Software 2.0 and Snorkel” Alex Ratner on “Creating large training data sets quickly” Jeff Jonas on “Real-time entity resolution made accessible” “Data collection and data markets in the age of privacy and machine learning” Guillaume Chaslot on “The importance of transparency and user control in machine learning”


18 Jul 2019

Rank #12

Podcast cover

Machine intelligence for content distribution, logistics, smarter cities, and more

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores). I wanted to consult local market researchers to help make sense of some of the things I’ve been observing from afar. Liu and her colleagues have put out a series of interesting reports highlighting some of these important trends. They also have an annual report—Trends & Predictions for China’s Tech Industry in 2018—that Liu will discuss in her keynote and talk at Strata Data Singapore in December. Here are some highlights from our conversation: Machine learning and content distribution Media consumption takes a large proportion of people’s everyday life here in China. Before, people learned their news from news portals and from editorial teams who served as the gatekeepers. People now trust machine learning algorithms with editorial and agenda setting. Apps like Toutiao have become very popular. It’s been quite a surprise to most news portals and media professionals here in China. People are trying to find a balance between the traditional ways of content creation and the new ways of content distribution by aggregators fully powered by machines. Toutiao’s news recommendation engine is purely a black box to most people. … But users are spending more and more time on these types of platforms. And, machine-generated news feeds have become a big thing. … So, it’s now becoming a content war again. After these algorithms improve the efficiency of content distribution, the battle may come down to what content you have. Bike sharing Bike sharing is kind of a new model adapted to Chinese society. … In between every subway station, there’s still several miles to go, where people still need to walk or maybe take a taxi. Bike sharing is being used to replace these other kinds of approaches. There are two primary players. One is Mobike and the other one is Ofo, and they started with different models, actually. Ofo started a year or two earlier from a university campus. … It provided this kind of public bike rental system to users on campus. This was kind of the preliminary prototype of this model. Mobike started in a city. These bike sharing companies have their GPS systems on the bikes, and the bikes have digital electronic locks that can be unlocked with an app on your phone. These technologies, combined together, can help them collect data as well as have a better management system of all the bikes they distribute over a city. Smart cities It’s still a maybe, but it’s very likely we are going to include things about smart cities in our 2018 reports. … This includes AR applications to help build better cities for urban planning. … Urban planning is a very complicated thing, and what we are missing there is, we can be a little bit left behind because of the lack of data. But now people have different types of data. For example, I know the ride sharing company Didi is collaborating with several city governments to help them do urban planning: by using data to better understand traffic, how to manage traffic light systems in the city, and also the bus system. City governments at all levels are now collaborating with all these tech companies to explore applications of their data to improve the cities we have in China. … This is going to be a very important opportunity for the tech companies here in China, especially in terms of their data applications, and their contributions to society. Related resources: “Vehicle-to-vehicle communication networks and smart cities”: Bruno Fernandez-Ruiz on the importance of building the ground control center of the future “How intelligent data platforms are powering smart cities” “Creating autonomous vehicle systems”: Shaoshan Liu on understanding AV technologies and how to integrate them Cars that coordinate with people: 2017 AI Conference keynote by Anca Dragan


26 Oct 2017

Rank #13

Podcast cover

Effective mechanisms for searching the space of machine learning algorithms

In this episode of the Data Show, I spoke with Ken Stanley, founding member of Uber AI Labs and associate professor at the University of Central Florida. Stanley is an AI researcher and a leading pioneer in the field of neuroevolution—a method for evolving and learning neural networks through evolutionary algorithms. In a recent survey article, Stanley went through the history of neuroevolution and listed recent developments, including its applications to reinforcement learning problems. Stanley is also the co-author of a book entitled Why Greatness Cannot Be Planned: The Myth of the Objective—a book I’ve been recommending to anyone interested in innovation, public policy, and management. Inspired by Stanley’s research in neuroevolution (into topics like novelty search and open endedness), the book is filled with examples of how notions first uncovered in the field of AI can be applied to many other disciplines and domains. The book closes with a case study that hits closer to home—the current state of research in AI. One can think of machine learning and AI as a search for ever better algorithms and models. Stanley points out that gatekeepers (editors of research journals, conference organizers, and others) impose two objectives that researchers must meet before their work gets accepted or disseminated: (1) empirical: their work should beat incumbent methods on some benchmark task, and (2) theoretical: proposed new algorithms are better if they can be proven to have desirable properties. Stanley argues this means that interesting work (“stepping stones”) that fail to meet either of these criteria fall by the wayside, preventing other researchers from building on potentially interesting but incomplete ideas. Here are some highlights from our conversation: Neuroevolution today In the state of the art today, the algorithms have the ability to evolve variable topologies or different architectures. There are pretty sophisticated algorithms for evolving the architecture of a neural network; in other words, what’s connected to what, not just what the weight of those connections are—which is what deep learning is usually concerned with. There’s also an idea of how to encode very, very large patterns of connectivity. This is something that’s been developed independently in neuroevolution where there’s not a really analogous thing in deep learning right now. This is the idea that if you’re evolving something that’s really large, then you probably can’t afford to encode the whole thing in the DNA. In other words, if we have 100 trillion connections in our brains, our DNA does not have 100 trillion genes. In fact, it couldn’t have a 100 trillion genes. It just wouldn’t fit. That would be astronomically too high. So then, how is it that with a much, much smaller space of DNA, which is about 30,000 genes or so, three billion base pairs, how would you get enough information in there to encode something that’s 100 trillion parts? This is the issue of encoding. We’ve become sophisticated at creating artificial encodings that are basically compressed in an analogous way, where you can have a relatively short string of information to describe a very large structure that comes out—in this case, a neural network. We’ve gotten good at doing encoding and we’ve gotten good at searching more intelligently through the space of possible neural networks. We originally thought what you need to do is just breed by choosing among the best. So, you say, ‘Well, there’s some task we’re trying to do and I’ll choose among the best to create the next generation.’ We’ve learned since then that that’s actually not always a good policy. Sometimes you want to explicitly choose for diversity. In fact, that can lead to better outcomes. The myth of the objective Our book does recognize that sometimes pursuing objectives is a rational thing to do. But I think the broader point that’s important here is there’s a class of discoveries for which it really is against your interest to frame what you’re doing in terms of an objective. The reason we wrote the book is because … I started to realize this principle that ‘sometimes in order to make discovery possible, you have to stop having an objective’ speaks to people beyond just computer scientists who are developing algorithms. It’s an issue for our society and for institutions because there are many things we do that are driven by some kind of objective metric. It almost sounds like heresy to suggest that you shouldn’t do that. It’s like an unquestioned assumption that exists throughout our culture that the primary route to progress is to set objectives and move toward those objectives and measure your performance with respect to those objectives. We began to think that given the results we have that are hard empirical results, that it is important to counterweight this belief that pervades society with a counter argument that points out that there are cases where this is actually a really bad idea. The thing I learned more and more talking to different groups is that this discussion is not being had. We’re not talking about this, and I think it’s a very important discussion because our institutions are geared away from innovation because they are so objectively driven. We could do more to foster innovation if we recognize this principle. A lot of people want this security blanket of an objective because they don’t trust anything that isn’t driven by an objective. Actually, it turns out there are principled ways of exploring the world without an objective. In other words, it’s not just random and the book is about that. It’s about how smart ways of exploring in a non-objective way can lead to really, really important results. We just wanted to open up that conversation ‘society wide’ and not just have it narrowly within the field of computer science because it is such an important conversation to have. Related resources: “Neuroevolution: A different kind of deep learning“: a must-read overview by Ken Stanley “Why continuous learning is key to AI“: A look ahead at the tools and methods for learning from sparse feedback. Ray: A distributed execution framework for emerging AI applications: 2017 Strata Data keynote by Michael Jordan Deep reinforcement learning for robotics: 2016 AI Conference presentation by Pieter Abbeel Cars that coordinate with people: 2017 AI Conference keynote by Anca Dragan


31 Aug 2017

Rank #14

Podcast cover

Machine learning at Spotify: You are what you stream

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams. I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content. Here are some highlights from our conversation: Recommenders at Spotify For us, engagement always comes first. At Spotify, we have a couple hundred people who are just focused on user engagement, and this is the group that creates personalized playlists, like Discover Weekly or your Daily Mix for you. We know our users love discovery and see Spotify as a very important platform for them to discover something new, but there are also times when people just want to have some music played in the background that fits the mood. But again, we don’t have a specific agenda in terms of what we should push for. We want to give you what you want so that you are happy, which is why we invested so much in understanding people through music. If we believe you might like some “long tail” content, we will recommend it to you because it makes you happy, but we can also do the same for the top 100 track if we believe you will enjoy them. Music is like a mirror Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true. As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preference and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models. Just to give you my personal example to illustrate how it works, you can learn a lot about me just by me telling you what I stream. You will see that I use my running playlist only during the weekend in early mornings, and I have a lot of children’s songs streamed at my house between 5 p.m. and 7 p.m. I also have a lot of tango and salsa playlists that I created and followed. So what does that tell you? It tells you that I am probably a weekend runner, which means I have some kind of affiliation for fitness; it tells you that I am probably a mother and play songs for my child after I get home from work; it also tells you that I somehow like tango and salsa, so I am probably a dancer, too. As you can see, we are investing a lot into understanding people’s context and preference so we can start capturing different moments of their lives. And, of course, the more we understand your context, your preference, and what you are looking for, the better we can customize your playlists for you. Related resources: Music, the window into your soul: Christine Hung’s keynote at Strata Data NYC 2017 “Transforming organizations through analytics centers of excellence”: Carme Artigas on helping enterprises transform themselves with big data tools and technologies. “A framework for building and evaluating data products”: Grace Huang on lessons learned in the course of machine learning product launches. “How companies can navigate the age of machine learning”: to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.


7 Dec 2017

Rank #15

Podcast cover

How Ray makes continuous learning accessible and easy to scale

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment. What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call. As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation: Tools for reinforcement learning Ray is something we’ve been building that’s motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they’re largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on. … For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up. … In fact, people often use MPI along with lower-level multi-processing libraries to build the communication infrastructure for their reinforcement learning applications. Scaling machine learning in dynamic environments I think right now when we think of machine learning, we often think of supervised learning. But a lot of machine learning applications are changing from making just one prediction to making sequences of decisions and taking sequences of actions in dynamic environments. The thing that’s special about reinforcement learning is it’s not just the different algorithms that are being used, but rather the different problem domain that it’s being applied to: interactive, dynamic, real-time settings bring up a lot of new challenges. … The set of algorithms actually goes even a little bit further. Some of these techniques are even useful in, for example, things like text summarization and translation. You can use these techniques that have been developed in the context of reinforcement learning to better tackle some of these more classical problems [where you have some objective function that may not be easily differentiable]. … Some of the classic applications that we have in mind when we think about reinforcement learning are things like dialogue systems, where the agent is one participant in the conversation. Or robotic control, where the agent is the robot itself and it’s trying to learn how to control its motion. … For example, we implemented the evolution algorithm described in a recent OpenAI paper in Ray. It was very easy to port to Ray, and writing it only took a couple of hours. Then we had a distributed implementation that scaled very well and we ran it on up to 15 nodes. Related resources: Why continuous learning is key to AI: A look ahead at the tools and methods for learning from sparse feedback Ray: A distributed execution framework for emerging AI applications: A Strata Data keynote by Michael Jordan Deep reinforcement learning for robotics: A 2016 Artificial Intelligence Conference presentation by Pieter Abbeel Cars that coordinate with people: A 2017 Artificial Intelligence Conference keynote by Anca Dragan) Introduction to reinforcement learning and OpenAI Gym Reinforcement learning explained


17 Aug 2017

Rank #16

Podcast cover

Applications of data science and machine learning in financial services

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China. We had a great conversation spanning many topics, including: Potential applications of data science in financial services. The current state of data science in financial services in both the U.S. and China. His experience recruiting, training, and managing data science teams in both the U.S. and China. Here are some highlights from our conversation: Opportunities in financial services There’s a customer acquisition piece and then there’s a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it’s a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. … Once you have a specific cohort of users who you want to target, there’s a need to be able to precisely convert them, which means understanding the stage of the customer’s thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need. … On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer’s data and able to serve the customer better with automated services whenever and wherever the customer is. It’s all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term. Opportunities in China A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry. For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we’re talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better. If you look at WeChat, they’re boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance. Related resources: Kai-Fu Lee outlines the factors that enabled China’s rapid ascension in AI Gary Kazantsev on how “Data science makes an impact on Wall Street” Juan Huerta on “Upcoming challenges and opportunities for data technologies in consumer finance” Geoffrey Bradway on “Programming collective intelligence for financial trading” Jason Dai on why “Companies in China are moving quickly to embrace AI technologies” Haoyuan Li on why “In the age of AI, fundamental value resides in data”


23 May 2019

Rank #17

Podcast cover

The state of machine learning in Apache Spark

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio. We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency. Here is a partial list of the items we discussed: The current state of machine learning in Spark. Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward. The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines. Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput). Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning—lack of training data, and deploying and monitoring models in production. [Full disclosure: I am an advisor to Databricks.] Related resources: Spark: The Definitive Guide Advanced Analytics with Spark High-performance Spark Learning Path: Get Started with Natural Language Processing Using Python, Spark, and Scala Learning Path: Getting Up and Running with Apache Spark “The current state of applied data science”


14 Sep 2017

Rank #18

Podcast cover

How social science research can inform the design of AI systems

In this episode of the Data Show, I spoke with Jacob Ward, a Berggruen Fellow at Stanford University. Ward has an extensive background in journalism, mainly covering topics in science and technology, at National Geographic, Al Jazeera, Discovery Channel, BBC, Popular Science, and many other outlets. Most recently, he’s become interested in the interplay between research in psychology, decision-making, and AI systems. He’s in the process of writing a book on these topics, and was gracious enough to give an informal preview by way of this podcast conversation. Here are some highlights from our conversation: Psychology and AI I began to realize there was a disconnect between what is a totally revolutionary set of innovations coming through in psychology right now that are really just beginning to scratch the surface of how human beings make decisions; at the same time, we are beginning to automate human decision-making in a really fundamental way. I had a number of different people say, ‘Wow, what you’re describing in psychology really reminds me of this piece of AI that I’m building right now,’ to change how expectant mothers see their doctors or change how we hire somebody for a job or whatever it is. Transparency and designing systems that are fair I was talking to somebody the other day who was trying to build a loan company that was using machine learning to present loans to people. He and his company did everything they possibly could to not redline the people they were loaning to. They were trying very hard not to make unfair loans that would give preference to white people over people of color. They went to extraordinary lengths to make that happen. They cut addresses out of the process. They did all of this stuff to try to basically neutralize the process, and the machine learning model still would pick white people at a disproportionate rate over everybody else. They can’t explain why. They don’t know why that is. There’s some variable that’s mapping to race that they just don’t know about. But that sort of opacity—this is somebody explaining it to me who just happened to have been inside the company, but it’s not as if that’s on display for everybody to check out. These kinds of closed systems are picking up patterns we can’t explain, and that their creators can’t explain. They are also making really, really important decisions based on them. I think it is going to be very important to change how we inspect these systems before we begin trusting them. Anthropomorphism and complex systems In this book, I’m also trying to look at the way human beings respond to being given an answer by an automated system. There are some very well-established, psychological principles out there that can give us some sense of how people are going to respond when they are told what to do based on an algorithm. The people who study anthropomorphism, the imparting of intention and human attributes to an automated system, say there’s a really well-established pattern. When people are shown a very complex system and given some sort of exposure to that complex system, whether it gives them an answer or whatever it is, it tends to produce in human beings a level of trust in that system that doesn’t really have anything to do with reality. … The more complex the system, the more people tend to trust it. Related resources: Jacob Ward on “How AI will amplify the best and worst of humanity” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” “Managing risk in machine learning models”: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain. “We need to build machine learning tools to augment machine learning engineers” “Case studies in data ethics” “Haunted by data”: Maciej Ceglowski makes the case for adopting enforceable limits for data storage.


11 Oct 2018

Rank #19

Podcast cover

Unleashing the potential of reinforcement learning

In this episode of the Data Show, I spoke with Danny Lange, VP of AI and machine learning at Unity Technologies. Lange previously led data and machine learning teams at Microsoft, Amazon, and Uber, where his teams were responsible for building data science tools used by other developers and analysts within those companies. When I first heard that he was moving to Unity, I was curious as to why he decided to join a company whose core product targets game developers. As you’ll glean from our conversation, Unity is at the forefront of some of the most exciting, practical applications of deep learning (DL) and reinforcement learning (RL). Realistic scenery and imagery are critical for modern games. GANs and related semi-supervised techniques can ease content creation by enabling artists to produce realistic images much more quickly. In a previous post, Lange described how reinforcement learning opens up the possibility of training/learning rather than programming in game development. Lange explains why simulation environments are going to be important tools for AI developers. We are still in the early days of machine intelligence, and I am looking forward to more tools that can democratize AI research (including future releases by Lange and his team at Unity). Here are some highlights from our conversation: Why reinforcement learning is so exciting I’m a huge fan of reinforcement learning. I think it has incredible potential, not just in game development but in a lot of other areas, too. … What we are doing at Unity is basically making reinforcement learning available to the masses. We have shipped open source software on GitHub called Unity ML Agents, that include the basic frameworks for people to experiment with reinforcement learning. Reinforcement learning is really creating a machine learned-driven feedback loop. Recall the example I previously wrote about, of the chicken crossing the road; yes, it gets hit thousands and thousands of times by these cars, but every time it gets hit, it learns that’s a bad thing. And every time it manages to pick up a gift package on the way over the road, that’s a good thing. Over time, it gets superhuman capabilities in crossing this road, and that is fantastic because there’s not a single line of code going into that. It’s pure simulation, and through reinforcement learning it captures a method. It learns a method to cross the road, and you can take that into many different aspects of games. There are many different methods you can train. You can add two chickens—can they collaborate to do something together? We are looking at what we call multi-agent systems, where two or more of these trained reinforcement learning-trained agents are acting together to achieve a goal. … I want a million developers to start working on this. I want a lot more innovation, and I want a lot more out-of-the-box thinking, and that is what we want by making our RL tools and platform available to our Unity community. Let me just jump to one thing here: most people think that reinforcement learning in the game world or in game-like situations is a lot about what we call ‘path finding.’ Path finding is basically for a character in a game to navigate through some situation—this is pretty well understood. There are good algorithms for that. Looking ahead, I’m actually thinking about a different set of decisions. For instance, which weapon or which tool should a character pick up and bring with them in a game? That is a much, much harder decision. It’s strategy at a higher level. Machine learning and AI at Unity If you think about where intelligence originated around us (animals and humans), it’s really originating out of surviving and thriving in a physical world. That is really the job of intelligence. You have to survive, you have to find food, you have to avoid your enemies, you have to walk falling down—so, gravity is playing a big role there. If you think about the Unity game engine, it creates a visual 3D environment where the laws of physics rule. So, you have gravity, you have inertia, you have friction, and you basically have a 3D environment. Unity provides a fantastic lab to explore machine learning, and thereby elements of artificial intelligence in simulations within that world. So, rather than just using machine learning to work on spreadsheets and sell more products at Amazon or get your car to arrive more quickly at Uber, now you can actually start running simulations that are covering aspects of the real world, and you can explore things like vision, touching, path finding, etc. Related resources: ”Practical applications of reinforcement learning in industry” “Bringing gaming to life with AI and deep learning”: Danny Lange on how machine learning opens the door for the use of training rather than programming in game development “Introducing RLlib – A composable and scalable reinforcement learning library”: this new software makes the task of training RL models much more accessible “An Outsider’s Tour of Reinforcement Learning”: Ben Recht’s ongoing series of posts “Open-endedness: The last grand challenge you’ve never heard of” “Why continuous learning is key to AI”


1 Mar 2018

Rank #20