Cover image of O'Reilly Data Show Podcast
(58)
Business
Technology

O'Reilly Data Show Podcast

Updated 8 days ago

Business
Technology
Read more

The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Read more

The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

iTunes Ratings

58 Ratings
Average Ratings
31
12
8
6
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.

iTunes Ratings

58 Ratings
Average Ratings
31
12
8
6
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.
Cover image of O'Reilly Data Show Podcast

O'Reilly Data Show Podcast

Latest release on Oct 10, 2019

The Best Episodes Ranked Using User Listens

Updated by OwlTail 8 days ago

Rank #1: The evolution of data science, data engineering, and AI

Podcast cover
Read more

This episode of the Data Show marks our 100th episode. This podcast stemmed out of video interviews conducted at O’Reilly’s 2014 Foo Camp. We had a collection of friends who were key members of the data science and big data communities on hand and we decided to record short conversations with them. We originally conceived of using those initial conversations to be the basis of a regular series of video interviews. The logistics of studio interviews proved too complicated, but those Foo Camp conversations got us thinking about starting a podcast, and the Data Show was born.

To mark this milestone, my colleague Paco Nathan, co-chair of Jupytercon, turned the tables on me and asked me questions about previous Data Show episodes. In particular, we examined the evolution of key topics covered in this podcast: data science and machine learning, data engineering and architecture, AI, and the impact of each of these areas on businesses and companies. I’m proud of how this show has reached so many people across the world, and I’m looking forward to sharing more conversations in the future.

Here are some highlights from our conversation:

AI is more than machine learning

I think for many people machine learning is AI. I’m trying to, in the AI Conference series, convince people that a true AI system will involve many components, machine learning being one. Many of the guests I have seem to agree with that.

Evolving infrastructure for big data

In the early days of the podcast, many of the people I interacted with had Hadoop as one of the essential things in their infrastructure. I think while that might still be the case, there are more alternatives these days. I think a lot of people are going to object stores in the cloud. Another examples is that before, people maintained specialized systems. There’s still that, but people are trying to see if they can combine some of these systems, or come up with systems that can do more than one workload. For example, this whole notion in Spark of having a unified system that is able to do batch in streaming caught on during the span of this podcast.

Related resources:

May 24 2018

30mins

Play

Rank #2: What machine learning engineers need to know

Podcast cover
Read more

In this episode of the Data Show, I spoke with Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team.

We recorded this conversation at Strata San Jose, while Anderson was in the middle of teaching his very popular two-day training course on real-time systems. We closed the conversation with Anderson’s take on Apache Pulsar, a very impressive new messaging system that is starting to gain fans among data engineers.

Here are some highlights from our conversation:

Why we need machine learning engineers

Jesse Anderson: (2:09) One of the issues I’m seeing as I work with teams is that they’re trying to operationalize machine learning models, and the data scientists are not the one to productionize these. They simply don’t have the engineering skills. Conversely, the data engineers don’t have the skills to operationalize this either. So, we’re seeing this kind of gap in between the data science and the data engineering, and the gap I’m seeing and the way I’m seeing it being filled, is through a machine learning engineer.

… I disagree with Paco that generalization is the way to go. I think it’s hyper-specialization, actually. This is coming from my experience having taught a lot of enterprises. At a startup, I would say that super-specialization is probably not going to be as possible, but at an enterprise, you are going to have to have a team that specializes in big data, and that is a part from a team, even a software engineering team, that doesn’t work with data.

Putting Apache Pulsar on the radar of data engineers

Key features of Apache Pulsar. Image by Karthik Ramasamy, used with permission.

Jesse Anderson: (23:30) A lot of my time, since I’m really teaching data engineering is spent on data integration and data ingestion. How do we move this data around efficiently? For a lot of that time Kafka was really the only open source game in town for that. But now there’s another technology called Apache Pulsar. I’ve spent a decent amount of time actually going through Pulsar and there are some things that I see in it that Kafka will either have difficulty doing or won’t be able to do.

… Apache Pulsar separates pub-sub from storage. When I first read about that, I didn’t quite get it. I didn’t quite see, why is this so important or why this is so interesting. It’s because you can individually scale your pub-sub and your storage resources independently. Now you’ve got something. Now you can say, “Well, we originally decided I wanted to store data for seven days. All right, let’s spin up some more bookkeeper processes and now we can store fourteen days, now we can store twenty one days.” I think that’s going to be a pretty interesting addition there. Where the other side of that, the corollary to that is, “Okay, we’re hitting Black Friday and we don’t have so much more data coming through as we have way more consumption and have way more things hitting our pub-sub. We could spin up more pub-sub with that.” This separation is actually allowing some interesting use cases.

Related resources:

Mar 29 2018

32mins

Play

Rank #3: The current state of Apache Kafka

Podcast cover
Read more

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Here are some highlights from our conversation:

The first engineering project that made use of Apache Kafka

If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?

So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.

So from there, Kafka adoption became pretty vital. Now, I think years down the line, all of LinkedIn runs on Kafka. It’s essentially the central nervous system for the whole company.

Microservices and Kafka

My own opinion of microservices is that it lets you add more money and turn it into software at a more constant rate by allowing engineers to focus on various parts of the application, by essentially decoupling a big monolith so that a lot of things can happen in parallel development of real applications.

… The upside is that it lets you move fast. It adds a certain amount of agility to an engineering organization. But it comes with its own set of challenges. And these were not very obvious back then. How are all these microservices deployed? How are they monitored? And, most importantly, how do they communicate with each other? The communication bit is where Kafka comes in. When you break a monolith, you break state. And you distribute that state across different machines that run all those different applications.

So now the problem is, ‘well, how do these microservices share that state? How do they talk to each other?’ Frequently, the expectation is that things happens in real time. The context of microservices where streams or Kafka comes in is in the communication model for those microservices. I should just say that there isn’t a one size fits all when it comes to communication patterns for microservices.

Related resources:

Nov 22 2017

37mins

Play

Rank #4: Make data science more useful

Podcast cover
Read more

In this episode of the Data Show, I speak with Cassie Kozyrkov, technical director and chief decision scientist at Google Cloud. She describes “decision intelligence” as an interdisciplinary field concerned with all aspects of decision-making, and which combines data science with the behavioral sciences. Most recently she has been focused on developing best practices that can help practitioners make safe, effective use of AI and data. Kozyrkov uses her platform to help data scientists develop skills that will enable them to connect data and AI with their organizations’ core businesses.

We had a great conversation spanning many topics, including:

  • How data science can be more useful
  • The importance of the human side of data
  • The leadership talent shortage in data science
  • Is data science a bubble?

Related resources:

Aug 01 2019

35mins

Play

Rank #5: Enabling end-to-end machine learning pipelines in real-world applications

Podcast cover
Read more

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.

We had a great conversation spanning many topics, including:

Related resources:

Jun 20 2019

42mins

Play

Rank #6: Tools for machine learning development

Podcast cover
Read more

In this week’s episode of the Data Show, we’re featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML).

Here are a few highlights:

Tools for productive collaboration

A data catalog, at a high level, basically answers questions around the data that’s available and who is using it so an enterprise can understand access patterns. … The term “data catalog” is generally used when you’ve gotten to the point where you have a team of data scientists and you need a place where they can use libraries in a setting where they can collaborate, and where they can share not only models but maybe even data pipelines and features. The more advanced data science platforms will have automation tools built in. … The ideal scenario is the data science platform is not just for prototyping, but also for pushing things to production.

Tools for ML development

We have tools for software development, and now we’re beginning to hear about tools for machine learning development—there’s a company here at Strata called Comet.ml, and there’s another startup called Verta.ai. But what has really caught my attention is an open source project from Databricks called MLflow. When it first came out, I thought, ‘Oh, yeah, so we don’t have anything like this. Might have a decent chance of success.’ But I didn’t pay close attention until recently; fast forward to today, there are 80 contributors for 40 companies and 200+ companies using it.

What’s good about MLflow is that it has three components and you’re free to pick and choose—you can use one, two, or three. Based on their surveys, the most popular component is the one for tracking and managing machine learning experiments. It’s designed to be useful for individual data scientists, but it’s also designed to be used by teams of data scientists, so they have documented use-cases of MLflow where you have a company managing thousands of models and productions.

Jul 03 2019

39mins

Play

Rank #7: Using machine learning to improve dialog flow in conversational applications

Podcast cover
Read more

In this episode of the Data Show, I spoke with Alan Nichol, co-founder and CTO of Rasa, a startup that builds open source tools to help developers and product teams build conversational applications. About 18 months ago, there was tremendous excitement and hype surrounding chatbots, and while things have quieted lately, companies and developers continue to refine and define tools for building conversational applications. We spoke about the current state of chatbots, specifically about the types of applications developers are building today and how he sees conversational applications evolving in the near future.

As I described in a recent post, workflow automation will happen in stages. With that in mind, chatbots and intelligent assistants are bound to improve as underlying algorithms, technologies, and training data get better.

Here are some highlights from our conversation:

Chatbots and state machines

The first component is what we call natural language understanding, which typically means taking a short message that a user sends and extracting some meaning from it, which means turning it into structured data. In the case we talked about regarding the SQL database, if somebody asks, for example, ‘What was my ROI on my Facebook campaigns last month?’, the first thing you want to understand is that this is a data question and you want to assign it a label identifying it as a person, and they’re not saying hello, or goodbye, or thank you, but asking a specific question. Then you want to pick out those fields to help you create a query.

… The second piece is, how do you actually know what to do next? How do you build a system that can hold a conversation that is coherent? What you realize very quickly is that it’s not enough to have one input always matched to the same output. For example, if you ask somebody a yes or no question and they say, ‘yes,’ the next thing to do, of course, depends on what the original question was.

… Real conversations aren’t stateless; they have some context and they need to pay attention to the history. So, the way developers do that is build a state machine. Which means, for example, that you have a bot that can do some different things. It can talk about flights; it can talk about hotels. Then you define different states for when the person is still searching, or for when they are comparing different things, or for when they finish a booking. And then you have to define rules for how to behave for every input, for every possible state.

Beyond state machines

The problem is that [the state machine] approach works for building your first version, but it really restricts you to what we call “the happy parts,” which is where the user is compliant and cooperative and does everything you ask them to do. But in typical cases, you ask a person, “Do you like option A, or option B?” Then you probably build the path for the person saying, A, you build a path for the person saying B. But then you give it to real users, and they say, “No, I don’t like either of those.” Or they ask a question like, “Why is A so much more expensive than B?” Or, “Let me get back to you about that.”

… They don’t scale, that’s the problem. If you’re a developer and somebody has a conversation with your bot and you realize that it did the wrong thing, now you have to go look back at your (literally) thousands or tens of thousands of rules to figure out which one crashed and which one did the wrong thing. You figure out where to inject one more rule to handle one more etiquette, and that just doesn’t scale at all.

… With our dialogue library Rasa core, we give the user the ability to talk to the bot and provide feedback. So, in Rasa, the whole flow of dialogue is also controlled with machine learning. And it’s learned from real sample conversations. You talk to the system and if it does something wrong, you provide feedback and it corrects itself. So, you explore the space of possible conversations interactively yourself, and then your users do as well.

Related resources:

Sep 13 2018

45mins

Play

Rank #8: How privacy-preserving techniques can lead to more robust machine learning models

Podcast cover
Read more

In this episode of the Data Show, I spoke with Chang Liu, applied research scientist at Georgian Partners. In a previous post, I highlighted early tools for privacy-preserving analytics, both for improving decision-making (business intelligence and analytics) and for enabling automation (machine learning). One of the tools I mentioned is an open source project for SQL-based analysis that adheres to state-of-the-art differential privacy (a formal guarantee that provides robust privacy assurances).  Since business intelligence typically relies on SQL databases, this open source project is something many companies can already benefit from today.

What about machine learning? While I didn’t have space to point this out in my previous post, differential privacy has been an area of interest to many machine learning researchers. Most practicing data scientists aren’t aware of the research results, and popular data science tools haven’t incorporated differential privacy in meaningful ways (if at all). But things will change over the next months. For example, Liu wants to make  ideas from differential privacy accessible to industrial data scientists, and she is part of a team building tools to make this happen.

Here are some highlights from our conversation:

Differential privacy and machine learning

In the literature, there are actually multiple ways differential privacy is used in machine learning. We can either inject noise directly at the input data level, or while we’re training a model. We can also inject noise into the gradient. At every iteration we’re computing the gradients, we can inject some sort of noise. Or we can also inject noise during aggregation. If we’re using ensembles, we can inject noise there. And we can also inject noise at the output level. So after we’ve trained the model, and we have our vectors of weights, then we can also inject noise directly to the weights.

A mechanism for building robust models

There could be a chance that differential privacy methods can actually make your model more general. Because, essentially, when models memorize their training data, it could be due to overfitting. So, injecting all of this noise may help the resulting model move you further away from overfitting, and you get a more general model.

Related resources:

Aug 02 2018

36mins

Play

Rank #9: Trends in data, machine learning, and AI

Podcast cover
Read more

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences.

Here are some highlights from our conversation:

Real-world use cases for new technology

If you’re someone who wants to use data, data infrastructure, data science, machine learning, and AI, we’re really at the point where there are a lot of tools for implementers and developers. They’re not necessarily doing research and development; they just want to build better products and automate workflow. I think that’s the most significant development in my mind.

And then I think use case sharing also has an impact. For example, at our conferences, people are sharing how they’re using AI and ML in their businesses, so the use cases are getting better defined—particularly for some of these technologies that are relatively new to the broader data community, like deep learning. There are now use cases that touch the types of problems people normally tackle—so, things that involve structured data, for example, for time series forecasting, or recommenders.

With that said, while we are in an implementation phase, I think as people who follow this space will attest, there’s still a lot of interesting things coming out of the R&D world, so still a lot of great innovation and a lot more growth in terms of how sophisticated and how easy to use these technologies will be.

Addressing ML and AI bottlenecks

We have a couple of surveys that we’ll release early in 2019. In one of these surveys, we asked people what the main bottleneck is in terms of adopting machine learning and AI technologies.

Interestingly enough, the main bottleneck was cultural issues—people are still facing challenges in terms of convincing people within their companies to adopt these technologies. And then, of course, the next two are the ones we’re familiar with: lack of data and lack of skilled people. And then the fourth bottleneck people cited was trouble identifying business use cases.

What’s interesting about that is, if you then ask people how mature their practice is and you look at the people with the most mature AI and machine learning practices, they still cite a lack of data as the main bottleneck. What that tells me is that there’s still a lot of opportunity for people to apply these technologies within their companies, but there’s a lot of foundational work people have to do in terms of just getting data in place, getting data collected and ready for analytics.

Focus on foundational technologies

At the Strata Data conferences in San Francisco, London, and New York, the emphasis will be building technologies, bringing in technologies and cultural practices that will allow you to sustain analytics and machine learning in your organization. That means having all of the foundational technologies in place—data ingestion, data governance, ETL, data lineage, data science platform, metadata, store, and things like that, the various pieces of technology that will be important as you scale the practice of machine learning and AI in your company.

At the Artificial Intelligence conferences, we remain focused on being the de facto gathering place for people interested in applied artificial intelligence. We will focus on servicing the most important use cases in many, many domains. That means showcasing, of course, the latest research in deep learning and other branches of machine learning, but also helping people grapple with some of the other important considerations, like privacy and security, fairness, reliability, and safety.

…At both the Strata Data and Artificial Intelligence conferences, we will focus on helping people understand the capabilities of the technology, the strengths and limitations; that’s why we run executive briefings at all of these events. We showcase case studies that are aimed at the non-technical and business user as well—so, we’ll have two types of case studies, one more technical and one not so technical so the business decision-makers can benefit from seeing how their peers are using and succeeding with some of these technologies.

Dec 20 2018

28mins

Play

Rank #10: Specialized hardware for deep learning will unleash innovation

Podcast cover
Read more

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices.

In fact, companies with enough volume have already begun building specialized processors for machine learning. But you have to either use specific cloud computing platforms or work at specific companies to have access to such hardware. A new wave of startups (including Cerebras) will make specialized hardware affordable and broadly available. Over the next 12-24 months architects and engineers will need to revisit their infrastructure and decide between general purpose or specialized hardware, and cloud or on-premise gear.

In light of the training duration and cost they face using current (general purpose) hardware, some experiments might be hard to justify. Upcoming specialized hardware will enable data scientists to try out ideas that they previously would have hesitated to pursue. This will surely lead to more research papers and interesting products as data scientists are able to run many more experiments (on even bigger models) and iterate faster.

As founder of one of the most anticipated hardware startups in the deep learning space, I wanted get Feldman’s views on the challenges and opportunities faced by engineers and entrepreneurs building hardware for machine learning workloads.

Here are some highlights from our conversation:

A renaissance for computer architecture

OpenAI put out some very interesting analysis recently that showed that since 2012, the compute use for the largest AI training runs has increased by 300,000x. … What’s available to us to attack the vast discrepancy between compute demand and what we have today? Two things are available to us. The first is, exploring interesting compute architectures. I think this ushers in a golden age for compute architectures. And number two, it’s building dedicated hardware and saying: ‘We’re prepared to make trade offs to accelerate AI compute by not trying to be good at other things. By not trying to be good at graphics or by not trying to be a good web server. But we will attack this vast demand for compute by building dedicated hardware for artificial intelligence work.’ Historically, the following has been a very productive and valuable trade off: new and interesting architectures dedicated for a particular type of work. That’s the opportunity that many of these hardware companies or chip companies have seen.

Communication intensive workloads

When you stay on a chip, one can communicate fairly quickly. The problem is our work in artificial intelligence often spans more than one traditional chip. And the performance penalty for leaving the chip is very, very high. On-chip, you stay in silicon; off-chip, you have to wrap your communication in some sort of protocol, you need to send it, connect it over lanes on a print circuit board or maybe through a PCI switch, or maybe through an Ethernet switch or an InfiniBand switch. This adds two, three, four orders of magnitude of latency.

Some of the problems the hardware vendors who are interested in solving data center training, data center inference are working on are how you can accelerate the communication between cores and across tens of thousands of cores, or even hundreds of thousands of cores across many chips. Some are inventing new techniques for special switches and modifying PCIe to do that. Others have sort of more fundamental approaches to accelerating this communication. But if you can’t communicate quickly, you can’t train a model quickly and you can’t provide inference quickly.

New hardware will significantly reduce training times

You should see a reduction in training times of 10-50x sometime over the next 12-18 months. … I think you’re going to see an additional 10-25x in the following year. We’re looking at three orders of magnitude in reduction in training time over the next several years.

Related resources:

Jul 19 2018

41mins

Play

Rank #11: Why companies are in need of data lineage solutions

Podcast cover
Read more

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.

… Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR … which was one of the hot topics I heard about at the Strata Data Conference in London in 2018.

Why companies are suddenly building data lineage solutions

A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you’re working with. If you’re working with 10 databases, you need to know what’s going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it’s linked to. Then it gives you some metadata information so you can drill down. Let’s say you have corrupted data, let’s say you want to debug something. All these cases tie into the actual use cases for which we want to build it.

Related resources:

Apr 25 2019

34mins

Play

Rank #12: Acquiring and sharing high-quality data

Podcast cover
Read more

In this episode of the Data Show, I spoke with Roger Chen, co-founder and CEO of Computable Labs, a startup focused on building tools for the creation of data networks and data exchanges. Chen has also served as co-chair of O’Reilly’s Artificial Intelligence Conference since its inception in 2016. This conversation took place the day after Chen and his collaborators released an interesting new white paper, Fair value and decentralized governance of data. Current-generation AI and machine learning technologies rely on large amounts of data, and to the extent they can use their large user bases to create “data silos,” large companies in large countries (like the U.S. and China) enjoy a competitive advantage. With that said, we are awash in articles about the dangers posed by these data silos. Privacy and security, disinformation, bias, and a lack of transparency and control are just some of the issues that have plagued the perceived owners of “data monopolies.”

In recent years, researchers and practitioners have begun building tools focused on helping organizations acquire, build, and share high-quality data. Chen and his collaborators are doing some of the most interesting work in this space, and I recommend their new white paper and accompanying open source projects.

Sequence of basic market transactions in the Computable Labs protocol. Source: Roger Chen, used with permission.

We had a great conversation spanning many topics, including:

  • Why he chose to focus on data governance and data markets.
  • The unique and fundamental challenges in accurately pricing data.
  • The importance of data lineage and provenance, and the approach they took in their proposed protocol.
  • What cooperative governance is and why it’s necessary.
  • How their protocol discourages an unscrupulous user from just scraping all data available in a data market.

Related resources:

Jul 18 2019

39mins

Play

Rank #13: Machine intelligence for content distribution, logistics, smarter cities, and more

Podcast cover
Read more

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores).

I wanted to consult local market researchers to help make sense of some of the things I’ve been observing from afar. Liu and her colleagues have put out a series of interesting reports highlighting some of these important trends. They also have an annual report—Trends & Predictions for China’s Tech Industry in 2018—that Liu will discuss in her keynote and talk at Strata Data Singapore in December.

Here are some highlights from our conversation:

Machine learning and content distribution

Media consumption takes a large proportion of people’s everyday life here in China. Before, people learned their news from news portals and from editorial teams who served as the gatekeepers. People now trust machine learning algorithms with editorial and agenda setting. Apps like Toutiao have become very popular.

It’s been quite a surprise to most news portals and media professionals here in China. People are trying to find a balance between the traditional ways of content creation and the new ways of content distribution by aggregators fully powered by machines. Toutiao’s news recommendation engine is purely a black box to most people. … But users are spending more and more time on these types of platforms. And, machine-generated news feeds have become a big thing.

… So, it’s now becoming a content war again. After these algorithms improve the efficiency of content distribution, the battle may come down to what content you have.

Bike sharing

Bike sharing is kind of a new model adapted to Chinese society. … In between every subway station, there’s still several miles to go, where people still need to walk or maybe take a taxi. Bike sharing is being used to replace these other kinds of approaches.

There are two primary players. One is Mobike and the other one is Ofo, and they started with different models, actually. Ofo started a year or two earlier from a university campus. … It provided this kind of public bike rental system to users on campus. This was kind of the preliminary prototype of this model. Mobike started in a city.

These bike sharing companies have their GPS systems on the bikes, and the bikes have digital electronic locks that can be unlocked with an app on your phone. These technologies, combined together, can help them collect data as well as have a better management system of all the bikes they distribute over a city.

Smart cities

It’s still a maybe, but it’s very likely we are going to include things about smart cities in our 2018 reports. … This includes AR applications to help build better cities for urban planning. … Urban planning is a very complicated thing, and what we are missing there is, we can be a little bit left behind because of the lack of data. But now people have different types of data. For example, I know the ride sharing company Didi is collaborating with several city governments to help them do urban planning: by using data to better understand traffic, how to manage traffic light systems in the city, and also the bus system.

City governments at all levels are now collaborating with all these tech companies to explore applications of their data to improve the cities we have in China. … This is going to be a very important opportunity for the tech companies here in China, especially in terms of their data applications, and their contributions to society.

Related resources:

Oct 26 2017

36mins

Play

Rank #14: Effective mechanisms for searching the space of machine learning algorithms

Podcast cover
Read more

In this episode of the Data Show, I spoke with Ken Stanley, founding member of Uber AI Labs and associate professor at the University of Central Florida. Stanley is an AI researcher and a leading pioneer in the field of neuroevolution—a method for evolving and learning neural networks through evolutionary algorithms. In a recent survey article, Stanley went through the history of neuroevolution and listed recent developments, including its applications to reinforcement learning problems.

Stanley is also the co-author of a book entitled Why Greatness Cannot Be Planned: The Myth of the Objective—a book I’ve been recommending to anyone interested in innovation, public policy, and management. Inspired by Stanley’s research in neuroevolution (into topics like novelty search and open endedness), the book is filled with examples of how notions first uncovered in the field of AI can be applied to many other disciplines and domains.

The book closes with a case study that hits closer to home—the current state of research in AI. One can think of machine learning and AI as a search for ever better algorithms and models. Stanley points out that gatekeepers (editors of research journals, conference organizers, and others) impose two objectives that researchers must meet before their work gets accepted or disseminated: (1) empirical: their work should beat incumbent methods on some benchmark task, and (2) theoretical: proposed new algorithms are better if they can be proven to have desirable properties. Stanley argues this means that interesting work (“stepping stones”) that fail to meet either of these criteria fall by the wayside, preventing other researchers from building on potentially interesting but incomplete ideas.

Here are some highlights from our conversation:

Neuroevolution today

In the state of the art today, the algorithms have the ability to evolve variable topologies or different architectures. There are pretty sophisticated algorithms for evolving the architecture of a neural network; in other words, what’s connected to what, not just what the weight of those connections are—which is what deep learning is usually concerned with.

There’s also an idea of how to encode very, very large patterns of connectivity. This is something that’s been developed independently in neuroevolution where there’s not a really analogous thing in deep learning right now. This is the idea that if you’re evolving something that’s really large, then you probably can’t afford to encode the whole thing in the DNA. In other words, if we have 100 trillion connections in our brains, our DNA does not have 100 trillion genes. In fact, it couldn’t have a 100 trillion genes. It just wouldn’t fit. That would be astronomically too high. So then, how is it that with a much, much smaller space of DNA, which is about 30,000 genes or so, three billion base pairs, how would you get enough information in there to encode something that’s 100 trillion parts?

This is the issue of encoding. We’ve become sophisticated at creating artificial encodings that are basically compressed in an analogous way, where you can have a relatively short string of information to describe a very large structure that comes out—in this case, a neural network. We’ve gotten good at doing encoding and we’ve gotten good at searching more intelligently through the space of possible neural networks. We originally thought what you need to do is just breed by choosing among the best. So, you say, ‘Well, there’s some task we’re trying to do and I’ll choose among the best to create the next generation.’

We’ve learned since then that that’s actually not always a good policy. Sometimes you want to explicitly choose for diversity. In fact, that can lead to better outcomes.

The myth of the objective

Our book does recognize that sometimes pursuing objectives is a rational thing to do. But I think the broader point that’s important here is there’s a class of discoveries for which it really is against your interest to frame what you’re doing in terms of an objective.

The reason we wrote the book is because … I started to realize this principle that ‘sometimes in order to make discovery possible, you have to stop having an objective’ speaks to people beyond just computer scientists who are developing algorithms. It’s an issue for our society and for institutions because there are many things we do that are driven by some kind of objective metric. It almost sounds like heresy to suggest that you shouldn’t do that.

It’s like an unquestioned assumption that exists throughout our culture that the primary route to progress is to set objectives and move toward those objectives and measure your performance with respect to those objectives. We began to think that given the results we have that are hard empirical results, that it is important to counterweight this belief that pervades society with a counter argument that points out that there are cases where this is actually a really bad idea.

The thing I learned more and more talking to different groups is that this discussion is not being had. We’re not talking about this, and I think it’s a very important discussion because our institutions are geared away from innovation because they are so objectively driven. We could do more to foster innovation if we recognize this principle. A lot of people want this security blanket of an objective because they don’t trust anything that isn’t driven by an objective.

Actually, it turns out there are principled ways of exploring the world without an objective. In other words, it’s not just random and the book is about that. It’s about how smart ways of exploring in a non-objective way can lead to really, really important results. We just wanted to open up that conversation ‘society wide’ and not just have it narrowly within the field of computer science because it is such an important conversation to have.

Related resources:

Aug 31 2017

45mins

Play

Rank #15: How social science research can inform the design of AI systems

Podcast cover
Read more

In this episode of the Data Show, I spoke with Jacob Ward, a Berggruen Fellow at Stanford University. Ward has an extensive background in journalism, mainly covering topics in science and technology, at National Geographic, Al Jazeera, Discovery Channel, BBC, Popular Science, and many other outlets. Most recently, he’s become interested in the interplay between research in psychology, decision-making, and AI systems. He’s in the process of writing a book on these topics, and was gracious enough to give an informal preview by way of this podcast conversation.

Here are some highlights from our conversation:

Psychology and AI

I began to realize there was a disconnect between what is a totally revolutionary set of innovations coming through in psychology right now that are really just beginning to scratch the surface of how human beings make decisions; at the same time, we are beginning to automate human decision-making in a really fundamental way. I had a number of different people say, ‘Wow, what you’re describing in psychology really reminds me of this piece of AI that I’m building right now,’ to change how expectant mothers see their doctors or change how we hire somebody for a job or whatever it is.

Transparency and designing systems that are fair

I was talking to somebody the other day who was trying to build a loan company that was using machine learning to present loans to people. He and his company did everything they possibly could to not redline the people they were loaning to. They were trying very hard not to make unfair loans that would give preference to white people over people of color.

They went to extraordinary lengths to make that happen. They cut addresses out of the process. They did all of this stuff to try to basically neutralize the process, and the machine learning model still would pick white people at a disproportionate rate over everybody else. They can’t explain why. They don’t know why that is. There’s some variable that’s mapping to race that they just don’t know about.

But that sort of opacity—this is somebody explaining it to me who just happened to have been inside the company, but it’s not as if that’s on display for everybody to check out. These kinds of closed systems are picking up patterns we can’t explain, and that their creators can’t explain. They are also making really, really important decisions based on them. I think it is going to be very important to change how we inspect these systems before we begin trusting them.

Anthropomorphism and complex systems

In this book, I’m also trying to look at the way human beings respond to being given an answer by an automated system. There are some very well-established, psychological principles out there that can give us some sense of how people are going to respond when they are told what to do based on an algorithm.

The people who study anthropomorphism, the imparting of intention and human attributes to an automated system, say there’s a really well-established pattern. When people are shown a very complex system and given some sort of exposure to that complex system, whether it gives them an answer or whatever it is, it tends to produce in human beings a level of trust in that system that doesn’t really have anything to do with reality. … The more complex the system, the more people tend to trust it.

Related resources:

Oct 11 2018

45mins

Play

Rank #16: Machine learning at Spotify: You are what you stream

Podcast cover
Read more

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams.

I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.

Here are some highlights from our conversation:

Recommenders at Spotify

For us, engagement always comes first. At Spotify, we have a couple hundred people who are just focused on user engagement, and this is the group that creates personalized playlists, like Discover Weekly or your Daily Mix for you. We know our users love discovery and see Spotify as a very important platform for them to discover something new, but there are also times when people just want to have some music played in the background that fits the mood. But again, we don’t have a specific agenda in terms of what we should push for. We want to give you what you want so that you are happy, which is why we invested so much in understanding people through music. If we believe you might like some “long tail” content, we will recommend it to you because it makes you happy, but we can also do the same for the top 100 track if we believe you will enjoy them.

Music is like a mirror

Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true. As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preference and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models.

Just to give you my personal example to illustrate how it works, you can learn a lot about me just by me telling you what I stream. You will see that I use my running playlist only during the weekend in early mornings, and I have a lot of children’s songs streamed at my house between 5 p.m. and 7 p.m. I also have a lot of tango and salsa playlists that I created and followed. So what does that tell you? It tells you that I am probably a weekend runner, which means I have some kind of affiliation for fitness; it tells you that I am probably a mother and play songs for my child after I get home from work; it also tells you that I somehow like tango and salsa, so I am probably a dancer, too. As you can see, we are investing a lot into understanding people’s context and preference so we can start capturing different moments of their lives. And, of course, the more we understand your context, your preference, and what you are looking for, the better we can customize your playlists for you.

Related resources:

Dec 07 2017

21mins

Play

Rank #17: Building tools for enterprise data science

Podcast cover
Read more

In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.

I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.

Here are some highlights from our conversation:

The need for an internal data science platform

It’s more about how much commonality there is between every single data science use case—how many of the problems are redundant and repeatable.

… A lot of data scientists solve problems that honestly have a lot to do with engineering, a lot to do with things that are not pure modeling.

TransmogrifAI

TransmogrifAI is an automated machine library for mostly structured data, and the problem that it aims to solve is that we at Salesforce have hundreds of thousands of customers. While all of them share a common set of data, the Salesforce platform itself is extremely customizable. Actually, 80% of the data inside the Salesforce platform actually sits in what we refer to as custom objects, which one can think of as custom tables in a database.

… We don’t build models that are shared between customers. We always use a single customer’s data. We have hundreds of thousands of models potentially that we need to build, and because of that, we needed to automate the entire process. We just cannot throw people at the problem. We basically created TransmogrifAI to automate the entire end-to-end process for creating a model for a user and we decided to open source it a couple months ago.

Related resources:

Nov 21 2018

31mins

Play

Rank #18: How Ray makes continuous learning accessible and easy to scale

Podcast cover
Read more

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we’ve been building that’s motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they’re largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

… For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.

… In fact, people often use MPI along with lower-level multi-processing libraries to build the communication infrastructure for their reinforcement learning applications.

Scaling machine learning in dynamic environments

I think right now when we think of machine learning, we often think of supervised learning. But a lot of machine learning applications are changing from making just one prediction to making sequences of decisions and taking sequences of actions in dynamic environments.

The thing that’s special about reinforcement learning is it’s not just the different algorithms that are being used, but rather the different problem domain that it’s being applied to: interactive, dynamic, real-time settings bring up a lot of new challenges.

… The set of algorithms actually goes even a little bit further. Some of these techniques are even useful in, for example, things like text summarization and translation. You can use these techniques that have been developed in the context of reinforcement learning to better tackle some of these more classical problems [where you have some objective function that may not be easily differentiable].

… Some of the classic applications that we have in mind when we think about reinforcement learning are things like dialogue systems, where the agent is one participant in the conversation. Or robotic control, where the agent is the robot itself and it’s trying to learn how to control its motion.

… For example, we implemented the evolution algorithm described in a recent OpenAI paper in Ray. It was very easy to port to Ray, and writing it only took a couple of hours. Then we had a distributed implementation that scaled very well and we ran it on up to 15 nodes.

Related resources:

Aug 17 2017

18mins

Play

Rank #19: Applications of data science and machine learning in financial services

Podcast cover
Read more

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.
  • The current state of data science in financial services in both the U.S. and China.
  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Here are some highlights from our conversation:

Opportunities in financial services

There’s a customer acquisition piece and then there’s a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it’s a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. … Once you have a specific cohort of users who you want to target, there’s a need to be able to precisely convert them, which means understanding the stage of the customer’s thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.

… On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer’s data and able to serve the customer better with automated services whenever and wherever the customer is. It’s all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.

Opportunities in China

A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.

For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we’re talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.

If you look at WeChat, they’re boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.

Related resources:

May 23 2019

42mins

Play

Rank #20: The state of machine learning in Apache Spark

Podcast cover
Read more

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.
  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.
  • The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.
  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).
  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning—lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

Sep 14 2017

21mins

Play