Cover image of O'Reilly Data Show - O'Reilly Media Podcast
(46)
Business
Technology
Tech News

O'Reilly Data Show - O'Reilly Media Podcast

Updated 12 days ago

Business
Technology
Tech News
Read more

Big data and data science interviews, insight, and analysis.

Read more

Big data and data science interviews, insight, and analysis.

iTunes Ratings

46 Ratings
Average Ratings
26
8
6
5
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.

iTunes Ratings

46 Ratings
Average Ratings
26
8
6
5
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.
Cover image of O'Reilly Data Show - O'Reilly Media Podcast

O'Reilly Data Show - O'Reilly Media Podcast

Updated 12 days ago

Read more

Big data and data science interviews, insight, and analysis.

Rank #1: Applications of data science and machine learning in financial services

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.

  • The current state of data science in financial services in both the U.S. and China.

  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Here are some highlights from our conversation:

Opportunities in financial services

There's a customer acquisition piece and then there's a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it's a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. ... Once you have a specific cohort of users who you want to target, there's a need to be able to precisely convert them, which means understanding the stage of the customer's thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.

... On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer's data and able to serve the customer better with automated services whenever and wherever the customer is. It's all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.

Opportunities in China

A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.

For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we're talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.

If you look at WeChat, they're boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.

Related resources:

May 23 2019
42 mins
Play

Rank #2: A framework for building and evaluating data products

Podcast cover
Read more

The O’Reilly Data Show Podcast: Pinterest data scientist Grace Huang on lessons learned in the course of machine learning product launches.

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

Here are some highlights from our conversation:

Using machine learning to strengthen content ecosystems

Pinterest content is a giant, complicated corpus, that has a very rich meta data associated with it. If you build a recommendation system where there's a lot of bias in it, over time you can start showing just a particular corner of that corpus to the world—because you think your user might find a piece of that corner of content particularly engaging. This is an issue when you're basing your algorithms only on your existing users.

When Pinterest first started out, we had a very strong user base around particular user demographics. That part of the content corpus becomes very well curated, which makes those content pieces rank really high in our machine learning products. Then we had to start consciously thinking about how to combat that problem because otherwise, over time, you're just going to build a product that only appeals to that segment of users.

From the user perspective, you want to make sure you're creating a corpus that covers enough in terms of topics and interests, in terms of different languages people speak, in terms of different cultural backgrounds. Then, I think on the content side, we have the same problem where fresher, newer content may have trouble competing with older content that's been around for a long time and has really good historical performance.

Maintaining this healthy ecosystem involves creating mechanisms to jump start new content so we can show it enough times to quickly learn whether or not it's high quality. And whether or not it might be relevant for certain segments of users. We then want to be able to use that information very efficiently to drive our downstream products.

Building data products: Three anti-patterns

The first one is, do not build a model for users today. You have to think about your users tomorrow as well. Second, it's really easy to build a system where the rich get richer. There are a lot of techniques out there to prevent that from happening; it's often not by design. It's very subtle, and it takes a long time to observe this rich-get-richer effect and for it to build up. You have to be very vigilant about it. ... The third anti-pattern is that you might find yourself optimizing not quite the right thing. You can get exactly what you wish for with a machine learning system. It's very good at optimizing a goal that you specify. But that goal may not necessarily correlate with the ultimate goal. Keeping your ultimate goal in mind and evaluating your products with the ultimate goal, instead your intermediate goal, is really important. For example, I think short-term metrics are easier to optimize toward. But they may or may not correlate with a long-term goal like retention.

Related resources:

Jul 06 2017
22 mins
Play

Rank #3: Trends in data, machine learning, and AI

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ben Lorica looks ahead at what we can expect in 2019 in the big data landscape.

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences.

Here are some highlights from our conversation:

Real-world use cases for new technology

If you're someone who wants to use data, data infrastructure, data science, machine learning, and AI, we're really at the point where there are a lot of tools for implementers and developers. They're not necessarily doing research and development; they just want to build better products and automate workflow. I think that's the most significant development in my mind.

And then I think use case sharing also has an impact. For example, at our conferences, people are sharing how they're using AI and ML in their businesses, so the use cases are getting better defined—particularly for some of these technologies that are relatively new to the broader data community, like deep learning. There are now use cases that touch the types of problems people normally tackle—so, things that involve structured data, for example, for time series forecasting, or recommenders.

With that said, while we are in an implementation phase, I think as people who follow this space will attest, there's still a lot of interesting things coming out of the R&D world, so still a lot of great innovation and a lot more growth in terms of how sophisticated and how easy to use these technologies will be.

Addressing ML and AI bottlenecks

We have a couple of surveys that we'll release early in 2019. In one of these surveys, we asked people what the main bottleneck is in terms of adopting machine learning and AI technologies.

Interestingly enough, the main bottleneck was cultural issues—people are still facing challenges in terms of convincing people within their companies to adopt these technologies. And then, of course, the next two are the ones we're familiar with: lack of data and lack of skilled people. And then the fourth bottleneck people cited was trouble identifying business use cases.

What's interesting about that is, if you then ask people how mature their practice is and you look at the people with the most mature AI and machine learning practices, they still cite a lack of data as the main bottleneck. What that tells me is that there's still a lot of opportunity for people to apply these technologies within their companies, but there's a lot of foundational work people have to do in terms of just getting data in place, getting data collected and ready for analytics.

Focus on foundational technologies

At the Strata Data conferences in San Francisco, London, and New York, the emphasis will be building technologies, bringing in technologies and cultural practices that will allow you to sustain analytics and machine learning in your organization. That means having all of the foundational technologies in place—data ingestion, data governance, ETL, data lineage, data science platform, metadata, store, and things like that, the various pieces of technology that will be important as you scale the practice of machine learning and AI in your company.

At the Artificial Intelligence conferences, we remain focused on being the de facto gathering place for people interested in applied artificial intelligence. We will focus on servicing the most important use cases in many, many domains. That means showcasing, of course, the latest research in deep learning and other branches of machine learning, but also helping people grapple with some of the other important considerations, like privacy and security, fairness, reliability, and safety.

...At both the Strata Data and Artificial Intelligence conferences, we will focus on helping people understand the capabilities of the technology, the strengths and limitations; that's why we run executive briefings at all of these events. We showcase case studies that are aimed at the non-technical and business user as well—so, we'll have two types of case studies, one more technical and one not so technical so the business decision-makers can benefit from seeing how their peers are using and succeeding with some of these technologies.

Dec 20 2018
28 mins
Play

Rank #4: A scalable time-series database that supports SQL

Podcast cover
Read more

The O’Reilly Data Show Podcast: Michael Freedman on TimescaleDB and scaling SQL for time-series.

In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.

Here are some highlights from our conversation:

The need for a time-series database

We initially were developing a platform to collect and store and analyze IoT data, and certainly a lot of IoT data is time-series in nature. We found ourselves struggling. The reason a lot of people adopt NoSQL was they thought it offered scale in the ways that more traditional relational databases did not—yet, they often gave up a lot of the rich query language, optimized complex queries, joins, and an ecosystem that you get in these more traditional relational databases. Customers who were using our platform kept wanting all these ways to query the data, and we couldn't do it with the existing NoSQL database we were using. It just didn't support those types of queries.

We ended up building one, in fact, based on top of Postgres. Architecting Postgres in a very particular way for time-series workloads, we came to realize that this is not just a problem limited to us. We think there is an important space still in the market where people either use a Vanilla relational database that does have scaling problems, or they go to something like NoSQL because a lot of the time-series data came from one particular use case, things like server metrics. People's needs are much broader than just server metrics, so we actually thought there was an important area that's somewhat missing from what people had before.

... The interesting thing about a time-series database is sometimes that data starts in one part of your organization, and then different parts of your organization quickly find a use for that data.

... In many cases, the people who are asking questions actually know SQL already; some of them may not but are using existing tools that support SQL. So, if you have a database that doesn’t support SQL, then those existing tools often can't directly work with it. You would have to integrate them. You'd have to build special connectors. That was one of the things we wanted when we set out to build Timescale. We wanted to give the appearance that this looks like Postgres. It just looks like a traditional relational database. If you have any of those existing tools and business applications, you could just speak directly to it as if it's a traditional database. It just happens to be much more efficient and much more scalable for time-series data.

Column-oriented and row-oriented databases

In the beginning, we weren't setting out to build our own time-series database. ... A lot of the time-series databases, particularly on the market now, are column-oriented, because that allows you to do very fast aggregations on a single column. TimescaleDB also allows you to define a schema and different metrics could be in their own column.

There is a difference between what are known as column-oriented databases and traditional SQL databases, which are row-oriented. This is related to the ways that they store data on disk—that is, are all of the values in a row stored contiguously on disk? In a column-oriented database, even though you might have every metric, or if a bunch of metrics belong to the same row, they're actually going to be stored almost separately. It's like every column becomes its own table.

For example, columns make it really easy and fairly efficient to scan a single column. If all you want to do is take the average of the CPU, that's efficient; but if you want to ask a question, this is called a rich predicate. A predicate is that WHERE clause in SQL. If you want to ask a question like: "Tell me the average temperature of all devices where the CPU is above a certain threshold, or the free memory is below something"—internally with column-oriented databases, each of those WHERE clauses is a different column, almost a different table that the database needs to scan and then do a JOIN on. While the column-oriented databases might be very efficient for just rolling up a single column, if you want to ask anything richer, it becomes a lot more expensive. In some of these databases, they don't have indexes for these WHERE clauses, so any time you ask a question, it actually takes a full table scan.

If only 1% of devices have a high CPU and you say, "Tell me all the statistics where the device has a high CPU," in some of these time-series databases that lack this indexing on columns, you end up actually scanning all of the data, not just the 1%. If you have something like TimescaleDB or someone who could build these efficient secondary indices, you could quickly focus in on the important data, so the only thing we need to touch is that 1% of data, not all the data.

Related resources:

Jun 22 2017
49 mins
Play

Rank #5: Simplifying machine learning lifecycle management

Podcast cover
Read more

The O’Reilly Data Show Podcast: Harish Doddi on accelerating the path from prototype to production.

In this episode of the Data Show, I spoke with Harish Doddi, co-founder and CEO of Datatron, a startup focused on helping companies deploy and manage machine learning models. As companies move from machine learning prototypes to products and services, tools and best practices for productionizing and managing models are just starting to emerge. Today’s data science and data engineering teams work with a variety of machine learning libraries, data ingestion, and data storage technologies. Risk and compliance considerations mean that the ability to reproduce machine learning workflows is essential to meet audits in certain application domains. And as data science and data engineering teams continue to expand, tools need to enable and facilitate collaboration.

As someone who specializes in helping teams turn machine learning prototypes into production-ready services, I wanted to hear what Doddi has learned while working with organizations that aspire to “become machine learning companies.”

Here are some highlights from our conversation:

A central platform for building, deploying, and managing machine learning models

In one of the companies where I worked, we had built infrastructure related to Spark. We were a heavy Spark shop. So we built everything around Spark and other components. But later, when that organization grew, a lot of people came from a TensorFlow background. That suddenly created a little bit of frustration in the team because everybody wanted to move to TensorFlow. But we had invested a lot of time, effort and energy in building the infrastructure for Spark.

... We suddenly had hidden technical debt that needed to be addressed. ... Let's say right now you have two models running in production and you know that in the next two or three years you are going to deploy 20 to 30 models. You need to start thinking about this ahead of time.

... That's why these days I observed that organizations are creating centralized teams. The centralized team is responsible for maintaining flexible machine learning infrastructure that can be used to deploy, operate, and monitor many models simultaneously.

Feature store: Create, manage, and share canonical features

When I talk to companies these days, everybody knows that their data scientists are duplicating work because they don't have a centralized feature store. Everybody I talk to really wants to build or even buy a feature store, depending on what is easiest for them.

... The number of data scientists within most companies is increasing. And one of the pain points I've observed is when a new data scientist joins an organization, there is an extreme amount of ramp-up period. A new data scientist needs to figure out what the data sets are, what the features are, so on and so forth. But if an organization had a feature store, the ramp-up period can be much faster.

Related resources:

Aug 16 2018
37 mins
Play

Rank #6: Building accessible tools for large-scale computation and machine learning

Podcast cover
Read more

The O’Reilly Data Show Podcast: Eric Jonas on Pywren, scientific computation, and machine learning.

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.

Here are some highlights from our conversation:

Pywren

The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, 'I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?' Thus, Pywren was born.

... Right now, we're primarily focused on the entire scientific Python stack, so SciPy, NumPy, Pandas, Matplotlib, the whole ecosystem there. ... One of the challenges with all of these frameworks and running these things on Lambda is that, right now, Lambda is a fairly constrained resource environment. Amazon will quite happily give you 3,000 cores in the next two seconds, but each one has a maximum runtime and a small amount of memory and a small amount of local disk. Part of the current active research thrust for Pywren is figuring out how to do more general-purpose computation within those resource limits. But right now, we mostly support everything you would encounter in your normal Python workflow—including Jupyter, NumPy, and scikit-learn.

Numpywren

Chris Ré has this nice quote: 'Why is it easier to train a bidirectional LSTM with attention than it is to just compute the SVD of a giant matrix?' One of these things is actually fantastically more complicated than the other, but right now, our linear algebra tools are just such an impediment to doing that sort of large-scale computation. We hope NumPyWren will enable this class of work for the machine learning community.

The growing importance of reinforcement learning

Ben Recht makes the argument that the most interesting problems in machine learning right now involve taking action based upon your intelligence. I think he's right about this—taking action based upon past data and doing it in a way that is safe and robust and reliable and all of these sorts of things. That is very much the domain that has traditionally been occupied by fields like control theory and reinforcement learning.

Reinforcement learning and Ray

Ray is an excellent platform for building large-scale distributed systems, and it's much more Python-native than Spark was. Ray also has much more of a focus on real-time performance. A lot of the things that people are interested in with Ray revolve around doing things like large-scale reinforcement learning—and it just so happens that deep reinforcement learning is something that everyone's really excited about.

Related resources:

Aug 30 2018
53 mins
Play

Rank #7: How big data and AI will reshape the automotive industry

Podcast cover
Read more

The O’Reilly Data Show Podcast: Evangelos Simoudis on next-generation mobility services.

In this episode of the Data Show, I spoke with Evangelos Simoudis, co-founder of Synapse Partners and a frequent contributor to O’Reilly. He recently published a book entitled The Big Data Opportunity in Our Driverless Future, and I wanted get his thoughts on the transportation industry and the role of big data and analytics in its future. Simoudis is an entrepreneur, and he also advises and invests in many technology startups. He became interested in the automotive industry long before the current wave of autonomous vehicle startups was in the planning stages.

Here are some highlights from our conversation:

Understanding the automotive industry

The more I started spending time with the automotive industry, the more I came to realize that, because of the autonomous vehicle technology and because of various forms of mobility services, which are stemming from new business models, the incumbent automotive industry is in significant risk of being disrupted.

If you were to look at the automotive industry, the first thing that is very striking is that there's a small number of very large companies that control a number of different labels. With GM, we talk about Chevy, we talk about Buick, we talk about Opel in Europe. There are a very small number of companies that control this trillion dollar industry.

The other thing that is interesting is that these companies are responsible for designing the vehicle, manufacturing it, assembling it, post-manufacturing, and then creating demand, whereas the sale of the vehicle is done through the dealers. And they're paying relatively little attention to what happens post-sale. So, that means there is a relatively little understanding of consumer behavior.

The third observation is that the reason there are so few of these companies is because starting one is very capital intensive. And if you look at how much money, for example, a company like Tesla has been able to raise, you get a sense of what kind of capital is necessary. And the next point is that even though there is a lot of capital that's being raised, in the end this is a relatively low margin business. Where you try to make it up is in volume. That's why, if you look at all these corporations, they have extremely sophisticated supply chains, extremely sophisticated manufacturing lines, highly optimized, because they are working on maintaining these margins.

Infrastructure for autonomous vehicles

A vehicle needs to know very much what's happening around it. So that means it needs to receive signals from roads, bridges, other vehicles. ... The term people use is V2X or vehicle-to-everything communication.

It will take a very long time to have the preponderance of vehicles being autonomous. So, we need infrastructure that will enable cars to safely operate in a hybrid world between autonomous vehicles and manually operated vehicles. I think the experiments that today involve just a few tens of cars will expand over the next few years. And I think the result of those experiments will give us an understanding and appreciation of the investments that we need to make and how to prioritize them, as well as the regulations that we will need to institute in order to have this type of hybrid environment operate safely.

AI and big data

The argument that I'm making, and this actually comes from my education on AI and my work on AI since the mid ‘80s, is that while machine learning is important, I think everybody needs to appreciate that it's not only about machine learning. In order to bring to realization an autonomous vehicle, you need more than machine learning. And, of course, within machine learning we have neural network learning and particularly deep learning, and these are very important areas.

But people need to realize that an autonomous vehicle requires the ability to plan, requires the ability to reason, to represent knowledge, to search. All of these are components of AI. What I'm hoping to impart is that it's not only about machine learning and particularly not about deep learning. The popular press, I think, is leading everybody to believe that it's only about deep learning.

There is the autonomous driving technology and then the data cloud, where big data gets processed, stored, and analyzed. I think we will have multiple cloud providers. In fact, I'm betting on that through my investments in the space. I think that those cloud providers will be in the application layer. So, those cloud providers may be utilizing infrastructures from the likes of Microsoft or Amazon or other generic clouds.

Related resources:

Jul 20 2017
51 mins
Play

Rank #8: The state of machine learning in Apache Spark

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.

  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.

  • The plan to make it easier to integrate advanced analytics libraries that aren't "textbook machine learning," like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.

  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).

  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning—lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

Sep 14 2017
21 mins
Play

Rank #9: Bringing scalable real-time analytics to the enterprise

Podcast cover
Read more

The O’Reilly Data Show Podcast: Dhruba Borthakur and Shruti Bhat on enabling interactive analytics and data applications against live data.

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.

We had a great conversation spanning many topics, including:

  • RocksDB, an open source, embeddable key-value store originated by Facebook, and which is used in several other open source projects.

  • Time-series databases.

  • The importance of having solutions for real-time analytics, particularly now with the renewed interest in IoT applications and rollout of 5G technologies.

  • Use cases for Rockset’s technologies—and more generally, applications of real-time analytics.

  • The Aggregator Leaf Tailer architecture as an alternative to the Lambda architecture.

  • Building data infrastructure in the cloud.

The Aggregator Leaf Tailer (“CQRS for the data world”): A data architecture favored by web-scale companies. Source: Dhruba Borthakur, used with permission.

Related resources:

Jun 06 2019
37 mins
Play

Rank #10: What data scientists and data engineers can do with current generation serverless technologies

Podcast cover
Read more

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.

  • The fundamental differences between serverless and conventional cloud computing.

  • The reasons serverless—specifically AWS Lambda—took off so quickly.

  • What can data scientists and data engineers do with the current generation serverless offerings.

  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Apr 11 2019
36 mins
Play

Rank #11: Using machine learning and analytics to attract and retain employees

Podcast cover
Read more

The O’Reilly Data Show Podcast: Maryam Jahanshahi on building tools to help improve efficiency and fairness in how companies recruit.

In this episode of the Data Show, I spoke with Maryam Jahanshahi, research scientist at TapRecruit, a startup that uses machine learning and analytics to help companies recruit more effectively. In an upcoming survey, we found that a “skills gap” or “lack of skilled people” was one of the main bottlenecks holding back adoption of AI technologies. Many companies are exploring a variety of internal and external programs to train staff on new tools and processes. The other route is to hire new talent. But recent reports suggest that demand for data professionals is strong and competition for experienced talent is fierce. Jahanshahi and her team are building natural language and statistical tools that can help companies improve their ability to attract and retain talent across many key areas.

Here are some highlights from our conversation:

Optimal job titles

The conventional wisdom in our field has always been that you want to optimize for “the number of good candidates” divided by “the number of total candidates.” ... The thinking is that one of the ways in which you get a good signal-to-noise ratio is if you advertise for a more senior role. ... In fact, we found the number of qualified applicants was lower for the senior data scientist role.

... We saw from some of our behavioral experiments that people were feeling like that was too senior a role for them to apply to. What we would call the "confidence gap" was kicking in at that point. It's a pretty well-known phenomena that there are different groups of the population that are less confident. This has been best characterized in terms of gender. It's the idea that most women only apply for jobs when they meet 100% of the qualifications versus most men will apply even with 60% of the qualifications. That was actually manifesting.

Highlighting benefits

We saw a lot of big companies that would offer 401(k), that would offer health insurance or family leave, but wouldn't mention those benefits in the job descriptions. This had an impact on how candidates perceived these companies. Even though it's implied that Coca-Cola is probably going to give you 401(k) and health insurance, not mentioning it changes the way you think of that job.

... So, don't forget the things that really should be there. Even the boring stuff really matters for most candidates. You'd think it would only matter for older candidates, but, actually, millennials and everyone in every age group are very concerned about these things because it's not specifically about the 401(k) plan; it's about what it implies in terms of the company—that the company is going to take care of you, is going to give you leave, is going to provide a good workplace.

Improving diversity

We found the best way to deal with representation at the end of the process is actually to deal with representation early in the process. What I mean by that is having a robust or a healthy candidate pool at the start of the process. We found for data scientist roles, that was about having 100 candidates apply for your job.

... If we're not getting to the point where we can attract 100 applicants, we'll take a look at that job description. We'll see what's wrong with it and what could be turning off candidates; it could be that you're not syndicating the job description well, it's not getting into search results, or it could be that it's actually turning off a lot of people. You could be asking for too many qualifications, and that turns off a lot of people. ... Sometimes it involves taking a step back and taking a look at what we're doing in this process that's not helping us and that's starving us of candidates.

Related resources:

Jan 31 2019
46 mins
Play

Rank #12: How Ray makes continuous learning accessible and easy to scale

Podcast cover
Read more

The O’Reilly Data Show Podcast: Robert Nishihara and Philipp Moritz on a new framework for reinforcement learning and AI applications.

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we've been building that's motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they're largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

... For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.

... In fact, people often use MPI along with lower-level multi-processing libraries to build the communication infrastructure for their reinforcement learning applications.

Scaling machine learning in dynamic environments

I think right now when we think of machine learning, we often think of supervised learning. But a lot of machine learning applications are changing from making just one prediction to making sequences of decisions and taking sequences of actions in dynamic environments.

The thing that's special about reinforcement learning is it's not just the different algorithms that are being used, but rather the different problem domain that it's being applied to: interactive, dynamic, real-time settings bring up a lot of new challenges.

... The set of algorithms actually goes even a little bit further. Some of these techniques are even useful in, for example, things like text summarization and translation. You can use these techniques that have been developed in the context of reinforcement learning to better tackle some of these more classical problems [where you have some objective function that may not be easily differentiable].

... Some of the classic applications that we have in mind when we think about reinforcement learning are things like dialogue systems, where the agent is one participant in the conversation. Or robotic control, where the agent is the robot itself and it's trying to learn how to control its motion.

... For example, we implemented the evolution algorithm described in a recent OpenAI paper in Ray. It was very easy to port to Ray, and writing it only took a couple of hours. Then we had a distributed implementation that scaled very well and we ran it on up to 15 nodes.

Related resources:

Aug 17 2017
18 mins
Play

Rank #13: The real value of data requires a holistic view of the end-to-end data pipeline

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ashok Srivastava on the emergence of machine learning and AI for enterprise applications.

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications.

Here are some highlights from our conversation:

Chief data officer

A chief data officer, in my opinion, is a person who thinks about the end-to-end process of obtaining data, data governance, and transforming that data for a useful purpose. His or her purview is relatively large. I view my purview at Intuit to be exactly that, thinking about the entire data pipeline, proper stewardship, proper governance principles, and proper application of data. I think that as the public learns more about the opportunities that can come from data, there's a lot of excitement about the potential value that can be unlocked from it from the consumer standpoint, and also many businesses and scientific organizations are excited about the same thing. I think the CDO plays a role as a catalyst in making those things happen with the right principles applied.

I would say if you look back into history a little bit, you'll find the need for the chief data officer started to come into play when people saw a huge amount of data coming in at high speeds with high variety and variability—but then also the opportunity to marry that data with real algorithms that can have a transformational property to them. While it's true that CIOs, CTOs, and people who are in lines of business can and should think about this, it's a complex enough process that I think it merits having a person and an organization think about that end-to-end pipeline.

Ethics

We're actually right now in the process of launching a unified training program in data science that includes ethics as well as many other technical topics. I should say that I joined Intuit only about six months ago. They already had training programs happening worldwide in the area of data science and acquainting people with the principles necessary to use data properly as well as the technical aspects of doing it.

I really feel ethics is a critical area for those of us who work in the field to think about and to be advocates of proper use of data, proper use of privacy information and security, in order to make sure the data that we're stewards of is used in the best possible way for the end consumer.

Describing AI

You can think about two overlapping circles. One circle is really an AI circle. The other is a machine learning circle. Many people think that that intersection is the totality of it, but in fact, it isn't.

... I'm finding that AI needs to be bounded a little bit. I often say that it's a reasonable technology with unreasonable expectations associated with it. I really feel this way, that people for whatever reason have decided that deep learning is going to solve many problems. And there's a lot of evidence to support that, but frankly, there's a lot of evidence also to support the fact that much more work has to be done before these things become “general purpose AI solutions.” That's where a lot of exciting innovation is going to happen in the coming years.

Related resources:

Jun 07 2018
31 mins
Play

Rank #14: Managing risk in machine learning models

Podcast cover
Read more

The O’Reilly Data Show Podcast: Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain.

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer at Immuta, and Steven Touw, co-founder and CTO of Immuta. Burt recently co-authored a white paper on managing risk in machine learning models, and I wanted to sit down with them to discuss some of the proposals they put forward to organizations that are deploying machine learning.

Some high-profile examples of models gone awry have raised awareness among companies for the need for better risk management tools and processes. There is now a growing interest in ethics among data scientists, specifically in tools for monitoring bias in machine learning models. In a previous post, I listed some of the key considerations organization should keep in mind as they move models to production, but the report co-authored by Burt goes far beyond and recommends lines of defense, including a description of key roles that are needed.

Here are some highlights from our conversation:

Privacy and compliance meet data science

Andrew Burt: I would say the big takeaway from our paper is that lawyers and compliance and privacy folks live in one world and data scientists live in another with competing objectives. And that can no longer be the case. They need to talk to each other. They need to have a shared process and some shared terminology so that everybody can communicate.

One of the recommendations that we make, taken from some of the model risk management frameworks, is to create what we call lines of defense: these are basically different lines of reviewers who conduct periodic reviews from the creation testing phase to validation to an auditing phase. And the members of those lines of review need to be made up of teams with multiple expertise. So, there needs to be data owners, the people responsible for the data being piped into the models; there needs to be compliance personnel who are thinking about legal and ethical obligations; there needs to be data scientists; and there needs to be subject domain experts.

... We also dive into how you should be thinking about de-risking and monitoring your input data. How you should be thinking about monitoring and de-risking your output data and using output data for models. … And then, I think really importantly, is this idea of thinking about what it means for a model to fail, and having a concrete plan for what that means, how to correct it if it fails, and how to pull it from production if you need to.

Explainability and GDPR

Steven Touw: I gave a talk, "How Does the GDPR Impact Machine Learning?", at Strata Data London. A lot of people are concerned about language in GDPR, which states that you must be able to explain how the model came to that conclusion. I think people are kind of overreacting to this a little bit, and we need to inject some common sense in there along the lines of: at the end of the day, you can explain what data went in; you can explain the logic of what you're trying to solve and why; and you don't have to explain every neuron in the neural net and how it was correlated to every other piece.

I think the GDPR is actually doing a good thing. It's enabling consumers to understand how the decisions are being made about them, but they don't have to understand everything in the weeds about it. Because the whole point of machine learning is that it can do things that we can't as humans. So, that's why we use it, and there are cases where it makes sense to trust the model rather than humans to get things done and, potentially and hopefully, done more accurately.

Related resources:

Jun 21 2018
32 mins
Play

Similar Podcasts