Cover image of O'Reilly Data Show - O'Reilly Media Podcast
(54)
Business
Technology

O'Reilly Data Show - O'Reilly Media Podcast

Updated 1 day ago

Business
Technology
Read more

Big data and data science interviews, insight, and analysis.

Read more

Big data and data science interviews, insight, and analysis.

iTunes Ratings

54 Ratings
Average Ratings
29
11
7
6
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.

iTunes Ratings

54 Ratings
Average Ratings
29
11
7
6
1

Dropping Knowledge Bombs

By Virtually Natalie - May 21 2019
Read more
Ben and his wide variety of knowledgeable guests are truly rockstars! They drop quality (and free!) knowledge bombs in each and every episode. The great advice they provide, combined with the relatable way in which they deliver it had me hooked from the very first listen. Thanks for putting out such a stellar show Ben - keep up the great work!

Great to hear from those in the front lines

By Daddictedy - Jan 17 2016
Read more
Great way to catch up on the history and evolution of DS.

Listen to:

Cover image of O'Reilly Data Show - O'Reilly Media Podcast

O'Reilly Data Show - O'Reilly Media Podcast

Updated 1 day ago

Read more

Big data and data science interviews, insight, and analysis.

Trends in data, machine learning, and AI

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ben Lorica looks ahead at what we can expect in 2019 in the big data landscape.

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences.

Here are some highlights from our conversation:

Real-world use cases for new technology

If you're someone who wants to use data, data infrastructure, data science, machine learning, and AI, we're really at the point where there are a lot of tools for implementers and developers. They're not necessarily doing research and development; they just want to build better products and automate workflow. I think that's the most significant development in my mind.

And then I think use case sharing also has an impact. For example, at our conferences, people are sharing how they're using AI and ML in their businesses, so the use cases are getting better defined—particularly for some of these technologies that are relatively new to the broader data community, like deep learning. There are now use cases that touch the types of problems people normally tackle—so, things that involve structured data, for example, for time series forecasting, or recommenders.

With that said, while we are in an implementation phase, I think as people who follow this space will attest, there's still a lot of interesting things coming out of the R&D world, so still a lot of great innovation and a lot more growth in terms of how sophisticated and how easy to use these technologies will be.

Addressing ML and AI bottlenecks

We have a couple of surveys that we'll release early in 2019. In one of these surveys, we asked people what the main bottleneck is in terms of adopting machine learning and AI technologies.

Interestingly enough, the main bottleneck was cultural issues—people are still facing challenges in terms of convincing people within their companies to adopt these technologies. And then, of course, the next two are the ones we're familiar with: lack of data and lack of skilled people. And then the fourth bottleneck people cited was trouble identifying business use cases.

What's interesting about that is, if you then ask people how mature their practice is and you look at the people with the most mature AI and machine learning practices, they still cite a lack of data as the main bottleneck. What that tells me is that there's still a lot of opportunity for people to apply these technologies within their companies, but there's a lot of foundational work people have to do in terms of just getting data in place, getting data collected and ready for analytics.

Focus on foundational technologies

At the Strata Data conferences in San Francisco, London, and New York, the emphasis will be building technologies, bringing in technologies and cultural practices that will allow you to sustain analytics and machine learning in your organization. That means having all of the foundational technologies in place—data ingestion, data governance, ETL, data lineage, data science platform, metadata, store, and things like that, the various pieces of technology that will be important as you scale the practice of machine learning and AI in your company.

At the Artificial Intelligence conferences, we remain focused on being the de facto gathering place for people interested in applied artificial intelligence. We will focus on servicing the most important use cases in many, many domains. That means showcasing, of course, the latest research in deep learning and other branches of machine learning, but also helping people grapple with some of the other important considerations, like privacy and security, fairness, reliability, and safety.

...At both the Strata Data and Artificial Intelligence conferences, we will focus on helping people understand the capabilities of the technology, the strengths and limitations; that's why we run executive briefings at all of these events. We showcase case studies that are aimed at the non-technical and business user as well—so, we'll have two types of case studies, one more technical and one not so technical so the business decision-makers can benefit from seeing how their peers are using and succeeding with some of these technologies.

Dec 20 2018

28mins

Play

The state of machine learning in Apache Spark

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.

  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.

  • The plan to make it easier to integrate advanced analytics libraries that aren't "textbook machine learning," like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.

  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).

  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning—lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

Sep 14 2017

21mins

Play

Applications of data science and machine learning in financial services

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.

  • The current state of data science in financial services in both the U.S. and China.

  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Here are some highlights from our conversation:

Opportunities in financial services

There's a customer acquisition piece and then there's a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it's a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. ... Once you have a specific cohort of users who you want to target, there's a need to be able to precisely convert them, which means understanding the stage of the customer's thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.

... On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer's data and able to serve the customer better with automated services whenever and wherever the customer is. It's all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.

Opportunities in China

A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.

For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we're talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.

If you look at WeChat, they're boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.

Related resources:

May 23 2019

42mins

Play

The real value of data requires a holistic view of the end-to-end data pipeline

Podcast cover
Read more

The O’Reilly Data Show Podcast: Ashok Srivastava on the emergence of machine learning and AI for enterprise applications.

In this episode of the Data Show, I spoke with Ashok Srivastava, senior vice president and chief data officer at Intuit. He has a strong science and engineering background, combined with years of applying machine learning and data science in industry. Prior to joining Intuit, he led the teams responsible for data and artificial intelligence products at Verizon. I wanted his perspective on a range of issues, including the role of the chief data officer, ethics in machine learning, and the emergence of AI technologies for enterprise products and applications.

Here are some highlights from our conversation:

Chief data officer

A chief data officer, in my opinion, is a person who thinks about the end-to-end process of obtaining data, data governance, and transforming that data for a useful purpose. His or her purview is relatively large. I view my purview at Intuit to be exactly that, thinking about the entire data pipeline, proper stewardship, proper governance principles, and proper application of data. I think that as the public learns more about the opportunities that can come from data, there's a lot of excitement about the potential value that can be unlocked from it from the consumer standpoint, and also many businesses and scientific organizations are excited about the same thing. I think the CDO plays a role as a catalyst in making those things happen with the right principles applied.

I would say if you look back into history a little bit, you'll find the need for the chief data officer started to come into play when people saw a huge amount of data coming in at high speeds with high variety and variability—but then also the opportunity to marry that data with real algorithms that can have a transformational property to them. While it's true that CIOs, CTOs, and people who are in lines of business can and should think about this, it's a complex enough process that I think it merits having a person and an organization think about that end-to-end pipeline.

Ethics

We're actually right now in the process of launching a unified training program in data science that includes ethics as well as many other technical topics. I should say that I joined Intuit only about six months ago. They already had training programs happening worldwide in the area of data science and acquainting people with the principles necessary to use data properly as well as the technical aspects of doing it.

I really feel ethics is a critical area for those of us who work in the field to think about and to be advocates of proper use of data, proper use of privacy information and security, in order to make sure the data that we're stewards of is used in the best possible way for the end consumer.

Describing AI

You can think about two overlapping circles. One circle is really an AI circle. The other is a machine learning circle. Many people think that that intersection is the totality of it, but in fact, it isn't.

... I'm finding that AI needs to be bounded a little bit. I often say that it's a reasonable technology with unreasonable expectations associated with it. I really feel this way, that people for whatever reason have decided that deep learning is going to solve many problems. And there's a lot of evidence to support that, but frankly, there's a lot of evidence also to support the fact that much more work has to be done before these things become “general purpose AI solutions.” That's where a lot of exciting innovation is going to happen in the coming years.

Related resources:

Jun 07 2018

31mins

Play

Real-time entity resolution made accessible

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jeff Jonas on the evolution of entity resolution technologies.

In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.).

ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.

We had a great conversation spanning many topics including:

  • Why ER is interesting and challenging

  • How ER technologies have evolved over the years

  • How Senzing is working to democratize ER by making real-time AI technologies accessible to developers

  • Some early use cases for Senzing’s technologies

  • Some items on their research agenda

Here are a few highlights from our conversation:

Entity Resolution through years

In the early '90s, I worked on a much more advanced version of entity resolution for the casinos in Las Vegas and created software called NORA, non-obvious relationship awareness. Its purpose was to help casinos better understand who they were doing business with. We would ingest data from the loyalty club, everybody making hotel reservations, people showing up without reservations, everybody applying for jobs, people terminated, vendors, and 18 different lists of different kinds of bad people, some of them card counters (which aren't that bad), some cheaters. And they wanted to figure out across all these identities when somebody was the same, and then when people were related. Some people were using 32 different names and a bunch of different social security numbers.

... Ultimately, IBM bought my company and this technology became what is known now at IBM as “identity insight.” Identity insight is a real-time entity resolution engine that gets used to solve many kinds of problems. MoneyGram implemented it and their fraud complaints dropped 72%. They saved a few hundred million just in their first few years.

... But while at IBM, I had a grand vision about a new type of entity resolution engine that would have been unlike anything that's ever existed. It's almost like a Swiss Army knife for ER.

Recent developments

The Senzing entity resolution engine works really well on two records from a domain that you've never even seen before. Say you've never done entity resolution on restaurants from Singapore. The first two records you feed it, it's really, really already smart. And then as you feed it more data, it gets smarter and smarter.

... So, there are two things that we've intertwined. One is common sense. One type of common sense is the names—Dick, Dickie, Richie, Rick, Ricardo are all part of the same name family. Why should it have to study millions and millions of records to learn that again?

... Next to common sense, there's real-time learning. In real-time learning, we do a few things. You might have somebody named Bob, but who now goes by a nickname or an alias of Andy. Eventually, you might come to learn that. So, now you know you have to learn over time that Bob also has this nickname, and Bob lived at three addresses, and this is his credit card number, and now he's got four phone numbers. So you want to learn those over time.

... These systems we're creating, our entity resolution systems—which really resolve entities and graph them (call it index of identities and how they're related)—never has to be reloaded. It literally cleans itself up in the past. You can do maintenance on it while you're querying it, while you're loading new transactional data, while you're loading historical data. There's nothing else like it that can work at this scale. It's really hard to do.

Related resources:

May 09 2019

27mins

Play

Effective mechanisms for searching the space of machine learning algorithms

Podcast cover
Read more

The O’Reilly Data Show Podcast: Kenneth Stanley on neuroevolution and other principled ways of exploring the world without an objective.

In this episode of the Data Show, I spoke with Ken Stanley, founding member of Uber AI Labs and associate professor at the University of Central Florida. Stanley is an AI researcher and a leading pioneer in the field of neuroevolution—a method for evolving and learning neural networks through evolutionary algorithms. In a recent survey article, Stanley went through the history of neuroevolution and listed recent developments, including its applications to reinforcement learning problems.

Stanley is also the co-author of a book entitled Why Greatness Cannot Be Planned: The Myth of the Objective—a book I’ve been recommending to anyone interested in innovation, public policy, and management. Inspired by Stanley’s research in neuroevolution (into topics like novelty search and open endedness), the book is filled with examples of how notions first uncovered in the field of AI can be applied to many other disciplines and domains.

The book closes with a case study that hits closer to home—the current state of research in AI. One can think of machine learning and AI as a search for ever better algorithms and models. Stanley points out that gatekeepers (editors of research journals, conference organizers, and others) impose two objectives that researchers must meet before their work gets accepted or disseminated: (1) empirical: their work should beat incumbent methods on some benchmark task, and (2) theoretical: proposed new algorithms are better if they can be proven to have desirable properties. Stanley argues this means that interesting work (“stepping stones”) that fail to meet either of these criteria fall by the wayside, preventing other researchers from building on potentially interesting but incomplete ideas.

Here are some highlights from our conversation:

Neuroevolution today

In the state of the art today, the algorithms have the ability to evolve variable topologies or different architectures. There are pretty sophisticated algorithms for evolving the architecture of a neural network; in other words, what's connected to what, not just what the weight of those connections are—which is what deep learning is usually concerned with.

There's also an idea of how to encode very, very large patterns of connectivity. This is something that's been developed independently in neuroevolution where there's not a really analogous thing in deep learning right now. This is the idea that if you're evolving something that's really large, then you probably can't afford to encode the whole thing in the DNA. In other words, if we have 100 trillion connections in our brains, our DNA does not have 100 trillion genes. In fact, it couldn't have a 100 trillion genes. It just wouldn't fit. That would be astronomically too high. So then, how is it that with a much, much smaller space of DNA, which is about 30,000 genes or so, three billion base pairs, how would you get enough information in there to encode something that's 100 trillion parts?

This is the issue of encoding. We've become sophisticated at creating artificial encodings that are basically compressed in an analogous way, where you can have a relatively short string of information to describe a very large structure that comes out—in this case, a neural network. We've gotten good at doing encoding and we've gotten good at searching more intelligently through the space of possible neural networks. We originally thought what you need to do is just breed by choosing among the best. So, you say, ‘Well, there's some task we're trying to do and I'll choose among the best to create the next generation.’

We've learned since then that that's actually not always a good policy. Sometimes you want to explicitly choose for diversity. In fact, that can lead to better outcomes.

The myth of the objective

Our book does recognize that sometimes pursuing objectives is a rational thing to do. But I think the broader point that's important here is there’s a class of discoveries for which it really is against your interest to frame what you're doing in terms of an objective.

The reason we wrote the book is because … I started to realize this principle that ‘sometimes in order to make discovery possible, you have to stop having an objective’ speaks to people beyond just computer scientists who are developing algorithms. It's an issue for our society and for institutions because there are many things we do that are driven by some kind of objective metric. It almost sounds like heresy to suggest that you shouldn't do that.

It's like an unquestioned assumption that exists throughout our culture that the primary route to progress is to set objectives and move toward those objectives and measure your performance with respect to those objectives. We began to think that given the results we have that are hard empirical results, that it is important to counterweight this belief that pervades society with a counter argument that points out that there are cases where this is actually a really bad idea.

The thing I learned more and more talking to different groups is that this discussion is not being had. We're not talking about this, and I think it's a very important discussion because our institutions are geared away from innovation because they are so objectively driven. We could do more to foster innovation if we recognize this principle. A lot of people want this security blanket of an objective because they don't trust anything that isn't driven by an objective.

Actually, it turns out there are principled ways of exploring the world without an objective. In other words, it's not just random and the book is about that. It's about how smart ways of exploring in a non-objective way can lead to really, really important results. We just wanted to open up that conversation ‘society wide’ and not just have it narrowly within the field of computer science because it is such an important conversation to have.

Related resources:

Aug 31 2017

45mins

Play

Building tools for enterprise data science

Podcast cover
Read more

The O’Reilly Data Show Podcast: Vitaly Gordon on the rise of automation tools in data science.

In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.

I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.

Here are some highlights from our conversation:

The need for an internal data science platform

It's more about how much commonality there is between every single data science use case—how many of the problems are redundant and repeatable.

... A lot of data scientists solve problems that honestly have a lot to do with engineering, a lot to do with things that are not pure modeling.

TransmogrifAI

TransmogrifAI is an automated machine library for mostly structured data, and the problem that it aims to solve is that we at Salesforce have hundreds of thousands of customers. While all of them share a common set of data, the Salesforce platform itself is extremely customizable. Actually, 80% of the data inside the Salesforce platform actually sits in what we refer to as custom objects, which one can think of as custom tables in a database.

... We don't build models that are shared between customers. We always use a single customer’s data. We have hundreds of thousands of models potentially that we need to build, and because of that, we needed to automate the entire process. We just cannot throw people at the problem. We basically created TransmogrifAI to automate the entire end-to-end process for creating a model for a user and we decided to open source it a couple months ago.

Related resources:

Nov 21 2018

31mins

Play

Building accessible tools for large-scale computation and machine learning

Podcast cover
Read more

The O’Reilly Data Show Podcast: Eric Jonas on Pywren, scientific computation, and machine learning.

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.

Here are some highlights from our conversation:

Pywren

The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, 'I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?' Thus, Pywren was born.

... Right now, we're primarily focused on the entire scientific Python stack, so SciPy, NumPy, Pandas, Matplotlib, the whole ecosystem there. ... One of the challenges with all of these frameworks and running these things on Lambda is that, right now, Lambda is a fairly constrained resource environment. Amazon will quite happily give you 3,000 cores in the next two seconds, but each one has a maximum runtime and a small amount of memory and a small amount of local disk. Part of the current active research thrust for Pywren is figuring out how to do more general-purpose computation within those resource limits. But right now, we mostly support everything you would encounter in your normal Python workflow—including Jupyter, NumPy, and scikit-learn.

Numpywren

Chris Ré has this nice quote: 'Why is it easier to train a bidirectional LSTM with attention than it is to just compute the SVD of a giant matrix?' One of these things is actually fantastically more complicated than the other, but right now, our linear algebra tools are just such an impediment to doing that sort of large-scale computation. We hope NumPyWren will enable this class of work for the machine learning community.

The growing importance of reinforcement learning

Ben Recht makes the argument that the most interesting problems in machine learning right now involve taking action based upon your intelligence. I think he's right about this—taking action based upon past data and doing it in a way that is safe and robust and reliable and all of these sorts of things. That is very much the domain that has traditionally been occupied by fields like control theory and reinforcement learning.

Reinforcement learning and Ray

Ray is an excellent platform for building large-scale distributed systems, and it's much more Python-native than Spark was. Ray also has much more of a focus on real-time performance. A lot of the things that people are interested in with Ray revolve around doing things like large-scale reinforcement learning—and it just so happens that deep reinforcement learning is something that everyone's really excited about.

Related resources:

Aug 30 2018

53mins

Play

Using machine learning and analytics to attract and retain employees

Podcast cover
Read more

The O’Reilly Data Show Podcast: Maryam Jahanshahi on building tools to help improve efficiency and fairness in how companies recruit.

In this episode of the Data Show, I spoke with Maryam Jahanshahi, research scientist at TapRecruit, a startup that uses machine learning and analytics to help companies recruit more effectively. In an upcoming survey, we found that a “skills gap” or “lack of skilled people” was one of the main bottlenecks holding back adoption of AI technologies. Many companies are exploring a variety of internal and external programs to train staff on new tools and processes. The other route is to hire new talent. But recent reports suggest that demand for data professionals is strong and competition for experienced talent is fierce. Jahanshahi and her team are building natural language and statistical tools that can help companies improve their ability to attract and retain talent across many key areas.

Here are some highlights from our conversation:

Optimal job titles

The conventional wisdom in our field has always been that you want to optimize for “the number of good candidates” divided by “the number of total candidates.” ... The thinking is that one of the ways in which you get a good signal-to-noise ratio is if you advertise for a more senior role. ... In fact, we found the number of qualified applicants was lower for the senior data scientist role.

... We saw from some of our behavioral experiments that people were feeling like that was too senior a role for them to apply to. What we would call the "confidence gap" was kicking in at that point. It's a pretty well-known phenomena that there are different groups of the population that are less confident. This has been best characterized in terms of gender. It's the idea that most women only apply for jobs when they meet 100% of the qualifications versus most men will apply even with 60% of the qualifications. That was actually manifesting.

Highlighting benefits

We saw a lot of big companies that would offer 401(k), that would offer health insurance or family leave, but wouldn't mention those benefits in the job descriptions. This had an impact on how candidates perceived these companies. Even though it's implied that Coca-Cola is probably going to give you 401(k) and health insurance, not mentioning it changes the way you think of that job.

... So, don't forget the things that really should be there. Even the boring stuff really matters for most candidates. You'd think it would only matter for older candidates, but, actually, millennials and everyone in every age group are very concerned about these things because it's not specifically about the 401(k) plan; it's about what it implies in terms of the company—that the company is going to take care of you, is going to give you leave, is going to provide a good workplace.

Improving diversity

We found the best way to deal with representation at the end of the process is actually to deal with representation early in the process. What I mean by that is having a robust or a healthy candidate pool at the start of the process. We found for data scientist roles, that was about having 100 candidates apply for your job.

... If we're not getting to the point where we can attract 100 applicants, we'll take a look at that job description. We'll see what's wrong with it and what could be turning off candidates; it could be that you're not syndicating the job description well, it's not getting into search results, or it could be that it's actually turning off a lot of people. You could be asking for too many qualifications, and that turns off a lot of people. ... Sometimes it involves taking a step back and taking a look at what we're doing in this process that's not helping us and that's starving us of candidates.

Related resources:

Jan 31 2019

46mins

Play

Using machine learning to improve dialog flow in conversational applications

Podcast cover
Read more

The O’Reilly Data Show Podcast: Alan Nichol on building a suite of open source tools for chatbot developers.

In this episode of the Data Show, I spoke with Alan Nichol, co-founder and CTO of Rasa, a startup that builds open source tools to help developers and product teams build conversational applications. About 18 months ago, there was tremendous excitement and hype surrounding chatbots, and while things have quieted lately, companies and developers continue to refine and define tools for building conversational applications. We spoke about the current state of chatbots, specifically about the types of applications developers are building today and how he sees conversational applications evolving in the near future.

As I described in a recent post, workflow automation will happen in stages. With that in mind, chatbots and intelligent assistants are bound to improve as underlying algorithms, technologies, and training data get better.

Here are some highlights from our conversation:

Chatbots and state machines

The first component is what we call natural language understanding, which typically means taking a short message that a user sends and extracting some meaning from it, which means turning it into structured data. In the case we talked about regarding the SQL database, if somebody asks, for example, ‘What was my ROI on my Facebook campaigns last month?’, the first thing you want to understand is that this is a data question and you want to assign it a label identifying it as a person, and they're not saying hello, or goodbye, or thank you, but asking a specific question. Then you want to pick out those fields to help you create a query.

... The second piece is, how do you actually know what to do next? How do you build a system that can hold a conversation that is coherent? What you realize very quickly is that it's not enough to have one input always matched to the same output. For example, if you ask somebody a yes or no question and they say, ‘yes,’ the next thing to do, of course, depends on what the original question was.

... Real conversations aren't stateless; they have some context and they need to pay attention to the history. So, the way developers do that is build a state machine. Which means, for example, that you have a bot that can do some different things. It can talk about flights; it can talk about hotels. Then you define different states for when the person is still searching, or for when they are comparing different things, or for when they finish a booking. And then you have to define rules for how to behave for every input, for every possible state.

Beyond state machines

The problem is that [the state machine] approach works for building your first version, but it really restricts you to what we call “the happy parts,” which is where the user is compliant and cooperative and does everything you ask them to do. But in typical cases, you ask a person, "Do you like option A, or option B?" Then you probably build the path for the person saying, A, you build a path for the person saying B. But then you give it to real users, and they say, "No, I don't like either of those." Or they ask a question like, "Why is A so much more expensive than B?" Or, "Let me get back to you about that."

... They don't scale, that's the problem. If you're a developer and somebody has a conversation with your bot and you realize that it did the wrong thing, now you have to go look back at your (literally) thousands or tens of thousands of rules to figure out which one crashed and which one did the wrong thing. You figure out where to inject one more rule to handle one more etiquette, and that just doesn't scale at all.

... With our dialogue library Rasa core, we give the user the ability to talk to the bot and provide feedback. So, in Rasa, the whole flow of dialogue is also controlled with machine learning. And it's learned from real sample conversations. You talk to the system and if it does something wrong, you provide feedback and it corrects itself. So, you explore the space of possible conversations interactively yourself, and then your users do as well.

Related resources:

Sep 13 2018

45mins

Play

Why companies are in need of data lineage solutions

Podcast cover
Read more

The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It's something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I'm describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.

... Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR ... which was one of the hot topics I heard about at the Strata Data Conference in London in 2018.

Why companies are suddenly building data lineage solutions

A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you're working with. If you're working with 10 databases, you need to know what's going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it's linked to. Then it gives you some metadata information so you can drill down. Let's say you have corrupted data, let's say you want to debug something. All these cases tie into the actual use cases for which we want to build it.

Related resources:

Apr 25 2019

34mins

Play

How Ray makes continuous learning accessible and easy to scale

Podcast cover
Read more

The O’Reilly Data Show Podcast: Robert Nishihara and Philipp Moritz on a new framework for reinforcement learning and AI applications.

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we've been building that's motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they're largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

... For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.

... In fact, people often use MPI along with lower-level multi-processing libraries to build the communication infrastructure for their reinforcement learning applications.

Scaling machine learning in dynamic environments

I think right now when we think of machine learning, we often think of supervised learning. But a lot of machine learning applications are changing from making just one prediction to making sequences of decisions and taking sequences of actions in dynamic environments.

The thing that's special about reinforcement learning is it's not just the different algorithms that are being used, but rather the different problem domain that it's being applied to: interactive, dynamic, real-time settings bring up a lot of new challenges.

... The set of algorithms actually goes even a little bit further. Some of these techniques are even useful in, for example, things like text summarization and translation. You can use these techniques that have been developed in the context of reinforcement learning to better tackle some of these more classical problems [where you have some objective function that may not be easily differentiable].

... Some of the classic applications that we have in mind when we think about reinforcement learning are things like dialogue systems, where the agent is one participant in the conversation. Or robotic control, where the agent is the robot itself and it's trying to learn how to control its motion.

... For example, we implemented the evolution algorithm described in a recent OpenAI paper in Ray. It was very easy to port to Ray, and writing it only took a couple of hours. Then we had a distributed implementation that scaled very well and we ran it on up to 15 nodes.

Related resources:

Aug 17 2017

18mins

Play

Why it’s hard to design fair machine learning models

Podcast cover
Read more

The O’Reilly Data Show Podcast: Sharad Goel and Sam Corbett-Davies on the limitations of popular mathematical formalizations of fairness.

In this episode of the Data Show, I spoke with Sharad Goel, assistant professor at Stanford, and his student Sam Corbett-Davies. They recently wrote a survey paper, “A Critical Review of Fair Machine Learning,” where they carefully examined the standard statistical tools used to check for fairness in machine learning models. It turns out that each of the standard approaches (anti-classification, classification parity, and calibration) has limitations, and their paper is a must-read tour through recent research in designing fair algorithms. We talked about their key findings, and, most importantly, I pressed them to list a few best practices that analysts and industrial data scientists might want to consider.

Here are some highlights from our conversation:

Calibration and other standard metrics

Sam Corbett-Davies: The problem with many of the standard metrics is that they fail to take into account how different groups might have different distributions of risk. In particular, if there are people who are very low risk or very high risk, then it can throw off these measures in a way that doesn't actually change what the fair decision should be. ... The upshot is that if you end up enforcing or trying to enforce one of these measures, if you try to equalize false positive rates, or you try to equalize some other classification parity metric, you can end up hurting both the group you're trying to protect and any other groups for which you might be changing the policy.

... A layman's definition of calibration would be, if an algorithm gives a risk score—maybe it gives a score from one to 10, and one is very low risk and 10 is very high risk—calibration says the scores should mean the same thing for different groups (where the groups are defined based on some protected variable like gender, age, or race). We basically say in our paper that calibration is necessary for fairness, but it's not good enough. Just because your scores are calibrated doesn't mean you aren't doing something funny that could be harming certain groups.

The need to interrogate data

Sharad Goel: One way to operationalize this is if you have a set of reasonable measures to be your label, you can see how much your algorithm changes if you use different measures. If your algorithm is changing a lot using these different measures, then you really have to worry about determining the right measure. What is the right thing to predict. If it's the case that under a variety of reasonable measures everything looks kind of stable, maybe it's less of an issue. This is very hard to carry out in practice, but I do think it's one of the most important things to understand and to be aware of when designing these types of algorithms.

... There are a lot of subtleties to these different types of metrics that are important to be aware of when designing these algorithms in an equitable way. ... But fundamentally, these are hard problems. It's not particularly surprising that we don't have an algorithm to help us make all of these algorithms fair. ... What is most important is that we really interrogate the data.

Related resources:

Sep 27 2018

34mins

Play

Data regulations and privacy discussions are still in the early stages

Podcast cover
Read more

The O’Reilly Data Show Podcast: Aurélie Pols on GDPR, ethics, and ePrivacy.

In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.

Here are some highlights from our conversation:

GDPR is just the starting point

GDPR is not an end point. It's a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations?

... I've been teaching privacy and ethics in Madrid at the IE Business School, one of the top business schools in the world. I’ve been teaching in the big data and analytics graduate program. I see the evolution as well. Five years ago, they looked at me like, 'What is she talking about?' Three years ago, some of the people in the room started to understand. ... Last year it was like 'We get it.'

Privacy by design

It's defined as data protection by design and by default as well. The easy part is more the default settings: when you create systems, it's the question I ask 20 times a week: 'Great. I love your system. What data do you collect by default and what do you pass on by default?' Then you start turning things off and then we'll see who takes on the responsibility to turn things on again. That's a default. Privacy by design was pushed by Ann Cavoukian from Ottawa in Canada more than 10 years ago.

These principles are finding themselves within the legislation. Not only in GDPR—for example, Hong Kong is starting to talk about this and Japan as well. One of these principles is about positive-sum, not zero-sum. It's not 'I win and you lose.' It's 'we work together and we both win.' That's a very good principle.

There are interesting challenges within privacy by design to translate these seven principles into technical requirements. I think there are opportunities as well. It talks about traceability, visibility, transparency. Which then comes back again to, we're sitting on so much data; how much data do we want to surface and are data subjects or citizens ready to understand what we have, and are they able to make decisions based on that? ... Hopefully this generation of more ethically minded engineers or data scientists will start thinking in that way as well.

Related resources:

Jul 05 2018

33mins

Play

Bringing scalable real-time analytics to the enterprise

Podcast cover
Read more

The O’Reilly Data Show Podcast: Dhruba Borthakur and Shruti Bhat on enabling interactive analytics and data applications against live data.

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.

We had a great conversation spanning many topics, including:

  • RocksDB, an open source, embeddable key-value store originated by Facebook, and which is used in several other open source projects.

  • Time-series databases.

  • The importance of having solutions for real-time analytics, particularly now with the renewed interest in IoT applications and rollout of 5G technologies.

  • Use cases for Rockset’s technologies—and more generally, applications of real-time analytics.

  • The Aggregator Leaf Tailer architecture as an alternative to the Lambda architecture.

  • Building data infrastructure in the cloud.

The Aggregator Leaf Tailer (“CQRS for the data world”): A data architecture favored by web-scale companies. Source: Dhruba Borthakur, used with permission.

Related resources:

Jun 06 2019

37mins

Play

What data scientists and data engineers can do with current generation serverless technologies

Podcast cover
Read more

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.

  • The fundamental differences between serverless and conventional cloud computing.

  • The reasons serverless—specifically AWS Lambda—took off so quickly.

  • What can data scientists and data engineers do with the current generation serverless offerings.

  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Apr 11 2019

36mins

Play

How social science research can inform the design of AI systems

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jacob Ward on the interplay between psychology, decision-making, and AI systems.

In this episode of the Data Show, I spoke with Jacob Ward, a Berggruen Fellow at Stanford University. Ward has an extensive background in journalism, mainly covering topics in science and technology, at National Geographic, Al Jazeera, Discovery Channel, BBC, Popular Science, and many other outlets. Most recently, he’s become interested in the interplay between research in psychology, decision-making, and AI systems. He’s in the process of writing a book on these topics, and was gracious enough to give an informal preview by way of this podcast conversation.

Here are some highlights from our conversation:

Psychology and AI

I began to realize there was a disconnect between what is a totally revolutionary set of innovations coming through in psychology right now that are really just beginning to scratch the surface of how human beings make decisions; at the same time, we are beginning to automate human decision-making in a really fundamental way. I had a number of different people say, 'Wow, what you're describing in psychology really reminds me of this piece of AI that I'm building right now,' to change how expectant mothers see their doctors or change how we hire somebody for a job or whatever it is.

Transparency and designing systems that are fair

I was talking to somebody the other day who was trying to build a loan company that was using machine learning to present loans to people. He and his company did everything they possibly could to not redline the people they were loaning to. They were trying very hard not to make unfair loans that would give preference to white people over people of color.

They went to extraordinary lengths to make that happen. They cut addresses out of the process. They did all of this stuff to try to basically neutralize the process, and the machine learning model still would pick white people at a disproportionate rate over everybody else. They can't explain why. They don't know why that is. There's some variable that's mapping to race that they just don't know about.

But that sort of opacity—this is somebody explaining it to me who just happened to have been inside the company, but it's not as if that's on display for everybody to check out. These kinds of closed systems are picking up patterns we can't explain, and that their creators can't explain. They are also making really, really important decisions based on them. I think it is going to be very important to change how we inspect these systems before we begin trusting them.

Anthropomorphism and complex systems

In this book, I'm also trying to look at the way human beings respond to being given an answer by an automated system. There are some very well-established, psychological principles out there that can give us some sense of how people are going to respond when they are told what to do based on an algorithm.

The people who study anthropomorphism, the imparting of intention and human attributes to an automated system, say there's a really well-established pattern. When people are shown a very complex system and given some sort of exposure to that complex system, whether it gives them an answer or whatever it is, it tends to produce in human beings a level of trust in that system that doesn't really have anything to do with reality. ... The more complex the system, the more people tend to trust it.

Related resources:

Oct 11 2018

45mins

Play

Tools for generating deep neural networks with efficient network architectures

Podcast cover
Read more

The O’Reilly Data Show Podcast: Alex Wong on building human-in-the-loop automation solutions for enterprise machine learning.

In this episode of the Data Show, I spoke with Alex Wong, associate professor at the University of Waterloo, and co-founder of DarwinAI, a startup that uses AI to address foundational challenges with deep learning in the enterprise. As the use of machine learning and analytics become more widespread, we’re beginning to see tools that enable data scientists and data engineers to scale and tackle many more problems and maintain more systems. This includes automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection, and hyperparameter tuning, as well as tools for data engineering and data operations.

Wong and his collaborators are building solutions for enterprises, including tools for generating efficient neural networks and for the performance analysis of networks deployed to edge devices.

Here are some highlights from our conversation:

Using AI to democratize deep learning

Having worked in machine learning and deep learning for more than a decade, both in academia as well as industry, it really became very evident to me that there's a significant barrier to widespread adoption. One of the main things is that it is very difficult to design, build, and explain deep neural networks. I especially wanted to meet operational requirements. The process just involves way too much guesswork, trial and error, so it's hard to build systems that work in real-world industrial systems.

One of the out-of-the-box moments we had—pretty much the only way we could actually do this—was to reinvent the way we think about building deep neural networks. Which is, can we actually leverage AI itself as a collaborative technology? Can we build something that works with people to design and build much better networks? And that led to the start of DarwinAI—our main vision is pretty much enabling deep learning for anyone, anywhere, anytime.

Generative synthesis

The general concept of generative synthesis is to find the best generative model that meets your particular operational requirements (which could be size, speed, accuracy, and so forth). So, the intuition behind that is that we treat it as a large constrained optimization problem where we try to identify the generative machine that will actually give you the highest performance. We have a unique way of having an interplay between a generator and an inquisitor where the generator will generate networks that the inquisitor probes and understands. Then it learns intuition about what makes a good network and what doesn't.

Related resources:

Dec 06 2018

32mins

Play

The technical, societal, and cultural challenges that come with the rise of fake media

Podcast cover
Read more

The O’Reilly Data Show Podcast: Siwei Lyu on machine learning for digital media forensics and image synthesis.

In this episode of the Data Show, I spoke with Siwei Lyu, associate professor of computer science at the University at Albany, State University of New York. Lyu is a leading expert in digital media forensics, a field of research into tools and techniques for analyzing the authenticity of media files. Over the past year, there have been many stories written about the rise of tools for creating fake media (mainly images, video, audio files). Researchers in digital image forensics haven’t exactly been standing still, though. As Lyu notes, advances in machine learning and deep learning have also found a receptive audience among the forensics community.

We had a great conversation spanning many topics including:

  • The many indicators used by forensic experts and forgery detection systems

  • Balancing “open” research with risks that come with it—including “tipping off” adversaries

  • State-of-the-art detection tools today, and what the research community and funding agencies are working on over the next few years.

  • Technical, societal, and cultural challenges that come with the rise of fake media.

Here are some highlights from our conversation:

Imbalance between digital forensics researchers and forgers

In theory, it looks difficult to synthesize media. This is true, but on the other hand, there are factors to consider on the side of the forgers. The first is the fact that most people working in forensics, like myself, usually just write a paper and publish it. So, the details of our detection algorithm becomes available immediately. On the other hand, people making fake media are usually secretive; they don't usually publish the details of their algorithms. So, there's a kind of imbalance between the information on the forensic side and the forgery side.

The other issue is user habit. The fact that even if some of the fakes are very low quality, a typical user checks it just for a second; sees something interesting, exciting, sensational; and helps distribute it without actually checking the authenticity. This actually helps fake media to broadcast very, very fast. Even though we have algorithms to detect fake media, these tools are probably not fast enough to actually stop the trap.

... Then there are the actual incentives for this kind of work. For forensics, even if we have the tools and the time to catch a piece of fake media, we don't get anything. But for people actually making the fake media, there is more financial or other forms of incentive to do that.

Related resources:

Feb 14 2019

30mins

Play

It’s time for data scientists to collaborate with researchers in other disciplines

Podcast cover
Read more

The O’Reilly Data Show Podcast: Forough Poursabzi Sangdeh on the interdisciplinary nature of interpretable and interactive machine learning.

In this episode of the Data Show, I spoke with Forough Poursabzi-Sangdeh, a postdoctoral researcher at Microsoft Research New York City. Poursabzi works in the interdisciplinary area of interpretable and interactive machine learning. As models and algorithms become more widespread, many important considerations are becoming active research areas: fairness and bias, safety and reliability, security and privacy, and Poursabzi’s area of focus—explainability and interpretability.

We had a great conversation spanning many topics, including:

  • Current best practices and state-of-the-art methods used to explain or interpret deep learning—or, more generally, machine learning models.

  • The limitations of current model interpretability methods.

  • The lack of clear/standard metrics for comparing different approaches used for model interpretability

  • Many current AI and machine learning applications augment humans, and, thus, Poursabzi believes it’s important for data scientists to work closely with researchers in other disciplines.

  • The importance of using human subjects in model interpretability studies.

Related resources:

Mar 28 2019

36mins

Play

Machine learning for operational analytics and business intelligence

Podcast cover
Read more

The O’Reilly Data Show Podcast: Peter Bailis on data management, ML benchmarks, and building next-gen tools for analysts.

In this episode of the Data Show, I speak with Peter Bailis, founder and CEO of Sisu, a startup that is using machine learning to improve operational analytics. Bailis is also an assistant professor of computer science at Stanford University, where he conducts research into data-intensive systems and where he is co-founder of the DAWN Lab.

We had a great conversation spanning many topics, including:

  • His personal blog, which contains some of the best explainers on emerging topics in data management and distributed systems.
  • The role of machine learning in operational analytics and business intelligence.
  • Machine learning benchmarks—specifically two recent ML initiatives that he’s been involved with: DAWNBench and MLPerf.
  • Trends in data management and in tools for machine learning development, governance, and operations.

Related resources:

Oct 10 2019

51mins

Play

Machine learning and analytics for time series data

Podcast cover
Read more

The O’Reilly Data Show Podcast: Arun Kejariwal and Ira Cohen on building large-scale, real-time solutions for anomaly detection and forecasting.

In this episode of the Data Show, I speak with Arun Kejariwal of Facebook and Ira Cohen of Anodot (full disclosure: I’m an advisor to Anodot). This conversation stemmed from a recent online panel discussion we did, where we discussed time series data, and, specifically, anomaly detection and forecasting. Both Kejariwal (at Machine Zone, Twitter, and Facebook) and Cohen (at HP and Anodot) have extensive experience building analytic and machine learning solutions at large scale, and both have worked extensively with time-series data. The growing interest in AI and machine learning has not been confined to computer vision, speech technologies, or text. In the enterprise, there is strong interest in using similar automation tools for temporal data and time series.

We had a great conversation spanning many topics, including:

  • Why businesses should care about anomaly detection and forecasting; specifically, we delve into examples outside of IT Operations & Monitoring.
  • (Specialized) techniques and tools for automating some of the relevant tasks, including signal processing, statistical methods, and machine learning.
  • What are some of the key features of an anomaly detection or forecasting system.
  • What lies ahead for large-scale systems for time series analysis.

Related resources:

Sep 26 2019

40mins

Play

Understanding deep neural networks

Podcast cover
Read more

The O’Reilly Data Show Podcast: Michael Mahoney on developing a practical theory for deep learning.

In this episode of the Data Show, I speak with Michael Mahoney, a member of RISELab, the International Computer Science Institute, and the Department of Statistics at UC Berkeley. A physicist by training, Mahoney has been at the forefront of many important problems in large-scale data analysis. On the theoretical side, his works spans algorithmic and statistical methods for matrices, graphs, regression, optimization, and related problems. On the applications side, he has contributed to systems used for internet and social media analysis, social network analysis, as well as for a host of applications in the physical and life sciences. Most recently, he has been working on deep neural networks, specifically developing theoretical methods and practical diagnostic tools that should be helpful to practitioners who use deep learning.

Analyzing deep neural networks with WeightWatcher. Image by Michael Mahoney and Charles Martin, used with permission.

We had a great conversation spanning many topics, including:

Related resources:

Sep 12 2019

39mins

Play

Becoming a machine learning practitioner

Podcast cover
Read more

The O’Reilly Data Show Podcast: Kesha Williams on how she added machine learning to her software developer toolkit.

In this episode of the Data Show, I speak with Kesha Williams, technical instructor at A Cloud Guru, a training company focused on cloud computing. As a full stack web developer, Williams became intrigued by machine learning and started teaching herself the ML tools on Amazon Web Services. Fast forward to today, Williams has built some well-regarded Alexa skills, mastered ML services on AWS, and has now firmly added machine learning to her developer toolkit.

Anatomy of an Alexa skill. Image by Kesha Williams, used with permission.

We had a great conversation spanning many topics, including:

  • How she got started and made the transition into a full-fledged machine learning practitioner.

  • We discussed the evolution of ML tools and learning resources, and how accessible they’ve become for developers.

  • How to build and monetize Alexa skills. Along the way, we took a deep dive and discussed some of the more interesting Alexa skills she has built, as well as one that she really admires.

Related resources:

Aug 29 2019

33mins

Play

Labeling, transforming, and structuring training data sets for machine learning

Podcast cover
Read more

The O’Reilly Data Show Podcast: Alex Ratner on how to build and manage training data with Snorkel.

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.

Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.

Snorkel pipeline for data labeling. Source: Alex Ratner, used with permission.

We had a great conversation spanning many topics, including:

  • Why he and his collaborators decided to focus on "data programming" and tools for building and managing training data.

  • A tour through Snorkel, including its target users and key components.

  • What’s in the newly released version (v 0.9) of Snorkel.

  • The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.

  • Data lineage, AutoML, and end-to-end automation of machine learning pipelines.

  • Holoclean and other projects focused on data quality and data programming.

  • The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.

Related resources:

Aug 15 2019

40mins

Play

Make data science more useful

Podcast cover
Read more

The O’Reilly Data Show Podcast: Cassie Kozyrkov on connecting data and AI to business.

In this episode of the Data Show, I speak with Cassie Kozyrkov, technical director and chief decision scientist at Google Cloud. She describes "decision intelligence" as an interdisciplinary field concerned with all aspects of decision-making, and which combines data science with the behavioral sciences. Most recently she has been focused on developing best practices that can help practitioners make safe, effective use of AI and data. Kozyrkov uses her platform to help data scientists develop skills that will enable them to connect data and AI with their organizations' core businesses.

We had a great conversation spanning many topics, including:

  • How data science can be more useful

  • The importance of the human side of data

  • The leadership talent shortage in data science

  • Is data science a bubble?

Related resources:

Aug 01 2019

35mins

Play

Acquiring and sharing high-quality data

Podcast cover
Read more

The O’Reilly Data Show Podcast: Roger Chen on the fair value and decentralized governance of data.

In this episode of the Data Show, I spoke with Roger Chen, co-founder and CEO of Computable Labs, a startup focused on building tools for the creation of data networks and data exchanges. Chen has also served as co-chair of O'Reilly's Artificial Intelligence Conference since its inception in 2016. This conversation took place the day after Chen and his collaborators released an interesting new white paper, Fair value and decentralized governance of data. Current-generation AI and machine learning technologies rely on large amounts of data, and to the extent they can use their large user bases to create “data silos,” large companies in large countries (like the U.S. and China) enjoy a competitive advantage. With that said, we are awash in articles about the dangers posed by these data silos. Privacy and security, disinformation, bias, and a lack of transparency and control are just some of the issues that have plagued the perceived owners of “data monopolies.”

In recent years, researchers and practitioners have begun building tools focused on helping organizations acquire, build, and share high-quality data. Chen and his collaborators are doing some of the most interesting work in this space, and I recommend their new white paper and accompanying open source projects.

Sequence of basic market transactions in the Computable Labs protocol. Source: Roger Chen, used with permission.

We had a great conversation spanning many topics, including:

  • Why he chose to focus on data governance and data markets.

  • The unique and fundamental challenges in accurately pricing data.

  • The importance of data lineage and provenance, and the approach they took in their proposed protocol.

  • What cooperative governance is and why it's necessary.

  • How their protocol discourages an unscrupulous user from just scraping all data available in a data market.

Related resources:

Jul 18 2019

39mins

Play

Tools for machine learning development

Podcast cover
Read more

The O'Reilly Data Show: Ben Lorica chats with Jeff Meyerson of Software Engineering Daily about data engineering, data architecture and infrastructure, and machine learning.

In this week's episode of the Data Show, we're featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML).

Here are a few highlights:

Tools for productive collaboration

A data catalog, at a high level, basically answers questions around the data that's available and who is using it so an enterprise can understand access patterns. ... The term "data catalog" is generally used when you've gotten to the point where you have a team of data scientists and you need a place where they can use libraries in a setting where they can collaborate, and where they can share not only models but maybe even data pipelines and features. The more advanced data science platforms will have automation tools built in. ... The ideal scenario is the data science platform is not just for prototyping, but also for pushing things to production.

Tools for ML development

We have tools for software development, and now we're beginning to hear about tools for machine learning development—there's a company here at Strata called Comet.ml, and there's another startup called Verta.ai. But what has really caught my attention is an open source project from Databricks called MLflow. When it first came out, I thought, 'Oh, yeah, so we don't have anything like this. Might have a decent chance of success.' But I didn't pay close attention until recently; fast forward to today, there are 80 contributors for 40 companies and 200+ companies using it.

What's good about MLflow is that it has three components and you're free to pick and choose—you can use one, two, or three. Based on their surveys, the most popular component is the one for tracking and managing machine learning experiments. It's designed to be useful for individual data scientists, but it's also designed to be used by teams of data scientists, so they have documented use-cases of MLflow where you have a company managing thousands of models and productions.

Jul 03 2019

39mins

Play

Enabling end-to-end machine learning pipelines in real-world applications

Podcast cover
Read more

The O’Reilly Data Show Podcast: Nick Pentreath on overcoming challenges in productionizing machine learning models.

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.

We had a great conversation spanning many topics, including:

Related resources:

Jun 20 2019

42mins

Play

Bringing scalable real-time analytics to the enterprise

Podcast cover
Read more

The O’Reilly Data Show Podcast: Dhruba Borthakur and Shruti Bhat on enabling interactive analytics and data applications against live data.

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.

We had a great conversation spanning many topics, including:

  • RocksDB, an open source, embeddable key-value store originated by Facebook, and which is used in several other open source projects.

  • Time-series databases.

  • The importance of having solutions for real-time analytics, particularly now with the renewed interest in IoT applications and rollout of 5G technologies.

  • Use cases for Rockset’s technologies—and more generally, applications of real-time analytics.

  • The Aggregator Leaf Tailer architecture as an alternative to the Lambda architecture.

  • Building data infrastructure in the cloud.

The Aggregator Leaf Tailer (“CQRS for the data world”): A data architecture favored by web-scale companies. Source: Dhruba Borthakur, used with permission.

Related resources:

Jun 06 2019

37mins

Play

Applications of data science and machine learning in financial services

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.

  • The current state of data science in financial services in both the U.S. and China.

  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Here are some highlights from our conversation:

Opportunities in financial services

There's a customer acquisition piece and then there's a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it's a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. ... Once you have a specific cohort of users who you want to target, there's a need to be able to precisely convert them, which means understanding the stage of the customer's thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.

... On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer's data and able to serve the customer better with automated services whenever and wherever the customer is. It's all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.

Opportunities in China

A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.

For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we're talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.

If you look at WeChat, they're boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.

Related resources:

May 23 2019

42mins

Play

Real-time entity resolution made accessible

Podcast cover
Read more

The O’Reilly Data Show Podcast: Jeff Jonas on the evolution of entity resolution technologies.

In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.).

ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.

We had a great conversation spanning many topics including:

  • Why ER is interesting and challenging

  • How ER technologies have evolved over the years

  • How Senzing is working to democratize ER by making real-time AI technologies accessible to developers

  • Some early use cases for Senzing’s technologies

  • Some items on their research agenda

Here are a few highlights from our conversation:

Entity Resolution through years

In the early '90s, I worked on a much more advanced version of entity resolution for the casinos in Las Vegas and created software called NORA, non-obvious relationship awareness. Its purpose was to help casinos better understand who they were doing business with. We would ingest data from the loyalty club, everybody making hotel reservations, people showing up without reservations, everybody applying for jobs, people terminated, vendors, and 18 different lists of different kinds of bad people, some of them card counters (which aren't that bad), some cheaters. And they wanted to figure out across all these identities when somebody was the same, and then when people were related. Some people were using 32 different names and a bunch of different social security numbers.

... Ultimately, IBM bought my company and this technology became what is known now at IBM as “identity insight.” Identity insight is a real-time entity resolution engine that gets used to solve many kinds of problems. MoneyGram implemented it and their fraud complaints dropped 72%. They saved a few hundred million just in their first few years.

... But while at IBM, I had a grand vision about a new type of entity resolution engine that would have been unlike anything that's ever existed. It's almost like a Swiss Army knife for ER.

Recent developments

The Senzing entity resolution engine works really well on two records from a domain that you've never even seen before. Say you've never done entity resolution on restaurants from Singapore. The first two records you feed it, it's really, really already smart. And then as you feed it more data, it gets smarter and smarter.

... So, there are two things that we've intertwined. One is common sense. One type of common sense is the names—Dick, Dickie, Richie, Rick, Ricardo are all part of the same name family. Why should it have to study millions and millions of records to learn that again?

... Next to common sense, there's real-time learning. In real-time learning, we do a few things. You might have somebody named Bob, but who now goes by a nickname or an alias of Andy. Eventually, you might come to learn that. So, now you know you have to learn over time that Bob also has this nickname, and Bob lived at three addresses, and this is his credit card number, and now he's got four phone numbers. So you want to learn those over time.

... These systems we're creating, our entity resolution systems—which really resolve entities and graph them (call it index of identities and how they're related)—never has to be reloaded. It literally cleans itself up in the past. You can do maintenance on it while you're querying it, while you're loading new transactional data, while you're loading historical data. There's nothing else like it that can work at this scale. It's really hard to do.

Related resources:

May 09 2019

27mins

Play

Why companies are in need of data lineage solutions

Podcast cover
Read more

The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It's something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I'm describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.

... Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR ... which was one of the hot topics I heard about at the Strata Data Conference in London in 2018.

Why companies are suddenly building data lineage solutions

A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you're working with. If you're working with 10 databases, you need to know what's going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it's linked to. Then it gives you some metadata information so you can drill down. Let's say you have corrupted data, let's say you want to debug something. All these cases tie into the actual use cases for which we want to build it.

Related resources:

Apr 25 2019

34mins

Play

What data scientists and data engineers can do with current generation serverless technologies

Podcast cover
Read more

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.

  • The fundamental differences between serverless and conventional cloud computing.

  • The reasons serverless—specifically AWS Lambda—took off so quickly.

  • What can data scientists and data engineers do with the current generation serverless offerings.

  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Apr 11 2019

36mins

Play

It’s time for data scientists to collaborate with researchers in other disciplines

Podcast cover
Read more

The O’Reilly Data Show Podcast: Forough Poursabzi Sangdeh on the interdisciplinary nature of interpretable and interactive machine learning.

In this episode of the Data Show, I spoke with Forough Poursabzi-Sangdeh, a postdoctoral researcher at Microsoft Research New York City. Poursabzi works in the interdisciplinary area of interpretable and interactive machine learning. As models and algorithms become more widespread, many important considerations are becoming active research areas: fairness and bias, safety and reliability, security and privacy, and Poursabzi’s area of focus—explainability and interpretability.

We had a great conversation spanning many topics, including:

  • Current best practices and state-of-the-art methods used to explain or interpret deep learning—or, more generally, machine learning models.

  • The limitations of current model interpretability methods.

  • The lack of clear/standard metrics for comparing different approaches used for model interpretability

  • Many current AI and machine learning applications augment humans, and, thus, Poursabzi believes it’s important for data scientists to work closely with researchers in other disciplines.

  • The importance of using human subjects in model interpretability studies.

Related resources:

Mar 28 2019

36mins

Play

Algorithms are shaping our lives—here’s how we wrest back control

Podcast cover
Read more

The O’Reilly Data Show Podcast: Kartik Hosanagar on the growing power and sophistication of algorithms.

In this episode of the Data Show, I spoke with Kartik Hosanagar, professor of technology and digital business, and professor of marketing at The Wharton School of the University of Pennsylvania.  Hosanagar is also the author of a newly released book, A Human’s Guide to Machine Intelligence, an interesting tour through the recent evolution of AI applications that draws from his extensive experience at the intersection of business and technology.

We had a great conversation spanning many topics, including:

  • The types of unanticipated consequences of which algorithm designers should be aware.

  • The predictability-resilience paradox: as systems become more intelligent and dynamic, they also become more unpredictable, so there are trade-offs algorithms designers must face.

  • Managing risk in machine learning: AI application designers need to weigh considerations such as fairness, security, privacy, explainability, safety, and reliability.

  • A bill of rights for humans impacted by the growing power and sophistication of algorithms.

  • Some best practices for bringing AI into the enterprise.

Related resources:

Mar 14 2019

44mins

Play

Why your attention is like a piece of contested territory

Podcast cover
Read more

The O’Reilly Data Show Podcast: P.W. Singer on how social media has changed, war, politics, and business.

In this episode of the Data Show, I spoke with P.W. Singer, strategist and senior fellow at the New America Foundation, and a contributing editor at Popular Science. He is co-author of an excellent new book, LikeWar: The Weaponization of Social Media, which explores how social media has changed war, politics, and business. The book is essential reading for anyone interested in how social media has become an important new battlefield in a diverse set of domains and settings.

We had a great conversation spanning many topics, including:

  • In light of the 10th anniversary of his earlier book Wired for War, we talked about progress in robotics over the past decade.

  • The challenge posed by the fact that social networks reward virality, not veracity.

  • How the internet has emerged as an important new battlefield.

  • How this new online battlefield changes how conflicts are fought and unfold.

  • How many of the ideas and techniques covered in LikeWar are trickling down from nation-state actors influencing global events, to consulting companies offering services that companies and individuals can use.

Here are some highlights from our conversation:

LikeWar

We spent five years tracking how social media was being used all around the world. ... We looked at everything from how was it being used by militaries, by terrorist groups, by politicians, by teenagers—you name it. The finding of this project is sort of a two-fold play on words. The first is, if you think of cyberwar as the hacking of networks, LikeWar is its twin. It's the hacking of people on the networks by driving ideas viral through a mix of likes and lies.

... Social media began as a space for fun, for entertainment. It then became a communication space. It became a marketplace. It's also turned it into a kind of battle space. It's simultaneously all of these things at once, and you can see, for example, Russian information warriors who are using digital marketing techniques and teenage jokes to influence the outcomes of elections. A different example would be ISIS' top recruiter, Junaid Hussain, mimicking how Taylor Swift built her fan army.

A common set of tactics

The second finding of the project was that when you look across all these wildly diverse actors, groups, and organizations, they turned out to be using very similar tactics, very similar approaches. To put it a different way: it's a mode of conflict. There's ways of “winning” that all the different groups are realizing. More importantly, the groups that understand these new rules of the game are the ones that are winning their online wars and having a real effect, whether that real effect is winning a political campaign, winning a corporate marketing campaign, winning a campaign to become a celebrity, or to become the most popular kid in school. Or “winning” might be to do the opposite—to sabotage someone else's campaign to become a leading political candidate.

Related resources:

Feb 28 2019

43mins

Play

The technical, societal, and cultural challenges that come with the rise of fake media

Podcast cover
Read more

The O’Reilly Data Show Podcast: Siwei Lyu on machine learning for digital media forensics and image synthesis.

In this episode of the Data Show, I spoke with Siwei Lyu, associate professor of computer science at the University at Albany, State University of New York. Lyu is a leading expert in digital media forensics, a field of research into tools and techniques for analyzing the authenticity of media files. Over the past year, there have been many stories written about the rise of tools for creating fake media (mainly images, video, audio files). Researchers in digital image forensics haven’t exactly been standing still, though. As Lyu notes, advances in machine learning and deep learning have also found a receptive audience among the forensics community.

We had a great conversation spanning many topics including:

  • The many indicators used by forensic experts and forgery detection systems

  • Balancing “open” research with risks that come with it—including “tipping off” adversaries

  • State-of-the-art detection tools today, and what the research community and funding agencies are working on over the next few years.

  • Technical, societal, and cultural challenges that come with the rise of fake media.

Here are some highlights from our conversation:

Imbalance between digital forensics researchers and forgers

In theory, it looks difficult to synthesize media. This is true, but on the other hand, there are factors to consider on the side of the forgers. The first is the fact that most people working in forensics, like myself, usually just write a paper and publish it. So, the details of our detection algorithm becomes available immediately. On the other hand, people making fake media are usually secretive; they don't usually publish the details of their algorithms. So, there's a kind of imbalance between the information on the forensic side and the forgery side.

The other issue is user habit. The fact that even if some of the fakes are very low quality, a typical user checks it just for a second; sees something interesting, exciting, sensational; and helps distribute it without actually checking the authenticity. This actually helps fake media to broadcast very, very fast. Even though we have algorithms to detect fake media, these tools are probably not fast enough to actually stop the trap.

... Then there are the actual incentives for this kind of work. For forensics, even if we have the tools and the time to catch a piece of fake media, we don't get anything. But for people actually making the fake media, there is more financial or other forms of incentive to do that.

Related resources:

Feb 14 2019

30mins

Play

Using machine learning and analytics to attract and retain employees

Podcast cover
Read more

The O’Reilly Data Show Podcast: Maryam Jahanshahi on building tools to help improve efficiency and fairness in how companies recruit.

In this episode of the Data Show, I spoke with Maryam Jahanshahi, research scientist at TapRecruit, a startup that uses machine learning and analytics to help companies recruit more effectively. In an upcoming survey, we found that a “skills gap” or “lack of skilled people” was one of the main bottlenecks holding back adoption of AI technologies. Many companies are exploring a variety of internal and external programs to train staff on new tools and processes. The other route is to hire new talent. But recent reports suggest that demand for data professionals is strong and competition for experienced talent is fierce. Jahanshahi and her team are building natural language and statistical tools that can help companies improve their ability to attract and retain talent across many key areas.

Here are some highlights from our conversation:

Optimal job titles

The conventional wisdom in our field has always been that you want to optimize for “the number of good candidates” divided by “the number of total candidates.” ... The thinking is that one of the ways in which you get a good signal-to-noise ratio is if you advertise for a more senior role. ... In fact, we found the number of qualified applicants was lower for the senior data scientist role.

... We saw from some of our behavioral experiments that people were feeling like that was too senior a role for them to apply to. What we would call the "confidence gap" was kicking in at that point. It's a pretty well-known phenomena that there are different groups of the population that are less confident. This has been best characterized in terms of gender. It's the idea that most women only apply for jobs when they meet 100% of the qualifications versus most men will apply even with 60% of the qualifications. That was actually manifesting.

Highlighting benefits

We saw a lot of big companies that would offer 401(k), that would offer health insurance or family leave, but wouldn't mention those benefits in the job descriptions. This had an impact on how candidates perceived these companies. Even though it's implied that Coca-Cola is probably going to give you 401(k) and health insurance, not mentioning it changes the way you think of that job.

... So, don't forget the things that really should be there. Even the boring stuff really matters for most candidates. You'd think it would only matter for older candidates, but, actually, millennials and everyone in every age group are very concerned about these things because it's not specifically about the 401(k) plan; it's about what it implies in terms of the company—that the company is going to take care of you, is going to give you leave, is going to provide a good workplace.

Improving diversity

We found the best way to deal with representation at the end of the process is actually to deal with representation early in the process. What I mean by that is having a robust or a healthy candidate pool at the start of the process. We found for data scientist roles, that was about having 100 candidates apply for your job.

... If we're not getting to the point where we can attract 100 applicants, we'll take a look at that job description. We'll see what's wrong with it and what could be turning off candidates; it could be that you're not syndicating the job description well, it's not getting into search results, or it could be that it's actually turning off a lot of people. You could be asking for too many qualifications, and that turns off a lot of people. ... Sometimes it involves taking a step back and taking a look at what we're doing in this process that's not helping us and that's starving us of candidates.

Related resources:

Jan 31 2019

46mins

Play

How machine learning impacts information security

Podcast cover
Read more

The O’Reilly Data Show Podcast: Andrew Burt on the need to modernize data protection tools and strategies.

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer and legal engineer at Immuta, a company building data management tools tuned for data science. Burt and cybersecurity pioneer Daniel Geer recently released a must-read white paper (“Flat Light”) that provides a great framework for how to think about information security in the age of big data and AI. They list important changes to the information landscape and offer suggestions on how to alleviate some of the new risks introduced by the rise of machine learning and AI.

We discussed their new white paper, cybersecurity (Burt was previously a special advisor at the FBI), and an exciting new Strata Data tutorial that Burt will be co-teaching in March.

Privacy and security are converging

The end goal of privacy and the end goal of security are now totally summed up by this idea: how do we control data in an environment where that control is harder and harder to achieve, and in an environment that is harder and harder to understand?

... As we see machine learning become more prominent, what's going to be really fascinating is that, traditionally, both privacy and security are really related to different types of access. One was adversarial access in the case of security; the other is the party you're giving the data to accessing it in a way that aligns with your expectations—that would be a traditional notion of privacy. ... What we're going to start to see is that both fields are going to be more and more worried about unintended entrances.

Data lineage and data provenance

One of the things we say in the paper is that as we move to a world where models and machine learning increasingly take the place of logical instruction-oriented programming, we're going to have less and less source code, and we're going to have more and more source data. And as that shift occurs, what then becomes most important is understanding everything we can about where that data came from, who touched it, and if its integrity has in fact been preserved.

In the white paper, we talk about how, when we think about integrity in this world of machine learning and models, it does us a disservice to think about a binary state, which is the traditional way: “either data is correct or it isn't. Either it's been tampered with, or it hasn't been tampered with.” And that was really the measure by which we judged whether failures had occurred. But when we're thinking not about source code but about source data for models, we need to be moving into more of a probabilistic mode. Because when we're thinking about data, data in itself is never going to be fully accurate. It's only going to be representative to some degree of whatever it's actually trying to represent.

Related resources:

Jan 17 2019

39mins

Play