Cover image of Linear Digressions
(313)

Rank #90 in Technology category

Technology

Linear Digressions

Updated 2 months ago

Rank #90 in Technology category

Technology
Read more

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Read more

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

iTunes Ratings

313 Ratings
Average Ratings
281
22
3
4
3

Thanks for 4 years of awesomeness!

By Eliot01010 - Oct 09 2018
Read more
Look forward to listening every week. You guys are the best!

Great stats episode

By Lcat123456789 - Oct 17 2017
Read more
Very clear, very cutting-edged, and very helpful!

iTunes Ratings

313 Ratings
Average Ratings
281
22
3
4
3

Thanks for 4 years of awesomeness!

By Eliot01010 - Oct 09 2018
Read more
Look forward to listening every week. You guys are the best!

Great stats episode

By Lcat123456789 - Oct 17 2017
Read more
Very clear, very cutting-edged, and very helpful!
Cover image of Linear Digressions

Linear Digressions

Latest release on Jul 26, 2020

Read more

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Rank #1: The Three Types of Data Scientists, and What They Actually Do

Podcast cover
Read more
If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a disconnect between lots of different stories about what data scientists should know, or do, or expect from their job. This week, we cover two thought pieces, one that arose from interviews with 35(!) data scientists speaking about what their jobs actually are (and aren't), and one from the head of data science at AirBnb organizing core data science work into three main specialties.

Relevant links:
https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists https://www.linkedin.com/pulse/one-data-science-job-doesnt-fit-all-elena-grewal

Sep 09 2018

23mins

Play

Rank #2: Troubling Trends In Machine Learning Scholarship

Podcast cover
Read more
There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's growing really quickly, but it's also an artifact of strange incentive structures in academic machine learning, and the fact that sometimes machine learning is just really hard. At the same time, a high quality of academic work is critical for maintaining the reputation of the field, so in this episode we walk through a recent paper that spells out some of the most common shortcomings of academic machine learning papers and what we can do to make things better.

Relevant links:
https://arxiv.org/abs/1807.03341

Aug 06 2018

29mins

Play

Rank #3: Agile Development for Data Scientists, Part 1: The Good

Podcast cover
Read more
If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't work at a software firm, agile practices might be newer to you. In either case, we wanted to go through a great series of blog posts about some of the practices from agile that are relevant for how data scientists work, in hopes of inspiring some transfer learning from software development to data science.

Relevant links:
https://www.locallyoptimistic.com/post/agile-analytics-p1/ https://www.locallyoptimistic.com/post/agile-analytics-p2/ https://www.locallyoptimistic.com/post/agile-analytics-p3/

Aug 19 2018

25mins

Play

Rank #4: Bayesian Psychics

Podcast cover
Read more
Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.

Aug 18 2015

11mins

Play

Rank #5: The Fourier Transform

Podcast cover
Read more
The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.

Jan 08 2018

15mins

Play

Rank #6: The State of Data Science

Podcast cover
Read more
How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."

Nov 10 2015

15mins

Play

Rank #7: Yiddish Translation

Podcast cover
Read more
Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.

Aug 03 2015

12mins

Play

Rank #8: Model Interpretation (and Trust Issues)

Podcast cover
Read more
Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical for interpreting the model, and for knowing if it's doing something reasonable (one could even say... trustworthy). We'll talk about a new algorithm called LIME that seeks to make any model more understandable and interpretable.

Relevant Links:
http://arxiv.org/abs/1602.04938 https://github.com/marcotcr/lime/tree/master/lime

Apr 25 2016

16mins

Play

Rank #9: Backpropagation

Podcast cover
Read more
The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights of the neural net based on how good of a job the neural net is doing at classifying training examples, thereby getting better and better at making predictions. In this episode: we talk backpropagation, and how it makes it possible to train the neural nets we know and love.

Feb 29 2016

12mins

Play

Rank #10: Troll Detection

Podcast cover
Read more
Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.

Aug 07 2015

12mins

Play

Rank #11: Rock the ROC Curve

Podcast cover
Read more
This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

Jan 30 2017

15mins

Play

Rank #12: Guinness

Podcast cover
Read more
Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.

Oct 07 2015

14mins

Play

Rank #13: A Sprint Through What's New in Neural Networks

Podcast cover
Read more
Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up.  So this week we have another installment in our "neural nets: they so smart!" series, talking about three topics.  And all the topics this week were listener suggestions, too!

Mar 06 2017

16mins

Play

Rank #14: Inside a Data Analysis: Fraud Hunting at Enron

Podcast cover
Read more
It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been done before), made this project much more interesting to design than it might appear. Here's the story of what a data analysis like this looks like...from the inside.

May 16 2016

30mins

Play

Rank #15: Data Engineering

Podcast cover
Read more
If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This job, that of owner and maintainer of the data being used for analytics, is often the realm of data engineers. From data extraction, transform and loading procedures to the data storage strategy and even the definitions of key data quantities that serve as focal points for a whole organization, data engineers keep the plumbing of data analytics running smoothly.

Sep 24 2018

16mins

Play

Rank #16: Traffic Metering Algorithms

Podcast cover
Read more
Originally release June 2016

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome).

Jun 12 2017

18mins

Play

Rank #17: Building Data Science Teams

Podcast cover
Read more
At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and sometimes others all working together to build tools and products around data science. This episode talks about some of those roles on a typical data science team, what the responsibilities are for each role, and what skills and traits are most important for each team member to have.

Nov 12 2018

25mins

Play

Rank #18: Neural Net Inception

Podcast cover
Read more
When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.

Oct 23 2015

15mins

Play

Rank #19: Feature Importance

Podcast cover
Read more
Figuring out what features actually matter in a model is harder to figure out than you might first guess.  When a human makes a decision, you can just ask them--why did you do that?  But with machine learning models, not so much.  That's why we wanted to talk a bit about both regularization (again) and also other ways that you can figure out which models have the biggest impact on the predictions of your model.

Mar 27 2017

20mins

Play

Rank #20: A Data Scientist's View of the Fight against Cancer

Podcast cover
Read more
In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that fight go in the coming decade, and how do we know when we're making progress? No matter how tricky you might think this problem is to solve, the fact is, once you get in there trying to solve it, it's even trickier than you thought.

Mar 14 2016

19mins

Play

So long, and thanks for all the fish

Podcast cover
Read more
All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn’t cover data science—it’s mostly reminiscing, thanking our wonderful audience (that’s you!), and marveling at how this thing that started out as a side project grew into a huge part of our lives for over 5 years.

It’s been a ride, and a real pleasure and privilege to talk to you each week. Thanks, best wishes, and good night!

—Katie and Ben

Jul 26 2020

35mins

Play

A Reality Check on AI-Driven Medical Assistants

Podcast cover
Read more
The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two computer vision algorithms, one that diagnoses diabetic retinopathy and another that classifies liver cancer, and asks the question—are patients now getting better care, and achieving better outcomes, with these algorithms in the mix? The answer isn’t no, exactly, but it’s not a resounding yes, because these algorithms interact with a very complex system (the healthcare system) and other shortcomings of that system are proving hard to automate away. Getting a faster diagnosis from an image might not be an improvement if the image is now harder to capture (because of strict data quality requirements associated with the algorithm that wouldn’t stop a human doing the same job). Likewise, an algorithm getting a prediction mostly correct might not be an overall benefit if it introduces more dramatic failures when the prediction happens to be wrong. For every data scientist whose work is deployed into some kind of product, and is being used to solve real-world problems, these papers underscore how important and difficult it is to consider all the context around those problems.

Jul 19 2020

14mins

Play

A Data Science Take on Open Policing Data

Podcast cover
Read more
A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to come talk to our audience about their work. This week we’re excited to bring on Todd Hendricks, Bay Area data scientist and a volunteer who reached out to tell us about his studies with the Stanford Open Policing dataset.

Jul 13 2020

23mins

Play

Procella: YouTube's super-system for analytics data storage

Podcast cover
Read more
This is a re-release of an episode that originally ran in October 2019.

If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn’t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, the system they built, is the topic of our episode today: by deconstructing the system, we dig into the four motivating uses of this system, the complexity they had to introduce to service all four uses simultaneously, and the impressive engineering that has to go into building something that “just works.”

Jul 06 2020

29mins

Play

The Data Science Open Source Ecosystem

Podcast cover
Read more
Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintained by a small number of individuals, some of whom are institutionally supported, but many of whom do this maintenance on a purely volunteer basis. The health of the data science ecosystem depends on the support of open source projects, on an individual and institutional level.

https://hdsr.mitpress.mit.edu/pub/xsrt4zs2/release/2

Jun 29 2020

23mins

Play

Rock the ROC Curve

Podcast cover
Read more
This is a re-release of an episode that first ran on January 29, 2017.

This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

Jun 21 2020

15mins

Play

Criminology and Data Science

Podcast cover
Read more
This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science methods to studies of criminal behavior, and got in touch after our last episode (about racially complicated recidivism algorithms). Our conversation covers a wide range of topics—common misconceptions around race and crime statistics, how methodologically-driven criminology scholars think about building crime prediction models, and how to think about policy changes when we don’t have a complete understanding of cause and effect in criminology. For the many of us currently re-thinking race and criminal justice, but wanting to be data-driven about it, this conversation with Zach is a must-listen.

Jun 15 2020

30mins

Play

Racism, the criminal justice system, and data science

Podcast cover
Read more
As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and amplifies racism in the American criminal justice system. COMPAS is an algorithm that claims to give a prediction about the likelihood of an offender to re-offend if released, based on the attributes of the individual, and guess what: it shows disparities in the predictions for black and white offenders that would nudge judges toward giving harsher sentences to black individuals.

We dig into this algorithm a little more deeply, unpacking how different metrics give different pictures into the “fairness” of the predictions and what is causing its racially disparate output (to wit: race is explicitly not an input to the algorithm, and yet the algorithm gives outputs that correlate with race—what gives?) Unfortunately it’s not an open-and-shut case of a tuning parameter being off, or the wrong metric being used: instead the biases in the justice system itself are being captured in the algorithm outputs, in such a way that a self-fulfilling prophecy of harsher treatment for black defendants is all but guaranteed. Like many other things this week, this episode left us thinking about bigger, systemic issues, and why it’s proven so hard for years to fix what’s broken.

Jun 07 2020

31mins

Play

An interstitial word from Ben

Podcast cover
Read more
A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.

Jun 05 2020

5mins

Play

Convolutional Neural Networks

Podcast cover
Read more
This is a re-release of an episode that originally aired on April 1, 2018

If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.

May 31 2020

21mins

Play

Stein's Paradox

Podcast cover
Read more
This is a re-release of an episode that was originally released on February 26, 2017.

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.

May 24 2020

27mins

Play

Protecting Individual-Level Census Data with Differential Privacy

Podcast cover
Read more
The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be ways to re-identify the records or otherwise figure out sensitive personal information. That problem has motivated the study of differential privacy, a set of techniques and definitions for keeping personal information private when datasets are released or used for study. Differential privacy is getting a big boost this year, as it’s being implemented across the 2020 US Census as a way of protecting the privacy of census respondents while still opening up the dataset for research and policy use. When two important topics come together like this, we can’t help but sit up and pay attention.

May 18 2020

21mins

Play

Causal Trees

Podcast cover
Read more
What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusions from naive data methods leads to biased answers) but economists Susan Athey and Guido Imbens are on the case. This episodes explores their algorithm for recursively partitioning a dataset to find heterogeneous treatment effects, or for you ML nerds, applying decision trees to causal inference problems. It’s not a free lunch, but for those (like us!) who love crossover topics, causal trees are a smart approach from one field hopping the fence to another.

Relevant links:
https://www.pnas.org/content/113/27/7353

May 11 2020

15mins

Play

The Grammar Of Graphics

Podcast cover
Read more
You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and annotated so you can quickly extract the information in the underlying data using just visual cues. It’s a bit abstract but very profound, and these principles underlie the ggplot2 package in R that makes famously beautiful plots with minimal code. This episode covers a paper by Hadley Wickham (author of ggplot2, among other R packages) that unpacks the layered approach to graphics taken in ggplot2, and makes clear the assumptions and structure of many familiar data visualizations.

May 04 2020

35mins

Play

Gaussian Processes

Podcast cover
Read more
It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some combination of these, and perhaps others? Gaussian processes introduce a nonparameteric option where you can fit over all the possible types of functions, using the data points in your datasets as constraints on the results that you get (the idea being that, no matter what the “true” underlying function is, it produced the data points you’re trying to fit). What this means is a very flexible, but depending on your parameters not-too-flexible, way to fit complex datasets.

The math underlying GPs gets complex, and the links below contain some excellent visualizations that help make the underlying concepts clearer. Check them out!

Relevant links:
http://katbailey.github.io/post/gaussian-processes-for-dummies/
https://thegradient.pub/gaussian-process-not-quite-for-dummies/
https://distill.pub/2019/visual-exploration-gaussian-processes/

Apr 27 2020

20mins

Play

Keeping ourselves honest when we work with observational healthcare data

Podcast cover
Read more
The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges is how, exactly, to do that structuring and analysis—data scientists working with this data have hundreds or thousands of small, and sometimes large, decisions to make in their day-to-day analysis work. What data should they include in their studies? What method should they use to analyze it? What hyperparameter settings should they explore, and how should they pick a value for their hyperparameters? The thing that’s really difficult here is that, depending on which path they choose among many reasonable options, a data scientist can get really different answers to the underlying question, which makes you wonder how to conclude anything with certainty at all.

The paper for this week’s episode performs a systematic study of many, many different permutations of the questions above on a set of benchmark datasets where the “right” answers are known. Which strategies are most likely to yield the “right” answers? That’s the whole topic of discussion.

Relevant links:
https://hdsr.mitpress.mit.edu/pub/fxz7kr65

Apr 20 2020

19mins

Play

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell

Podcast cover
Read more
AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC Berkeley, has a formulation for modifications to AI that we should study and try implementing now to keep it much safer in the long run. Prof. Russell’s new book, “Human Compatible: Artificial Intelligence and the Problem of Control” gives an accessible but deeply thoughtful exploration of why he thinks runaway AI is something we need to be considering seriously now, and what changes in formulation might be a solution. This episodes features Prof. Russell as a special guest, exploring the topics in his book and giving more perspective on the long-term possible futures of AI: both good and bad.

Relevant links:
https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/

Apr 13 2020

28mins

Play

Putting machine learning into a database

Podcast cover
Read more
Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few years, a few visionary researchers are starting to see a world in which the ML pipelines can actually be deployed inside the database. Why? One strong advantage for databases is they have built-in features for data governance, including things like permissioning access and tracking the provenance of data. Adding machine learning as another thing you can do in a database means that, potentially, these enterprise-grade features will be available for ML models too, which will make them much more widely accepted across enterprises with tight IT policies. The papers this week articulate the gap between enterprise needs and current ML infrastructure, how ML in a database could be a way to knit the two closer together, and a proof-of-concept that ML in a database can actually work.

Relevant links:
https://blog.acolyer.org/2020/02/19/ten-year-egml-predictions/ https://blog.acolyer.org/2020/02/21/extending-relational-query-processing/

Apr 06 2020

24mins

Play

The work-from-home episode

Podcast cover
Read more
Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for many of us, and can hold some challenges compared to coming in to the office every day. This episode explores this a little bit, informally, as we compare our new work-from-home setups and reflect on what’s working well and what we’re finding challenging.

Mar 29 2020

29mins

Play

Understanding Covid-19 transmission: what the data suggests about how the disease spreads

Podcast cover
Read more
Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how much of the spread of the disease comes from carriers who are experiencing no or mild symptoms but are contagious anyway. This episode digs into the epidemiological model that was published in Science this week—this model finds that the data suggests that the majority of carriers of the coronavirus, 80-90%, do not have a detected disease. This has big implications for the importance of social distancing of a way to get the pandemic under control and explains why a more comprehensive testing program is critical for the United States.

Also, in lighter news, Katie (a native of Dayton, Ohio) lays a data-driven claim for just declaring the University of Dayton flyers to be the 2020 NCAA College Basketball champions.

Relevant links:
https://science.sciencemag.org/content/early/2020/03/13/science.abb3221

Mar 23 2020

25mins

Play

iTunes Ratings

313 Ratings
Average Ratings
281
22
3
4
3

Thanks for 4 years of awesomeness!

By Eliot01010 - Oct 09 2018
Read more
Look forward to listening every week. You guys are the best!

Great stats episode

By Lcat123456789 - Oct 17 2017
Read more
Very clear, very cutting-edged, and very helpful!