Rank #1: Machine Learning Done Wrong
Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.
Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for Machine Learning on Multicore is the basis for Apache Mahout. His most recent endeavor has just emerged from steath, so please check out OneInterview.io.
Apr 01 2016
Rank #2: Advertising Attribution with Nathan Janos
A conversation with Convertro's Nathan Janos about methodologies used to help advertisers understand the affect each of their marketing efforts (print, SEM, display, skywriting, etc.) contributes to their overall return.
Jun 06 2014
Rank #3: [MINI] Backpropagation
Backpropagation is a common algorithm for training a neural network. It works by computing the gradient of each weight with respect to the overall error, and using stochastic gradient descent to iteratively fine tune the weights of the network. In this episode, we compare this concept to finding a location on a map, marble maze games, and golf.
Apr 07 2017
Rank #4: Being Bayesian
This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised distribution.
We present this concept in a few different contexts but primarily focus on how our bird Yoshi sends signals about her food preferences.
Like many animals, Yoshi is a complex creature whose preferences cannot easily be summarized by a straightforward utility function the way they might in a textbook reinforcement learning problem. Her preferences are sequential, conditional, and evolving. We may not always know what our bird is thinking, but we have some good indicators that give us clues.
Oct 26 2018
Rank #5: [MINI] The Girlfriend Equation
Economist Peter Backus put forward "The Girlfriend Equation" while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we explore the soundness of his model and also share some stories about how Linhda and Kyle met.
Nov 28 2014
Rank #6: [MINI] Recurrent Neural Networks
RNNs are a class of deep learning models designed to capture sequential behavior. An RNN trains a set of weights which depend not just on new input but also on the previous state of the neural network. This directed cycle allows the training phase to find solutions which rely on the state at a previous time, thus giving the network a form of memory. RNNs have been used effectively in language analysis, translation, speech recognition, and many other tasks.
Aug 18 2017
Rank #7: [MINI] The T-Test
The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determine if the sweetness, acidity, or some other property of two separate grape vines might differ in a statistically meaningful way.
Oct 17 2014
Rank #8: Zillow Zestimate
Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.
Sep 01 2017
Rank #9: Quantum Computing
In this week's episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a lively discussion about the capabilities and limits of quantum computers and computational complexity.
Dec 01 2017
Rank #10: Crypto
How do people think rationally about small probability events?
What is the optimal statistical process by which one can update their beliefs in light of new evidence?
This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"
Jul 17 2015
Rank #11: [MINI] Logistic Regression on Audio Data
Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available features, which are spectral bands in the discussion on this episode.
Keep an eye on the dataskeptic.com blog this week as we post more details about this project.
Thanks to our sponsor this week, the Data Science Association. Please check out their upcoming conference in Dallas on Saturday, February 18th, 2017 via the link below.
Jan 27 2017
Rank #12: [MINI] Activation Functions
In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow for non-linear properties to be introduced.
Activation functions can also help to standardize your data between layers. Some functions such as the sigmoid have the effect of "focusing" the area of interest on data. Extreme values are placed close together, while values near it's point of inflection change more quickly with respect to small changes in the input. Similarly, these functions can take any real number and map all of them to a finite range such as [0, 1] which can have many advantages for downstream calculation.
In this episode, we overview the concept and discuss a few reasons why you might select one function verse another.
Jun 16 2017
Rank #13: Opinion Polls for Presidential Elections
Recently, we've seen opinion polls come under some skepticism. But is that skepticism truly justified? The recent Brexit referendum and US 2016 Presidential Election are examples where some claims the polls "got it wrong". This episode explores this idea.
Apr 28 2017
Rank #14: [MINI] Multiple Regression
This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.
Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.
The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.
Feb 19 2016
Rank #15: Data Science at Patreon
In this week's episode of Data Skeptic, host Kyle Polich talks with guest Maura Church, Patreon's data science manager. Patreon is a fast-growing crowdfunding platform that allows artists and creators of all kinds build their own subscription content service. The platform allows fans to become patrons of their favorite artists- an idea similar the Renaissance times, when musicians would rely on benefactors to become their patrons so they could make more art. At Patreon, Maura's data science team strives to provide creators with insight, information, and tools, so that creators can focus on what they do best-- making art.
On the show, Maura talks about some of her projects with the data science team at Patreon. Among the several topics discussed during the episode include: optical music recognition (OMR) to translate musical scores to electronic format, network analysis to understand the connection between creators and patrons, growth forecasting and modeling in a new market, and churn modeling to determine predictors of long time support.
A more detailed explanation of Patreon's A/B testing framework can be found here
Other useful links to topics mentioned during the show:
Mar 31 2017
Rank #16: [MINI] Automated Feature Engineering
If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process observed in deep learning, where each layer of the business reports up different types of observations, KPIs, and reports to be interpreted by the next layer of the business. In deep learning, this process can be thought of as automated feature engineering. DNNs built to recognize objects in images may learn structures that behave like edge detectors in the first hidden layer. Proceeding layers learn to compose more abstract features from lower level outputs. This episode explore that analogy in the context of automated feature engineering.
Linh Da and Kyle discuss a particular image in this episode. The image included below in the show notes is drawn from the work of Lee, Grosse, Ranganath, and Ng in their paper Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations.
Feb 24 2017
Rank #17: [MINI] p-values
In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.
Jun 13 2014
Rank #18: Data Science Hiring Processes
Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.
Dec 28 2018
Rank #19: Applied Data Science in Industry
Sep 06 2019
Rank #20: [MINI] The Chi-Squared Test
The χ2 (Chi-Squared) test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arrise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men report that?"
Feb 06 2015