Rank #1: [MINI] Multiple Regression
This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.
Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.
The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.
Rank #2: Quantum Computing
In this week's episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a lively discussion about the capabilities and limits of quantum computers and computational complexity.
Rank #3: Being Bayesian
This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised distribution.
We present this concept in a few different contexts but primarily focus on how our bird Yoshi sends signals about her food preferences.
Like many animals, Yoshi is a complex creature whose preferences cannot easily be summarized by a straightforward utility function the way they might in a textbook reinforcement learning problem. Her preferences are sequential, conditional, and evolving. We may not always know what our bird is thinking, but we have some good indicators that give us clues.
Rank #4: The Complexity of Learning Neural Networks
Over the past several years, we have seen many success stories in machine learning brought about by deep learning techniques. While the practical success of deep learning has been phenomenal, the formal guarantees have been lacking. Our current theoretical understanding of the many techniques that are central to the current ongoing big-data revolution is far from being sufficient for rigorous analysis, at best. In this episode of Data Skeptic, our host Kyle Polich welcomes guest John Wilmes, a mathematics post-doctoral researcher at Georgia Tech, to discuss the efficiency of neural network learning through complexity theory.
Rank #5: Advertising Attribution with Nathan Janos
A conversation with Convertro's Nathan Janos about methodologies used to help advertisers understand the affect each of their marketing efforts (print, SEM, display, skywriting, etc.) contributes to their overall return.
Rank #6: The Library Problem
We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.
Rank #7: [MINI] Primer on Deep Learning
In this episode, we talk about a high-level description of deep learning. Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give Linh Da the basic concept.
Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at dallasdatascience.eventbrite.com
Rank #8: Crypto
How do people think rationally about small probability events?
What is the optimal statistical process by which one can update their beliefs in light of new evidence?
This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"
Rank #9: Zillow Zestimate
Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.
Rank #10: Big Data Tools and Trends
In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft. We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.
Rank #11: Transfer Learning
Sebastian Ruder is a research scientist at DeepMind. In this episode, he joins us to discuss the state of the art in transfer learning and his contributions to it.
Rank #12: Data Science at eHarmony
I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships.
Interesting open source projects mentioned in the interview include Face-parts, a web service for detecting faces and extracting a robust set of fiducial markers (features) from the image, and Aloha, a Scala based machine learning library. You can learn more about these and other interesting projects at the eHarmony github page.
In the wrap up, Jon mentioned the LA Machine Learning meetup which he runs. This is a great resource for LA residents separate and complementary to datascience.la groups, so consider signing up for all of the above and I hope to see you there in the future.
Rank #14: Auditing Algorithms
Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside the intent of the system designers and programmers). Christian Sandvig joins us in this episode to talk about his work and the concept of auditing algorithms.
Christian Sandvig (@niftyc) has a PhD in communications from Stanford and is currently an Associate Professor of Communication Studies and Information at the University of Michigan. His research studies the predictable and unpredictable effects that algorithms have on culture. His work exploring the topic of auditing algorithms has framed the conversation of how and why we might want to have oversight on the way algorithms effect our lives. His writing appears in numerous publications including The Social Media Collective, The Huffington Post, and Wired.
One of his papers we discussed in depth on this episode was Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms, which is well worth a read.
Rank #15: [MINI] Entropy
Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding whether or not Yoshi the parrot will like a new chew toy. A few other everyday examples help us examine why entropy is a nice metric for constructing a decision tree.
Rank #16: [MINI] Random Forest
Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.
Rank #17: [MINI] Auto-correlative functions and correlograms
Rank #18: [MINI] k-d trees
This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.
Rank #19: [MINI] p-values
In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.
Rank #20: [MINI] The T-Test
The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determine if the sweetness, acidity, or some other property of two separate grape vines might differ in a statistically meaningful way.