Rank #1: BERT
Kyle provides a non-technical overview of why Bidirectional Encoder Representations from Transformers (BERT) is a powerful tool for natural language processing projects.
Rank #2: Being Bayesian
This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised distribution.
We present this concept in a few different contexts but primarily focus on how our bird Yoshi sends signals about her food preferences.
Like many animals, Yoshi is a complex creature whose preferences cannot easily be summarized by a straightforward utility function the way they might in a textbook reinforcement learning problem. Her preferences are sequential, conditional, and evolving. We may not always know what our bird is thinking, but we have some good indicators that give us clues.
Rank #3: Machine Learning Done Wrong
Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.
Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for Machine Learning on Multicore is the basis for Apache Mahout. His most recent endeavor has just emerged from steath, so please check out OneInterview.io.
Rank #5: Zillow Zestimate
Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.
Rank #6: [MINI] type i / type ii errors
In this first mini-episode of the Data Skeptic Podcast, we define and discuss type i and type ii errors (a.k.a. false positives and false negatives).
Rank #7: [MINI] R-squared
How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.
Rank #8: Let's Talk About Natural Language Processing
This episode reboots our podcast with the theme of Natural Language Processing for the next few months.
We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of the classic problems are, and just a bit on approaches.
Finishing out the show is an interview with Lucy Park about her work on the KoNLPy library for Korean NLP in Python.
If you want to share your NLP project, please join our Slack channel. We're eager to see what listeners are working on!
Rank #9: Data Science Hiring Processes
Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.
Rank #10: Quantum Computing
In this week's episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a lively discussion about the capabilities and limits of quantum computers and computational complexity.
Rank #11: Crypto
How do people think rationally about small probability events?
What is the optimal statistical process by which one can update their beliefs in light of new evidence?
This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"
Rank #12: Advertising Attribution with Nathan Janos
A conversation with Convertro's Nathan Janos about methodologies used to help advertisers understand the affect each of their marketing efforts (print, SEM, display, skywriting, etc.) contributes to their overall return.
Rank #13: [MINI] Primer on Deep Learning
In this episode, we talk about a high-level description of deep learning. Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give Linh Da the basic concept.
Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at dallasdatascience.eventbrite.com
Rank #14: The Complexity of Learning Neural Networks
Over the past several years, we have seen many success stories in machine learning brought about by deep learning techniques. While the practical success of deep learning has been phenomenal, the formal guarantees have been lacking. Our current theoretical understanding of the many techniques that are central to the current ongoing big-data revolution is far from being sufficient for rigorous analysis, at best. In this episode of Data Skeptic, our host Kyle Polich welcomes guest John Wilmes, a mathematics post-doctoral researcher at Georgia Tech, to discuss the efficiency of neural network learning through complexity theory.
Rank #15: [MINI] Recurrent Neural Networks
RNNs are a class of deep learning models designed to capture sequential behavior. An RNN trains a set of weights which depend not just on new input but also on the previous state of the neural network. This directed cycle allows the training phase to find solutions which rely on the state at a previous time, thus giving the network a form of memory. RNNs have been used effectively in language analysis, translation, speech recognition, and many other tasks.
Rank #16: [MINI] Gradient Descent
Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.
Rank #17: [MINI] Random Forest
Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.
Rank #18: [MINI] The T-Test
The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determine if the sweetness, acidity, or some other property of two separate grape vines might differ in a statistically meaningful way.
Rank #19: Big Data Tools and Trends
In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft. We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.
Rank #20: [MINI] The Chi-Squared Test
The χ2 (Chi-Squared) test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arrise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men report that?"