An interview with Dan Becker, Team Lead of Kaggle Learn & Product Lead of Kaggle Kernels

Sayak Paul
8 min readSep 7, 2019

--

This is a part of the interview series I have started on data science and machine learning. These interviews are from the people that have inspired me, that have taught me (read are teaching me) about these beautiful subjects. The purpose of doing this is to mainly get insights about the real-world project experiences, perspectives on learning new things, some fun facts and thereby enriching the communities in the process.

This is where you can find all the interviews done so far.

Today, I have Dan Becker with me. Dan currently heads the Kaggle Learn team (which he founded as well) at Google along with being the Product Lead for Kaggle Kernels. Prior to joining Google, he served DataRobot as a Product Director. He has supervised data science consulting projects for 6 companies in the Fortune 100. He has also been a contributor to two of the major deep learning libraries known to mankind — TensorFlow and Keras. He finished 2nd (out of 1353 teams) in the $3million Heritage Health Prize data mining competition. His DataCamp course Deep Learning in Python is one of the best introductory courses in the subject. You can learn more about Dan from here.

I would like to wholeheartedly thank Dan for taking the time to do this interview. I hope this interview serves a purpose towards the betterment of data science and machine learning communities in general :)

An interview with Dan Becker, Team Lead of Kaggle Learn & Product Lead of Kaggle Kernels

Sayak: Hi Dan! Thank you for doing this interview. It’s a pleasure to have you here today.

Dan: Thanks. It’s a pleasure to talk with you.

Sayak: Maybe you could start by introducing yourself — what is your current job and what are your responsibilities over there?

Dan: I started Kaggle Learn in 2018. I started Learn in response to something really frustrating about how data science is broadly taught elsewhere.

I’d done machine learning for industry, and I’d been involved in hiring dozens of data scientists. So I saw the huge disconnect between what’s taught in most courses and what skills you need to be effective in real business situations. I started Kaggle Learn to focus on the skills that you need to do machine learning and data science. There are a lot of topics that you’d learn everywhere else that we skip in Learn courses, and some very practical topics we cover that conventional courses skip.

The other thing we do with Kaggle Learn is to keep our courses really short. In practice, you learn best from hands-on practice on our own projects. And creating a portfolio is the best way to get noticed by employers. We make our courses short, with the goal of giving you just what you need to start working on independent projects.

We have two other very talented data science teachers now, Alexis Cook and Mat Leonard. So they do most of the course creation. I try to help them stay focused on Learn’s goal and philosophy.

So if those courses are the first steps in someone’s progression, the second step is independent projects to build a portfolio. I’ve recently taken a bigger interest in that, as the product manager of Kaggle Notebooks. We’re trying to provide a great place for you to do your work, find data, learn from others, and showcase your work in a portfolio. I’m not doing software engineering for Kaggle Notebooks, but I’m trying to ensure we build new features that make our users successful.

Putting these together, I aim to help thousands of people a month transition from learner to a practitioner to professional.

Sayak: Machine Learning Interpretability was my first Kaggle Learn course and it was so awesome. Also, your DataCamp course Deep Learning in Python was one of the courses that formally introduced me to deep learning. I am curious to know how did you become interested in data science and machine learning?

Dan: When I first finished college, I joined a start-up where we tried to apply machine learning to help retailers optimize their online product postings. This was nearly 20 years ago, and machine learning didn’t work especially well yet. So, that generally failed, and I spent years thinking that machine learning would never be useful for anything.

I decided conventional statistical methods like linear regression were the future. So I went to graduate school and got a PhD in Econometrics. After getting my PhD, I heard about a Kaggle competition to predict who would be hospitalized. It had a $3million prize if someone built a sufficiently accurate model. I thought I knew a lot about data and models, so I tried it.

I still remember my first submission, and I was almost in the last place. I was shocked, thinking I was so smart and then realizing how the rest of the community was so far ahead of me. This caused me to stay up late every night learning more and trying to improve my place in that competition. It turned out everyone ahead of me was using machine learning, so I had to relearn machine learning. But I kept working very late every night, and I would make a new submission to the competition before going to bed each night. I climbed a little bit at a time but eventually reached second place.

This caused me to be hired by an analytics consulting company, to help their consultants apply machine learning in their projects. Doing Machine Learning as a consultant for a range of companies was a great learning experience, and I think that experience has shaped the rest of my career.

Sayak: Thank you so much for sharing this story, Dan. I am sure this is going to serve as an inspiration for many aspiring data scientists like me. When you were starting what kind of challenges did you face? How did you overcome them?

Dan: The problems of getting started were a lot different then. There wasn’t any community around machine learning. There weren’t any meetups within hundreds of miles. I started one, and we never got more than 5 people. There just wasn’t much interest in machine learning yet. Now people feel bad because they can’t keep up with all the meetups, events and books. I came to Kaggle when it had only 9000 users because it was the first place I could find a community of people doing and learning ML.

Sayak: I strongly reciprocate this. I am myself one of the co-organizers of GDG Kolkata and back in our early days we too faced this issue. But things are changing in a very acceptable way nowadays. What were some of the capstone projects you did during your formative years?

Dan: The big one was the Heritage Health Prize. I was fortunate to end up in 2nd place, and that was a great credential.

That isn’t what I recommend to friends trying to break into the field today. Kaggle competitions are more competitive than they were back then. And if you need to be in the top 5% for it to be a powerful credential, 95% of people will finish below that. I think it’s better to do something creative on a topic you find interesting. And turn it into a notebook that’s graphically interesting.

I’ve recently done some side-projects about how you can use simulations to go from predictions, which you get from conventional machine learning models to optimized decisions. I have some ideas about how that can be done, and I think it’s going to be the next big wave in data science.

Sayak: Ah, that sounds really cool — using predictions for driving decision! These fields data science and machine learning are rapidly evolving. How do you manage to keep track of the latest relevant happenings?

Dan: This felt harder for me to do a couple of years ago. I regularly look at research papers on the machine learning subreddit. And a couple of years ago, everything felt like a revolutionary new idea. Now, I go there and feel like there are a lot of papers I’m ok missing. I don’t know if the change is in me or in the underlying research. But I’ve been reading papers for long enough that I’ve gotten a good filter for what I think I need to know. And there’s a lot I feel ok skipping now.

Sayak: That is an interesting viewpoint. Catching up to machine learning research has always been very weirdly painful for me because there are so many things. What are some of the upcoming courses in Kaggle Learn?

Dan: The next course coming out is Geospatial analysis, it doesn’t sound cutting-edge, but it’s really fun, and the maps you can make are great. Most data scientists aren’t familiar with this stuff, and they’ll be shocked how they can solve such a wide range of problems, and do it with really interesting maps.

The other new courses coming up are Feature Engineering, Natural Language Processing and Reinforcement Learning. Those are all topics that speak for themselves.

Sayak: Thanks for sharing that, Dan. I am really looking forward to checking them out when they are published. Being a practitioner, one thing that I often find myself struggling with is learning a new concept. Would you like to share how do you approach that process?

Dan: I’m very focused on applications. Learn something well enough to use it. Ideally, you’ll do that with high-level libraries. You’ll be surprised how quickly you can get the overall understanding of what’s involved. Sometimes things won’t work like you expect. Now you can go into the paper with a very specific question, and think about it in a very directed way. Or perhaps you’ll need to look at the source code for the library.

It’s so engaging to do this while focused on a specific application. I’m usually surprised to realize I have a decent understanding of how the underlying library works. Because I never invested time in learning the library, and I picked it up as a side effect of solving a problem.

Sayak: I will save this catch forever. Any advice for the beginners?

Dan: Build a portfolio. Get feedback on it. The fastest way to learn is to have people give you feedback on your work. The fastest way to get a job is to show people a portfolio of interesting work. Don’t spend months or years on theory, because much of that you’ll eventually find out is irrelevant.

Even if your goal is to be awesome at data science in 10 years, learning theory first is an inefficient way to go about it. Once you are a professional data scientist, you’ll be spending 40 hours a week or more doing data science, in addition to what you do on your own time. So you’ll learn so quickly once you do this professionally, and you should optimize to get that first data science job quickly.

I think we’ve built courses and tools optimized for that, with Kaggle Learn and Kaggle notebooks. But if you come in with the right mindset, you can do it well with other tools and websites too.

Sayak: Thank you so much, Dan, for doing this interview and for sharing your valuable insights. I hope they will be immensely helpful for the community.

Dan: Thanks. I hope so too.

Summary

Dan gave a lot of takeaways for the community in this interview, I believe. I am blown away after getting to know about how his 20 years old idea of incorporating machine learning into projects turned out today. I am sure you all have! He observed a critical gap in the skillset of data scientists and he did not just stay there. Kaggle Learn is the result of that observation and we all know how effective Kaggle Learn courses are!

I hope you enjoyed reading this interview. Watch out this space for the next one and I hope to see you soon.

If you want to know more about me, check out my website.

--

--

Sayak Paul
Sayak Paul

Written by Sayak Paul

ML at 🤗 | Netflix Nerd | Personal site: https://sayak.dev/

No responses yet