So You Want To Be A Data Scientist? A Self-Guided Curriculum

Imgur

One of the themes that has been becoming more and more popular over the past few years at SIOP has been the area(s) of data science and machine learning. Unfortunately many of our grad programs don't have the time to provide us with both a well rounded education in the core competencies of I/O and data science.

However, lucky for us many do provide us with what is in my opinion the most important competencies which can provide a solid foundation for a career in data science.

While they don't often teach us about random forests, linear algebra, or programming what they do teach us in my experience is more fundamental and more rare in the practice of data science. I will list and describe two specific areas I think I/O does a great job of teaching that are often forgotten by many.

  1. Critical Thinking: The idea that as data scientists we might not have all the data to fully answer the question asked is a difficult concept for many people with more pure computer science and statistics backgrounds to grasp at first. I/Os tend to understand that the data is what was thought to be important by one decision maker and at one point in time and may not even be appropriate given the question being asked. We can leverage this more holistic view of the problem to identify if we are asking the right questions, have the right data to answer the question(s) and as a lead into the second area whether we are using the right methodology and/or algorithms to answer the question(s).
  2. Research Methods: I/Os typically have strong training in research methods. Our holy grail for years in grad school taught us all about the many experimental and quasi-experimental designs, along with the pros and cons of each. This allows us to view each research question in the context of the quasi-experimental design the data was collected in. On the other hand CS and Stats trained professionals might have only heard of A/B testing. We also typically have a strong understanding of the pros and cons of most of the more standard machine learning algorithms (linear vs. logistic regression, ANOVAs, etc.) for this reason the image below wasn't as mind blowing to me as it was to many of the more pure data scientists I saw posting it all over the internet.

Imgur

AND given the fact that most programming needs to happen by repetitive practice on your own time, learning by yourself through trial and error is honestly a great way to go about it. Plus, unlike learning critical thinking and research methods, there are an abundance of free or near free resources online to learn programming, linear algebra, and machine learning.

Our hope for this session and this blog post is to provide you with the resources we found helpful. We also reflected back on our journey and the many courses/books we tried and decided if we could do it all over again we'd probably have focused on only a few of them. We provided these courses in a curriculum that any of you can follow to build the skills necessary to make the transition into Data Science.

The Debate!

There is often a debate when it comes to what programming language to learn. For open source the options are typically Python or R. In all honesty I think it's best to be able to use both, but I'd pick one and run with it. It's better to be able to do a lot in one language than a few things in either language. I'll give a few thoughts on each one below that hopefully helps you make a decision on which to focus on and then after that we will lay out our self-guided curriculum for both.

Python

Imgur

  • A complete programming language. You can build a website, and a deep learning algorithm. This makes it extremely user friendly when it comes to doing traditional programming work that often comes up, like building functions to extract data, etc.
  • The preferred programming language by the ML & DL community. This means that the cutting edge stuff in ML & DL will almost always be coming out first on Python and then someone will later build a wrapper or a similar implementation in R.
  • All software developers know Python. While it's true most software developers don't prefer Python for the majority of their work they are all familiar with it and typically have an easier time tying Python code into their existing environment if your goal is to put a model into production.

The old saying is....Python is the second best language for everything.

R

Imgur

  • A statistical programming language. Developed specifically for working with dataframes. This makes a lot of the commands in base R extremely well suited for the way the I/O community views data analysis.
  • Great niche packages, especially for the social sciences. If you can think of a niche implementation of an algorithm or a methodology, someone probably already has a package for it in R. This isn't necessarily true in Python, that's not to say you can't build one for Python, but just that most package devs in Python are focused on software dev, ML, and DL, and a lot less on Item Response Theory and Structural Equation Modeling.
  • Data manipulation and management is extremely easy and intuitive (or so I've heard) with dplyr and tidyr. I personally haven't used this, but people familiar with both often complain about the lack of a dplyr and tidyr implementation in Python.

Now onto the Curriculum(s)!

The Python Self-Guided Curriculum

Step Resource Title Why This Resource? Link
1 Codecademy Introduction to Python It's the fastest and easiests initial dive into Python out there. Learning Python basics is a fundamental first step into getting into Machine Learning and this course will have you up in running in under a minute. The class provides a great introduction to Python and trains you up through Object Oriented Programming. This should be enough for you to begin learning Machine Learning specific Python modules and understanding Machine Learning specific educational resources. Link
2 Python for Data Science and Machine Learning Bootcamp It draws a good connection between basic Python and the more specific visualization and ML libraries that are required to begin doing Machine Learning. It also provides a nice overview intro to Python course to verify that you learned enough in the previous step and provide confidence in your ability to move on (you will always feel like you don't know enough to move on). The class ends with a nice applied introduction to most Machine Learning methods giving you an idea of what each method does and allows you to decide where you want to dive deeper moving forward. Link
3A Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow Provides a nice survey of machine learning and deep learning methods. Assumes familiarity with Python and machine learning fundamentals. Link
3B Hands-On Machine Learning with Scikit-Learn and Tensorflow Provides a nice sruvey of machine learning and deep learning methods. Assumes familiarity with Python and machine learning fundamentals. Link
4 Deep Learning with Python This book provides a history of deep learning, an introduction to the concepts and topics relevant to understanding deep learning, and then walk throughs of different deep learning methods using Keras. The book was written by Francois Chollet, the lead developer at Google for Keras. It's clearly written and a great way to quickly dive into deep learning methods and approaches. Additionally, Chollet stays away from confusing looking mathematical equations and insteads writes out any relevant equations in Python code. This approach is surprisingly effective as it makes it clear what the equation is doing and how the deep learning method is using that math to develop models from the data. Link
5 Coursera Deep Learning Specialization Andrew Ng did a lot over the last decade to make machine learning and deep learning a household name. He is one of the founders of Coursera and his Machine Learning course on Coursera is still probably one of the most popular courses ever on the platform. This specialization does a great job of breaking down the math and walking you through the code step by step for neural networks, CNNs, and RNNs. It's a great overview of everything deep learning has to offer. Link

The R Self-Guided Curriculum

Step Resource Title Why This Resource? Link
1 SWIRL SWIRL teaches you R programming and data science interactively, at your own pace, and right in the R console. This is a fast and effective way to learn R from scratch from a Data Science lense. Link
2 An Introduction to Statistical Learning with applications in r This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist. Link
3 Data Science and Machine Learning Bootcamp with R Similar to Python's implementation of this course it takes you from the basics of R programming, through vector and matrix algebra, into the ggplot2 plotting library and finally introduces you to Machine Learning using real datasets. Link
4A Pattern Recognition and Machine Learning This book, intended for advanced undergraduates, PhD students, researchers and practitioners in the ML field, provides an authoritative presentation of many statistical techniques in machine learning. The book is as an excellent reference, written very clearly and structured in an excellent way. It is not math heavy and offers plenty of heuristic and intutions. Link
4B Pattern Recognition - Lecture Notes, North Carolina University A series of lecture notes from Professor Osuna on machine learning. The slides are extremely clear, simple and full of step by step examples. Link
5 Deep Learning with R This book provides a history of deep learning, an introduction to the concepts and topics relevant to understanding deep learning, and then walk throughs of different deep learning methods using Keras. The book was written by Francois Chollet, the lead developer at Google for Keras. It's clearly written and a great way to quickly dive into deep learning methods and approaches. Additionally, Chollet stays away from confusing looking mathematical equations and insteads writes out any relevant equations in R code. This approach is surprisingly effective as it makes it clear what the equation is doing and how the deep learning method is using that math to develop models from the data. Link
In [19]:
import pandas as pd
from IPython.display import HTML

A list of all resources gathered by our team along with the approximate level of difficulty.

In [29]:
HTML('<iframe src="https://docs.google.com/spreadsheets/d/1fmIPR0vCCGTOqWGJ--dJq6cdls6fDro8QTDGwt33Udc/edit?usp=sharing/pubhtml?gid=0&amp;single=true&amp;widget=true&amp;headers=false" width = 1400, height = 1000></iframe>')
Out[29]:
In [ ]: