When I started learning about data science, I was overwhelmed by the ocean of resources available online. Thankfully, a few practicing data scientists and professors guided me in the right direction. Below is a list of resources that I found most useful — hopefully they will kickstart your data science fascination, as they did for me.
If you are completely new to programming, learning the basics of Python on Codecademy is your most-logical first step. You don’t need to be a software developer to practice data science, but you should work to become proficient at programming. As you grow your data science career, expect your programming skills to also grow.
Data Camp is a great introduction to applying Python for data science. They have many courses that will help you nail down the basics of data science. Data Camp is not free, but its pricing is approachable at $30 per month. I recommend starting with these courses:
Intro to Python for Data Science: Learn the Python basics, everything from variables and lists, to functions and the Python package NumPy. I recommend this course because it specifically teaches Python for doing data science. Intermediate Python for Data Science: This course builds on the foundations from Intro to Python for Data Science. You will learn how to use the Python packages Matplotlib and Pandas, as well as programming fundamentals like logic, control flow, and loops.Python Data Science Toolbox (Part 1 & 2): These two courses enhance your Pythonic skills. After completing these courses, you will have an understanding of functions, lambda functions, list comprehensions, and more. Pandas Foundations: Pandas is one of the most widely used Python packages. This course will cover the basics of importing data, manipulating data, and conducting exploratory data analysis, all critical skills for the fledgling data scientist. Statistical Thinking in Python (Part 1 & 2): If it has been a few years since your Statistics 101 course in college, or if you are completely new to statistics, you’ll find this course helpful. The course covers introductory statistical topics like summary statistics, discrete and continuous variables, confidence intervals, and hypothesis testing.
Anaconda and Jupyter Notebooks
Anaconda is an open-source data science platform. When you download Anaconda, it comes with many Python packages pre-installed, such as Pandas, NumPy, SciPy, Matplotlib, Scikit-Learn, TensorFlow, Statsmodels, NLTK, and Flask. Anaconda allows you to manage Python packages, download new packages with the Conda package manager, create Python environments, and switch between different versions of Python. Most importantly, Anaconda comes with Jupyter Notebooks. Jupyter allows you to execute Python code, review outputs from that code, and annote your data analysis using Markdown (a markup language that allows you to include narrative text). These core functionalities make Jupyter more human-interpretable than an array of Python scripts. The vast majority of my data science analyses are conducted within Jupyter Notebooks. Check out this installation guide from Quantitative Economics, if you don’t already have Anaconda and Jupyter installed.
Once you have Jupyter installed, you can learn from a wide variety of data scientists. I like to watch lectures from experts in different fields through PyData’s YouTube Channel. A lot of the PyData lecturers make their code and data publicly available on GitHub, so you can replicate their analyses on your local machine. Walking through their analysis on your own creates a great hands-on learning experience.
For example, if you like time series analysis, check out Jeffrey Yau’s lecture on YouTube and pull his notebooks from GitHub.
SQL is another programming language many data scientists rely upon. SQL allows us to access data from relational databases. SQL Zoo is a great resource to learn the basics of SQL. Additionally, online coding platforms like Codecademy offer good SQL courses.
Deep dives into data science approaches
There are two textbooks nearly every data scientist owns, Intro to Statistical Learning (ISLR) and Elements of Statistical Learning (ESLR). Luckily, they’re both available for free in PDF version. ISLR is accessible for folks who have taken a few statistics courses in college. ESLR is intended for those who have taken more than a few statistics courses. One of my graduate school classmates spent a full two semesters combing through ESLR, asking professors and other students for help along the way.
I have also found a handful of textbooks from O’Reilly helpful and interesting:
Think Bayes: Bayesian Statistics in PythonDoing Data Science: Straight Talk from the FrontlinePython for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Lastly, William Chen has a list of 22 free data science textbooks. I don’t need to recreate his list here, but definitely check out his article.
Andrew Ng’s Machine Learning course at Stanford has been one of the most popular machine learning courses of all time. You can take it for free from Coursera. Machine learning is a fairly complicated subject and requires advanced knowledge of statistics and math, but if you’re up for the challenge, Ng’s course is an amazing one.
Eventually, it will be time to leave the books and courses behind. Getting your hands on data and experimenting is a great way to learn. There is a ton of freely available data online, but I wanted to highlight a few resources I believe showcase interesting datasets.
Jeremy Singer Vine’s newsletter, Data is Plural, delivers new datasets weekly. Kaggle is not only a great place to test out your data science chops in competitions, but is also a great data resource. Fivethirtyeight publishes a fair amount of data on their GitHub account. Check out their 2018 World Cup data and analysis.
Medium.com has a thriving tech and data science community. Many folks at data-science-focused firms such as AirBnB freely share new information and insights into complex problems. Towards Data Science is a curated collection of data science articles that I find highly informative — they put together a praise-worthy weekly selection.
Books for Pleasure-Reading
Nate Silver’s The Signal and the Noise, Christian Rudder’s Dataclysm, and Seth Stephens-Davidowitz’s Everybody Lies inspired me and contextualized the role of the data science in the real world.
You could start learning data science from almost any angle. The key is simply starting. Begin your journey with a healthy diet of statistics and programming, and see where it takes you. Maybe you will be the next Andrew Ng.