Data science is a popular field and garners a lot of interest. Whether you’re looking to become a data scientist or leverage data science skills in your current role, “where should I start?” and “what are some good books?” are common questions. Data science is also a broad field, with many skill areas to continue growing in your career. Given this, it can be helpful to put together a learning plan. In this article, we hope to share some worthwhile resources that we’ve used to build a data science foundation.
In an earlier article, we shared a list of skills to be an effective data scientist. In this one, we provide resources, including ones such as courses, books, and papers, to help develop those skills. In particular, we expand here on the technical skills section of the prior article, given that business and domain resources depend on the particular domain context.
As you dive into data science courseware, you’ll see that it extends from a variety of academic departments, and that various fields of study have contributed to current techniques. Data science is a diverse, interdisciplinary field. Here is a Venn diagram to illustrate how it brings math, statistics, computer science, and domain knowledge together:
Adapted from and with credit to “Data science concepts you need to know! Part 1,” by Michael Barber, on Medium.com.
Covered topics
Here are the topic areas we cover in this article, starting at the core and working outward, in this conceptual view:
Here are quick links to each section:
- Statistical concepts and techniques
- Programming languages (SQL, Python, R, and Kusto)
- Data analytics and forecasting
- Machine Learning and Deep Learning
- Experimentation and causal inference
- Data visualization and communication
- Communities, podcasts, datasets, and events
Statistical concepts and techniques
Data scientists must be familiar with statistics as they collect data and information and use it to investigate problems, analyze and forecast trends, conduct significance testing, design experiments, and inform business decision-making. Here are some good resources for building a foundation in statistics:
-
Course Statistics with R Specialization: This online course helps in mastering statistics with R. -
Book Practical Statistics for Data Scientists: 50 Essential Concepts: A practical guide that explains how to apply various statistical methods to data science. -
Book The Elements of Statistical Learning: Data Mining, Inference, and Prediction: This book helps readers develop a deeper understanding of statistical learning and requires some mathematical and statistical sophistication. -
Book An introduction to Statistical Methods and Data Analysis: This book offers a broad overview of statistical methods, providing research studies and examples that connect the statistical concepts to data analysis problems data scientists may encounter in daily work.
Programming languages (SQL, Python, R, and Kusto)
As a data scientist, you can choose from a number of programming languages that are useful in your work. SQL is very important to learn because it allows you to query the data in a structured database. Python is particularly popular among data scientists today due to its wide range of uses across domains, such as data collection and cleaning, data visualization, Machine Learning, and Deep Learning. R is another popular language ideal for data science, big data, and Machine Learning. Kusto is a query language for Azure Data Explorer and related services that has a simplified syntax. Here are some training materials to learn these programming languages and their applications to data science:
SQL
-
Course LinkedIn Learning Path for SQL: This pathway includes various courses on leveraging SQL in data science projects. -
Course SQL for Data Science: A Coursera course that teaches fundamental SQL query and data manipulation. -
Tutorial W3 Schools SQL Tutorial: This set of online lessons provides hands-on practice with SQL queries.
Python
-
Course Data Scientist with Python: DataCamp track consisting of several courses covering introductory and intermediate programming in Python, usage of pandas and matplotlib libraries, and much more. -
Course Python for Data Science: Coursera course that introduces users to foundational Python programming concepts. -
Course LinkedIn Learning Path for Python: This pathway includes various Python courses ranging from introductory programming topics to deep learning and neural networks. -
Course edX: Computational Thinking using Python: This learning program covers programming, data structures, computational thinking, data science, and algorithms related to Python. -
Book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython: This title teaches how to clean and manipulate data using popular libraries.
R
-
Book An Introduction to Statistical Learning with Applications in R: This volume interprets statistical concepts with R programming and is publicly available now. -
Course Learn R with DataCamp: This course offers several tracks for R programming, ranging from data manipulation to statistical inference. -
Course R Programming: This is an introductory course to programming R, covering topics like reading data and writing functions. -
Course LinkedIn Learning Path for R: This pathway covers a range of R-related topics including cleaning data, forecasting, and text analytics.
Kusto
-
Reference Azure Data Explorer/Kusto Reference Documentation: This is an online reference for Kusto query language syntax and functions. -
Course KQL from Scratch: This is an introductory course on fundamental Kusto syntax and function principles.
Data analytics and forecasting
Data analytics and forecasting are fundamental tools for data scientists. As technology generates vast and growing amounts of data, analytics and forecasting are core steps to explore business opportunities, identify key trends, and find insights to enable data-driven decision making.
-
Course Data Science with DataBricks for Data Analyst Specialization -
Course Data Science Fundamentals for Data Analyst -
Course Applied Data Science for Data Analysts -
Course MicroMasters program from Georgia Tech: This course covers fundamental data science programming in Python, R, and SQL, as well as modeling and data pipelines. (Credit can be counted toward Georgia Tech’s Master’s in Analytics program.) -
Course [Forecasting in R DataCamp](https://learn.datacamp.com/courses/forecasting-in-r) -
Course [Sequences, Time Series and Prediction Coursera](https://www.coursera.org/learn/tensorflow-sequences-time-series-and-prediction) -
Book Forecasting: Principles and Practice (3rd ed) (otexts.com)
Machine Learning and Deep Learning
Machine Learning includes algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions. Deep Learning is considered an evolution of Machine Learning. It uses a programmable neural network that enables machines to make accurate decisions without help from humans. The following are helpful resources to grow skills in Machine Learning and Deep Learning.
Machine Learning foundations
-
Course [Machine Learning by Stanford University Coursera](https://www.coursera.org/learn/machine-learning): Andrew Ng’s popular course that introduces supervised and unsupervised learning, as well as Machine Learning best practices. (Offered to the public through Coursera and to Stanford students as part of the university’s curriculum.) -
Course CalTech: Machine Learning (Yaser Abu-Mostafa): This course is introductory but less friendly for novices. It covers most of the same topics as Ng’s course, but more deeply and with a more theoretical approach, and is recommended for students and practicing data scientists. -
Course Reinforcement Learning: Set of four courses covering various fundamental concepts and includes a hands-on project. -
Course LinkedIn: Machine Learning and AI Foundations -
Course LinkedIn: Become a Machine Learning Specialist -
Book Deep Learning with Python -
Course [Improving your Model Performance — ML Strategy (1) Coursera](https://www.coursera.org/lecture/machine-learning-projects/improving-your-model-performance-4IPD6) -
Course [What is Predictive Model Performance Evaluation by divya singh Medium](/@divyacyclitics15/what-is-predictive-model-performance-evaluation-8ef117ae0e40)
Deep Learning
-
Course Deep Neural Networks with Pytorch: Similar to the course above, but using PyTorch. -
Course Hugo Larochelle-Neural Networks: Graduate-level course covering topics such as Deep Learning, conditional random fields, autoencoders, and more. -
Course Carnegie Mellon’s Neural Networks for NLP: This class starts with a brief overview of neural networks, and then spends the majority of the class demonstrating how to apply neural networks to natural language problems. -
Reference Deep Learning Cheat Sheet — Stanford: Concise explanations of neural networks. -
Reference Neural Network Zoo: Cheat sheet for neural network architectures. -
Course Deep Learning with PyTorch: Build a NN (1 hour; guided project) -
Course Getting started with PyTorch (1.5 hours; guided project) -
Course Getting started with PyTorch (2) (2 hours; guided project) -
Course Detecting Covid-19 with chest X-ray using PyTorch (2 hours; guided project)
ML Ops, data engineering and more
-
Course Distributed computing with Spark SQL (13 hours) -
Course Introduction to High-Performance and Parallel Computing (18 hours) -
Course Custom and Distributed Training with Tensorflow (approximately 24 hours) -
Course Perform Data Science with Azure Databricks -
Course Data Engineering with Azure Databricks -
Course Distributed Programming on the Cloud -
Book Data Science Solutions on Azure: Tools and Techniques Using Databricks and MLOps -
Book Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud -
Course Fast-ai course: Practical Deep Learning for Coders (See Part 1 & Part 2) -
Book Fast-AI Book (Corresponding to the course above) -
Course Getting Started with Python Concurrency (approximately 2.5 hours) -
Article Python Concurrency: The Tricky Bits
Experimentation and causal inference
Experimentation and causal inference are designed to identify causal relationships among variables. Given the importance of understanding causal drivers to ensure the right data-driven decisions, these techniques have been gaining increased adoption among data scientists in the industry. These resources provide great learning opportunities on these topics:
Experimentation
-
Book Experimentation Works: Covers key tenets of driving an experimentation culture, including case studies from tech company examples. -
Book Trustworthy Online Controlled Experiments (A Practical Guide to A/B Testing): Great resource for best practices in experiment design and analysis, using learnings from real-world examples. -
Course EXP platform -
Course Udacity -
Course Coursera
Causal inference
-
Course Data Science with DataBricks for Data Analyst Specialization -
Course A Crash Course in Causality -
Course Introduction to Causal Inference -
Book The Book of Why: The New Science of Cause and Effect -
Book Elements of Causal Inference -
Book Causal Inference in Statistics: A Primer -
Book Introduction to Causal Inference (textbook that accompanies course of the same name) -
Paper Causal Structure Learning and Inference: A Selective Review -
Paper A Crash Course in Good and Bad Controls -
Paper On Identifying Causal Effects -
Methodology https://arxiv.org/abs/2007.10979 -
Methodology https://github.com/microsoft/dowhy -
Methodology https://github.com/microsoft/EconML -
Methodology https://github.com/uber/causalml -
Research Meta learners: Künzel, Sören R., et al. “Metalearners for estimating heterogeneous treatment effects using machine learning.” Proceedings of the National Academy of Sciences 116.10 (2019): 4156–4165. -
Research Double Machine Learning: V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and a. W. Newey. Double Machine Learning for Treatment and Causal Parameters. ArXiv e-prints, July 2016. -
Research Estimation methods with instruments: W. K. Newey and J. L. Powell. “Instrumental variable estimation of nonparametric models.” Econometrica, 71 (5): 1565–1578, 2003. Also: Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: “A flexible approach for counterfactual prediction.” Proceedings of the 34th International Conference on Machine Learning, 2017. -
Research Doubly robust learning: D. Foster and V. Syrgkanis. Orthogonal Statistical Learning. arXiv preprint arXiv:1901.09036, 2019. URL http://arxiv.org/abs/1901.09036.
Data visualization and communication
Data science organizations often partner with stakeholder teams throughout an organization. Communicating data science deliverables is an important step in maximizing their impact, whether through presentations, data visualizations, or written communications, and whether presented to a business or technical audience. Here are some resources to help with this:
Data visualization
Here is a range of books, courses and papers on data visualization techniques and approaches that you can incorporate into your work. Also see the data visualization articles on the Data Science at Microsoft online publication.
-
Book Storytelling with Data -
Book The Wall Street Journal Guide to Information Graphics -
Book Information Dashboard Design -
Book The Visual Display of Quantitative Information -
Course Data Visualization Tips and Tricks LinkedIn Learning (may require free registration) -
Course PowerPoint: Creating an Infographic LinkedIn Learning (may require free registration) -
Paper How deceptive are deceptive visualizations? An empirical analysis of common distortion techniques -
Paper Surfacing Visualization Mirages -
Paper The Persuasive Power of Data Visualization -
Paper A Model-Based Visualization Taxonomy -
Paper An Insight-Based Methodology for Evaluating Bioinformatics Visualizations -
Paper Characterizing Visualization Insights from Quantified Selfers’ Personal Data Presentations -
Paper Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods -
Paper More Than Telling a Story: Transforming Data into Visually Shared Stories -
Paper Reaching Broader Audiences with Data Visualization -
Paper Understand Users’ Comprehension and Preferences for Composing Information Visualization -
Paper Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations -
Paper A Guide to Understanding Color -
Paper Escaping RGBland: Selecting Colors for Statistical Graphics -
Paper Rainbow Color Map (Still) Considered Harmful -
Paper Somewhere Over the Rainbow: How to Make Effective Use of Colors in Meteorological Visualizations -
Paper True Colors of Oceanography: Guidelines for Effective and Accurate Colormap Selection -
Paper ModelTracker: Redesigning Performance Analysis Tools for Machine Learning
Communication and public speaking
Below are some resources for presentation training and scientific writing, as well as an organization you can join for further practice.
-
Course [Presentation Skills LinkedIn Learning](https://www.linkedin.com/learning/paths/develop-your-presentation-skills?u=3322) -
Course [Public Speaking LinkedIn Learning](https://www.linkedin.com/learning/topics/public-speaking?u=3322) -
Course [Scientific Writing Coursera](https://www.coursera.org/learn/sciwrite) -
Community Toastmasters
Communities, podcasts, datasets, and events
As you continue learn, here are some great spaces where you can exchange ideas with others and hear from their experiences regarding data science in practice. We’ve included opportunities to engage in online communities, participate in hands-on events, leverage publicly available datasets, listen to data science podcasts, and attend relevant conferences. We also recommend GitHub and Jupyter Notebooks as great ways to share your work and collaborate with others.
Communities
Countless data science meetups and communities exist. Here are a few where you can engage with other data scientists on relevant topics:
- Reddit channels (Analytics, AskStatistics, DataScience, Statistics): These subreddits cover data science topics.
- KDNuggets: This online platform covers business analytics, big data, data mining, and data science.
- Time Series Forecasting: This is a GitHub site for time series forecasting discussions.
Podcasts
For those who prefer learning via audio, the following podcasts are great options:
- Towards Data Science: In-depth subject area discussion.
- Women in Data Science: Stanford-led community interviewing women leaders in data science.
- TW/ML AI: Leaders in AI and ML discuss how they’re innovating in their domains.
- Super Data Science: Higher-level overview of data science topics.
Hands-on events
These can be a great place to learn about new tools, hone your skills, and uncover best practices in the data science domain.
- Kaggle Competitions: Kaggle allow users to work with other data scientists and Machine Learning engineers to enter competitions to solve data science challenges.
- Women in Data Science Datathons: A global event that encourages more women to enter the field of data science.
Datasets
The best way to learn data science is to practice with different projects. You can search and download free datasets online using the following resources.
- Kaggle datasets: Kaggle has one of the largest dataset libraries online. The data is free and you can also upload your own datasets there.
- KDNuggets datasets: KDnuggets maintains a good collection of datasets that are free and can be used for learning data science.
- Data is Plural: A weekly newsletter of useful and interesting datasets.
- TidyTuesday: A weekly data project aimed at the R ecosystem.
Conferences
Conferences can be a great way to learn from others’ experiences, get exposure to new ideas, and gain additional perspective. Here are some to explore:
- NeurIPS: The purpose of the Neural Information Processing Systems annual meeting is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects.
- SIGKDD: The main professional association for data mining and knowledge discovery.
- ICML: The International Conference on Machine Learning is the leading international academic conference in this subject area.
- CVPR: CVPR is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses.
- ACL: The Association for Computational Linguistics (ACL) is the international scientific and professional society for people working on problems involving natural language and computation.
- SIGIR: The annual SIGIR conference is the major international forum for the presentation of new research results and the demonstration of new systems and techniques in the broad field of information retrieval (IR).
- MLSys: The Conference on Machine Learning and Systems targets research at the intersection of systems and Machine Learning.
Original article
:
https://medium.com/data-science-at-microsoft/data-science-learning-resources-193ccf6fafb