Here are 5 useful things to know about Data Science, including its relationship to BI, Data Mining, Predictive Analytics, and Machine Learning; Data Scientist job prospects; where to learn Data Science; and which algorithms/methods are used by Data Scientists
By Gregory Piatetsky, KDnuggets.
I am frequently asked questions about Data Science, so here my answers to some frequent questions and 5 useful things to know about Data Science and Data Scientists.
1. Business Intelligence, Business Analytics, Data Science, Data Analytics, Data Mining, Predictive Analytics – what are the differences?
Data Science is concerned with analyzing data and extracting useful knowledge from it. Building predictive models is usually the most important activity for a Data Scientist.
However, because “Data Science” term is relatively new, the name is not commonly accepted yet, and other names are frequently used for the same area.
Data Science can be understood in terms of The Data Science Process which includes business understanding, data understanding, data preparation, modeling, evaluation, and deployment, as described in this CRISP-DM framework:
Fig. 1: CRISP-DM – Data Science Process.
Many universities have recently created degrees in Business Analytics, Data Analytics, or Data Science. Business Analytics, as the name implies, puts more emphasis on business skills and methods, while “Data Science” and “Data Analytics” put more emphasis on data engineering aspects.
Within the scientific community, the most popular name for this field has changed over time
- Data Mining: first appeared in 1970s, and peaked around 2002, but is still used today
- KDD (Knowledge Discovery in Data): was used in 1990s, after the start of KDD conferences, but now only used within research community
- Predictive Analytics: appeared in 2000s, and popularized by Predictive Analytics World, but has not caught with the general public
- Data Science, 2012-now , fueled by popularity of “Data Scientist” job
This Google Trends chart shows the relative change in popularity of 5 Data Science related terms from 2004 to 2017.
Fig. 2: Google Trends for Data Mining, Data Science, Data Analytics, Business Analytics, Predictive Analytics, 2004-2017.
2. Data Science vs Machine Learning: What are the differences?
Data Science and Machine Learning can be thought of as close cousins.
What they have in common is supervised learning methods – learning from historical data.
However, Data Science is also concerned with Data Visualization and presenting results in the form understandable to people. Data Science has much bigger focus on Data Preparation and Data Engineering.
Machine Learning main focus is on the learning algorithms – it is not concerned, for example, with data visualization. Machine Learning studies not only learning from historical data, but also learning in real-time. A major part of ML are the algorithms for agents acting in the environment and learning from their actions. This is called Reinforcement Learning (RL). To learn more about history and current state of RL, see my Interview with Rich Sutton, the Father of Reinforcement Learning.
RL was the key part of the recent success of AlphaGo Zero and AlphaZero.
Q3. Is Data Scientist a good job?
Yes! Data Scientist was ranked by Glassdoor as the best job in America for 3 years in a row – see
- Data Scientist – best job in America, 3 years in a row
- Data Scientist – best job in America in 2017
- Data Scientist – best job in America in 2016
Recent LinkedIn Economic Graph report also had good news for this field. Machine Learning Engineer and Data Scientist were the top US emerging jobs in 2017, with Machine Learning Engineer jobs growing 9.8 times in 5 years, and Data Scientist job growing 6.5 times.
4. Where can I learn Data Science?
Data Science Education is one of the most popular topics on KDnuggets, with a whole section dedicated to it.
There are many options for learning data science and related topics.
We have recently done a series of surveys of Best Masters in Analytics, Data Science, examining also tuition and ranking of the program. See
- Best Online Masters in Data Science and Analytics
- Best Masters in Data Science and Analytics in US/Canada
- Best Masters in Data Science and Analytics – Europe Edition
- Best Masters in Data Science and Analytics – Asia and Australia Edition
Here is an overview chart of the top ranked programs from the first post:
MS in Analytics, Data Science – Online and On Campus from this post
Symbol color is blue for online, green for on-campus; shape is circle for MS in Analytics; square for MS in Data Science.
We note that there is little correlation between ranking and tuition. Most high-ranking universities do NOT offer online degrees. Berkeley and CMU are the exceptions.
Slightly over half of MS degrees we surveyed are called “Data Science” – most of them are technical oriented, and slightly less than half are called “Analytics” – mostly business oriented.
There are also many options for
- Certificates and Certification in Analytics, Big Data, Data Science, Machine Learning
- online classes and courses on Data Science.
- Bootcamps in Analytics, Big Data, and Data Science
See also relevant KDnuggets posts on Data Science courses and education under
- Tag: Online Education (201)
- Tag: Data Science Education (167)
- Tag: Master of Science (64)
- Tag: MS in Data Science (40)
- Tag: MS in Analytics (27)
- Tag: MS in Business Analytics (25)
5. What algorithms and methods does Data Scientist use?
While Deep Learning pushes the state-of-the art seemingly every day, and very advanced methods like XGBoost win many Kaggle competitions, most Data Science work involves more basic algorithms and methods.
KDnuggets recent poll
had these top 10 results:
Top 10 Data Science, Machine Learning Methods Used, 2017 KDnuggets Poll
Deep Learning was used by about 20% of respondents.
Our poll also found which methods were most affiliated with industry:
- Uplift modeling (for the second year in a row)
- Anomaly/Deviation detection
- Gradient Boosted Machines
The most “academic” methods are advanced topics related to Deep Learning:
- Generative Adversarial Networks (GAN)
- Reinforcement Learning
- Recurrent Neural Networks (RNN)
- Convolutional Nets
In Kaggle 2017 Survey The State of Data Science & Machine Learning the most common Data Science methods used at work were:
- Logistic Regression, 63.5%
- Decision Trees, 49.9%
- Random Forests, 46.3%
- Neural Networks, 37.6%
- Bayesian Techniques, 30.6%
To learn more about most important algorithms, see our most popular posts on algorithms
- Top 10 Machine Learning Algorithms for Beginners
- The 10 Statistical Techniques Data Scientists Need to Master
- Logistic Regression: A Concise Technical Overview
- Which Machine Learning Algorithm be used in year 2118?
- Machine Learning Algorithms: Which One to Choose for Your Problem
- Random Forests(r), Explained
- XGBoost, a Top Machine Learning Method on Kaggle, Explained
- Machine Learning Algorithms: A Concise Technical Overview – Part 1
- The Machine Learning Algorithms Used in Self-Driving Cars
- Keep it simple! How to understand Gradient Descent algorithm
- The 10 Algorithms Machine Learning Engineers Need to Know
and KDnuggets Posts tagged