We continue a series of analytical researches of a demand of skills in labor market. This time thanks to Pavel Surmenk of sharky we will consider a new profession – Data Scientist.
The last years the term Data Science began to gain popularity. Write about it much, speak at conferences. Some companies even employ people to a position with the sonorous name Data Scientist. What is Data Science? And who such Data Scientists?
Who such Data Scientists?
If to ask such question to the resident of San Francisco, it is possible to receive the answer that Data Scientist is the statistician living in San Francisco. Ridiculously, though not strongly encourages those who live not in San Francisco, the truth? Well, then one more determination: Data Scientist is the one who understands statistics better, than any programmer, and understands programming better, than any statistician. And this option is already close to an essence. Data Scientist, scientific by data, is a peculiar hybrid statistics and the programmer. And both statistics, and programmers are very different therefore it is better to consider this profession as a wide range from pure statisticians to pure programmers.
Robert Chiang, Data Scientist from Twitter, divides representatives of the profession into 2 groups: Type A Data Scientist v.s. Type B Data Scientist.
Type A where A is Analysis. These people mostly are engaged in extraction of sense from static data. They are very similar to statisticians, can even be statisticians and just replace a job title with Data Scientist, and as we know, only one change can already give job titles a considerable gain of a salary, plus honor and respect. But except statistics they know also practical aspects: how to clear data how to work with big data sets how to visualize data and to describe results of the work.
Type B, where B – Building. They also have knowledge of statistics, but at the same time strong and experienced programmers. They are more interested in data application on real systems. Often build the models working in interaction with users, for example, of system of recommendations of goods, movies, advertizing.
Data Science is also a little crossed with such spheres of activity as Machine Learning and Artificial Intelligence, representatives of this sphere are close to Data Science of type B.
Data Scientist Skills
On the English-speaking Internet the trend of increase of interest in Data Science is swept well up approximately since 2012 (https://www.google.com/trends/explore#q=Data%20Science). In the last several years growth of interest in adjacent areas is swept also well up: Machine Learning, Artificial Intelligence, Deep Learning. Gartner placed Machine Learning on hype curve top in 2015: Gartner's 2015 Hype Cycle for Emerging Technologies Identifies the Computing Innovations That Organizations Should Monitor. And the Harvard Business Review log in 2012 published article with the intriguing heading: Data Scientist: The Sexiest Job of the 21st Century.
What to study that who wants to become Data Scientist what skills are necessary? Let's look at what requirements the American employers imposed to candidates for positions in the Data Science and Machine Learning areas.
We analyzed 549 vacancies published on one of the largest world portals on job search which included requirements of Data Science and Machine Learning.
Data Scientist Hard Skills
Let's begin with requirement analysis to ownership of professional skills (hard skills).
As it is possible to see from a rating, fundamental knowledge of mathematics, statistics, Computer Science and machine learning are the most popular. In addition to theoretical knowledge, Data Scientist has to be able "to get", clear, model and visualize data. Experience in software development and quality management is also important.
Data Science Tools and Technologies
The main Data Scientist tools are the Python and R programming languages.
R – it is a specialized programming language for statistical calculations for this reason it was so fallen in love to statisticians and scientists by data. It allows to load quickly a data set, to count the main statistical characteristics, to visualize data, to construct data models.
Python, though represents a programming language of general purpose, but has a huge number of qualitative libraries and platforms for Data Science and Machine Learning.
What is remarkable, in 39% of vacancies the knowledge both R, and Python at the same time therefore it is better to learn both languages at once is required, but not to try to select one of them.
For work with big data employers prefer to use Hadoop and Spark. Among databases MySQL and MongoDB are popular.
Data Scientist Soft Skills
The general competences (soft skills) in comparison with professional skills are demanded in a smaller measure as they are mentioned in vacancies more than twice less often. Average salaries of vacancies in which soft skills also significantly, approximately for 20% are required, below where hard skills and knowledge of technologies are required.
Nevertheless, among the met soft skills the following is the most important: ability to communicate, visualize data, to do the presentations, to write and speak effectively. Skills of team working, management and a solution of problems are also useful.
Data Scientist Domain Knowledge
In some vacancies the knowledge of data domain from physics and biology to real estate and hotel business is required. Here in leaders economy, marketing and medicine.
Data Scientists Specializations
Before research we assumed to select subspecializations of a profession of Data Scientist. For example, to separate those who are engaged mainly in the analysis and data visualization from those who build models for predictive analytics or algorithms of machine learning. But, as it appeared during data analysis, requirements to the majority of vacancies are rather homogeneous, and accurate splitting into specialties is not traced.
Though some patterns seem interesting. For example, if in vacancy knowledge of Python or C ++ is required, then the requirement of communication skills and management and vice versa is improbable.
Influence of technologies on a salary
Poll of O'Reilly 2015 Data Science Salary Survey helps us to look at labor market from the opposite side. This research is based on poll of 600 Data Scientists, and collected data include the level of salaries, demographic information and an amount of time which specialists spend for problems of different types. Key outputs of this research following:
- SQL, Excel, R, Python – key tools, and this list does not change on an extent of 3 years.
- Strongly popularity of Spark and Scala grows.
- Focus of those who used specialized commercial tools earlier is displaced on use of R.
- But those who used R earlier pass to Python, Python is in the lead.
- Among all industries there are highest salaries in Software Development.
- Cloud Computing continues to be demanded.
We recommend to read the report entirely. Except other, it describes a mathematical model of dependence of a salary of Data Scientist from where he lives what education he has and on what tasks works. For example, Data Scientists which spend more time at meetings earn more. And who more than 4 hours a day are engaged in studying of data, earns less.
How to study Data Science?
In recent years there was a set of online-courses on this subject. And it is very good method to begin!
If you incline more to data analysis, then good option are courses of specialization Data Science on Coursera: Launch Your Career in Data Science. Receipt of specialization is not free, but if you do not need the certificate, then you can complete all these courses free of charge: just look at the name of a course and by means of search find a course.
For those whom Machine Learning interests Andr En (Andrew Ng) can recommend a course, Chief Scientist in the Baidu Research company which in combination the teacher in Stanford also is a founder of Coursera: Computer training.
What is Data Science?
Data Science is a new sphere of activity therefore requirements to Data Scientists are not up to the end created. Considering dynamism of our time, perhaps, Data Science will never become an independent profession in which will train at universities, and and there will be a set a practician and skills. But it is precisely those practicians and those skills which will be very demanded in the next years.
This article is a translation of the original post at habrahabr.ru/post/271085/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.