📜 ⬆️ ⬇️

Will data scientists soon be replaced by automated algorithms and artificial intelligence?

Habr, hello! In modern machine learning and data science, there are several trends. First of all, it is a deep learning: image recognition, audio and video, natural language processing. Another trend is learning with reinforcement - reinforcement learning, which allows algorithms to successfully play computer and board games, and giving the opportunity to constantly improve the constructed models based on the response of the external environment.

There is one more trend, less noticeable, since its results for external observers do not look so impressive, but no less important is the automation of machine learning. Due to its rapid development, the question of whether data scientists will eventually be automated and ousted by artificial intelligence becomes relevant again.

According to the estimates of the American research and consulting company Gartner , by 2020 more than 40% of tasks in the field of big data and data science will be automated. Even if this estimate is not overestimated, specialists in the field of big data and machine learning have nothing to worry about. This opinion is shared by most experts, including the developers of automated machine learning systems themselves.

The fact is that the role of an analyst in a company, regardless of how complex analysis tools it uses, does not boil down to the application of these tools. According to the most popular methodology for conducting projects in the field of data analysis CRISP-DM , the implementation of data analysis projects includes 6 phases, in each of which an analyst or data scientist is directly involved:
')
  1. Understanding business objectives
  2. Initial Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

image

Steps 3 and 4 involve a lot of routine work. To apply machine learning to solve specific cases, you need to constantly:

  1. Adjust the hyperparameters of the models;
  2. Try new algorithms;
  3. Add to the model different representations of the original features (standardization, stabilization of dispersion, monotone transformations, reduction of dimension, coding of categorical variables, creation of new features from existing ones, etc.).

From these routine operations, as well as from part of the operations in the preparation and cleaning of data, analysts or data scientists can be eliminated with the help of automation. However, all other parts of 3, 4 and the remaining steps of CRISP-DM will be preserved, so such a simplification of the daily work of analysts does not pose any threat to this profession.

Machine learning is only one of the data scientist's tools besides visualization, survey data research, statistical and econometric methods. And even in it, full automation is impossible. The high role of the data scientist will, of course, be preserved when solving non-standard problems, in developing and applying new algorithms and their combinations. An automated algorithm can iterate through all standard combinations and produce some basic solution that a qualified specialist can use as a basis and further improve. However, in many cases, the results of the automated algorithm will be sufficient without additional improvements, and they can be used directly.

One can hardly expect that a business can use the results of automated machine learning without the help of analysts. In any case, data preparation, interpretation of results and other stages of the above scheme will be needed. At the same time, many companies today have analysts who constantly work with data and have an appropriate mindset, are deeply knowledgeable in the subject area, but do not have machine learning methods at the required level. It is often difficult for an industry company to attract highly qualified and highly paid machine learning specialists, the demand for which is growing and many times exceeds supply. The solution here may be to provide analysts working in the company with access to automated machine learning tools. This will be the effect of the democratization of technology created by automation. The advantages of big data in the future will be available to many companies without the formation of highly professional teams and the involvement of consulting firms.

To date, there are two most effective automated machine learning packages. Both of them use the sklearn Python language learning library and are actively developed.

The first one is the Auto-sklearn library, developed at the University of Freiburg. This package is the winner of the recently held KDNuggets portal of the automated machine learning algorithms competition, and also showed the best results in the auto and tweakathon tasks of the ChaLearn AutoML challenge . Auto-sklearn automates model selection and optimization of hyperparameters using Bayesian optimization, uses meta-learning and builds ensembles of models, automates preprocessing of data, including methods for coding variables and downsizing. Auto-sklearn works only in Linux and requires the sklearn library installed. The library supports distributed computing. Auto-sklearn can be downloaded from its official GitHub repository , the package documentation can be found here . The application of the Auto-sklearn classifier to the well-known MNIST data set (recognition of handwritten numbers) takes about an hour and results in an accuracy of more than 98%.

The second leading solution in the field of automated machine learning is the library TPOT . Its key differences from the previous package considered are as follows:

  1. Genetic programming is used instead of Bayesian optimization, in which models undergo something like Darwinian natural selection;
  2. A well-known library of gradient boosting over XGBoost trees is supported;
  3. There are no restrictions on operating systems;
  4. In contrast to Auto-sklearn, the TPOT at the output forms not only the final trained model, but also the ready-to-implement code for all steps of building the best model (pipeline) in Python.

TPOT accuracy on the same MNIST dataset without any preliminary settings is 98.4%.
image

The above packages are used in conjunction with the Python language and its libraries, which may be an obstacle for some analysts and specialists of other professions. Some of the largest cloud services, such as Amazon Machine Learning and BigML , are trying to make machine learning more accessible to everyone. The user of such services does not require knowledge of machine learning algorithms and data preprocessing, he receives all the necessary hints, explanations and visualizations in the process of building models. Such cloud services provide an already deployed and ready-to-use infrastructure for storing and processing large amounts of data, which may not be the case in a particular company. However, their disadvantage is the limited set of algorithms and optimization methods used. For example, BigML focuses on decision trees, and Amazon Machine Learning uses only classifiers based on stochastic gradient descent. Such cloud services are designed rather to build fairly good solutions to standard problems than to get the best possible models in any situation.

There are also more advanced cloud services of automated machine learning, which are similar in capabilities to the Python libraries discussed above. Among them, we can particularly highlight the DataRobot service. Its advantages over the Python automated machine learning libraries are an intuitive web-based interface, the ability to combine the best algorithms of R, sklearn, Spark, XGBoost, H20, ThensorFlow, Vowpal Wabbit and other systems in one model, providing the infrastructure necessary for analyzing and processing data , visualization of the stages of building models and final results. DataRobot automatically performs statistical processing of texts, automatically determines the types of variables and, if necessary, encodes them, applies the necessary transformations and is able to automatically construct new features, uses intelligent methods for selecting hyperparameters and evaluates the performance of models over a wide range of available metrics. Parallel computations are supported, which increases the speed of learning and application of models, the system has the means for quick and easy implementation of the constructed models.
image

The DataRobot service is the most universal, in addition to it, many cloud services have been developed for automating and democratizing machine learning in specific subject areas. For example, ThingWorx Analytics is designed to automate machine learning in the field of the Internet of Things, primarily to monitor equipment performance, predict breakdowns, optimize its performance, and Context Relevant offers automated cybersecurity and anti-fraud solutions.

The future of the data scientist profession is still uncertain and is subject to expert assessments. However, no one bothers to use the results of automated machine learning now. Automated machine learning is an easy step to start applying machine learning at your current job or, if you already work as a data scientist, to make your daily duties much easier.

The demand for specialists in machine learning and data analysis is constantly growing, and the automation of machine learning, leading to the democratization of technology, will only expand its use. Today, almost everyone can create a beautiful website with the help of website builder, such as Ucoz. However, the demand for web designers and developers since no such designers existed, did not fall at all, on the contrary, increased many times over. The range of available web development tools has been greatly enriched, and the tools themselves have become more complex and functional. If we assume that data analysis will become as accessible for companies as website creation, you can imagine what the demand for highly qualified specialists in this field will be in a few years.

We remind you that the program “Big Data Specialist” will start on March 16, we will be glad to see you.

Source: https://habr.com/ru/post/322414/


All Articles