Five Myths about Data Science

My name is Ivan Serov, I work in the Data Science department of the fintech company ID Finance. Data scientist is a rather young, but very demanded profession that has acquired many myths. In this post, I’ll talk about a few misconceptions that novice data scientists (DS) face.

DS don't have to know about business

A good DS should not only be able to build a good model, but also understand why it should build such a model, and even say that this model is not needed, if so. For example, for one of our projects we made a model that would predict the availability of money on the client’s account and debited using a special algorithm. But in the process of creating a model, they realized that it was not needed: it is easier to slightly improve the working algorithm. Sometimes the costs of DS work greatly exceed the revenue from the new model they are developing. In this case, he should discuss the need for such a model with the project manager and do something more useful.

Complex algorithms are always better.

XGBoost, LightGBM, Random Forest ... All these algorithms are called as primary for any task. Many DS beginners do not even try to start with something easier. However, when suddenly there is a problem with sparse data, where 10,000 variables and 20,000 lines, and XGBoost shows Gini 0.2 (AUROC 0.6) problems begin. For example, in this case, a simple SVM with a nonlinear core, which gave Gini 0.8, is better suited. Simple models sometimes work better than complex ones.

If you want to be cool DS - go to a big company

Every day we hear from big companies about their new projects. How artificial intelligence improves one process by 10%, another by 20% and so on. After this, many may get the impression that only in large companies something happens, and in smaller companies there are neither interesting projects nor good DS. Fortunately, this is not the case - having worked in one of the largest banks that has positioned itself as digital, I can say that there are more interesting projects in startups. The rate of implementation of projects in large companies has become the talk of the town and the reason for memes. For example, a project can be implemented in a bank even for 3 months and six months, during which time you will have time to do several projects in a startup. Conclusion: PR of large companies is often just PR.

Project managers are paid more than good specialists.

Those who outgrow the average level often have a question - where to go next. There are actually two options - Lead Data Scientist (team lead) and Senior DS. There is already a lot written about the difference between the levels (for example, here’s a good post from Viktor Kantor), I’ll just say that the salary of good specialists can be much higher than that of any team lead, and we should only start from our desires. Usually, after several years of work, the burnout begins, all the tasks seem to be the same and annoying. Here you need to either look for something new (good, market leaders like Nvidia, Amazon or Yandex always have something), or go into management (Lead DS -> Chief DS -> CDO), which many choose.

DS should not implement the model or test its results.

Many will disagree, they say, now there are data engineers who must implement these models. But DS still has to take care to facilitate the work of the engineer date, and at least:

Write literate code that is easy to understand.
Think over the coding of variables. For example, LabelEncoder can be easily downloaded as a .pkl file, but frequency coding on new data can be a problem.
Consider how the AB tests will be conducted in the future (by the way, the evaluation of the model after introduction to production in most cases still depends on who developed it)

In many companies, there are no engineers at all, and DS does everything. Another possible situation is when the model interacts with your service through an API that someone from an IT specialist creates, and it’s not a fact that they know something about data science. In this case, DS can make a module for data processing, unload the algorithm in the form of pkl and create a ready-made executive file that receives a json request at the input, and gives the answer in the same json at the output. Separately, about testing: already when creating a model, it is important to think about future AB tests, choose the right metric correctly and understand the economic effect of the model.

Hopefully with this post I have revealed some of the points that newcomers data beginners face and it will help someone. In the following posts I will focus on some myths and speculation in more detail.

What myths did you most often meet with?

About us:

Fintech- holding ID Finance specializes in data science, credit scoring and non-bank lending. The company develops the brands MoneyMan, AmmoPay, Solva and Plazo in Russia, Spain, Kazakhstan, Georgia, Poland, Brazil and Mexico. R & D center ID Finance is located in Minsk. The company's founders are ex-top managers of Deutsche bank and Royal Bank of Scotland Alexander Dunaev and Boris Batin. Among investors ID Finance venture fund Emery Capital. The company ranked 36th in the Financial Times rating of the fastest growing companies in Europe in 2018. Since 2012, the assets of ID Finance have financed loans totaling more than 275 million EUR. At the beginning of 2018, the company's total loan portfolio was 77 million USD. About us write Forbes, Business Insider, Finextra, Venture Beat, Crowdfund Insider, The Banker and the BBC. We also publish in Russian media: Forbes, VC, Roem, RusBase and others.

Source: https://habr.com/ru/post/353270/

All Articles