About a year ago, I wrote a text about how the transition from the academic environment to the now popular Data Scientist profession took place. Surprisingly, I received quite a lot of messages from people who found themselves in a similar situation, that is, my post found its audience and was useful to someone. Now the couple has come to write a sequel.
(I apologize in advance for the abundance of English words, I don’t know how to translate some of them, but I don’t want to translate some of them.)
Summary of the first part :
It is important to note that I left the academic environment not because I do not like science - I love it very much, but because I have the strong impression that at this stage work in the industry will give me more. For example, I frankly didn’t like the code I’m writing, especially after reading a book on the topic “clean code”, and also didn’t like how I support it. Yes, I actively used git for some projects, and when writing articles (although the scientific director didn’t promote it, he is more the old fashioned way), but this self-study still differs from the real project with 100 developers. And the fact that the academic environment is very conservative and leisurely is also always annoyed me. I wanted to plunge into something new, dynamic, youth, so that the words agile development, as well as things like “round B financing” from abstract concepts from the book turned into something natural and understandable. And a startup in Silicon Valley is, clearly was a wonderful option.
The office, called Bidgely, offered me a Data Scientist position with a salary of $ 130k a year dirty (about $ 7,400 a month clean) to work in an office located in Sunnyvale, Silicon Valley, a few kilometers away from Google, Linkedin, Apple headquarters , etc.
Bidgely (hereinafter sometimes referred to as the code word “Hindus”) is a startup that is committed to taking energy consumption as a function of time and putting it into components - this is a washing machine, this is air conditioning, this is a kitchen stove, this is a refrigerator - and dd
This is all necessary in order for the people to know where the electric power goes and, if they wanted to optimize spending, they knew where to dig. For example, it is quite common to overestimate computer energy consumption and underestimate air conditioning.
The business was built on a B2B model, that is, customers were not mere mortals, but companies like PetroElectroSbyt.
End users on the screens see something like this:
First, to always be able to clearly answer the user's question why the electricity bill is so big and why it is not a mistake.
Secondly, in some countries there are government programs for subsidizing electricity suppliers if they convincingly show that their efforts have increased the efficiency of this very consumption.
Or, for example, here is such an interesting use case - in Australia it is very hot in the summer and energy suppliers are sewn up in the sense that the infrastructure does not cope with the load, and for them to mitigate the problem, they would ask (and offer $ 20) to those who actively use air conditioning rush hour, cooling the dwelling one hour before the expected maximum load, which allows to distribute this load.
What is a product? => algorithm + visualization.
Despite the fact that the above may seem marketing bullshit, salespeople managed to convince a bunch of companies in Australia, USA, Canada, and European countries that this should be the future, at least, to try. The reason why I had doubts about the commercial effectiveness of this product is that it painfully reminds Google Fit, in the sense that it gives some analysis of what is happening, but how this analysis solves some real problems was not obvious to me.
Not a single serious office will immediately fit into something, and therefore, the companies are fired up in three stages:
The company has grown over the year from 18 to over 50 people. While she was quite small, Data Science team was represented by one girl who shortly before hired me went to Facebook, as well as a guy named Alex, who either had her in the wings, or they worked in parallel. After the company expanded, it was decided to strengthen the Data Scientist team and under this they hired me and a Hindu named Pratik, both of whom had just finished their training, what is called fresh grad, in order to remove some of the burden from Alex, soak up part of the knowledge from him and after this knowledge has been absorbed - to push the desired algorithm intensively into a bright future, and the company to billions in profits.
The company has two offices: frontend, sales managers, top management, Data Science in the USA, and the entire backend and QA for reasons of economy - in India.
All calculations on AWS. The main Java programming languages ​​are frontend / backend. Matlab - for Data Science.
The task of Energy Disaggregation is similar to the task of voice recognition, and this is very interesting, there are recurrent neural networks and many other beautiful modern technologies.
That is, I accepted this Hindu offer, on the one hand, because my friend’s sleeping in a sleeping bag in the basement, getting up every morning with the thought that the money on the credit cards would soon end, there was no work and generally the prospects for the new candidate were dull, I was tired on the other hand, I was offered a position that was intelligible to Junior money (the range of salaries in the valley for a similar position of 100-150k / year) in a company that is all dynamic (that I understood it on the oniste interview in their office when I saw how everyone rushing). And the problem is interesting from a mathematical point of view, and it has practical application.
I didn’t like one thing about them - an office, noisy, and a lot of light. Bright lamps plus a healthy window in which the sun shines aggressively in the morning. In the eyes of the mother beats do not worry. During onsite, I clarified about the office, to which the CEO said that there was absolutely nothing to worry about, because nobody likes this office and after a couple of months, we still move to a new building in Mountain View.
During the conversation, I liked the team - very stupid people, and I was also very pleased with the ease with which I was allowed to go on vacation for the new year (two months after starting work).
With the exception of the office, which is just a temporary problem, this is not a job, but a song.
I wanted to hope for myself that I overcame a stage in my evolution of writing code, when someone else or my old code wanted to, without thinking of, remove it from scratch, and with those who suffered from the Dunning effect during the transition from the academic environment to the industry - Kruger , I had to meet, so I set myself up beforehand that all the experts are here, and I am a fresh graduate, and therefore you first need to absorb as much as possible from the older comrades who successfully conquer Silicon Valley at the expense of their hard work and immense ur Ovina knowledge, and only then draw some conclusions.
Upon arrival, I was given a MacBook 13 ”in 2012 with 4Gb ram, which, in my opinion, is a bit weak for developing algorithms even on small amounts of data. Alex, who was assigned to me as a mentor, quickly sketched a diagram of what needs to be done, poked at a piece of code, added that it was very important, gave instructions on how to compile the code and flew away on business.
I follow the instructions - I try to compile the code, but it does not compile. I caught Alex and it turned out that in order for everything to work, you need to disable unit tests, because they are outdated, and in general unit tests are low style. I am not a religious bearer of TDD, this approach has never worked for me in a pure form, but, despite this, this attitude of Bigely to code testing has strained me.
I open the code in Matlab and it gets scary. One of the reasons why I left the academic environment is to eventually learn how to write production quality code. And in order to learn this, besides reading the correct books, you have to work on complex code in a team with senior experienced comrades who, if anything, during code review, stick their nose where necessary, and learn how to use effective tools like smart IDEs, writing tests, proper architecture, etc.
And it became scary to me, because at first glance at the code it became clear to me that I had hard hit on the code of which I had seen enough and with which I got stuck in aspirate. Yes, they tried to create a modular architecture, and something happened somewhere, but random indentation, functions for 1000 lines, complete absence of comments to the code, names of variables with mystical meaning like pkI, vIdx, featsT, pkPos, M1 , almost each function by magic number, and no documentation.
I ask Alex: “ How to be? There is a pitchfork. ” And he says to me sincerely: “ Well, I didn’t have time to write good code, although I, of course, know how, but now when there are three of us, we will all do mind, for now, says, in principle, everything is simple. Debagger'om run line by line and look at the evolution of variables, and everything will be clear right away. That's just 10,000 lines . ”
Two weeks later, I did commit, which was required of me on the first day.
I think that all this hindrance in the code is largely due to the fact that both Alex and the girl who went to Facebook, like me, came directly from the academic environment, where on average they write the fucking code, to a startup in the early stages of development, in which it was necessary to produce features quickly, and for this purpose MatLab was used, which forces to write unreadable code and does not have an IDE that supports refractoring. And if it is painful to refractorize without convenient tools, they did not do it.
At the same time, those involved in frontend / backend wrote normal code, but what happens in those modules that Data Scientists write was not understood even at a high level and used it as a black box.
So it turned out that the code was written by those who did not learn to write normal code, there was no time to learn, and there was no one who would teach this, and besides, the programming language and IDE for him in every way prevented everything good.
As a conclusion for the future from this unhealthy situation - if the description of a vacancy contains the word Matlab - for me this vacancy is automatically excluded from consideration. First, because for sure the code is a solid unreadable, untestable indusyatin, adding features to which this ocean of pain is. And, secondly, because Matlab itself, with the exception of the signal processing library, does not have any serious advantages compared to other tools. And the fact that Data Science team chose it and after three years of use did not change for something more intelligible - this is a signal that the level of qualifications of this team is lower than where I would like to work.
When I was hired, the Indians were very actively pushing for me to accept or not accept the offer quickly, and then, after my consent, I was asked to go to work immediately, next Monday, and a well-deserved trip to their historic homeland I planned for myself on the occasion of finding a job, arranged it closer to the new year. And all this haste was due to the fact that the launch of a pilot project in Australia was planned for the new year and it was necessary to make everything beautiful, that is, to improve the accuracy of the algorithm to values ​​that would impress the client.
When I asked our CEO: “ Beautiful is how much? ”, He said that he understood that the data was noisy, that absolute accuracy could not be achieved and that everything was almost fine now, and it was necessary to tweak a little, so that it became 90%. There was not much time left, and so I was delegated the algorithms that detect the dishwasher and the washing machine, and the electric cookers, swimming pools, refrigerators went to the Data Scientists who work in the Indian office.
Will arrange 90% - means we will try to issue 90%. Two months is not much, but on kaggle.com I gave out good models and in less time, and therefore “the party said it is necessary, the Komsomol answered there is.”
The question is - how does this accuracy measure at all?
I imagined that this type of classical cross validation or something else similar, but it turned out that the Hindus have their own Hindu ways. According to the mind, a set of tagged data is needed, which try to pretend to be a representative sample of what the algorithm will be used for, compares the predictions and what is marked up - it turns out an estimate of the accuracy of the model. There are a lot of nuances, how to do it correctly (the accuracy of the model on the training, test data and production is three big differences), but, no matter how hard it is, to understand what is bad and what is good, you need a certain standard with which the predictions and compared.
After three years of working on all of these algorithms, the Hindus did not have such a reference tagged data set. And this is very, very bad. Data is important. ( A good post on this topic ). It should be noted that in most cases, when soaked up, you can visually, by the shape of the signal, distinguish who is the washing machine, and who is the pool. So the Hindus did what - they hired a whole team that, for little money, visually compares our predictions with what they see with their eyes. And in the case of each mismatch, submit a bug report. But at the same time, they do not make any attempts to create a marked data set. That is, the assessment of the accuracy of the algorithm is subjective.
And it’s still sadder because the decision to move the pilot project to the next stage is largely made by how the managers of the electricity supplier company are impressed with our disaggregation for their own homes. And the words about 90% accuracy in general, they care only because they look at how the algorithm works on the data that comes from their homes.
That is, there were 3 different accuracy:
And it turns out that on the one hand it is necessary to improve the algorithm itself, and on the other, to develop a framework for imparting dust in the eye, namely twisting the results in a semi-automatic mode for VIP and, accordingly, monitor the accuracy in important houses and correct it as necessary.
Sometimes electricity suppliers arranged a competition to choose with whom to start a pilot project and then the most severe hand labeling began when the floor of ours and the entire Indian office manually marked the data. How then to issue the same accuracy when we fit into the pilot project, the question was not raised.
And here I sit and think, where to start? Marked data is not. Evaluation of the accuracy of the algorithm is not clear. With a sin in half, figured out what is happening on the ideological level with these washing machines. And there is a system of hacks and balances that work with magic numbers, which is full of code. And the idea seems clever, but what to do is not clear. I ask Alex - and again he: " You debugger run, but see how and what evolves ." And this, of course, is an option, if you understand the code, and if at the top, then the devil knows. And I have both hands on it to visually inspect what is happening as much as possible and often, but the complete lack of documentation, even outdated, is wrong, at least because in the long run it is ineffective. And the fact that our CTO answered this question: “ Well, what do you want? We are a startup. ” This is, but my opinion is not the answer, for a company of 50+ people.
And all this happens in some kind of fire extinguishing mode. Eternal reminders that this particular client is strategic and that it is very important for the company and for the happiness of the whole world to do everything as best as possible. And everything is scurrying around. And by and large it is not clear why it will end when. That is, salespeople are in the know, top management is aware of what and when we have according to plan, but this is not brought to the rest.
It is clear that for the survival of the company it is necessary for the company to have a client base and to build it as quickly as possible before the financing is over. But the development of the company is a marathon, not a sprint, and it is not necessary to try to squeeze all the juice out of the employees, at least because in the long run it does not work. I did not steamed in the sense that I came at 9, left at 5 pm. My contract says that I work 8 hours a day. I subscribed to this, so it is. But the same Alex worked at least 10 hours a day. Of course, he is a religious Asian (he went to church several times a week without fail), maybe he likes it so much. But this prospect did not excite me.
Another important nuance is that we often required interaction with the QA and the backend, and for this we needed to communicate with the Indian office, where they were, and the time zones were very different, and therefore they were appointed to meet in the evenings and weekends. I joined a couple of them - but then it came to the realization that midnight meetings are not an exception, but the rule, and that we did not discuss work at night and on weekends, and how to properly organize the interaction between teams in different time zones is problem managers, and not mine. So I added to my calendar that I was busy from five in the evening and it seemed they stopped inviting me to attend such midnight meetings. At the same time, I don’t mind participating in on-call rotation, when, in the event of any major problems, it is necessary for someone on-call to decide something, even if it is two in the morning.
And the fact that we need to tense up now, but after the launch of the pilot project will be a little easier - this is also not an argument, due to the fact that salespeople have fooled a bunch of different companies and some regular project was launched every couple of months. (And yet, yes, after Australia came Germany, and after it, preparations were made for Italy and there was talk about Hong Kong)
I marked some of these washing machines, 100 pieces, I guess. I wrote a code that assesses the accuracy, rewrote some pieces to increase the speed (My little Macbook is too weak, and Alex waved off my suggestion to perform calculations in the cloud). Chased the algorithm. F1 score = 0.5 (if roughly, it can be interpreted as accuracy). And it’s not even close to 0.9 that our CEO was excitedly talking about.
Alex showed that he was upset that the accuracy was small, but he was glad that at last we would have the correct systematic scientific approach. I almost did not give him a tambourine, because he has a PhD in Computer Science, obtained at UC Berkley, and about how to evaluate the accuracy of models in machine learning, he, at least, heard. But this deer and the entire office for 3 years were not puzzled by this and instead suffered some unstructured inefficient garbage.
By mid-December, after reading a couple of books, stackoverflow questions, communicating with a pair of professors who process signals, easily changing the logic of the algorithm, spinning magic numbers and adding a heuristic pair, our gallant algorithm for recognizing striral machines produced F1 = 0.65. I went through a pilot launch in Australia and I went home to St. Petersburg for a couple of weeks with the thought that on arrival I would have to look for a new job.
It is important to note that I was on a student visa, which had long expired. It's all legal, though not transparent. I could be in the USA, I could work, I could leave, but I couldn’t enter. In order to enter, you need to get a new visa. Moreover, in order to get it, you need to leave the United States. For five years of study, I did this a couple of times and it took a little more than a week to get a new visa. It was supposed that once it worked, now it will work the same way. But there was a nuance ... At those times I was a student and received a student visa, and here I received a student visa in order to work. the man who took the documents in my American consulate in St. Petersburg, was very surprised at this whole situation, asked me to send a list of publications and resumes and sent my documents somewhere for consideration.
Americans in the consulate are also not done, they have public holidays both the USA and Russia on weekends. And, of course, they didn’t give me a visa before the winter holidays, despite the fact that I had to appear on January 2 at Sunnyvale. But, the Indians, which is nice, entered the situation, and were allowed to work remotely until the situation was resolved.
All this hassle with a visa changed my plans for a job search, and at the same time, to search immediately, I decided to wait for H1B and only then, when the rears were covered, to make some movements. In addition, it was not at all obvious that if I start a job search, the Indians will not fire me (I saw this in films), and that the search for a new job will not drag on for many months, and this is again a problem with a visa and money. (on a student visa, I can not work as a maximum, 90 days in the first year). But the fact that the question of finding a new job will arise soon enough was obvious.
Definitely did not like the office. Every evening my eyes hurt.
I didn’t like the technical skills I learned at work. I absorbed the magic of git and Jira at the level that it was used in the company. The quality of writing code did not improve, at least because the entire Data Science team wrote a fucking code, that is, it was not from anyone to absorb. Yes, I got used to Matlab, but without this skill one could live in peace. I figured out the algorithms that were used in the office, I didn’t see much of their potential, but about neural networks (I don’t think I want to plug in neural networks everywhere, but because they fall under this task very naturally, and There are interesting articles on this topic), Alex did not want to hear. There was an attempt to push the company to start moving towards python, as a development language and our CTO, and all Data Scienists were both hands on, but Alex, as a big Matlab fan, was adamant. The logic was as follows: “Matlab compiled a binary, flooded it onto the server and that's it. And then the engineer in India will have to compile the binary code and pull the Python code from the repository - this is difficult, we will not do that. ” And I told myself to rewrite the function by function, test that everything works, and explain to the engineer in India how and what to do. but failed. Apparently not the words said and not so. Machine learning, which I like very much and which largely determined my decision to leave the university, oddly enough, Bidgely did not do. The algorithms that were used are a set of heuristics and a bit of statistics. There was nothing wrong with this approach either, but it scaled so-so. We tried to use the same algorithm in different parts of the world and our algorithm coped poorly with it, at least because it lacked free parameters (capacity of the model), but we also could not increase their number, because we had exhibited manually, rather than trying to extract from the data, as is done in classical machine learning. I also couldn’t score a bolt on the process of increasing knowledge in machine learning, and almost every evening after work I watched lectures, read articles and books, and also participated in the appropriate competitions on Kagla.
I did not like Sunnyvale. I thought that Silicon Valley is dynamically youthful, innovations are at every turn, and in Mountain View and Palo Alto it is in many ways, but Sunnyvale did not belong to this team - this is a big village where in your free time there is nothing to do.
-, , , , — .
-, , , , .
-, , — , , .
CEO , Fresh Grad' , Data Scienist' 2-3 , Director of Analytics .
CEO, , , , , , 6 . , , , , :" I am not convinced that this your idea will work, let's not do it ", .
. Bigely , , , , . VP of Engineering, , . :
And away we go. Pratik 4-5 . . , Director Of Analytics. , , , , . , .
, 2-3 . , Bidgely Data Scientist'a 5 ( Data Science , , , Sr. Software Engineer) Director of Data Science 8 . ( , ). , , .
, , , . , — , , , , , . Data Scientist' , .
Pratik'a, 90 . — . .
, — . , — , . , - , - .
, , Sunnyvale — , - Mountain View, , , , , , .
Google Facebook, , , . e, , Bidgely, , . , , , c , , , .
, , , , onsite interview. , . , , , Bidgely.
( .):
, , ….
Data Science Team? - , => . . : N , John — Data Analyst, Mary — Data Engineer, Mike — Head of Data Science, Jennifer — experimentation, . .
, ? - , , . , , , ( ), Google Sheets, , DataDog Tableau. , Junior .
/ ? , Matlab', .
? — , Bidgely, , . , / , — . .
? , . Bidgely, ( ) , , , , , hand labeling, . “”, , .
? onsite .
, , TrueAccord Google, Uber Twitter, , , , , , , . CEO ( ) . .
, Bidgly '. , . Bidgely , , , . Pratik, , , , Bidgely. , , .
.
-, , ( ), , , , .
-, , , , , , , , . cc - , . , .
-, Pratik Bidgely . . , , , . , , . , , , 5 . . Google Sheets Tableu Matlab . , . , .
c Pratik , , , , Glassdoor ( , CEO email : “ . " , .)
PS — , , .
PPS Bidgely .
PPPS , , LinkedIn Data Scientist :
Source: https://habr.com/ru/post/310776/