Data Science is a
very promising area . Last year, we received 210 resumes from people who want to do Data Science at EPAM. Of these, we invited 43 people for a technical interview, and offered a job to seven. If demand is big, why is that?
We talked to technical interviewers and found out: the problem of many candidates is that they have a bad idea of ​​what data analysts are doing. Therefore, their knowledge and skills are not always relevant for work. Someone thinks that experience with Big Data is enough to work in Data Science, someone is sure that it’s enough to see several machine learning courses, some people think that it’s not necessary to understand the algorithms well.
Dmitry Nikitko and
Mikhail Kamalov - data analysts and technical interviewers from EPAM - told what they expect from candidates from the interviews, what questions they ask, what is valued in the resume and how to prepare for the interview.
')

In different companies, the understanding of what data analysts do is different. Someone understands this concept more widely, someone - already. Here is what such specialists do at EPAM:
- Engaged in data preprocessing
- Look for patterns in data and test hypotheses
- Create predictive models using machine learning algorithms
- Evaluate the quality of the models
- Visualize the data
- Helps integrate solution
There are a lot of tasks with which data analysts work. For example, ranking can be applied not only to the search results, but also to the creation of recommender systems, the search for similar images, music, and even 3D-models of the face. In each of these cases, you need to find a relevant answer on request. But data types are different, and you need to know which strategy to apply in one way or another.
Eram did a
test that recruiters send candidates to interviews. The part where you need to choose the right option is automatically checked. The part that contains detailed answers to questions is read by technical interviewers.
What you need to be able to
In short, a data analyst is a person who can program (in most cases in Python), understands statistics, mathematics, algorithms, and speaks English.
English is needed not only to read specialized literature and deal with documentation. Many analysts communicate directly with foreign customers. By the way, the ability to translate from the language of the data scientist to the one that is understandable to business is also useful.
Is profile education required?
It is important to know mathematics well, and a higher technical education is a big plus. Most data scientists in EPAM are math, programmers or physicists. But this is not a strict requirement - we have a linguist employee, and recently we also took a sociologist who, after graduating from the university, processed the results of sociological research, created models, was engaged in forecasting and analyzing social graphs. This experience is relevant for working in Data Science, so the candidate was interesting to us.
In general, one cannot say that a person with a technical education will suit us, but with a humanitarian one - no. It all depends on skills and experience. For example, a computer linguist who learned to write code is a more interesting candidate than a Big Data engineer who worked with MapReduce and Hadoop, but did not understand algorithms, or who holds a degree in statistics without work experience.
What is valued in the resume
Experience is most appreciated. If you have already worked in Data Science, write in detail what you did, what algorithms you used and what skills you have.
If you have no work experience, a big plus in your resume will be:
- A short story about pet projects . It is important that the candidate not only knows the theory, but also has time to practice.
- Participation in hackathons . This suggests at least that you worked in a team and (most likely) created a working solution within a limited time frame. Participation in hackathons is also good because employers can notice you on them. Then send a resume may not be required at all.
- Participation in machine learning competitions (Kaggle, DrivenData). If you participated or even won the Instacart competition at Kaggle, where you had to create a recommendation system, you can solve a business problem with similar goals faster. But, in our experience, winning such competitions does not always mean that the candidate knows, for example, how the algorithms that he used work.
What is asked at the interview
The goal of the Data Science interview, as elsewhere, is to understand how well a person understands his subject area. First, the interviewer asks questions on the basics of machine learning and statistics. From the answers, one can understand the depth and breadth of the candidate’s knowledge of basic questions. After that, specific questions are asked, for example, on natural language processing, working with time series or recommender systems. If the candidate says that he can work with graphs, images or other data, he will be asked about it.
Universal soldiers are extremely rare, and the questions at the interview depend on the experience of the candidates. Usually they ask about past projects, what technologies they used and why. After that, they may be asked to speculate. And of course they will ask a few theoretical questions.
Here are some questions you can ask at the interview:
•
Neural networks- What methods of preventing retraining (regularization) for neural networks do you know? How do they work? Where to embed batch normalization?
- What is the difference between a neural network with one output and a sigmoidal activation function and the same neural network but with two outputs and softmax?
- Imagine that we have a multilayer fully connected network with a nonlinear activation function. What will happen to the neural network if we remove the nonlinearity?
- Why use global pooling?•
Image Recognition- How is quality evaluated in object detection tasks?
- What neural network architectures for semantic segmentation do you know?
- How and why to use transfer learning?•
Time series- How to test the quality of models in working with time series?
- What should we do with seasonality in the data?
- How to look for anomalies in the time series?•
Natural language processing- What is the basis of modeling topics? How does this algorithm work? How do you choose the number of topics that will be trained by this algorithm?
- You have the text of reviews and rating, users use a 5-point scale. How would you build a system that can predict the rating on the text of the review? How to evaluate the quality of this system?In the course of reasoning and solving problems, the interviewers ask many clarifying questions and try to put the candidate in a "combat conditions". Narimer, the candidate offers a solution, and the interviewer adds new conditions to the problem.
“What will you do if the data set is unbalanced?”
“How will you solve the problem if there are gaps in the data?”
“What do you do if there are outliers in the data?”In addition, they may ask how the candidate organizes his working time, how the experiments are logged, whether they are monitored for their reproducibility, how they process large volumes of data and build the data processing pipelines.
Typical mistakes on interviews
• The
candidate does not understand how the algorithms that he used workInterviewers always ask about the algorithms that candidates used: what parameters they have, how to configure them. If there is no answer, or the candidate answers that he has configured the algorithm “on a whim” - this is bad. If you take the algorithm, it is worth taking the time to figure out how to set it up.
•
Candidate does not understand how to apply his knowledge in “combat conditions”It happens like this: the candidate knows the theory well, but he doesn’t know how to deal with problems on projects. It is important not only to be able to find insights in the data, do feature-engineering, build models, but also understand how to put all this into production or make a solution that will work faster.
•
Candidate cannot reason independentlyIf a person too often answers the question: “I will google” - this is not a good sign. Of course, data Scientists googling, but being able to talk independently is also important: sometimes there are problems for which there is no ready-made solution, and you need to invent something of your own.
• The
candidate invents how the system works.Sometimes people cannot answer the question of how this or that system works, and start thinking up, hoping to get a finger into the sky. This is not recommended: the interviewer will notice. It is better to honestly say: “I do not know,” then there will be more time for other questions. The likelihood that you will be asked about what you understand will increase.
Bibliography
Anyone who wants to study Data Science, we recommend to see / read:
• Course
"Programming in Python" at Stepik
• Course
"Introduction to machine learning" on Coursera
• Course
"Machine Learning and Data Analysis" on Coursera• Course
"Machine Learning" by Konstantin Vorontsov
•
Course on deep learning on Coursera• Course
"Neural networks" on Stepik
• Book
Deep Learning Book• The book
Deep Learning: Immersion in the World of Neural Networks is the first book about deep learning in Russian.
• Book on NLP
Speech and Language Processing• Book on Information Retrieval and NLP
“Introduction to Information Retrieval”• Articles on
opendatascience• Course
"Algorithms and Data Structures" by Maxim Babenko