In addition to stories about our own experience in optimizing various services of our
IaaS provider, we analyze western experience. From project management to technology cases, which tell other IT companies.
Today we decided to take a look at the profession, which is related to the direct work with the data, and drew attention to the
note by Philipp Guo, who works at the University of Rochester “data scientists”.
/ Photo by Jer Thorp / CC')
Philip developed a number of thematic
tools in the course of work on his PhD thesis on the topic “
Software for working with data ” back in 2012.
Since then, the concept of “
data science ” has become the generally accepted name of the individual profession, and higher education institutions around the world have incorporated this area into their curricula.
Philip's experience allows us to talk about the difficulties that await all who would like to seriously engage in this area.
How it works - data collection
In order to feel like a “data scientist,” you can use a number of publicly available sources. For example, open statistical data that the government or companies publish, find an open API and experiment with uploading data from your favorite social network, and even independently generate a data set using specialized software.
Working with data is a multi-step process that requires careful adherence to techniques. Even the most basic level - data collection, from which everything begins, carries with it unobvious difficulties and potential errors that can make further analysis impossible due to the poor quality of the data collected. Here it is necessary to verify the quality of the data on the side of the source itself and to understand how they were initially received and systematized.
From this stage follows the following - data storage. Of course, the problem here is not what version of Excel to choose, but how to group and organize the thousands of files containing related data that will be further analyzed in detail.
In the case of large amounts of data, it is advisable to think about the use of the cloud IT infrastructure, if we are talking about individual experiments for which a not-so-significant budget has been allocated. It would be strange to spend these funds on the purchase of its own iron, which later will also have to be sold.
Data processing
Different data analysis tasks require the presentation of information in a specific form and format. As a rule, you will not receive a ready-made data set, which can be immediately analyzed without any additional processing.
At this stage, you will encounter the need to correct semantic errors and correct formatting. Here is useful
profile software that helps automate a number of routine tasks.
As part of the process of bringing data into a working form, you can once again analyze their structure and get additional insights regarding the hypotheses that it makes sense to put forward for your research.
Of course, at this step you will feel a general decline in productivity, but this work should be taken as mandatory. Without it, it will be very difficult for you to analyze the data and its quality will be very easy to criticize.
Data analysis
Here we are talking about working directly on the algorithms and programs that are responsible for interpreting your data set. For convenience, they can be called scripts, which are written using Python, Perl, R, and MATLAB.
You need to understand the entire data analysis cycle, which consists of preparing and editing scripts before getting the first results, interpreting them, and then adjusting your scripts experience.
From what may not quite go the way you plan, time costs and various failures are worth noting. You can spend a huge amount of time due to the large amount of data being processed and the inefficient use of computational resources. For example, you will use only the home computer, whose resources are very difficult to scale.
In addition, time can also select the data analysis algorithm itself, embedded in your script. To do this, it is necessary to carry out test launches, analyze the course of the process and promptly make adjustments. Similarly, attention should be paid to possible failures.
Try to run the analysis taking into account various parameters and features of the input data. This process may require a series of experiments with changes in these parameters and additional iterations with adjustments to the processing algorithm itself.
findings
As a result, performing the first three steps you get certain results. They are no longer raw and allow conclusions. For this purpose, it is recommended to make detailed notes and present them to colleagues.
This approach will help to connect the result with what you planned to get at the very early stage of work with this or that topic. Such a reflection will allow you to trace the evolution of your hypothesis and may push you to additional experiments with data. This can contribute to a visual presentation of the results to your colleagues.
Comparison with what results were obtained in similar works of other scientists will help you to carry out our work with potential errors, go back to one of the previous steps and then go on to the stage of registration of the research results.
Presentation
In addition to the oral report, infographics, and the classic presentation that allows you to put all these elements together in front of your audience, there are other ways to complete a research project. The result of many works on data analysis are programs and algorithms with documentation and explanatory notes.
This form allows you to quickly reproduce the results of your colleagues in the profession and moves the field of data analysis forward. To do this, you need to be well versed in software development so as not to put the expert community in an awkward position when working with a script without any coherent documentation.
PS
Additional reading on the recommendation of Philip Guo.