Hello. Opened a set of new courses from Otus - "Applied Analytics on R" , which starts at the end of this month. In this regard, I want to share the translation of the publication of the difference between the data analyst and the statistician, who in turn uses R in practice.
Over the past ten years, data volumes and the rate of their appearance have increased exponentially. According to reports, more than 3 quintillion bytes of data appear every day! It is not surprising that to work with them a new profession of a data researcher (data scientist) appeared - a versatile specialist in data analysis and processing. However, people were engaged in statistics before the advent of digital data processing. What are the differences between these two professions: data researcher and statistician?
Let's see.
A data researcher is superior in professional qualities to any software engineer, and also understands software development better than any statistician.
Data researchers work with large amounts of data, which, as a rule, are in the repositories of organizations or on sites, but they themselves are practically useless in terms of gaining strategic or financial advantages. To provide recommendations and suggestions for making optimal decisions, data researchers arm themselves with statistical plans and evaluate previous and current data from such sources.
In marketing and planning systems, data researchers are primarily concerned with identifying ideas and statistical indicators that can be useful for preparing, implementing and tracking result-oriented marketing policies.
Statisticians collect and evaluate information in search of behavioral patterns or descriptions of the environment. Based on this information, they build models. These models can be used to predict and comprehend the universe.
For example, statistics show that celebrating a birthday is safe - the older a person is, the more birthdays he has celebrated.
A statistical researcher creates and uses statistical or mathematical models to help solve real problems based on collected and summarized useful data. Data is collected, analyzed and used in various fields, including engineering, science and business. The accumulated numerical data helps companies and their customers understand quantitative indicators and track or predict trends that are useful in making business decisions.
1. Education
Informatics are usually highly educated - 88% of them have a master's degree, and 46% are candidates for a candidate's degree. Although there are exceptions to this rule, in general, in order to obtain the necessary expertise and skills in the field of information science, strong training is usually required.
2. Programming on R
Data analysts should preferably know at least one such tool. R created specifically for the needs of data science. With the help of R you can process any information for scientific tasks. 43% of data researchers use R to solve statistical problems. However, R has a rather thorny path of learning.
3. Python programming
Python, along with Java, Perl, and C / C ++, is one of the most popular programming languages for data science. For data researchers, Python is a good option.
4. Hadoop Platform
Not in all, but in many cases, owning this tool is highly desirable. The value of a specialist increases if he also has experience with Hive or Pig. Cloud tools like Amazon S3 may also come in handy.
5. SQL: working with databases and programming
Data researchers should be proficient in SQL. This programming language is designed specifically for working with data. It allows you to get the information you are interested in from databases using brief instructions-queries - quickly and without writing bulky code.
6. Machine learning and artificial intelligence
Many data researchers do not own algorithms and methods of machine learning, nothing makes sense in neural networks, deep and competitive learning and things like that. However, if you want to stand out from the rest of the data researchers , you should better understand methods such as machine learning with a teacher, decision trees, logistic regression, etc.
7. Data Visualization
The amount of data in the corporate world is huge. They require conversion to simpler formats. As a rule, people perceive data better in the form of graphs and charts.
8. Unstructured data
The data explorer should be ready to work with unstructured data. Such data has an arbitrary format and is not stored in databases - for example, photos, blog entries, customer reviews, posts in social networks, videos, audio files, etc.
9. Knowledge of business principles
To be a researcher in the field of information, you need to understand the sector in which you work, as well as the business challenges facing your business.
10. Communication skills
Companies that are looking for a strong data researcher need a person who is able to clearly and freely deliver technical results to non-profile audiences, for example, to marketers or sales specialists.
1. SPSS
The Statistical Package for the Social Sciences (SPSS) is perhaps the most common statistical software in the field of human behavior research. SPSS visual interface allows you to combine descriptive statistics and parametric and non-parametric analysis results, presented in graphical form. SPSS has the ability to create scripts to automate estimates or complex statistical calculations.
2. R
R is a free software package that is actively used in human behavior research and other areas. The R-based toolkit, which simplifies the various steps of the information processing process, is available for various applications. R is a high-performance software, but mastering it is not so easy. In addition, its application will require skills in writing code.
3. MATLAB (Mathworks)
MatLab is a platform for analytics and programming that is widely used by technicians and researchers. As in the case of R, the path of development is rather thorny, and at a certain stage you will need to write your own programs. A variety of tools will help to cope with research tasks (for example, the EEGLab toolkit is designed to analyze EEG data). Although MatLab will be difficult to use for beginners, this package provides very wide opportunities, provided that you can write code (or at least run the necessary tools).
4. Microsoft Excel
Microsoft Excel offers a variety of visualization tools and easy-to-use statistical functions, although it is not a complete statistical analysis tool. It is easy to work with numbers, calculate summary totals and create custom graphs. These are useful tools for those who want to see what data underlie the available information. Since Excel is used by many people and companies, it can be considered an affordable option for beginners.
5. GraphPad Prism
GraphPad Prism provides many opportunities that can be used in various fields, especially in statistics related to biology. Similar to SPSS, analysis and complex statistical calculations can be automated here using scripts.
')
6. Minitab
In the Minitab software package, there are a variety of basic and rather complex statistical tools for evaluating information. Like GraphPad Prism, thanks to the graphical user interface and scripts, it can be accessible both to beginners and users who need more complex analysis.
1. R
R is a free software package for statistical calculations and their visualization. R compiles and runs on many UNIX, Windows, and macOS platforms.
2. Python
Python is a popular programming language developed by Guido van Rossum. The language source code was first published in 1991. Python is used to develop backends, computer production, math problems, create scripts for systems.
3. Julia
Julia was originally designed for high performance computing. For various LLVM systems, Julia programs are compiled into effective native code. Julia is a dynamically typed programming language that looks like a scripting language and has excellent interactive hints in the development environment.
4. Tableau
Tableau is one of the fastest growing data visualization tools in the business intelligence sector. This is the best way to convert raw data into easy-to-understand formats that do not require technical knowledge and programming skills.
5. QlikView
QlikView is one of the main platforms for finding corporate data. It differs from traditional business intelligence systems in a number of ways. As an information analysis tool, it always visualizes relationships between data using color, and also displays unrelated information. Direct and indirect search is implemented by entering queries in the list headers.
6. AWS
Offering compute capabilities, database resources, and content delivery services, Amazon Web Services (AWS) secure cloud platform helps businesses grow their businesses. Millions of customers are already using AWS and alternatives to develop complex applications that have great flexibility, scalability and reliability.
7. Spark
Apache Spark is a fast cluster computing framework. It supports high-level APIs for Java, Scala, Python, and R, as well as an optimized graph processing engine.
8. RapidMiner
RapidMiner is a technology data processing platform. It includes data preparation functions, machine and deep learning algorithms, text analysis tools, and a predictive analytics environment. RapidMiner supports all machine learning tools, including the preparation of information, visualization of results, verification of compliance with the requirements of the project task and optimization. RapidMiner is used in business, industry, training and teaching, rapid prototyping and software development.
9. Databricks
The Databricks platform, which combines data processing and business technology support, is designed for data professionals, engineers, and researchers. The platform supports the entire life cycle of machine learning: from preparing information to testing and implementation.
Research data is not only more in demand than the work of statisticians , but also better paid. According to Glassdoor, the average salary for a data researcher in the United States is $ 118,709, and statistics is $ 75,069. A data researcher is a universal specialist for an enterprise who can answer important questions. Usually he gets an open question. The specialist finds out what information is needed, determines the time for completing the task, performs modeling and analysis, and writes a brilliant program that allows you to get an answer.
Statistician
Statistical methods specialists, as a rule, perform information analysis under the supervision of a senior statistician, who can also be their mentor. After some time, many such specialists leave the backstage for more responsible and independent positions and take on complex technical tasks.
Applied Statistics
Applied statistics are responsible for ensuring that for each important issue the relevant data for the analysis is collected and prepared (or the corresponding analysis is carried out) and a report with the results is prepared. They work closely with other technical experts and management, being an integral part of the project team.
Senior Statistics Specialist
A senior statistician has a wider range of responsibilities than an applied statistician. He explores issues in a comprehensive way to find links to the goals of the organization as a whole. In order to offer fresh ideas that will eventually benefit the organization and customers, senior statisticians are proactive. Often they connect in the early stages of a project, help identify problems based on numbers, and recommend ways to solve them to senior management. They are then attracted to prepare and present the results. In statistical matters, they are often the best source of information and experience.
Head of Statistics
The heads of statistical departments, especially the youngest, are involved in project planning, helping to determine what should happen. They recruit staff, give advice and are responsible for the overall results of the projects. They inform senior managers about the achievements of the department, help their employees in their career development and determine development directions. Their administrative duties include recruiting and developing employees, as well as evaluating the effectiveness of their work. For obvious reasons, managers need less than ordinary workers.
Private consultant statistics
Some application statistics become independent private consultants. They carry out special studies, often commissioned by organizations that do not have statisticians, or they evaluate the work of other statisticians. Statistics consultants are often involved as experts in legal matters.
Data explorer
Data researchers work with statistical and mathematical models used to process information. The bright mind of a data analyst is useful in creating a system for estimating the number of credits that cannot be paid next month.
Data processing specialist
These broad-based specialists use computing systems to process large data sets based on their knowledge of software development. As a rule, each of them knows several programming languages, such as Python and Java. Typically, these workers focus on writing code, clearing data, and performing queries from data researchers . To convert the predictive model that the data explorer has created into program code, one usually resorts to the services of a data processing specialist.
Analyst
Finally, there are experts who examine the data, create reports and visualize what the data carries. Analysts help company employees to get information on specific issues.
An outstanding analyst is a valuable specialist; his coding style is optimized in terms of speed. But he is not a statistician , not even a bad one, because he does not draw final conclusions based on facts. The main task of the analyst is to declare: “This is what is contained in our data. To say that this follows is not my task. Perhaps the decision maker will want to attract statistics to figure this out. ”
That's all, waiting for everyone on the course .
Source: https://habr.com/ru/post/459354/
All Articles