📜 ⬆️ ⬇️

What programming language to choose to work with data?



The novice data expert (data scientist) has the opportunity to choose one of the many programming languages ​​that will help him master this science faster.

However, no one will tell you exactly which programming language is best for this purpose. Your success as a specialist in this field will depend on many factors and today we will try to consider them, and at the end of the article you will be able to vote for the programming language that you consider most suitable for working with data.

Specificity


Be prepared for the fact that as you go deeper into the field of data science, you will have to “reinvent the wheel” again and again. In addition, you will need to perfectly master the various software packages and modules for your chosen programming language. How well you can learn all this depends, first of all, on the availability of domain-specific software packages for the selected PL.
')

Versatility


The data lead has good overall programming skills, as well as the ability to perform calculations and analyze. Most of the daily work in the field of data science is aimed at finding and processing raw data or correcting data. Unfortunately, no new-built machine-learning packages will help you for this purpose.

Efficiency


In the rapidly developing world of commercial data science, there are many opportunities to quickly get the desired job. Nevertheless, precisely because of the rapid development of the field of data science, technical underperformance constantly accompanies it, and only persistent practice can minimize such shortcomings.

Performance


In some cases it is very important to optimize the performance of your code, especially when working with large volumes of especially important data. However, compiled languages ​​are usually much faster than interpreted languages. Similarly, statically typed languages ​​are much more fault tolerant than dynamically typed. Thus, the only compromise is performance degradation.

To some extent, each of the following programming languages ​​has one parameter in each of the two groups: universality - specificity; performance is convenience.

Given these basic principles, let's take a look at some of the most popular programming languages ​​that are used in data science. All information about the programming languages ​​listed below is based on my own observations and experiences, as well as the experiences of my friends and colleagues.

R





R, which is a direct descendant of the older programming language S, was released back in 1995 and has since become more and more perfect. Written in languages ​​like C and Fortran, this project is now supported by the R Foundation Foundation for Statistical Computing (R Foundation for Statistical Computing).

License:

Free

Benefits:


Disadvantages:


Our verdict is ideal for initial purposes.

R is a powerful language that is distinguished by the presence of a huge choice of applications for statistical data collection and data visualization, and the fact that it is an open-source PL allows it to gather a large number of fans among developers. Due to its effectiveness for initial purposes, this programming language managed to achieve wide popularity.

Python





In 1991, Guido van Rossum introduced the Python programming language. Since then, this language has become an extremely popular general-purpose PL and is widely used in the data community. Currently, the main versions are Python 3.6 and Python 2.7.

License:

Free

Benefits:


Disadvantages:


Our verdict is convenient in all respects.

Python is a good option for data science purposes, and this statement is true for both beginner and advanced levels of work in this area. Much of the data science is centered around the ETL (extraction-conversion-loading) process. This feature makes Python an ideal programming language for such purposes. Libraries such as Google's Tensorflow make Python a very interesting language for machine learning.

SQL

img align = "center" src = " habrastorage.org/web/f7e/2cf/42d/f7e2cf42d60b4b8fa6f442504828fe57.png " />

SQL (“structured query language”) defines, manages, and queries relational databases. The language appeared in 1974 and since then has undergone many modifications, but its basic principles remain unchanged.

License:

There are free and paid options.

Benefits


Disadvantages:


Our verdict is effective, despite the time

SQL is more useful as a data processing language than as an advanced analytical tool. Nevertheless, so many processes in the field of data science depend on ETL, and the longevity and efficiency of SQL once again show that such a PL should be known to every data specialist (data scientist).

Java




Java is an extremely popular general purpose language that runs on the Java Virtual Machine (JVM). It is an abstract computing system that provides smooth portability between platforms. Currently supported by Oracle Corporation.

License:

8th version - free

Benefits:


Disadvantages:


Our verdict is a serious contender for being the best language for working in the field of data science.

Many things can be said in favor of learning Java as a language for working in the field of data science. Many companies will appreciate the possibility of seamlessly integrating the finished code of a software product into their own code base, and Java's performance and type safety are its undeniable advantages. However, the disadvantages of such a language include the fact that it does not have sets of specific packages that are available for other languages. Despite this flaw, Java is a programming language that you should definitely pay attention to, especially if you already know R or Python.

Scala



The Scala programming language, which functions on the JVM, was developed by Martin Oderski in 2004. It is a multi-paradigm language that allows both object-oriented and functional approaches to be used. In addition, the Apache Spark cluster computing structure is written in Scala.

License:

Free

Benefits:


Disadvantages:


Our verdict is perfect for working with big data.

If you decide to use cluster computing for working with big data, then a pair of Scala + Spark is the perfect solution. Moreover, if you already have experience with Java and other statically typed programming languages, then you will certainly appreciate these features of Scala. However, if your application has nothing to do with large amounts of data, work with which may justify adding all the components of Scala, you are likely to achieve greater performance using other languages, such as R or Python.

Julia





Released a little over 5 years ago, Julia impressed the world of computational methods. The language has achieved such popularity due to the fact that several large organizations, including some in the financial industry, almost immediately began to use it for their own purposes.

License:

Free

Benefits:


Disadvantages:


Our verdict is a language that will still manifest itself.

Yes, the main problem of the Julia language is his youth, but he cannot be blamed for that. Since Julia was created only recently, it cannot compete with its main competitors, Python and R. Be patient and you will understand that there are many reasons to pay close attention to this language, which will certainly make outstanding steps in the near future.

Matlab



MATLAB is a recognized language for numerical calculations, used both for scientific purposes and in industry. It was developed and licensed by MathWorks, a company established in 1984, the main purpose of which was the commercialization of software.

License:


Prices vary depending on the language you choose.

Benefits:


Disadvantages:


Our verdict is the best option for purposes requiring significant mathematical calculations.

Due to its widespread use in various quantitative calculations, both for scientific purposes and for industry, MATLAB has become a worthy option for use in data science. You will have it at the right time, if you need intensive, advanced mathematical functionality for your daily goals, which is what MATLAB was developed for.

Other languages


There are other popular PLs that may be of interest to data experts. This section provides a brief overview.

C ++


Often, C ++ is not used in data science. However, it has lightning fast performance and widespread popularity. The main reason C ++ has not gained popularity in the field of data science is its inefficiency for such a purpose.

As one of the forum participants wrote:
“Suppose you are writing code to do some special analysis that is likely to run only once. So, would you prefer to spend 30 minutes to create a program that will work for 10 seconds or spend 10 minutes for a program that will work for 1 minute? ”

And this guy is right! However, C ++ will be an excellent choice for implementing low-level machine learning algorithms.

Our verdict is not the best choice for everyday work, but when it comes to performance ...

Javascript


Due to the fact that over the past few years, the Node.js platform has been actively developing, the JavaScript programming language has increasingly acquired the features of a server language. However, its capabilities in the field of data science and machine learning today are quite modest (however, you should not forget about brain.js and synaptic.js!). The disadvantages of JavaScript include:


Node.js’s undoubted advantages include its asynchronous I / O, its growing popularity, and the fact that there are many languages ​​that compile with JavaScript. So it is quite possible that in the near future we will see a useful framework for working in the field of data science with the possibility of processing using ETL in real time. Another question: will it be relevant at that time ...

Our verdict - there is still a lot to do in order for JavaScript to be considered a worthy language for working in the field of data science

Perl


Perl is known as the "Swiss Army Knife of Programming Languages" because of its versatility as a general purpose scripting language. It has a lot in common with Python, being a dynamically typed scripting language. But he is still very far from the popularity that Python has in data science.

This is a bit surprising given its use in areas that use quantitative analysis methods, such as bioinformatics. As for the science of data, Perl has several drawbacks: it will not be able to quickly become popular in this field, and its syntax is considered unfriendly. In addition, from the side of its developers, there are no attempts to create libraries that could be used in the field of data science. And as we all know: often the right actions at the right moment decide everything.

Our verdict is a useful general-purpose scripting language, but with it, you certainly won’t get a job as a data expert ...

Ruby


Ruby is another dynamically typed general-purpose interpreted language. However, it seems that its creators have no desire to make it suitable for work in the field of data science, as is the case with Python.

This may seem strange, but all of the above is somehow connected with the dominant position of Python in the field of scientific research, as well as with the positive feedback from people who write in this language. The more people choose Python, the more modules and frameworks are developed for it, and the more programmers give their preference to Python. The SciRuby project was created in order to implement the functionality of scientific computations in Ruby, for example, matrix algebra. But, despite all these attempts, Python is still leading at the moment.

Our verdict is not exactly the right choice for data science, but your resume doesn’t hurt Ruby knowledge

Conclusion


Well, here we are with you and we have covered a short tutorial on programming languages ​​that came closest to the field of data science. The important point here is to understand what you need more: the specificity or universality of the language, its convenience or efficiency.

I regularly use R, Python, and SQL, as my current work is mainly focused on developing existing data pipelines and ETL processes. These languages ​​combine the right balance of community and efficiency to do this work with the ability to use more advanced R statistical packages when necessary.

However, it is possible that you already have a good hand in Java, or you are not looking to try out Scala for working with big data in action, or maybe you are crazy about the Julia project.

Or maybe you crammed MATLAB on pairs at the institute or would you mind giving SciRuby a chance to show yourself? Yes, you can have hundreds of different reasons! If so, leave your comment below - because for us it is really important to know the opinion of each of you!

Thanks for attention!

- Marketing for your project on Reddit, Medium and Bitcointalk .

Source: https://habr.com/ru/post/337330/


All Articles