
The novice data expert (data scientist) has the opportunity to choose one of the many programming languages that will help him master this science faster.
However, no one will tell you exactly which programming language is best for this purpose. Your success as a specialist in this field will depend on many factors and today we will try to consider them, and at the end of the article you will be able to vote for the programming language that you consider most suitable for working with data.
Specificity
Be prepared for the fact that as you go deeper into the field of data science, you will have to “reinvent the wheel” again and again. In addition, you will need to perfectly master the various software packages and modules for your chosen programming language. How well you can learn all this depends, first of all, on the availability of domain-specific software packages for the selected PL.
')
Versatility
The data lead has good overall programming skills, as well as the ability to perform calculations and analyze. Most of the daily work in the field of data science is aimed at finding and processing raw data or correcting data. Unfortunately, no new-built machine-learning packages will help you for this purpose.
Efficiency
In the rapidly developing world of commercial data science, there are many opportunities to quickly get the desired job. Nevertheless, precisely because of the rapid development of the field of data science, technical underperformance constantly accompanies it, and only persistent practice can minimize such shortcomings.
Performance
In some cases it is very important to optimize the performance of your code, especially when working with large volumes of especially important data. However, compiled languages are usually much faster than interpreted languages. Similarly, statically typed languages are much more fault tolerant than dynamically typed. Thus, the only compromise is performance degradation.
To some extent, each of the following programming languages has one parameter in each of the two groups: universality - specificity; performance is convenience.
Given these basic principles, let's take a look at some of the most popular programming languages that are used in data science. All information about the programming languages listed below is based on my own observations and experiences, as well as the experiences of my friends and colleagues.
R
R, which is a direct descendant of the older programming language S, was released back in 1995 and has since become more and more perfect. Written in languages like C and Fortran, this project is now supported by the R Foundation Foundation for Statistical Computing (R Foundation for Statistical Computing).
License:Free
Benefits:- Excellent set of high quality open source domain-oriented packages. R has at its disposal packages for virtually any quantitative and statistical application that can only be imagined. These include neural networks, nonlinear regression, phylogenetics, the construction of complex diagrams, graphs, and much more.
- Together with the basic installation in the appendage, we are given the opportunity to install extensive built-in functions and methods. In addition, R perfectly processes the data of matrix algebra.
- The ability to visualize data is an important advantage, along with the ability to use different libraries, for example ggplot2.
Disadvantages:- Poor performance There is nothing to say: R is not a fast YaP.
- Specificity. R is great for statistical studies and data science, but it's not so good when it comes to programming for general purposes.
- Other features. R has several unusual features that can confuse programmers who are used to working with other PL: indexing starts with 1, the use of several assignment operators, non-traditional data structures.
Our verdict is ideal for initial purposes.R is a powerful language that is distinguished by the presence of a huge choice of applications for statistical data collection and data visualization, and the fact that it is an open-source PL allows it to gather a large number of fans among developers. Due to its effectiveness for initial purposes, this programming language managed to achieve wide popularity.
Python
In 1991, Guido van Rossum introduced the Python programming language. Since then, this language has become an extremely popular general-purpose PL and is widely used in the data community. Currently, the main versions are Python 3.6 and Python 2.7.
License:Free
Benefits:- Python is a very popular, widely used general purpose programming language. It has an extensive set of specially designed modules and is widely used by developers. Many online services provide APIs for Python.
- Python is very easy to learn. The low threshold of entry makes it an ideal first language for those who program.
- Software packages like pandas, scikit-learn and Tensorflow make Python a reliable option for modern machine learning applications.
Disadvantages:- Type Safety. Python is a dynamically typed language, which means you have to be careful when working with it. Type mismatch errors (for example, passing a string (string) as an argument to a method that expects an integer) can happen from time to time.
- For example, if there are specific goals for statistical analysis and data analysis, the extensive set of R language packages gives it an advantage over Python. In addition, there are faster and safer alternatives to Python among programming languages.
Our verdict is convenient in all respects.Python is a good option for data science purposes, and this statement is true for both beginner and advanced levels of work in this area. Much of the data science is centered around the ETL (extraction-conversion-loading) process. This feature makes Python an ideal programming language for such purposes. Libraries such as Google's Tensorflow make Python a very interesting language for machine learning.
SQL
img align = "center" src = "
habrastorage.org/web/f7e/2cf/42d/f7e2cf42d60b4b8fa6f442504828fe57.png " />
SQL (“structured query language”) defines, manages, and queries relational databases. The language appeared in 1974 and since then has undergone many modifications, but its basic principles remain unchanged.
License:There are free and paid options.
Benefits- It is very effective when working with queries, updates, as well as when processing relational databases.
- Declarative syntax makes SQL a very readable language. There is no uncertainty about what SELECT name FROM users WHERE age> 18 should do!
- SQL is very often used in various applications, so familiarity with it can be very useful. Modules such as SQLAlchemy make it easy to integrate SQL with other languages.
Disadvantages:- SQL syntax can seem like a daunting task for those who are used to imperative programming.
- There are many different variations of SQL, such as PostgreSQL, SQLite, MariaDB. They are all quite different, so about any compatibility is out of the question.
Our verdict is effective, despite the timeSQL is more useful as a data processing language than as an advanced analytical tool. Nevertheless, so many processes in the field of data science depend on ETL, and the longevity and efficiency of SQL once again show that such a PL should be known to every data specialist (data scientist).
Java
Java is an extremely popular general purpose language that runs on the Java Virtual Machine (JVM). It is an abstract computing system that provides smooth portability between platforms. Currently supported by Oracle Corporation.
License:8th version - free
Benefits:- Versatility. Many modern systems and applications are developed using the Java language. The great advantage of this PL is the ability to integrate data science methods directly into the existing code base.
- Strong typing. Providing type safety is not an empty sound for Java, and in the case of developing critical applications for working with big data, this feature is more important than ever.
- Java is a high-performance, compiled general purpose language. This makes it suitable for writing an effective production ETL code, as well as machine learning algorithms using computational tools.
Disadvantages:- The verbosity of the Java language makes it not the best option for conducting special analyzes and developing more specialized statistical applications.
- Java does not have a large number of libraries for advanced statistical methods compared to some domain-specific languages, for example R.
Our verdict is a serious contender for being the best language for working in the field of data science.Many things can be said in favor of learning Java as a language for working in the field of data science. Many companies will appreciate the possibility of seamlessly integrating the finished code of a software product into their own code base, and Java's performance and type safety are its undeniable advantages. However, the disadvantages of such a language include the fact that it does not have sets of specific packages that are available for other languages. Despite this flaw, Java is a programming language that you should definitely pay attention to, especially if you already know R or Python.
Scala
The Scala programming language, which functions on the JVM, was developed by Martin Oderski in 2004. It is a multi-paradigm language that allows both object-oriented and functional approaches to be used. In addition, the Apache Spark cluster computing structure is written in Scala.
License:Free
Benefits:- Using Scala and Spark, you have the opportunity to work with high-performance cluster computing. Scala is the perfect choice for those who work with large amounts of data.
- Multiparadigmatic. For programmers working with Scala, both object-oriented and functional programming paradigms are available.
- Scala compiles to Java bytecode and runs on a JVM. This allows it to interact with the Java language, making Scala a very powerful general purpose language. In addition, it is also well suited for work in the field of data science.
Disadvantages:- If you are just going to work with Scala, then be prepared to pretty much "break" your head. It is best to download sbt and configure an IDE, such as Eclipse or IntelliJ, using a special Scala plugin.
- It is believed that the syntax and type system of Scala are quite complex. Thus, programmers who are used to working with dynamic languages, such as Python, will have a hard time.
Our verdict is perfect for working with big data.
If you decide to use cluster computing for working with big data, then a pair of Scala + Spark is the perfect solution. Moreover, if you already have experience with Java and other statically typed programming languages, then you will certainly appreciate these features of Scala. However, if your application has nothing to do with large amounts of data, work with which may justify adding all the components of Scala, you are likely to achieve greater performance using other languages, such as R or Python.
Julia
Released a little over 5 years ago, Julia impressed the world of computational methods. The language has achieved such popularity due to the fact that several large organizations, including some in the financial industry, almost immediately began to use it for their own purposes.
License:Free
Benefits:- Julia is a compiled JIT language (“just in time”), which makes it possible to achieve good performance. This language is quite simple, it provides for the possibilities of dynamic typing and scripts of an interpreted language, such as Python.
- Julia was designed for numerical analysis, it can also be considered as a general-purpose programming language.
- Readability. Many programmers working with this language believe that such a feature is its greatest advantage.
Disadvantages:- Immaturity. Since Julia is a fairly new language, some developers experience instability while working with its packages. However, the basic language tools are considered stable.
- Another sign of the immaturity of the language is a limited number of software packages, as well as a small number of fans among developers. In contrast to the well-established R and Python programming language, Julia does not have a large number of software packages (for now).
Our verdict is a language that will still manifest itself.Yes, the main problem of the Julia language is his youth, but he cannot be blamed for that. Since Julia was created only recently, it cannot compete with its main competitors, Python and R. Be patient and you will understand that there are many reasons to pay close attention to this language, which will certainly make outstanding steps in the near future.
Matlab
MATLAB is a recognized language for numerical calculations, used both for scientific purposes and in industry. It was developed and licensed by MathWorks, a company established in 1984, the main purpose of which was the commercialization of software.
License:
Prices vary depending on the language you choose.
Benefits:- MATLAB, designed for numerical calculations, is well suited for using quantitative analysis with complex mathematical requirements, such as signal processing, Fourier transforms, matrix algebra, and image processing.
- Data visualization. MATLAB has a number of built-in features for plotting graphs and charts.
- MATLAB can often be found in many undergraduate courses in science, such as physics, engineering, and applied mathematics. Therefore, it is widely used in these areas.
Disadvantages:- Paid license. Regardless of the option you choose (for scientific, personal purposes or company goals), you will have to fork out for an expensive license. Our advice: pay attention to the free alternative - Octave.
- MATLAB is not the best programming language for general use.
Our verdict is the best option for purposes requiring significant mathematical calculations.Due to its widespread use in various quantitative calculations, both for scientific purposes and for industry, MATLAB has become a worthy option for use in data science. You will have it at the right time, if you need intensive, advanced mathematical functionality for your daily goals, which is what MATLAB was developed for.
Other languages
There are other popular PLs that may be of interest to data experts. This section provides a brief overview.
C ++
Often, C ++ is not used in data science. However, it has lightning fast performance and widespread popularity. The main reason C ++ has not gained popularity in the field of data science is its inefficiency for such a purpose.
As one of the forum participants wrote:
“Suppose you are writing code to do some special analysis that is likely to run only once. So, would you prefer to spend 30 minutes to create a program that will work for 10 seconds or spend 10 minutes for a program that will work for 1 minute? ”
And this guy is right! However, C ++ will be an excellent choice for implementing low-level machine learning algorithms.
Our verdict is not the best choice for everyday work, but when it comes to performance ...Javascript
Due to the fact that over the past few years, the Node.js platform has been actively developing, the JavaScript programming language has increasingly acquired the features of a server language. However, its capabilities in the field of data science and machine learning today are quite modest (however, you should not forget about brain.js and synaptic.js!). The disadvantages of JavaScript include:
- It's too early for him to join the game (Node.js is only 8 years old!) ...
- The Node.js platform is really fast, but there will always be those who will actively criticize JavaScript.
Node.js’s undoubted advantages include its asynchronous I / O, its growing popularity, and the fact that there are many languages that compile with JavaScript. So it is quite possible that in the near future we will see a useful framework for working in the field of data science with the possibility of processing using ETL in real time. Another question: will it be relevant at that time ...
Our verdict - there is still a lot to do in order for JavaScript to be considered a worthy language for working in the field of data sciencePerl
Perl is known as the "Swiss Army Knife of Programming Languages" because of its versatility as a general purpose scripting language. It has a lot in common with Python, being a dynamically typed scripting language. But he is still very far from the popularity that Python has in data science.
This is a bit surprising given its use in areas that use quantitative analysis methods, such as bioinformatics. As for the science of data, Perl has several drawbacks: it will not be able to quickly become popular in this field, and its syntax is considered unfriendly. In addition, from the side of its developers, there are no attempts to create libraries that could be used in the field of data science. And as we all know: often the right actions at the right moment decide everything.
Our verdict is a useful general-purpose scripting language, but with it, you certainly won’t get a job as a data expert ...Ruby
Ruby is another dynamically typed general-purpose interpreted language. However, it seems that its creators have no desire to make it suitable for work in the field of data science, as is the case with Python.
This may seem strange, but all of the above is somehow connected with the dominant position of Python in the field of scientific research, as well as with the positive feedback from people who write in this language. The more people choose Python, the more modules and frameworks are developed for it, and the more programmers give their preference to Python. The SciRuby project was created in order to implement the functionality of scientific computations in Ruby, for example, matrix algebra. But, despite all these attempts, Python is still leading at the moment.
Our verdict is not exactly the right choice for data science, but your resume doesn’t hurt Ruby knowledgeConclusion
Well, here we are with you and we have covered a short tutorial on programming languages that came closest to the field of data science. The important point here is to understand what you need more: the specificity or universality of the language, its convenience or efficiency.
I regularly use R, Python, and SQL, as my current work is mainly focused on developing existing data pipelines and ETL processes. These languages combine the right balance of community and efficiency to do this work with the ability to use more advanced R statistical packages when necessary.
However, it is possible that you already have a good hand in Java, or you are not looking to try out Scala for working with big data in action, or maybe you are crazy about the Julia project.
Or maybe you crammed MATLAB on pairs at the institute or would you mind giving SciRuby a chance to show yourself? Yes, you can have hundreds of different reasons! If so, leave your comment below - because for us it is really important to know the opinion of each of you!
Thanks for attention!
-
Marketing for your project on Reddit, Medium and Bitcointalk .