A couple of days ago, David Robinson published an article on Stack Overflow with a very provocative title: Developers who use spaces make more money using tabulation (Habré translation ). The author took data from a developer study conducted by Stack Overflow, and in fact showed that the use of spaces is associated with higher wages, even taking into account the same level of experience. So, do you need to use spaces instead of tabs to increase your salary?
The answer is definitely “no”, because the correlation performed does not imply causality, and intuition suggests that the indentation in the code does not have a direct connection with someone’s salary. The whole story puzzled many people and even got into the news release of the Air Force .
I believe that the ultimate goal of the theory of data analysis and processing is to get answers to questions and identify new cause-effect relationships. Unfortunately, the original article does not provide answers to many questions. This is a funny correlation, but what is behind it? In my article I will try to shed light on this question. The original source of many forced to think about this problem, including me. So I suggest you my small scientific and detective story with a deep study of data from the study Stack Overflow. You will see that tabs and spaces are not what they seem. Spoiler: your salary depends more on the type of company and the environment in which you work, than on the type of indents used.
In his article, David shows that the use of spaces instead of tabs is associated with a higher salary, and this effect is evident regardless of the level of experience. At the same time, those who use both spaces and tabulation have the same salaries as those who use tabulation.
In addition, this effect allegedly does not depend on the programming language or your specialization as a developer. The same can be said about the size of the company. So why do higher paying developers prefer spaces? Obviously, there must be some kind of distorting factor here, but I was not sure that it was mentioned in the study. I began to conduct my own investigation, analyzing the linear regression model from the original article.
The original article includes a linear regression model predicting salaries based on several variables:
I decided to carefully examine the data and play with the modified models. For my linear regression, I took developers from the USA. Partly because it is the largest sample in the study, and analysis for one country eliminates many regional differences, and partly because I doubted the reliability of the level of wages in some countries (see below). Now let's take the statistics and analyze them. I want to show you the chain of my reasoning, which led me to certain conclusions.
I want to note that I changed the regression model used by David because it did not include a constant bias (bias term, constant addend), which resulted in a model like ANOVA. I used standard linear regression with a constant offset and applied two models:
Comparing the models should have told me how much information can be obtained by using the preferred type of indentation. Both models predict salaries equally well — or equally poorly, depending on your point of view. How does this know? You can look at the coefficient of determination R 2 , which determines the degree of salary deviation, which can be explained using input variables (length of service, language, and so on). The higher the ratio, the better you can model salary as a combination of other factors.
Model | R 2 | R 2 adj |
---|---|---|
Full model | 0.4008 | 0.3892 |
Shorthand model | 0.3938 | 0.3892 |
Both models have very close accuracy, both can account for about 40% of the salary deviation. The full R 2 model is higher, which is quite expected for a model with a large number of variables. The adjusted R 2 adj value can be used to compare the two models to see which one is better suited. In the full model, R 2 adj is also higher, but the difference is only 0.0068. It seems that information on the use of tabs and spaces is important, but does not make a noticeable contribution. In the abbreviated linear regression model, the missing data can be partially compensated by using other variables.
I checked for collinearity, which is always dangerous for predictive models. Collinearity is a situation when some variables correlate highly with each other, which makes it difficult to isolate their separate effects. I did not find signs of its presence, and the regression coefficients do not change massively depending on the model.
So what is the difference between a full and abbreviated model? I decided to look at the p-values of the regression coefficients, reflecting the significance of each variable in the model. The significance of at least one parameter increased significantly? I looked for variables whose p-values dropped by at least an order of magnitude (10 times) in order to find out which of the variables in the abbreviated model turned out to be more important than in the full one.
It turned out that in the abbreviated model the importance of variables grew:
The coefficients for these variables have also changed, but not dramatically. All together, this means that if you remove the data on tabs and spaces, the model will compensate for this with experience and contributions to open source (as well as whether you work with PHP). Experience is an obvious factor affecting wages, which is not surprising. My next candidate for the investigation was opensource.
I reviewed the details of the contribution to the opensource and made an interesting conclusion that this is due to a higher salary, at least if you live in the United States. Probably, people with higher salaries are more likely to contribute to the open source movement? This effect is observed in the whole range of experience.
How does opensource relate to our debates about spaces and tabs? It seems that participants of opensource-movement use spaces much more often than others. Among those who do not participate in opensource, approximately equally using tabs and spaces.
Among the participants of the opensource, “white spacers” are more than twice as large as those using tabs. This difference is also statistically significant given the p-value of 9.1981718 × 10 −24 . The same trend is observed in other countries, although opensource supporters there use tabulation a little more often.
I think now we are closer to a potential explanation of the reasons for David’s results. The main advantage of tabs is the ability to customize their display in the IDE, and with spaces a fixed layout is obtained. This means that for different people the same code with tabs will look completely different. And when they start mixing spaces and tabs in one file, this leads to a mess. I think that when working on a opensource project without adopting a single code style, the possible formatting problems force people to use spaces to make the code look the same for everyone.
This is just one of the possible theories. I did not appreciate how actively they participate in the opensource language community, where spaces are used primarily (for example, Python or Ruby). Again, the correlation does not imply causality.
Now the question is: does work in opensource explain higher salaries for those who use spaces more often than tabs? If we plot the salaries based on the data on the contribution to the opensource and the type of indents, we will get a more complex picture than in the original article, where only spaces and tabs were compared.
Juniors using spaces and tabs participating in an opensource have a slightly higher average salary than non-participating white boxers. And participating in opensource, having the experience more than 15 years and using tabs, have higher average salary, than "field worker". In addition, if you have experience of less than 15 years and use tabulation, participation in the opensource does not affect salary. But if you use spaces, then when you participate in the opensource you will receive more than if you will not participate. These results can be perceived with a certain degree of skepticism, because in some groups the results are relatively small.
In general, there is some kind of effect, but it does not change the overall picture: "whitestellers" generally earn more than those who use tabs. Is there anything else that can be analyzed?
At this point, I was convinced that any variables affecting the salaries of the whitelists and those who use tabulation were not part of a simple regression model. I did not want to perform the monkey work and add all the available variables (there are more than 150, and all categorical ones). I decided to analyze the distribution of salaries for different types of indents: do “spacers” generally have higher salaries, or are there subgroups of “spacers” that distort the results?
I built a schedule with different experience. Below shows the distribution density of salaries for developers with experience of less than 5 years, here the effect is most noticeable. All three distributions have a main peak in the region of the same salary level in the region of $ 65,000–70,000. This peak reflects the majority of juniors, and apparently, the use of spaces and tabs does not affect the salary.
It is curious that the distribution of salaries of "whitemen" is bimodal (has two peaks). Most get the same money as other developers, but there are two subgroups, mostly using spaces and getting much more than the rest. What is the difference? I searched for the answer to this question in the results of the study. For this, I used χ 2 to see if the number of “spacers” and those who use tabs in different categories differed greatly.
Since the number of programmers in the category with high salaries was small, I got a lot of potential candidates. I was surprised that one of the variables whose values are very different for a high-paying group and the others is versioning . I filtered out the versioning systems that are often used by juniors in the USA (at least 20 users per study):
Salary is higher | Salary is lower | |
---|---|---|
Git | 168 | 660 |
Other system | 17 | thirty |
Subversion | four | 47 |
Team Foundation Server | 6 | 92 |
It turns out that the use of the versioning system depends on the type of indents used, and this is true for developers all over the world, not only for juniors in the USA (p-value 1.5336476 x 10 -44 )! This means that there is a strong link between tabs, spaces, and versioning systems.
Let's analyze this fact. The two most popular systems among US developers (at least 200 users in a dataset) are Git and Team Foundation Server (TFS). How do they affect salaries?
Git users earn more regardless of experience. An interesting conclusion that may be related to our previous research of opensource participants. But it is much more interesting how everything is connected together: versioning, tabulation with spaces and salary?
Versioning systems break the pattern that high salaries are always associated with the use of spaces. Companies using Git, pay more money regardless of the type of indentation, at least for developers with experience up to 10 years! Using Git and tabs earn more spacers using TFS, regardless of experience. In the Git user group, spacers still have higher salaries. But in the TFS group, the situation is different: the spacers get the least.
In other countries, the picture is somewhat different, but you still hardly want to be a programmer with 15+ years of experience using spaces and TFS.
I also analyzed the users of the Subversion system, in the world it is slightly more popular than TFS. Subversion also does not confirm the claim that "whitespacers" generally earn more. Git + tab users earn almost as much as Subversion + spaces and Git + spaces and tabs.
Summing up, the combination of the factors “participation in opensource” and “use of the versioning system” at least partly affects the difference in salaries between users of tabs and spaces. This does not mean that you should start using Git and contribute to the opensource so that you get paid more (although in any case this is welcome!).
I think these two factors rather indicate the difference between the environments and types of companies, how much they adhere to traditional approaches and use modern technologies. More conservative old-school companies that do not use Git and opensource code generally pay less. The type of environment is difficult to estimate directly from the results of the study, so both of these factors only indirectly suggest similar reflections.
This is not the end of the story, and I’m sure that there are other variables that can shed light on the situation with spaces and tabs. Also, my conclusions are generally based on data from US developers, here the effect is most noticeable. Below I will explain why I have problems with the analysis of salaries in other countries.
When I evaluated the distribution of salaries taking into account other factors, one thing caught my attention that is incomprehensible to me. The data I worked with was only for full-time professional developers. But there is a large group of people with a very low annual income of less than $ 3000. Unfortunately, this in itself is not surprising, because incomes in different countries of the world vary greatly. But it was strange in which countries people get such low salaries.
Most of the lowest paid respondents were from India, which is understandable in this context. The average salary in India is significantly lower than in other OECD countries. But after it go Poland, Russia and even Germany. There may not be gigantic salaries, but much less than $ 3,000 per year for a full-time developer - extremely small.
I myself came from the Czech Republic, so I know about the peculiarities of the region and I have an assumption why such a strange situation with the data. Therefore, I checked the distribution of salaries in a pair of countries from Central and Eastern Europe, and also compared them with distributions in countries from other parts of the world.
In countries such as the United Kingdom, France and even India, wage distributions have one peak. And in all countries of Central and Eastern Europe - two peaks. The first corresponds to a very low salary, the second - a large, much larger corresponding to annual income. This is less pronounced in Germany, more pronounced in Poland and much more so in Russia. I have analyzed several other countries, including the Czech Republic and Ukraine, this trend also exists there. In all countries of this region bimodal distribution of salaries. What is happening there?
According to my experience, Czechs always discuss salaries in terms of not annual, but monthly income. I have never heard from the Czechs that they talk about annual income. My Polish friend confirmed this version - everyone operates on only monthly incomes. It seems that many respondents simply inattentively read the questions in the study and named their monthly incomes, not annual, because it is with this concept that they operate in everyday life.
Is it possible to somehow correct the data? For example, create a mixed model and multiply the low-wage group by 12. So we get the distribution, truncated to the left, but more accurately reflects the real wages in the countries compared to the initial distributions. Here is an example of Poland:
The main conclusion is that the data should always be treated carefully. There are many distortions in the results of the study, and some of them are quite unexpected. If I were not familiar with the peculiarities of the mentality, I would probably assume that in a number of countries there are indeed a lot of low-paid trainee-level positions. I am not sure which countries exactly the respondents called monthly salaries instead of annual ones, so I limited myself to the analysis of the American sample. I hope this data is the most consistent.
Unfortunately, people do not always correctly answer research questions, and this is very difficult to detect. This may have affected the situation with spaces and tabs. Considering the reaction in social media, someone indicated the use of tabs, because they press the Tab key, even though tabs are implicitly converted by editors into spaces.
, «» . , Git opensource, . , . : .
Source: https://habr.com/ru/post/331696/
All Articles