📜 ⬆️ ⬇️

Tabs or spaces? Analysis of 400 thousand GitHub repositories, billion files, 14 TB code



For inquisitive developers, the use of tabs and spaces for code formatting is still an issue. Can they be interchangeable: for example, 2 spaces per tab or 4? But there is no single standard, so sometimes there is a misunderstanding between developers. In addition, various IDEs and their compilers handle tabs in their own way as well.

The solution to the issue is usually the agreement on the rules of formatting within the project or the programming language as a whole.
')
The Google development team researched projects in the Github repository. They analyzed code written in 14 programming languages. The purpose of the study was to identify the ratio of tabs and spaces - that is, the most popular way of formatting text for each language.

Implementation


For the analysis, an existing table [bigquery-public-data: github_repos.sample_files] was used, in which the names of the Github repositories are written.

Recall that about two months ago, all open source Github became available in the form of BigQuery tables.

However, not all repositories were selected for analysis, but only the top 400 thousand repositories with the largest number of stars they received during the period from January to May 2016.



From this table were allocated files containing code for the 14 most popular programming languages. For this, the extensions of the corresponding files were specified as parameters of the sql query - .java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go.

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
  SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name 
  FROM [bigquery-public-data:github_repos.sample_files] 
  WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
  GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id = b.id

864.6s elapsed, 1.60 TB processed

. , (join) 190 70 . 1,6 . .

[contents] . . .



.

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
  SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext, 
         SUM(best='tab') tabs, SUM(best='space') spaces, 
         COUNT(*) countext
  FROM (
    SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
           COUNT(line) WITHIN RECORD c
    FROM (
      SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name 
      FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
      HAVING REGEXP_MATCH(line, r'[ \t]')
    )
    HAVING c>10 # at least 10 lines that start with space or tab
  )
  GROUP BY ext
)
ORDER BY countext DESC
LIMIT 100

16.0s elapsed, 133 GB processed

133 16 . BigQuery.


, — Java.

- , . IDE, . IDE, .

« ». . , .

Source: https://habr.com/ru/post/308974/


All Articles