📜 ⬆️ ⬇️

Rule 10: 1 in programming and writing

In this article, the author analyzes the amount of time spent on writing books or software code, and comes to an interesting pattern. It can be used to plan the timing of work on projects.


Hofstadter's law: Any business always lasts longer than expected, even if you take into account the Hofstadter's law.
- Douglas Hofstadter, Gödel, Escher, Bach

Writing prose and code has a lot in common. But the most noticeable similarity probably lies in the fact that neither the writers nor the programmers can finish their work on time. Writers are renowned for their inauspicious habit of breaking deadlines. Programmers have earned the reputation of people whose results are always seriously different from the original calculations. The question arises: why?

Today I have an idea how to answer it. And my discoveries struck me.
')

Studying your books


Both my books, Hi, Startup and Terraform: we launch and work , I wrote in the Atlas book creation environment, which provides for the management of all content using Git. This means that every line of text, every edit and every change was recorded in the Git commit log.

Check how much effort was spent on writing two books.

Hello startup

Let's start with my first book, Hi, Startup . It has 602 pages and about 190 thousand words. I ran cloc in the git repository Hello, Startup and got the following results (for simplicity, fractional parts are dropped):



602 pages contain 26,571 lines of text. The lion's share is written in AsciiDoc , similar to Markdown. It is used in Atlas to write almost any content. With the help of HTML and CSS, Atlas defines the layout and structure of the book. Besides them, there are other programming languages ​​(Java, Ruby, Python, and not only), in which various examples are written to the topics discussed in the book.

But 602 pages and 26,571 lines are just the end result. They do not reflect about 10 months of writing, editing, editing, proofreading, stylistic adjustments, research, notes and other work contributing to the publication of the book. Therefore, in order to get more useful ideas, I used git-quick-stats to analyze the whole journal of the book’s commits.



So, I added 163,756 lines and removed 131,425, which together makes 295,181 lines of recycled material. That is, it turns out that I wrote or deleted a total of 295,181 lines, of which 26,571 lines remained. This ratio is slightly more than 10: 1. For each published line, I had to first write 10 others!

I admit that counting the number of lines added to Git and deleted from it cannot be considered an ideal metric of the editing process. But, at least, this allows us to understand that it is not enough to evaluate the work done. A significant part of the process was not reflected at all in the Git commit log. For example, the first few chapters were written in Google Docs before I moved to Atlas, and many edits were made on my computer without commits.

Despite the fact that these data are far from ideal, I believe that the overall ratio of the “original text material” to the published one is 10: 1.

Terraform: we start and we work

Let's check if this proportion applies to my second book Terraform: we launch and work , containing 206 pages and about 52 thousand words.

Simplified cloc output:



206 pages consist of 8410 lines of text. Again, most of the text is written in AsciiDoc, although in this book there are noticeably more code samples written primarily in HCL, the main language of Terraform. Besides him there are a lot of Markdowns, which I used to document HCL examples.

We use git-quick-stats to check the history of edits in this book:



For almost five months, I added 32,209 and deleted 22,402 lines, which together made 54,611 recycled lines. The accuracy of the evaluation of the editing process of this book suffers even more, since the work began as a series of blog posts that went through tangible processing before they moved to Atlas and Git. The volume of these blog posts takes at least half of the book, so it would be logical to increase the final figure of the revised text by 50%. That is, the total will be 54611 * 1.5 = 81,916 lines of editable text, resulting in the total 8410 lines.

And again there is a ratio of about 10: 1!

It is not surprising that writers do not fit into the timeline. If according to the schedule it is supposed to hand over a book of 250 pages, then in practice it will come out that in the process we will write 2500 pages.

What about programming?


How are things in development? I decided to check out several open source git repositories of different maturity levels: from several months to 23 years.

terraform-aws-couchbase (2018)

terraform-aws-couchbase - a set of modules for Couchbase deployment and management on AWS, the source code of which was opened in 2018.

Simplified cloc output:



And here is the git-quick-stats check result:



We get as many as 37,693 lines of working code, resulting in 7481 lines of the final code in a 5: 1 ratio. Even in the repository under 5 months already had to rewrite each line five times! It is not surprising that the evaluation of software development is difficult: we do not even imagine that in order to get 7.5 thousand lines of the final code, in fact, we have to write 35 thousand.

Let's see how things are in older products.

Terratest (2016)

Terratest - opensource library, created in 2016 for testing infrastructure code.

Simplified cloc output:



git-quick-stats results:



This is 49,126 working lines of code that have turned into 6,140 lines of final text. For the two-year repository, the ratio was 8: 1. But Terratest is still quite young, so let's consider older repositories.

Terraform (2014)

Terraform is an open source library created in 2014 to manage infrastructure using programming methods.

Simplified cloc output:



git-quick-stats results:



We get 12,945,966 working lines of code that resulted in 1,371,718 lines of the final result. The ratio is 9: 1. Terraform has been around for almost 4 years, but the library has not yet been released, so even with this ratio, its code base is not yet mature. Let's look further into the past.

Express.js (2010)

Express is a popular open source JavaScript framework released for web development in 2010.

Simplified cloc output:



git-quick-stats results:



We get 224,211 working lines of code, reduced to 15,332 total lines. The result is 14: 1. Express is about 8 years old, its latest versions are number 4.x. It is considered the most popular and proven in web-based framework for Node.js.

It seems that as soon as the ratio reaches the level of 10: 1, we can say with confidence that the code base is already “adult”. Let's check what happens if we go even deeper into the past.

jQuery (2006)

jQuery is a popular open source JavaScript library released in 2006.

Simplified cloc output:



git-quick-stats results:



Total 730,146 working lines of code, resulting in 47,559 lines of the final result. A ratio of 15: 1 for a nearly twelve-year repository.

Let's go ten more years ago.

MySQL (1995)

MySQL is a popular open source relational database created in 1995.

Simplified cloc output:



git-quick-stats results:



We get 58,562,999 working lines, 3,662,869 lines of the final code and a ratio of 16: 1 for a nearly twenty-three year repository. Wow! Each line of MySQL code has been rewritten 16 times.

findings


The summarized results for my books are as follows:
Title
Work strings
Summary lines
Ratio
Hello startup
295 181
26,571
11: 1
Terraform: We start and work
81,916
8410
10: 1

Here is a summary table for various programming projects:
Title
Year of issue
Work strings
Summary lines
Ratio
terraform-aws-couchbase
2018
37 693
7481
5: 1
Terratest
2016
49,126
6140
8: 1
Terraform
2014
12,945,966
1,371,718
9: 1
Express
2010
224 211
15 325
14: 1
jQuery
2006
730,146
47,559
15: 1
Mysql
1995
58 562 999
3,662,869
16: 1

What do all these numbers mean?

Rule 10: 1 in prose and programming

Given that my data set is limited, I can only draw some preliminary conclusions:

  1. The ratio of "raw materials" and "final product" for the book is about 10: 1. Keep this figure in your head when you discuss with the editor the timetable for the delivery of the material. If you need to write a book of 300 pages, then in fact you have to write about 3 thousand pages.
  2. A similar rule can be derived for both mature and non-trivial software: the ratio of the amount of processed code to the total is at least 10: 1. Keep this in mind when a manager or client asks you to estimate time costs. An application of 10 thousand lines will require you to write about 100 thousand lines.

These findings can be summarized as rule 10: 1 for writing and programming :
Writing good software or text requires that each line be rewritten 10 times on average.

Next steps


Of course, lines of code and lines of text cannot be considered an ideal measure. But, I think, if you collect enough data, you can determine how the 10: 1 rule is universal and useful for specifying the time frame for completing the project.

Some questions I would like to answer:


If you are a book author and can do a similar analysis, I will be glad to know about your results. And if someone has time to automate such an analysis, it will be great to learn about the ratios found in various open source projects.

August 13 update

Discussions of the post on Hacker News and Reddit's r / programming revealed two more interesting points:
  1. Apparently, a similar rule 10: 1 is true for movies , journalism, music and photography! Who would have thought?
  2. Readers have left many comments that a change in even a single character can be counted in Git as inserting or deleting a line, so an indicator of 100 thousand modified lines does not mean that each line has undergone processing.

The last remark is valid, but, as I wrote above, my data do not take into account other types of changes:

  1. I do not commit for every single line. I can change it ten times, but only make one commit.
  2. The situation described in the previous paragraph is even more relevant for programming. During the testing of the code, I can change one line 50 times, while making only one commit.
  3. Many text editing and writing cycles were performed outside of Git (some chapters were written in Google Docs or Medium, and stylistic edits were made in PDF).

I think that all these factors compensate for the feature of accounting for inserting or deleting lines in Git. Of course, my estimates may be inaccurate, and the actual ratio will be 8: 1 or 12: 1. But in general, the difference is not too large, and 10: 1 is easier to remember.

Update August 14

A Github Decagon user created a repository called hofs-churn with a bash script to easily calculate the degree of code development in your repositories. He also used it to analyze a variety of repositories, such as React.js, Vue, Angular, RxJava, and many others, and the results were quite interesting.

image

Source: https://habr.com/ru/post/420821/


All Articles