📜 ⬆️ ⬇️

Analyzing book recommendations for developers with Stack Overflow using Python tools

Deciding which book on programming to read next is difficult and risky.

As befits a developer, you probably have little time, and you spend the lion’s share on reading books. You could program. You could rest. But instead, you devote valuable time to developing your skills.


So which book should you read? My colleagues and I often discuss the read literature, and I noticed that our opinions on specific books are very different.
')
So I decided to go deep into the problem. My idea was this: analyze the most popular resource in the world for programmers for links to a famous bookstore, and then calculate how many times each book is mentioned.

Fortunately, the Stack Exchange (parent company Stack Overflow) has just published its data dump. I sat down and started coding.


Screenshot of the tool I created: dev-books.com

“If you're curious, the book“ Effective work with legacy code ”(Michael Feathers) is most often recommended, followed by the work“ Object-oriented design techniques. Design patterns ”(Erich Gamma). Although the titles of these books are dry, like the Atacama Desert, the content of them, one should think, is of rather good quality. You can sort books by tags, such as javascript, c, graphics and so on. Of course, the list of recommendations does not end with these two names, but these are certainly excellent options for starting, if you are just starting to code or want to upgrade your skills, ”review on Lifehacker.com .

Soon after, I launched dev-books.com , which allows you to examine all the data that I collected and sorted. As a result, I received more than 100,000 visitors and many reviews in which people asked to describe the entire technical process.

So today I will tell you how I did all this.

Receiving and importing data

I took a dump of the Stack Exchange database from archive.org.

From the very beginning, I realized that it is not possible to import a 48 GB XML file into the newly created database (PostgreSQL) using popular methods such as myxml: = pg_read_file ('path / to / my_file.xml'), because I did not have 48 GB of RAM on the server. So I decided to use the SAX parser.

All values ​​were stored between tags, so I used the Python script to parse:

def startElement(self, name, attributes): if name == 'row': self.cur.execute(“INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”, ( (attributes['Id'] if 'Id' in attributes else None), (attributes['PostTypeId'] if 'PostTypeId' in attributes else None), (attributes['ParentID'] if 'ParentID' in attributes else None), (attributes['AcceptedAnswerId'] if 'AcceptedAnswerId' in attributes else None), (attributes['CreationDate'] if 'CreationDate' in attributes else None), (attributes['Score'] if 'Score' in attributes else None), (attributes['ViewCount'] if 'ViewCount' in attributes else None), (attributes['Body'] if 'Body' in attributes else None), (attributes['OwnerUserId'] if 'OwnerUserId' in attributes else None), (attributes['LastEditorUserId'] if 'LastEditorUserId' in attributes else None), (attributes['LastEditorDisplayName'] if 'LastEditorDisplayName' in attributes else None), (attributes['LastEditDate'] if 'LastEditDate' in attributes else None), (attributes['LastActivityDate'] if 'LastActivityDate' in attributes else None), (attributes['CommunityOwnedDate'] if 'CommunityOwnedDate' in attributes else None), (attributes['ClosedDate'] if 'ClosedDate' in attributes else None), (attributes['Title'] if 'Title' in attributes else None), (attributes['Tags'] if 'Tags' in attributes else None), (attributes['AnswerCount'] if 'AnswerCount' in attributes else None), (attributes['CommentCount'] if 'CommentCount' in attributes else None), (attributes['FavoriteCount'] if 'FavoriteCount' in attributes else None) ) ); 

After three days of import (almost half of the XML was imported during this time), I realized that I had made a mistake: the ParentID value should have the form ParentId.

At that moment, I didn’t want to wait another week and switched from AMD E-350 (2 x 1.35GHz) to Intel G2020 (2 x 2.90GHz). But this did not speed up the process.

The next solution is batch insert:

 class docHandler(xml.sax.ContentHandler): def __init__(self, cusor): self.cusor = cusor; self.queue = 0; self.output = StringIO(); def startElement(self, name, attributes): if name == 'row': self.output.write( attributes['Id'] + '\t` + (attributes['PostTypeId'] if 'PostTypeId' in attributes else '\\N') + '\t' + (attributes['ParentId'] if 'ParentId' in attributes else '\\N') + '\t' + (attributes['AcceptedAnswerId'] if 'AcceptedAnswerId' in attributes else '\\N') + '\t' + (attributes['CreationDate'] if 'CreationDate' in attributes else '\\N') + '\t' + (attributes['Score'] if 'Score' in attributes else '\\N') + '\t' + (attributes['ViewCount'] if 'ViewCount' in attributes else '\\N') + '\t' + (attributes['Body'].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if 'Body' in attributes else '\\N') + '\t' + (attributes['OwnerUserId'] if 'OwnerUserId' in attributes else '\\N') + '\t' + (attributes['LastEditorUserId'] if 'LastEditorUserId' in attributes else '\\N') + '\t' + (attributes['LastEditorDisplayName'].replace('\n', '\\n') if 'LastEditorDisplayName' in attributes else '\\N') + '\t' + (attributes['LastEditDate'] if 'LastEditDate' in attributes else '\\N') + '\t' + (attributes['LastActivityDate'] if 'LastActivityDate' in attributes else '\\N') + '\t' + (attributes['CommunityOwnedDate'] if 'CommunityOwnedDate' in attributes else '\\N') + '\t' + (attributes['ClosedDate'] if 'ClosedDate' in attributes else '\\N') + '\t' + (attributes['Title'].replace('\\', '\\\\').replace('\n', '\\\n').replace('\r', '\\\r').replace('\t', '\\\t') if 'Title' in attributes else '\\N') + '\t' + (attributes['Tags'].replace('\n', '\\n') if 'Tags' in attributes else '\\N') + '\t' + (attributes['AnswerCount'] if 'AnswerCount' in attributes else '\\N') + '\t' + (attributes['CommentCount'] if 'CommentCount' in attributes else '\\N') + '\t' + (attributes['FavoriteCount'] if 'FavoriteCount' in attributes else '\\N') + '\n' ); self.queue += 1; if (self.queue >= 100000): self.queue = 0; self.flush(); def flush(self): self.output.seek(0); self.cusor.copy_from(self.output, 'posts') self.output.close(); self.output = StringIO(); 

StringIO allows you to use a file type variable to process the copy_from function that uses COPY. Thus, the whole import process took only a night.

Great, now is the time to create indexes. In theory, the GiST indexes are slower than the GIN, but take up less space. So I decided to use GiST. A day later, I had an index that occupied 70 GB.

When I tried to enter a couple of test requests, I realized that it took too much time to process them. Why? Due to Disk I / O wait. SSD GOODRAM C40 120Gb helped a lot, even if it is not the fastest SSD today.

I created a new PostgreSQL cluster:

 initdb -D /media/ssd/postgresq/data 

Then I changed the path in my config service (using the Manjaro distribution):

 vim /usr/lib/systemd/system/postgresql.service Environment=PGROOT=/media/ssd/postgres PIDFile=/media/ssd/postgres/data/postmaster.pid 


I reloaded the configuration and started postgreSQL:

 systemctl daemon-reload postgresql systemctl start postgresql 

This time it took a couple of hours to import, but I used the GIN. Indexing took 20 GB of memory on the SSD, and simple requests were processed in less than a minute.

Extract books from the database

When my data was finally imported, I started looking for posts that mentioned books, and then copied them into a separate table using SQL:

 CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE '%book%'”; 

The next step was to find all the hyperlinks inside these posts:

 CREATE TABLE http_books AS SELECT * posts WHERE body LIKE '%http%'”; 

At this point, I realized that StackOverflow proxied such links as follows:

 rads.stackowerflow.com/[$isbn]/ 

I created another table for all posts with links:

 CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE '%http://rads.stackowerflow.com%'"; 

In doing so, I used regular expressions to extract all ISBN numbers. I extracted the Stack Overflow tags to a separate table using regexp_split_to_table.

After extracting and counting data on the most popular tags, it turned out that the top 20 books with the maximum number of references are approximately the same for all tags.

My next step: tag refinement.

The idea was to take the top 20 most frequently mentioned books for each tag and exclude those books that have already been processed.

Since this was a “one-off” job, I decided to use PostgreSQL arrays. I wrote a script to create a query, like this:

 SELECT * , ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )) , ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1) FROM ( SELECT * , ARRAY['isbn1', 'isbn2', 'isbn3'] AS to_exclude FROM ( SELECT tag , ARRAY_AGG(DISTINCT isbn) AS isbns , COUNT(DISTINCT isbn) FROM ( SELECT * FROM ( SELECT it.* , t.popularity FROM isbn_tags AS it LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn LEFT OUTER JOIN tags AS t on t.tag = it.tag WHERE it.tag in ( SELECT tag FROM tags ORDER BY popularity DESC LIMIT 1 OFFSET 0 ) ORDER BY post_count DESC LIMIT 20 ) AS t1 UNION ALL SELECT * FROM ( SELECT it.* , t.popularity FROM isbn_tags AS it LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn LEFT OUTER JOIN tags AS t on t.tag = it.tag WHERE it.tag in ( SELECT tag FROM tags ORDER BY popularity DESC LIMIT 1 OFFSET 1 ) ORDER BY post_count DESC LIMIT 20 ) AS t2 UNION ALL SELECT * FROM ( SELECT it.* , t.popularity FROM isbn_tags AS it LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn LEFT OUTER JOIN tags AS t on t.tag = it.tag WHERE it.tag in ( SELECT tag FROM tags ORDER BY popularity DESC LIMIT 1 OFFSET 2 ) ORDER BY post_count DESC LIMIT 20 ) AS t3 ... UNION ALL SELECT * FROM ( SELECT it.* , t.popularity FROM isbn_tags AS it LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn LEFT OUTER JOIN tags AS t on t.tag = it.tag WHERE it.tag in ( SELECT tag FROM tags ORDER BY popularity DESC LIMIT 1 OFFSET 78 ) ORDER BY post_count DESC LIMIT 20 ) AS t79 ) AS tt GROUP BY tag ORDER BY max(popularity) DESC ) AS ttt ) AS tttt ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC; 

With the data on hand, I went to the Internet.

Creating a web application


Nginx vs. Apache

Since I am not a web developer and definitely not an expert in creating web interfaces, I decided to make a very simple one-page application based on the standard theme from Bootstrap.

I created the “search by tag” option and then extracted the most popular tags to make each search clickable.

I visualized the search results using a histogram. I also tried Hightcharts and D3, but they are more suitable for dashboards. They found some problems with the response, and it turned out to be quite difficult to configure. With this in mind, I created my own responsive SVG-based diagram. To provide this responsiveness, it had to change when the screen orientation changed:

 var w = $('#plot').width(); var bars = "";var imgs = ""; var texts = ""; var rx = 10; var tx = 25; var max = Math.floor(w / 60); var maxPop = 0; for(var i =0; i < max; i ++){ if(i > books.length - 1 ){ break; } obj = books[i]; if(maxPop < Number(obj.pop)) { maxPop = Number(obj.pop); } } for(var i =0; i < max; i ++){ if(i > books.length - 1){ break; } obj = books[i]; h = Math.floor((180 / maxPop ) * obj.pop); dt = 0; if(('' + obj.pop + '').length == 1){ dt = 5; } if(('' + obj.pop + '').length == 3){ dt = -3; } var scrollTo = 'onclick="scrollTo(\''+ obj.id +'\'); return false;" "'; bars += '<rect id="rect'+ obj.id +'" class="cla" x="'+ rx +'" y="' + (180 - h + 30) + '" width="50" height="' + h + '" ' + scrollTo + '>'; bars += '<title>' + obj.name+ '</title>'; bars += '</rect>'; imgs += '<image height="70" x="'+ rx +'" y="220" href="img/ol/jpeg/' + obj.id + '.jpeg" onmouseout="unhoverbar('+ obj.id +');" onmouseover="hoverbar('+ obj.id +');" width="50" ' + scrollTo + '>'; imgs += '<title>' + obj.name+ '</title>'; imgs += '</image>'; texts += '<text x="'+ (tx + dt) +'" y="'+ (180 - h + 20) +'" class="bar-label" style="font-size: 16px;" ' + scrollTo + '>' + obj.pop + '</text>'; rx += 60; tx += 60; } $('#plot').html( ' <svg width="100%" height="300" aria-labelledby="title desc" role="img">' + ' <defs> ' + ' <style type="text/css"><![CDATA[' + ' .cla {' + ' fill: #337ab7;' + ' }' + ' .cla:hover {' + ' fill: #5bc0de;' + ' }' + ' ]]></style>' + ' </defs>' + ' <g class="bar">' + bars + ' </g>' + ' <g class="bar-images">' + imgs + ' </g>' + ' <g class="bar-text">' + texts + ' </g>' + '</svg>'); 

Web server crash

As soon as I managed to launch dev-books.com, I had already discovered on my website a huge crowd of people. Apache could not serve more than 500 visitors at the same time, so I quickly set up Nginx and switched to it along the way. I was very surprised when the number of visitors in real time immediately increased to 800 people.

Conclusion

I hope I explained everything quite clearly. If you have questions, feel free to ask them. You can find me on twitter and on facebook .

As promised, at the end of March, I will publish my full report from Amazon.com and Google Analytics. While the results are very unexpected.

Source: https://habr.com/ru/post/323522/


All Articles