📜 ⬆️ ⬇️

Social graph

Today, I would like to tell a story about my small experiment with an audience of Habr. The subject of the experiment was to build a social graph of habrasoobschestva.



Description

The objectives pursued in the experiment:


Parsing and filling the graph

To obtain information about users and their friends, I wrote the parser.py script:
')
# -*- coding:utf-8 -*- # parser.py from BeautifulSoup import BeautifulSoup from urllib2 import urlopen, URLError from draw import Drawer class Parser(object): def __init__(self, address='http://habrahabr.ru/people/page', begin = 1, end = 3098): self.drawer = Drawer() self.queue_user = [] self.__begin = begin self.__end = end self.__address = address def parse(self): for i in xrange(self.__begin, self.__end): try: doc = BeautifulSoup(urlopen(self.__address + str(i))) #     . page = doc.findAll('td', attrs = {'class':'user'}) for user in page: #       print 'Parsing for user: %s' %user.dl.dt.a.string doc = BeautifulSoup(urlopen(user.dl.dt.a['href'])) page = doc.findAll('a', attrs = {'rel' : 'friend'}) #          if page: for friend in page: self.drawer.graph.add_nodes_from((user.dl.dt.a.string, friend.string)) self.drawer.graph.add_edge(user.dl.dt.a.string, friend.string) print "Add edge (%s, %s)"%(user.dl.dt.a.string, friend.string) else: self.drawer.graph.add_node(user.dl.dt.a.string) except URLError: #      -  i -= 1 print 'Nodes: %s' %self.drawer.graph.size() self.drawer.draw() if __name__ == '__main__': parse = Parser(end=8) parse.parse() 


I used BeautifulSoup to parse the pages. The Drawer class is responsible for storing and drawing the graph.

Graph drawing and image saving

As mentioned above, the Drawer class from the draw.py module is responsible for storing and drawing the graph:

 # -*- coding:utf-8 -*- # draw.py import networkx as nx import matplotlib.pyplot as plt class Drawer(object): def __init__(self, file_name = 'graph.png'): self.graph = nx.Graph() self.file_name = file_name def draw(self): '''      ''' nx.draw(self.graph,pos=nx.spring_layout(self.graph), node_size=3500, nodecolor='r',edge_color='b', node_shape='o') #      plt.gcf().set_size_inches(100,100) plt.savefig(self.file_name) 


In this class, we store in the internal field an instance of the Graph class from the NetworkX module, which will contain our social graph. It should be noted that the Graph class provides a large number of methods for working with a graph. Details on working with the module can be found in the documentation . Attention should be paid to the method that sets the size of the resulting figure / graph. The parameter can be changed, depending on the number of vertices in the graph.

results

The result of the work done is somewhat different from the planned one. During the script, I found a very significant resource consumption. As is known, the approximate number of Habr's users is 60,000 people. Even if we discard (as I actually did) users who have no friends, the number is still significant. Checking the program was carried out on a machine with 3 GB of RAM. As soon as the graph starts to draw, the system began to swap out godlessly, so the number of users in the graph had to be reduced. As a result, I received several versions of a rendered graph with a different number of users.
The figure shows a graph containing 852 users:


As you can see, the images have to be strongly compressed for the article, so the rest will be cited, due to the large size of the images (7-14Mb):


Perspectives



UPD: archive with images

I apologize for the problems with downloading images, flooded the people .

UPD2: images on zoom.it

Thanks to mstyura for advice and assistance.
4095 users
7071 users

Source: https://habr.com/ru/post/126417/


All Articles