📜 ⬆️ ⬇️

WIKIzualize it, WIKIzualize it!

Good evening, dear friends!

Recently, walking through the vast expanses of the Internet, I came across the amazing work of Chris Harrison , after sitting a little in shock, I thought, “Is it hard to visualize Wikipedia or not?” And decided to try it!

image
')
So let's get started!


Instruments and tools


First of all, it is necessary to decide what we will visualize and by what means. And having a little studied what is and how my choice fell on the following means:


The developers of the Wiki API described it just fine, so you should not translate the description literally to understand how to call one or another method.

On Habré already wrote about the package Graphviz, so I think you should not re-describe it. But I immediately read the description of the dot language and graphviz tools and I almost started writing my own .dot files.

I was rescued by a python module called PyGraphviz, which allows you to conveniently work with the structure of the graph, which is then written into a .dot file.

The basics


So we will visualize cross-articles from Wikipedia. To do this, we need to call the method by reference:
ru.wikipedia.org/w/api.php?action=query&format=xml&titles=_&prop=links
where action is the type of method, format is the output format of the response, in our case it is XML, prop - we ask for cross-references links

at the output we get the following answer:
  1. < api >
  2. < query >
  3. < normalized >
  4. < n from = "Habrahabr" to = "Habrahabr" />
  5. </ normalized >
  6. < pages >
  7. < page pageid = "340809" ns = "0" title = "Habrahabr" >
  8. < links >
  9. < pl ns = "0" title = "2006" />
  10. < pl ns = "0" title = "2006" />
  11. < pl ns = "0" title = "2007" />
  12. < pl ns = "0" title = "Digg.com" />
  13. < pl ns = "0" title = "Linux.org.ru" />
  14. < pl ns = "0" title = "News 2.0" />
  15. < pl ns = "0" title = "Newsland" />
  16. < pl ns = "0" title = "Pligg" />
  17. < pl ns = "0" title = "Slashdot" />
  18. < pl ns = "0" title = "URL" />
  19. </ links >
  20. </ page >
  21. </ pages >
  22. </ query >
  23. < query-continue >
  24. < links plcontinue = "340809 | 0 | Blog" />
  25. </ query-continue >
  26. </ api >
* This source code was highlighted with Source Code Highlighter .


Which is processed using any DOM or SAX method.

Programming


So for processing, I used SAX and inherited my class from xml.sax.handler.ContentHandler:
class LinksListHandler(xml.sax.handler.ContentHandler):
Further, the main challenges are redefined:


The procedure for working with a query is as follows:
  1. def get_links (page):
  2. #See wiki api documentation http: //en.wikipedia.org/w/api.php
  3. query_val = { 'action' : 'query' ,
  4. 'prop' : 'links' ,
  5. 'titles' : page,
  6. 'format' : 'xml' }
  7. url = wiki_url () + '?' + urllib.urlencode (query_val)
  8. request = urllib2.Request (url)
  9. verbose_message ( "Wiki url:" + url)
  10. try :
  11. response = urllib2.urlopen (request)
  12. except urllib2.HTTPError:
  13. print "HTTP request error!"
  14. sys.exit (1)
  15. #verbose_message ( "Response xml: \ n" + response.read ())
  16. lh = LinksListHandler ()
  17. saxparser = xml.sax.make_parser ()
  18. saxparser.setContentHandler (lh)
  19. saxparser.parse (response)
  20. return lh.links
* This source code was highlighted with Source Code Highlighter .


Graphing


Using the PyGraphviz module, the work is quite simple:
  1. def make_wiki_graph (wiki_page, depth):
  2. gv = AGraph ()
  3. page_list = [wiki_page]
  4. temp_list = []
  5. verbose_message ( 'Create graph for' + wiki_page)
  6. pageLinks = get_links (wiki_page)
  7. gv.add_node (wiki_page)
  8. for i in range (depth):
  9. print '>>>> Get' + str (i) + 'level'
  10. for page in page_list:
  11. list = get_links (page)
  12. node = gv.get_node (page)
  13. node.attr [ 'fontsize' ] = "% i" % (MIN_FONT * 2 * (depth - i))
  14. for link in list:
  15. verbose_message (page + "=>" + link)
  16. gv.add_edge (page, link)
  17. temp_list.append (link)
  18. page_list = temp_list
  19. temp_list = []
  20. return gv
* This source code was highlighted with Source Code Highlighter .

results


Article "Mathematics" with 4 levels of nesting
image
Article "Habrahabr" with 5 levels of nesting
image

Others:
Socrates
Habrahabr 3 levels

Other examples

Script itself

Source: https://habr.com/ru/post/56209/


All Articles