📜 ⬆️ ⬇️

Visualization of computer usage statistics with R



I think many people are interested (at least out of curiosity) exactly how they use their computer: the most pushed buttons, the distance covered by the mouse, the average work time and other information. In this article I will tell one of the options for how to collect such information and then present it in the form of interactive graphs. All the actions described were performed on a laptop with Debian Wheezy , Python 2.7.3 , R 2.15 .
image


Data collection


')
It all started with the fact that this fall I wanted to have as much as possible complete statistics about my computer usage. No sooner said than done: he wrote the simplest keylogger, recording all keystrokes and releases on the keyboard, buttons on the mouse, as well as all mouse movements. I also had the idea to make a photo webcam after a certain period of time, which was also implemented.
Initially, all data was written simply to a text file, but then decided to do it nevertheless “for good” and transferred the recording of all events to the SQLite database. Despite the fact that its size is several times larger in comparison with compressed gzip text files (now the base takes up a little more than 0.5 GB), it is still relatively small, but it is still more convenient to sample from the base. Photos from the camera are still stored separately and their total size is now 2.2 GB (about 30,000 files).

Presentation of information



So, everything seems to be decided with the data recording, it remains to present all this in the form of convenient graphs and tables. Since I was going to learn the R language for quite a long time, but somehow there was no reason to start, I decided to use it here. In general, after other languages, starting to use R was fairly easy, although some unusual things still exist: for example, most of the standard operators and functions are vectorized, the naming style with the words separated by a dot (instead of _ or camel case in other languages) is adopted. At first, I simply mastered the language itself and the tools included in the standard delivery for processing and displaying data (I must say, quite rich), then I found ggplot2 libraries for flexible graphing and plyr for split-apply-combine type operations over data arrays. It is also possible to create files with Markdown markup and embedded R calculations (using knitr ), which is quite convenient. However, all these are static graphs and tables, and for any change in their appearance, the choice of a certain subset of data, the code needs to be rewritten each time, but I would like something more dynamic, with the ability to set controls like sliders, buttons and another.
As it turned out, a convenient way has recently appeared to achieve this in R , with a very small amount of additional code. I stumbled upon this tool can be said casually, and did not regret. See for yourself: Shiny - “Easy web applications in R”. The simplest example is right on that page, and it is really easy to write. I must say that Shiny is a fairly new product, the development (judging by the repository) began in June of the past (2012) and is actively promoted. I did not encounter any bugs in it, so I think the project can be considered stable. By the way, on the same site you can find RStudio - a convenient IDE for developing on R.
Thus, I began to display statistics using Shiny . Some screenshots of what happened at the moment:



You can also view these pages (albeit in a static form, ie, controls do not work) on shiny-sample.aplavin.ru .

It can be seen that the possibilities are really rich, and the appearance of the page can completely change the HTML , CSS and JavaScript code (I use the standard version everywhere). Conveniently, Shiny does not need to install any server, everything you need is contained in the R-package itself.

Briefly about the implementation



All code (both keylogger and visualization) is available on BitBucket . Now I have a capture file called Crown every minute, which takes a snapshot with a 50% probability and saves it (due to the fact that the camera does not initialize instantly, 20 snapshots are taken and the last one is saved). The keylogger.py is represented by the executable file keylogger.py , run from inittab 'a (using the respawn option). The statistics folder contains the keylogger.stats.R and keylogger.stats.Rmd files, the first of which generates graphs simply in the form of pictures, the second - in the form of an HTML page using knitr (both, of course, static). Finally, the shiny_page folder contains the files of the actual page ( ui.R , server.R ) and the file compute.data.R , which calculates all the necessary data (now it takes from 30 seconds to 1 minute, put into a separate file so as not to calculate each times when you open the page). For convenience, there is a Makefile in the same folder that allows you to start an application using the make run command.

Statistics Calculations



Initially, all calculations were performed almost entirely with SQLite database queries, but then, comparing the GROUP BY performance with the functions of the plyr package, I saw that SQLite performs similar actions much slower, even with indices. The only (but important) problem is that to use these functions it is necessary to load the entire data set into memory. Now when executing compute.data.R , about 1 GB of memory is used, and after a while 4 GB on my laptop will be missed. In this case, I think it will be necessary to return again to the calculations by means of the database, although this is much slower, but at least it will work (although, of course, suggestions on this are welcome). For comparison, similar code in SQL and R using plyr:

 SELECT field, COUNT(*) FROM Table GROUP BY field 


 ddply(dataset, ~field, nrow) 


You can also make more complex, multi-level groupings. An example from my compute.data.R (without an analog in SQL, but I think after the previous example this two-level should be clear):

 mouse.coords.by.win <- ddply( coords, ~window, function(df) { res <- ddply( df, .(x=as.integer(x/binsize), y=as.integer(y/binsize)), .fun=nrow, .drop=F) res$V1 <- res$V1 / max(res$V1) res$cnt <- nrow(df) res }) 


For each window ( window column), this code counts the distribution of mouse coordinates on the screen in squares with the binsize side.

Python Keylogger



To intercept X server events, use Python bindings for xlib , and write these events to the SQLite database. It is worth noting that inittab scripts run as root , you need to set the environment variable before accessing xlib : os.environ['XAUTHORITY'] = '/home/USER/.Xauthority' . Then we connect to the display and create a recording context to get the events we need (keystrokes, buttons, and mouse movements):

 dpy = display.Display(':0') ctx = dpy.record_create_context( 0, [record.AllClients], [{ 'core_requests': (0, 0), 'core_replies': (0, 0), 'ext_requests': (0, 0, 0, 0), 'ext_replies': (0, 0, 0, 0), 'delivered_events': (0, 0), 'device_events': (2, 6), #    ,    :    ,    2  6 'errors': (0, 0), 'client_started': False, 'client_died': False, }]) dpy.record_enable_context(ctx, record_callback) #        record_disable_context  callback 


The rest of the code is receiving, processing and recording events from the callback function. In connection with the complex structure of the data received from the X server, the following cycle is used in the processing:

 def record_callback(reply): data = reply.data while len(data): event, data = rq.EventField(None).parse_binary_value(data, record_dpy.display, None, None) #   event 


In the actual processing of the event, there is nothing complicated: we get its type, additional data (detail field) and write them down. However, since we also want to record the window (or rather its class) where the event occurred, we need to get it. This is also done using functions from xlib :

 windowvar = dpy.get_input_focus().focus wmclass = windowvar.get_wm_class() 


The rest of the code is getting the normal name of the key from its code, writing the event to the corresponding table in the database and handling errors. Another small feature is associated with writing to the database: if at each event the result is flushed to a file, then when a large number of them arrive, the record will not have time to go through. Therefore, I record approximately every 100 events:

 if randint(0, 100) <= 0: dbconn.commit() 


Of course, in this case, if the process terminates abnormally, a small number of recent events will not be saved, but in this case it is not critical.

Database schema:

 CREATE TABLE KeyEvents(TimeStamp REAL, KeyName TEXT, EventType INTEGER, WindowClass TEXT); CREATE TABLE MouseBtnEvents(TimeStamp REAL, KeyName TEXT, EventType INTEGER, WindowClass TEXT); CREATE TABLE MouseMoves(TimeStamp REAL, MoveX INTEGER, MoveY INTEGER, WindowClass TEXT); 


Conclusion



As a result, I can say that I figured out R - a really convenient language for such calculations, with a transparent capture of X server events, and of course I received beautiful statistics on my computer usage. I have a few more ideas about what graphs and tables should be added to such a “report”, but it would be interesting to hear your options as well (if someone read to here) .
image

PS: please advise a good book or online course on statistics and / or graphical visualization of the data obtained.

UPD1: significantly updated information on the implementation of the main parts of the system.

Source: https://habr.com/ru/post/165337/


All Articles