Data-mining in 40 lines or with whom and against whom you are at the same time

We find ~~like-minded people and opponents of~~ friends and enemies among the users of the site on Drupal, using the data of votingapi .

We do data sampling

SELECT v1.uid uid1, v2.uid uid2, u1.name name1, u2.name name2, v2.entity_id entity_id, v1.value value1, v2.value value2 FROM votingapi_vote v1 JOIN (votingapi_vote v2, users u1, users u2) ON (v1.uid != v2.uid AND v1.entity_id=v2.entity_id AND v1.entity_type=v2.entity_type AND v1.uid=u1.uid AND v2.uid=u2.uid) WHERE v1.uid < v2.uid AND v1.uid != 0 AND v2.uid != 0 ORDER BY v1.uid,v2.uid;

JOIN of the votingapi_vote table itself selects all permutations of user pairs, and the condition v1.uid <v2.uid turns permutations into combinations .
')
The condition v1.entity_id = v2.entity_id AND v1.entity_type = v2.entity_type allows you to select the votes that users gave for the same topic or comment. Let's say the first line in our sample means that Administrator and Bob gave 100 points to the same topic or the same comment.

The condition v1.uid! = 0 AND v2.uid! = 0 excludes anonymous comments.

As a result, we get a table of six columns:

 uid1   uid2   name1   name2  value1 value2 1      2      Administrator Bob    100    100 1      2      Administrator Bob    20     20 1      2      Administrator Bob    40     40 1      2      Administrator Bob    100    100 1      2      Administrator Bob    20     100 1      2      Administrator Bob    100    100 1      2      Administrator Bob    100    100 1      2      Administrator Bob    100    100 1      2      Administrator Bob    100    100 1      2      Administrator Bob    80     80 1      2      Administrator Bob    100    20 1      2      Administrator Bob    20     20 1      2      Administrator Bob    60     60 1      2      Administrator Bob    100    100 1      2      Administrator Bob    100    100

In the first column - the id of the first user, in this case, the administrator (uid = 1)
in the second column - the second user id
in the third column - the name of the first user
in the fourth column - the name of the second user
in the fifth column - the voice of the first user
in the sixth column - the voice of the second user

Calculate the correlation of votes

Calculation of course you can write in PHP, but then why come up with R ?

Take the tablet generated in the previous step from write it to the file in.tsv. Then:

 #!/usr/bin/env Rscript d <- read.delim("../in.tsv") unique1 <- unique(c(d$uid1, d$uid2)) for (id1 in unique1) { if (file.exists(as.character(id1))) { file.remove(as.character(id1)) } temp1 <- d[d$uid1==id1 | d$uid2==id1, ] unique2 <- unique(c(temp1$uid1, temp1$uid2)) unique2 <- unique2[!unique2 == id1] # remove id1 for (id2 in unique2) { if (id1 < id2) { result <- temp1[temp1$uid1==id1 & temp1$uid2==id2, ] name <- as.character(result$name2[1]) } else { result <- temp1[temp1$uid1==id2 & temp1$uid2==id1, ] name <- as.character(result$name1[1]) } n = nrow(result) if (n > 7) { x <- result$value1 y <- result$value2 pvalue <- cor.test(x,y)$p.value if (is.finite(pvalue) && pvalue < 0.05) { correlation <- cor(x,y) cat(id2, name, n, correlation, pvalue, "\n", sep = "\t", file = paste(id1, sep = ""), append = T) } } } }

All the work of calculating the correlation is done by the function cor (x, y). The cor.test (x, y) function calculates the correlation metrics, including its significance ( p -value). By default, it is assumed that everything that has p -value ≥ 0.05 is not significant enough, so we only select results with p -value <0.05 and write to the file with the name equal to the uid of the first user.

The game with id1, id2 and if-else is needed in order to select all combinations of user pairs, regardless of the order.

From the table above you should get a file with the name “1” and the following contents:

 2 Bob 15 0.6039604 0.01710946

In the first column, the id of the second user
in the second column, the name of the second user (so that you can immediately show it on the screen)
in the third column the number of topics and comments for which both users voted
in the fourth column - correlation
in the fifth column - p -value

With data processing, we are done.

Show results

I decided to show the results in the user profile, here is the corresponding hook:

 /** * Hook into the user menu */ function mymodule_menu() { $items['user/%user/likeminded'] = array( 'access callback' => TRUE, 'access arguments' => array(1), 'page callback' => 'mymodule_likeminded', // function defined below 'page arguments' => array(1), 'title' => 'Likeminded', 'weight' => 5, 'type' => MENU_LOCAL_TASK, ); return $items; }

Well, the longest part is the output of results.

 /** * Display likeminded users */ function mymodule_likeminded($arg){ if (is_object($arg) && !$arg->uid) { return; } # this is my path to the results, your path may be different $path = drupal_get_path('module', 'mymodule') . '/pearsons/' . $arg->uid; $lines = array(); $min = 0; $max = 0; if ($handle = @fopen($path, 'r')) { while($line = fgets($handle)) { $line = explode("\t", $line); if ($line[2] >= $max) { $max = $line[2]; } if ($line[2] < $min) { $min = $line[2]; } $lines[] = $line; } } $output = ''; // Likeminded $output .= '<h1>' .t('Likeminded') .'</h1>' ; $output .= '<div class="likeminded">'; foreach($lines as &$line) { if ($line[3] > 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = $line[3]; $output .= "<span style="\"font-size:"" .="" $size="" "pt;opacity:"="" $opacity="" "\"="">"; $output .= l($line[1], 'user/' . $line[0]); $output .= "</span>"; } } $output .= '</div>'; // Adversaries $output .= '<h1>' .t('Adversaries') .'</h1>' ; $output .= '<div class="adversaries">'; foreach($lines as &$line) { if ($line[3] < 0 ) { $size =mymodule_font_size($min, $max, $line[2]); $opacity = abs($line[3]); $output .= "<span style="\"font-size:"" .="" $size="" "pt;opacity:"="" $opacity="" "\"="">"; $output .= l($line[1], 'user/' . $line[0]); $output .= "</span>"; } } $output .= '</div>'; return $output; } /** * calculate the font size in proportion to the maximum and minimum of common votes */ function mymodule_font_size($min_count, $max_count, $cur_count, $min_font_size=11, $max_font_size=36) { if ($min_count == $max_count) # avoid DivideByZero exception { return $min_font_size; } return ( ($max_font_size - $min_font_size) / ($max_count - $min_count) * ($cur_count - $min_count) + $min_font_size); }

It's simple. The larger the font, the more users voted on the same topics. The brighter the text - the greater the correlation. If the correlation is positive - then we show the user in like-minded people, otherwise - in opponents.

On the real data of one hundred thousand users, a million posts and comments and several million votes, the SQL query worked in a minute, the execution of the code on R took 30 minutes.

You ask why the module for Drupal is not made? But who needs a module that calls R. And in PHP it is ugly to rewrite.

The end result in the profile of one of the users

Source: https://habr.com/ru/post/248607/

All Articles

Data-mining in 40 lines or with whom and against whom you are at the same time

We do data sampling

Calculate the correlation of votes

Show results

More articles: