"Related News" with PHP, phpmorphy and MySQL

I want to share the method of determining "similar" records I think it will be useful for blogs or news resources.
The purpose of this post is to show the principle, implementation may not be quite comme il faut, since the author is not a prof. a programmer, but an amateur.

So, the task

News is stored in MySQL table type:

It is necessary for each news in the output on the page to determine the most similar from the same table.
Here we are interested in the content of the fields title, lead, body. For simplicity, we will assume that we are creating everything from scratch and will not consider the need to process already existing records.

Tags field

We add the tags field (in fact, these are pseudo-tags, but they will not be shown anywhere on the site - this field is needed only for comparing texts). Specify the field type as VARCHAR (512) and add an index of type fulltext (FULLTEXT (tags)).

Pseudo-tag generation

Generate pseudo-tags from the title, lead, body fields before writing the news to the database (just before the INSERT statement). To do this, download the phpmorphy and dictionaries from here .
')
To exclude unimportant words (stop words), create an array of $ stopwords, we will use a text file for stop words (for example , save as stopwords.txt).

$stopwords=explode("\n", file_get_contents("stopwords.txt"));

Next, we connect the phpmorphy and its dictionaries, combine the title, lead and body and run all the words through phpmorphy.

Pseudo-tag generation

 $lowercaseLetters = array("''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''"); $uppercaseLetters = array("''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''", "''"); function cyrUpper($str) { global $lowercaseLetters; global $uppercaseLetters; return str_replace("'", "", preg_replace($lowercaseLetters, $uppercaseLetters, $str)); } function cyrLower($str) { global $lowercaseLetters; global $uppercaseLetters; return str_replace("'", "", preg_replace( $uppercaseLetters,$lowercaseLetters, $str)); } function cleanUP ($new_string) { //$new_string=nl2br($new_string); $new_string= str_replace("-"," ",$new_string); $new_string= str_replace("\r\n"," ",$new_string); $new_string= str_replace("\r"," ",$new_string); $new_string= str_replace("\n"," ",$new_string); $new_string= str_replace("."," ",$new_string); $new_string = ereg_replace("[^0-9 ]", "",$new_string ); return $new_string; } require_once( 'morphy/src/common.php'); $text=cleanUP($_REQUEST[title]." ".$_REQUEST[lead]." ".$_REQUEST[body]." "); $aText = explode(' ',$text); $aPort = array(); $aMorph = array(); foreach ($aText as $word) $aMorph[] = cyrUpper($word);//  1251    // set some options $opts = array( 'storage' => PHPMORPHY_STORAGE_FILE, // Extend graminfo for getAllFormsWithGramInfo method call 'with_gramtab' => false, // Enable prediction by suffix 'predict_by_suffix' => true, // Enable prediction by prefix 'predict_by_db' => true ); $dir = 'morphy/dicts'; $lang = 'ru_RU'; // Create descriptor for dictionary located in $dir directory with russian language $dict_bundle = new phpMorphy_FilesBundle($dir, 'rus'); // Create phpMorphy instance try { $morphy = new phpMorphy($dict_bundle, $opts); } catch(phpMorphy_Exception $e) { throw new Exception('Error occured while creating stemmer instance: ' . $e->getMessage()); } try { if($getroot==22) $pseudo_root = $morphy->getPseudoRoot($aMorph);//     else $pseudo_root = $morphy->getBaseForm($aMorph);//   //   $getroot=TRUE } catch(phpMorphy_Exception $e) { throw new Exception('Error occured while text processing: ' . $e->getMessage()); } foreach ($pseudo_root as $roots){ $slovo=cyrLower($roots[0]); if (strlen( $slovo)>3 && !in_array($slovo,$stopwords) && count($roots)==1 ) { $tags.=$slovo." "; } } }

The resulting list of tags in the $ tags variable is written in acc. table field. As a result, for each news in this field there will be a list of words which we will use for comparison.

Example

Source text

Samsung has begun manufacturing solid-state hard drives using V-NAND three-dimensional memory. The technology allows you to increase the amount of storage, as well as provides a 2 times higher speed of information transfer and increases the reliability of devices up to 10 times. At the moment, are creating SSD drives with 480 and 960 GB, only for corporate servers. As for home computers, there was no specific release date.

Generated word list:

device increase only technology company solid state create speed reliability server production allow to increase ensure transfer memory volume start drive moment corporate specific computer touch use information hard disk high release three-dimensional

SQL query

Now the most interesting thing is that this SQL query will be used to define similar records:

  SELECT * FROM news WHERE MATCH (tags) AGAINST ('[    ]' ) > [ ]

Here, the meaning of relevance is the “similarity” of texts - experiment (start with one)

Source: https://habr.com/ru/post/190034/

All Articles