📜 ⬆️ ⬇️

Parsim issuing Yandex

Greetings to all readers!

I started to engage in SEO recently, and immediately faced the task of determining the positions of promoted sites for keywords in search engines. The task is trivial and is solved with a bang by various software that all SEO specialists have on their lips: Semonitor, AllSubmitter, etc. In addition to proprietaryity, which smacks of all such programs, there is, paradoxically, a number of technical issues that make you want to throw the computer out the window.
I would not mind to buy Semonitor, but after using the demo version, I decided to give up this idea - on the program’s official website, the downloadable version was buggy for me, demanded to update myself, and after updating my request, I refused to do any position analysis. It’s impossible to set up, as you understand.
AllSubmitter is better in this regard, it even allows us to customize regular expressions for search engines, which seems to make this software resilient to changing the format of the issue, however, not all thanks to God - when 08/18/2008. Yandex suddenly changed the format for displaying search results, at the same time changing the URL of links (perhaps, experiments with entering conversion accounting, it was written more about it here ), and AllSub was powerless. True, the next day, Yandex returned to the old format of issue, but a precedent nonetheless occurred.

I then invented the bicycle for the sake of interest: I decided to write a position analyzer, moreover, in PHP. I didn’t have a goal to reach production, I just wanted to get a feel for how Semonitors and Allsubmitters work there. So, writing classes for parsing Yasha, Google and Rambler, potesting and making sure that he plows everything, he forgot with satisfaction, since he was AllSubmitter, and there was no need to make a garden, there were plenty of other tasks.

Formulation of the problem


When they wrote on Habré about the PHP class to work with Yandex.XML and I commented abundantly on this, then by the presence of pluses in karma I realized that the topic is not bad for placing into an article, especially since the opportunity appeared - I got out of the minuses. And although it was a little bit about something else - about organizing a search on a site using Yandex.XML, the task of analyzing the site’s positions by keywords and phrases overlaps with the first. So my task is:
create a site position analyzer in search engine results (for now Yandex)

Decision


There is nothing difficult.

First of all,


Everything you need to get search results is transmitted via GET-parameters, namely:

Query string pattern:
yandex.ru/yandsearch?text=[KEYWORD]&p=[PAGE_NUMBER]&numdoc=[RESULTS_ON_PAGE]

Secondly,


The issue of Yandex is parsed into the components we need, by running through a regular expression of the form:
  # <li>. * <a [^>] * tabindex [^>] * onclick [^>] * = [^>] * "[^>] *" [^>] * href = "([^ < > "] +)" [^>] *> (. +) </a>. * </ li> #Ui 

In this way:
  1. preg_match_all (REGENER, HTML_SET, MASSIV_RESULTS, PREG_SET_ORDER);
* This source code was highlighted with Source Code Highlighter .


As a result, an array is obtained (there are numdoc elements in it) of arrays (3 elements in a subarray: html with one output, the url of the output page found, its header).

Eventually,


Here is the scheme I use to find a site in Yandex by a specific query (search before the first occurrence):
  1. I get the first page with the issue
  2. if this is not Yandex’s “mistrust page” with captcha, I’m driving it through regwire, sorting through the results to find the right one.
  3. if I find, I return the result - the position number of the page in the output, I do not find it - I get the next page and return to step 2, after waiting for 3-5 seconds

As parameters to the analysis process, I have:

Implemented


this is through the class hierarchy (so that if necessary it is easy to extend the functionality of the analyzer to other search engines).
Abstract class - SomeAnalyzer:

  1. abstract class SomeAnalyzer {
  2. //// INTERFACE
  3. // analysis function
  4. public abstract function analyzeThis ($ url);
  5. // get the host name from url (parse_url with additional functionality, because it made sure that just parse_url does not always work for some reason when the url is too unreadable)
  6. public function getHost ($ url) {
  7. $ url = @ parse_url ($ url);
  8. if ($ url [ 'path' ] &&! $ url [ 'host' ])
  9. $ url [ 'host' ] = $ url [ 'path' ];
  10. $ url [ 'host' ] = ereg_replace ( "/.*$" , "" , $ url [ 'host' ]);
  11. $ url [ 'host' ] = ereg_replace ( "^ www \." , "" , $ url [ 'host' ]);
  12. return $ url [ 'host' ];
  13. }
  14. //// REALIZATION
  15. // 2 url comparison function for belonging to the same host
  16. protected function compareURL ($ url1, $ url2) {
  17. $ url1 = $ this -> getHost ($ url1);
  18. $ url2 = $ this -> getHost ($ url2);
  19. return (strtoupper ($ url1 [ 'host' ]) == strtoupper ($ url2 [ 'host' ])? true : false );
  20. }
  21. }
* This source code was highlighted with Source Code Highlighter .

Yandex Issue Analyzer Class:

  1. class YandexAnalyzer extends SomeAnalyzer {
  2. //// INTERFACE
  3. // settings
  4. public $ resultsLimit = 200; // limit results of issue
  5. public $ url;
  6. public $ keyword;
  7. public $ resultsOnPage = 50; // you can only 10, 20, 30, 50
  8. // analysis function
  9. public function analyzeThis ($ url, $ keyword = '' ) {
  10. $ this -> url = $ url;
  11. $ this -> keyword = $ keyword;
  12. $ x = 0;
  13. while ($ x * $ this -> resultsOnPage <= $ this -> resultsLimit-1) {
  14. if ($ results = $ this -> analyzePage (str_replace (array ( "\ r" , "\ n" , "\ t" ), '' , $ this -> downloadPage ($ x)))) {
  15. $ results [0] = $ x * $ this -> resultsOnPage + $ results [0];
  16. return $ results;
  17. }
  18. $ x ++;
  19. sleep (rand (3, 5));
  20. }
  21. return false ;
  22. }
  23. //// REALIZATION
  24. protected $ regexpParseResults = '# <li>. * <a [^>] * tabindex [^>] * onclick [^>] * = [^>] * "[^>] *" [^>] * href = "([^ <>"] +) "[^>] *> (. +) </a>. * </ li> #Ui ' ;
  25. protected $ urlMask = 'http://yandex.ru/yandsearch?text=[KEYWORD[&p=[PAGE_NUMBER_&numdoc=[RESULTS_ON_PAGE]' ;
  26. protected function downloadPage ($ pageNumber) {
  27. $ mask = str_replace ( '[KEYWORD]' , urlencode ($ this -> keyword), $ this -> urlMask);
  28. $ mask = str_replace ( '[PAGE_NUMBER]' , $ pageNumber, $ mask);
  29. $ mask = str_replace ( '[RESULTS_ON_PAGE]' , $ this -> resultsOnPage, $ mask);
  30. return file_get_contents ($ mask);
  31. }
  32. protected function analyzePage ($ content) {
  33. if (preg_match_all ($ this -> regexpParseResults, $ content, $ matches, PREG_SET_ORDER)! == false ) {
  34. if (count ($ matches) <= 0)
  35. deb ( '<br /> <span style = "color: red;"> No matches found or parser error: Google may suspect you are a robot! </ span>' );
  36. else
  37. foreach ($ matches as $ num => $ match) {
  38. if ($ this -> compareURL ($ match [1], $ this -> url))
  39. return array ($ num + 1, $ match [1], $ match [2]);
  40. }
  41. }
  42. else deb ( '<span style = "color: red;"> No entries were found or a parser error: yandex may have suspected that you are a robot! </ span>' );
  43. return false ;
  44. }
  45. }
* This source code was highlighted with Source Code Highlighter .

I want to draw your attention to one thing: I don’t know what it was on the 18th of this month of this year, but Yandex changed the format of the issue, had to change the regulars and finish the classes, however, on the 19th, everyone with amazement observed how yandexoids returned everything It was. And, why not, it can happen again, because I bring a list of changes and additions to the YandexAnalyzer class, which will need to be done, if the format of the issue suddenly becomes 18.08.2008. :

')

Result


Here is a small code for testing:

  1. $ url = "vinzavod.ru" ;
  2. $ keywords = array (
  3. 'winery' ,
  4. 'alcohol production' ,
  5. 'alcohol production' ,
  6. 'sale of alcohol' ,
  7. 'alcohol manufacturers' ,
  8. 'wine' ,
  9. 'wine' ,
  10. 'wine production' ,
  11. 'sale of wine' ,
  12. 'cognac' ,
  13. 'cognacs' ,
  14. 'cognac production' ,
  15. 'sale of brandies' ,
  16. 'sale of brandy' ,
  17. 'sale of brandy' ,
  18. 'tincture' ,
  19. 'tinctures' ,
  20. 'production of tinctures' ,
  21. 'sale of tinctures' ,
  22. 'vermouth' ,
  23. 'vermouths' ,
  24. 'vermouth production' ,
  25. 'port' ,
  26. 'port' ,
  27. 'port 777' ,
  28. 'sale of port' ,
  29. 'alcohol' ,
  30. 'alcohol products' ,
  31. 'branded alcohol' ,
  32. 'alcoholic drinks' ,
  33. 'classic alcoholic drinks'
  34. );
  35. $ g = new YandexAnalyzer ();
  36. foreach ($ keywords as $ keyword) {
  37. if ($ res = $ g-> analyzeThis ($ url, $ keyword)) {
  38. deb ( '<span style = "color: green;">' . $ res [0]. '-th site position' . $ url. 'for the phrase <a href = "' . $ url. '"' . $ keyword. '</a> "</ span>' );
  39. }
  40. else
  41. deb ($ url. 'not found in the first' . $ g-> resultsLimit. 'results for the phrase "' . $ keyword. '"' );
  42. sleep (rand (3, 5));
  43. }
* This source code was highlighted with Source Code Highlighter .

and test result:
4th position of the site vinzavod.ru on the phrase "winery"
13th position of the site vinzavod.ru on the phrase "alcohol production"
158th position of the site vinzavod.ru on the phrase "production of alcohol"
45th position of the site vinzavod.ru on the phrase "sale of alcohol"
vinzavod.ru not found in the first 300 results for the phrase “alcohol producers”
181st position of the site vinzavod.ru on the phrase "wine"
255th position of the site vinzavod.ru on the phrase "wine"
4th position of the site vinzavod.ru on the phrase "wine production"
56th position of the site vinzavod.ru on the phrase "the sale of wine"
94th position of the site vinzavod.ru on the phrase "brandy"
56th position of the site vinzavod.ru on the phrase "brandy"
7th position of the site vinzavod.ru on the phrase "cognac production"
5th position of the site vinzavod.ru on the phrase "sale of brandy"
7th position of the site vinzavod.ru on the phrase "sale of brandy"
5th position of the site vinzavod.ru on the phrase "sale of brandy"
11th position of the site vinzavod.ru on the phrase "tincture"
17th position of the site vinzavod.ru on the phrase "tinctures"
3rd position of the site vinzavod.ru on the phrase "production of tinctures"
1st position of the site vinzavod.ru on the phrase "sale of tinctures"
30th position of the site vinzavod.ru for the phrase "vermouth"
25th position of the site vinzavod.ru on the phrase "vermouth"
32nd position of the site vinzavod.ru on the phrase "vermouth production"
32nd position of the site vinzavod.ru on the phrase "port"
15th position of the site vinzavod.ru on the phrase "port"
93rd position of the site vinzavod.ru on the phrase "port 777"
4th position of the site vinzavod.ru on the phrase "sale of port"

Prospects and plans for the future



The article was prepared during the week, so at the time of writing, analyzers for Yandex, Google, Rabmler have already been implemented, and the application is being written on the sly. But this is the topic of the next posts))
PS: My first post on Habré, I ask you not to kick painfully, but to comment constructively))

Source: https://habr.com/ru/post/37913/


All Articles