Being engaged in the development of the PVS-Studio code analyzer for finding problems in 64-bit and parallel programs, we needed to collect fresh information on the Internet on some topics. For example, it is always useful to answer in the forums and blogs questions of programmers who may be interested in our tool. In the process of collecting, it turned out that there was a lot of information on the network and it was very long and tedious to manually perform a search, from which came the task of automating the search for fresh data. In this post we will tell you how we do it.
True, for sure you will immediately say: “Haha! The guys come up with a bike and do not know about Google Alerts. ” We know about Google Alerts. It's almost what you need, just not that :-). For more than six months of using Google Alerts, we haven’t managed to get what we need from it. And you need this:
- search on specific listed sites;
- search only for the last day;
- the ability to add stop words;
- Google Alerts somehow additionally filters the results. That is, the usual Google search gives more than Google Alerts.
Therefore, it was decided to try to make a bike.
')
As part of this task, it is required to search for new materials on specified sites in the amount of up to 30 pieces and created no longer than 24 hours before the start of the automated search. That is, roughly speaking, who wrote something on the Internet for the last day. Input data will be as follows:
- The list of addresses of sites - the url of sites on which to search
- The list of search phrases - phrases in Russian and / or English that should be searched.
- List of unwanted words - words that should be missing in the search results.
Idea
In the network there are many services offering search services, it is logical to use their capabilities to implement the task. The search engine google.com was chosen as, in our opinion, the most suitable.
Google search
The principle of operation is the same as that of any other search engine: a request is sent to Google, and it gives an answer. At the same time, the search engine has flexible settings for forming a request, thus you can form the necessary request.
Google search options
Consider the most interesting (within the scope of the task) search parameters:
www.google.com/search ? | Actually address |
as_q | Search phrase (just a phrase, not a set of words) |
Num | The number of results that will be shown on the page |
as_eq | Words that should be missing in the search results |
as_sitesearch | site url that is being searched |
The search engine has other options, but within the framework of the task they are not interesting. Sample Google query with search options:
http://www.google.com/search?as_q=64-bit+portability+&hl=en&newwindow=1&num=30&btnG=%D0%9F%D0%BE%D0%B8%D1%B % B2 + Google & as_epq = & as_oq = & as_eq =% D0% BA% D1% 83% D0% BF% D0% B8% D1% 82% D1% 8C +% D1% 81% D0% BA% D0% B0% D1% 87% D0% B0% D1% 82% D1% 8C + & lr = lang_ru & cr = & as_ft = i & as_filetype = & as_qdr = d & as_occt = any & as_dt = i & as_sitesearch = http: //www.codeguru.com/&as_rights=&safe=images
How can this be used
It follows from the above that it is possible to automate the search using the Google search engine. The algorithm will be as follows:
- On the basis of the initial data a request to Google is formed.
- The request is being executed.
- Processing of the result (analysis of html-page).
- Repeat the previous paragraphs for each site and each search phrase from the input data.
Implementation
The script is written in php.
Input data
There are three types of input data, this is a list of url sites for which you need to make a search, a list of search phrases and a list of words that should be missing in the search results. The following xml file is used to present this data:
<? xml version = "1.0" encoding = "utf-8"?>
<search_params lang = "ru">
<sites>
<url> http://www.dreamincode.net </ url>
<url> http://forum.vingrad.ru/ </ url>
<url> http://forum.sources.ru/ </ url>
<url> http://groups.google.com/ </ url>
</ sites>
<words>
<white_list>
<phrase> "64-bit" c ++ </ phrase>
<phrase> 64-bit migration </ phrase>
<phrase> viva64 </ phrase>
</ white_list>
<black_list>
<phrase> buy </ phrase>
<phrase> download </ phrase>
</ black_list>
</ words>
</ search_params>
XML parsing
The XML file has a simple structure and small size, so you can use the
PHP Simple HTML DOM Parser script .
The use of the script is described in the documentation, but it is worth noting that the techniques for using with DOM are very similar to how jQuery, a well-known javascript library, does it. For example, the following code gets all the links from the html page at google.com and displays them on the screen:
include ('../ simple_html_dom.php');
// get DOM from URL or file
$ html = file_get_html ('http://www.google.com/');
// find all link
foreach ($ html-> find ('a') as $ e)
echo $ e-> href. '<br>';
However, there is a small memory issue with the Simple HTML DOM Parser. It consists in the following: the function file_get_html at each call creates a new object of the class simple_html_dom and if this function is called in a loop, then the memory ends. For some reason, it is impossible to release forcibly. The solution is simply not to use this function in a loop, but to call it once and work with only one object of the class simple_html_dom.
Script creation
Actually nothing interesting, a regular php script written using the MVC pattern. The source code is also uncomplicated.
The user interface is very austere - when accessing the page, one button appears “Send request” (in the browser window) and after pressing it, the result is displayed after a while.
Total
After the introduction of this script, now we always find out what happened in the world in our data domain (64-bit and parallel programming) in the last 24 hours.