Over the past 5 years I have written many loaders. These are the so-called applets that parse information on source sites and keep it to their base. Often they are a sequence of regular expressions, with the help of which the values ​​are found in the right cells. Loaders can log in, can connect through a proxy, and sometimes even recognize security images. That's not the point.
The theoretical problem is that it is impossible to write an absolutely automatic loader. We can plug any info, but the base turns into a dump if the loader loses the classification of the source site. And when we start to keep the classification, there is a problem.
')
Consider an example. Let there is an autosite on which ads for selling cars from hundreds of other resources are loaded. Loader Parsite declare, gives an array:
{:"ford", :"focus", :"1.6 Ti-VCT 5d", : ...}.
An automatic loader often works like this: looks in the table of brands by name, if there is a ford - takes the id of the mark, if not - adds “ford” to the marks, and takes its id. It does the same with model and modification. Then he adds an ad with the received id-Schnick. Such a system is bad in that there will necessarily be an announcement in which “FORD” or not “VAZ”, and “VAZ”, or “AvtoVAZ”, or not “St. Petersburg”, and “St. Petersburg” will be in place of the brand, "SPb", "Spb". Smart Google will understand that these are synonyms, and our silly loader, checking names by character, no. The result is a mess in the tables with the classifications.
Trying to minimize the manual labor of the Mongolian / moderator, I came up with such an algorithm.
First of all, the loader consists of two parts.
The first is loader_pages.The script scans the pages with lists of ads such as
http://cars.auto.ru/cars/used/ford/focus/ and stupidly collects links to individual declarations. + Finds links to the transitions through the pages and recurses them. Found a link to announce - added it to the database or, if it has already been added, updated the “last found date” to the current one. This is necessary in order (the loader works every hour) to delete objects for which the date of finding the link is rather old (this means that the link has not been found already, which means that the object from the source has been deleted).
The second is loader_offer.Takes from the base not yet processed links, loads html, parsit. Gets an array of type
{:"ford", :"focus", :"1.6 Ti-VCT 5d", : ...}
Loads the compares label. It contains mappings that will be manually processed by the moderator. The label consists of the fields:
{,, ,id }.
In our case,
{:"auto.ru",:"",:"ford",:"..."}.
If the corresponding comparison is already affixed, hurray victory, take the id-Schnick. If not, add a new comparison to compares, but don't add the object.
The moderator looks at non-affixed comparisons and compares the values ​​from the corresponding “good” of our tables with car brands, models, cities, etc.
Parents.Everything works fine while the tables are small. For example, car brands - there are only 100 of them. Match spit times. There are 7,000 models in my database, and 20,000 modifications. Can you imagine, from 20,000, choose a modification comparison “1.6 Ti-VCT 5d”, which I have called “1.6 Ti-VCT”? Moderator dies. Or you need a good search.
But you can make it easier. When loading declared, we will process the comparison in order, first make, then model, after modification. Take the comparison for the brand,
{:"auto.ru",:"",:"ford",:"..."},
we find it or add it - not the point. Take the id of this comparison and write it in the additional field parent to compare the model:
{:"auto.ru",:"",:"focus",:"...",parent:"id "}.
We do the same in the modification, in whose parent we write the model comparison id.
Moderator is working in order. At first he takes comparisons of marks and puts them all down. Then he takes a comparison of the model. At the same time, we see that the comparison has a parent-comparison mark, which has already been affixed, therefore, not all possible models, but only those with a mark corresponding to the value of this parent-comparison, should be output as options for the comparison. Well, that is, “Ford” was stamped, and then “Focus” was chosen not from 7000 models, but only from hundreds of Ford models.
The essence of this post is not at all in the fact that I came up with something completely new. I just never met the descriptions of these programs. And I like it that it is excessive practicality, because in principle it is clear that each object is a subset of the vertices of some trees, and the parser is a comparison of the html-code elements of the page to these vertices. It would be possible to introduce a theory, something like a language for describing parsers, etc. ... On the other hand, the average loader code in php takes me 2 pages. And it is not clear whether it is worth bathing with theory, because I don’t have to figure out how else to reduce and simplify this code, even using some abstract language.