Simple metasearch algorithm in Python

Lyrical digression

As part of the research work at the university, I was faced with the task of classifying textual information. In fact, I had to create an algorithm that, processing a certain text document at the entrance, would return an array to me, each element of which would be a measure of the belonging of this text (probability or degree of confidence) to one of the specified topics.

This article is not about solving the classification problem specifically, but about trying to automate the most boring stage in the development of a rubricator — the creation of a training set.

When too lazy to work with your hands

The first and most obvious thought for me is to write a simple metasearch algorithm in Python. In other words, all automation comes down to the use of issuing another search engine (Google Search) in the absence of its databases. Immediately make a reservation, there are already ready libraries that solve a similar problem, for example pygoogle.
')

Closer to the point

For HTTP requests, I used requests, and to extract links from the search results — the library for parsing BeautifulSoup. Here's what happened:

from bs4 import BeautifulSoup import requests query = input('What are you searching for?: ' ) url ='http://www.google.com/search?q=' page = requests.get(url + query) soup = BeautifulSoup(page.text) h3 = soup.find_all("h3",class_="r") for elem in h3: elem=elem.contents[0] link=("https://www.google.com" + elem["href"]) print(link)

I pulled only links to sites that are on the Chrome search results page inside tags {h3 class = "r"}.

Well, great, now let's try to pick up links, bypassing several pages of the browser:

 from bs4 import BeautifulSoup import requests query = input('What are you searching for?: ' ) number = input('How many pages: ' ) url ='http://www.google.com/search?q=' page = requests.get(url + query) for index in range(int(number)): soup = BeautifulSoup(page.text) next_page=soup.find("a",class_="fl") next_link=("https://www.google.com"+next_page["href"]) h3 = soup.find_all("h3",class_="r") for elem in h3: elem=elem.contents[0] link=("https://www.google.com" + elem["href"]) print(link) page = requests.get(next_link)

The address of the next page Chrome stores in the tag {a class = "fl"}.

And finally, we will try to get information from any page and make a dictionary of it for a future rubricator. We will collect the necessary information from the same Wikipedia:

 from bs4 import BeautifulSoup import requests dict="" query = input('What are you searching for?: ' ) url ='http://www.google.com/search?q=' page = requests.get(url + query) soup = BeautifulSoup(page.text) h3 = soup.find_all("h3",class_="r") for elem in h3: elem=elem.contents[0] elem = elem["href"] if "wikipedia" in elem: link=("https://www.google.com" + elem) break page = requests.get(link) soup = BeautifulSoup(page.text) text = soup.find(id="mw-content-text") p= text.find("p") while p != None: dict+=p.get_text()+"\n" p = p.find_next("p") dict=dict.split()

For the query “god,” a good vocabulary of 3,500 terms is obtained, which, in truth, will have to be edited with a file, removing punctuation marks, links, stop words, and other garbage.

Conclusion

Summing up the work done, it should be noted that the vocabulary of course turned out to be “raw” and “dragging” the parser to a specific resource requires time to study its structure. This suggests a simple idea - the generation of a training set should be carried out on its own, or use ready-made databases.

On the other hand, with proper care to write the parser (clearing the html markup from unnecessary tags is not difficult) and a large number of classes, some degree of automation can add the necessary flexibility to the rubricator.

Links to tools used

BeautifulSoup: www.crummy.com/software/BeautifulSoup
Requests: docs.python-requests.org/en/latest

Source: https://habr.com/ru/post/272711/

All Articles