GSA: Preparing Google Search Appliance in a Virtual Machine

In recent years, reading about personal search engines with fun yellow boxes named Google, I occasionally googled according to the GSA, the Google Search Appliance, reverse engineering, and what is wrong, hack, DIY, disk dump, etc. But I haven’t met anything except official press releases and correspondence of happy (?) Owners with a support group.

Sometimes timid questions sounded on forums like “how can I get a root” or “get into GSA via ssh”, but the answer to all such questions was one - only the Google support team knows the passwords. And he will not tell anyone. Surprisingly, I have not seen on the Internet any attempts to build a “hackintosh” on the Google engine, or by living code to understand the page ranking algorithm.

The situation changed slightly in 2008, when, in the wake of virtualization euphoria, Google rolled out VGSA - a free virtual machine for VMware with a limited to 50 thousand documents. However, this did not arouse much enthusiasm on the Internet, in 2009 the project was curtailed and most of the links in Google began to return 404 to VGSA (note that by Google itself). The release link for 2008 can be found quite easily. The link to version 2009 is preserved only on a pair of Chinese sites.
')
How I put vgsa_20090210 on ESX 5.1 and saw a lot of interesting things can be found below.

A couple of words about the Google Search Appliance

GSA is a ~~pocket~~ search engine that can index websites, any documents and databases. Positioned as a local Google for large companies that want to have their local specific search, but do not show it (or show) to anyone. The box itself is a thresher that indexes not only the URLs specified in the settings (http: //, smb: //), but also any data (Oracle, MySQL, etc) that can be fed via the API. GSA is stuffed with data, in addition to its own http / smb spider, through so-called open source connectors written for various databases and file systems. They are freely accessible via Google Code and managed through the Connector manager .

Installation

VGSA virtual machine compiled based on CentOS 5 Final . After unpacking the archive, we get a standard set of vmx / vmdk files for Vmware Player. Since the version is already outdated, it did not work on ESXi 5.1, via Vmware Converter. A new VM was created with basic settings for Redhat 5 32b, 2 Gb of memory and a small disk, which was immediately removed and replaced by vmdk from VGSA (connected as SCSI parallel BusLogic). Update: in the comments they write that it runs perfectly in VmWare Workstation 8.

After the standard LILO, the download went and the first alarming caption about encryption appeared:

Then VGSA received addresses via DHCP and its standard screen appeared:

At the proposed URL, a normally working Google engine was found, ready for settings (with a license for 50 thousand documents):

To check, I indexed the first 100 pages of habr, with standard Google tactics - 4 processes per domain / site. The crawler bypasses all internal links, and the indexing mechanism simultaneously sorts garbage and duplicates. At the same time, this is all ranked, links to and from the page are considered, each page is assigned its own PR relative to the root (the root of the site is always PR10) and so on.

Having created a collection of sites that are interesting to me (I haven’t decided to index images yet), I quickly understand that the limit of 50 thousand pages is very little. It's time to look under the hood ...

Under the hood

Caught on shift LILO, when any attempt to set a parameter asks for a password:

Once again I take the distribution image of the disk and look at its structure. Fdisk swears at unfamiliar GPT, try parted:

[root@server /]# parted /home/vgsa/vgsa-flat.vmdk GNU Parted 1.8.1 Using /home/storage/azureus/vgsa-flat.vmdk Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) unit b (parted) p Model: (file) Disk /home/storage/azureus/vgsa-flat.vmdk: 36507222015B Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 17408B 2147484159B 2147466752B ext3 / 2 2147484160B 4294967807B 2147483648B linux-swap swap 3 4294967808B 36507205119B 32212237312B ext3 /export/hda3 (parted)

The first section starts from 17408. Mount it:

 # mount -t ext3 -o loop,rw,offset=17408 /home/vgsa/vgsa-flat.vmdk /mnt/vgsa

and we get the root VGSA - so far without the main partition / dev / hda3.
Look at lilo.conf for the password:

 # grep pass /mnt/vgsa/etc/lilo.conf password=cmBalx7

Now we boot with init = / bin / bash, remount the root in rw (mount -o rw, remount /) and change the password.
At the same time, you can fix iptables. In the main configuration of the /export/hda3/5.2.0/local/conf/google_config system, there is the parameter ENT_LICENSE_MAX_PAGES_OVERALL, which is responsible for the maximum number of pages indexed. Tried the first thing that came to mind: telinit 1, change ENT_LICENSE_MAX_PAGES_OVERALL to 50 million, then sync and reboot. Surprisingly, the system took off and showed a new limit ...

Briefly about interesting places for which there is not enough time:

/export/hda3/5.2.0/local/google3/quality/, namely rankboost / indexing / rankboost_cdoc_attachment_pb.py:

There is a very interesting point:

  self.link_count_ = 0 self.offdom_link_count_ = 0 self.paid_link_count_ = 0 self.ppc_link_count_ = 0 self.page_blog_score_ = 0 self.page_wiki_score_ = 0 self.page_forum_score_ = 0 self.page_ppc_spam_score_ = 0 self.has_link_count_ = 0 self.has_offdom_link_count_ = 0 self.has_paid_link_count_ = 0 self.has_ppc_link_count_ = 0 self.has_page_blog_score_ = 0 self.has_page_wiki_score_ = 0 self.has_page_forum_score_ = 0 self.has_page_ppc_spam_score_ = 0

- There are many known ranking factors, but something new is still there. I suspected that an advertising PPC campaign could bring both benefit and harm in organic delivery. In the above fragment, it is clear that Google takes into account the behavior of PPC campaigns. It remains only to guess what the PPC spam is.
Many interesting things in /export/hda3/5.2.0/spelling/. I haven’t figured out the format yet, but offhand - there are Google databases based on synonyms and conjugations in different languages. There is also a collection of stop words and a lot of very funny filters that sometimes give way to marasmus:

 en.spelling.filter.dnc.utf8:# Prevent correcting 'aryan' to 'jewish' or 'arabic'.

Instead of an epilogue

The overall impression of the viscera is not unambiguous. Made on python, apparently part of the usual script, part of the compiled code. The system is still more like a set of ~~crutches~~ scripts than a finished solid product. Nevertheless, it all works quite quickly and, while watching the CPU load during the active indexing of a large site, is quite effective.

Now that the black box system has turned into a really yellow one - a few thoughts about its possible use:

Understandable interest in terms of SEO.
Local search for a site when the system indexes only one site and is a powerful internal search system (for example, through an iframe).
It would be interesting to expand the functionality - for example, enter in the admin filter by keywords. That the spider collected only pages in which there are certain words or phrases.
I am sure that this can be run on the hardware, possibly with a small file in hand.
... or in an OpenVZ container.
... think for yourself.

I am pleased to hear your thoughts on this.

Source: https://habr.com/ru/post/170801/

All Articles

GSA: Preparing Google Search Appliance in a Virtual Machine

A couple of words about the Google Search Appliance

Installation

Under the hood

Instead of an epilogue

More articles: