UPD: PHPBB is actually to blame - it considers bots as registered users, and gives them read permissions. Thank you
khim mikes
Some years ago I read somewhere that Google is going to index the “dark side of the Internet” - these are all kinds of databases, closed libraries and generally paid sites. Those. information to view which you must enter at least a username and password. According to some calculations, the “dark” information on the Internet can be from 90 to 98%.
Then I was delighted - it will be possible to watch the same
experts-exchange.com (I know about the End key) and similar sites, which I used.
But recently I needed to create an internal forum for the organization. The organization is quite large and distributed throughout the country. The task was to make a simple communication of geographically distributed employees within the organization. It was planned to discuss internal information, access to which competitors were, to put it mildly, undesirable.
What I've done:- Added sub-domain
- Install and configure PHPBB
- Closed all forums - an unauthorized user receives the message "There are no forums on this site"
- I added an additional field to the registration page with a question, the answer to which is only known to employees working in this organization.
- Notify employees by mail only. The link on the Internet was not shining anywhere.
However, a week later in the logs he noticed googlebot, yandexbot, and other lesser-known spiders. It did not bother me - there are a bunch of services that show DNS statistics - through them, search engines could go to the forum.
However, a month later I noticed indexing in the logs by Google:
66.249.71.178 - - [time] "GET /robots.txt HTTP / 1.1" 404 2152 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
66.249.71.178 - - [time] "GET / HTTP / 1.1" 200 17743 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y96 HTTP / 1.1" 200 26238 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "
66.249.71.178 - - [time] "GET /viewforum.php?f=x5 HTTP / 1.1" 200 13482 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
66.249.71.177 - - [time] "GET /viewforum.php?f=x0 HTTP / 1.1" 200 14550 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y34 HTTP / 1.1" 200 15503 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "
I was somewhat shocked. AS? How did Google get access to the forum? At this time, the first 2 links appear for the request “site: forum.of.site.com”.
Quickly added
robots.txt User-agent: Googlebot
Disallow: /
After some time, the bot reread robots.txt, but continued indexing. A week later, several dozen pages appeared in the Google cache.
')
I started looking for information on how to remove information from the index and cache.
Google
recommends adding lines to HTML
<meta name = "robots" content = "noarchive">
<meta name = "googlebot" content = "noarchive">
What was immediately done, nevertheless, the indexing continued, the pages in the cache increased.
I continued the search - I found the
Tool for creating an application for deleting a web page , the service is not convenient because it allows you to delete only one URL at a time, and asks many questions, but anyone can submit an application.
Fortunately, I found a way to remove the entire site — add it to my toolbar, confirm management, and then delete it. Maybe in the near future the profession SED (Search Engine Deoptimizator) will be in demand :)?
But the main question remains:
How did Google get access?
I have only one assumption: one of the employees uses Google Desktop - (this is indicated by its user-agent string). Apparently Google Desktop sends cookies. Essentially steals cookies. I do not think that he is transmitting all the data of the forms - it would be a scandal, and there are no POST requests from the bot.UPD: PHPBB is actually to blame - it considers bots as registered users, and gives them read permissions. Thank you
khim mikes