📜 ⬆️ ⬇️

Bigbrother is watching you or the dark side of the internet

UPD: PHPBB is actually to blame - it considers bots as registered users, and gives them read permissions. Thank you khim mikes
google is watching you Some years ago I read somewhere that Google is going to index the “dark side of the Internet” - these are all kinds of databases, closed libraries and generally paid sites. Those. information to view which you must enter at least a username and password. According to some calculations, the “dark” information on the Internet can be from 90 to 98%.
Then I was delighted - it will be possible to watch the same experts-exchange.com (I know about the End key) and similar sites, which I used.

But recently I needed to create an internal forum for the organization. The organization is quite large and distributed throughout the country. The task was to make a simple communication of geographically distributed employees within the organization. It was planned to discuss internal information, access to which competitors were, to put it mildly, undesirable.

What I've done:
However, a week later in the logs he noticed googlebot, yandexbot, and other lesser-known spiders. It did not bother me - there are a bunch of services that show DNS statistics - through them, search engines could go to the forum.
However, a month later I noticed indexing in the logs by Google:
 66.249.71.178 - - [time] "GET /robots.txt HTTP / 1.1" 404 2152 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
 66.249.71.178 - - [time] "GET / HTTP / 1.1" 200 17743 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)"
 66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y96 HTTP / 1.1" 200 26238 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "
 66.249.71.178 - - [time] "GET /viewforum.php?f=x5 HTTP / 1.1" 200 13482 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
 66.249.71.177 - - [time] "GET /viewforum.php?f=x0 HTTP / 1.1" 200 14550 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot .html) "
 66.249.71.178 - - [time] "GET /viewtopic.php?f=x5&p=y34 HTTP / 1.1" 200 15503 "-" "Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com /bot.html) "


I was somewhat shocked. AS? How did Google get access to the forum? At this time, the first 2 links appear for the request “site: forum.of.site.com”.
Quickly added robots.txt
	 User-agent: Googlebot
	 Disallow: /
	

After some time, the bot reread robots.txt, but continued indexing. A week later, several dozen pages appeared in the Google cache.
')
I started looking for information on how to remove information from the index and cache.
Google recommends adding lines to HTML
	 <meta name = "robots" content = "noarchive">
	 <meta name = "googlebot" content = "noarchive">
	

What was immediately done, nevertheless, the indexing continued, the pages in the cache increased.

I continued the search - I found the Tool for creating an application for deleting a web page , the service is not convenient because it allows you to delete only one URL at a time, and asks many questions, but anyone can submit an application.
Fortunately, I found a way to remove the entire site — add it to my toolbar, confirm management, and then delete it. Maybe in the near future the profession SED (Search Engine Deoptimizator) will be in demand :)?

But the main question remains:

How did Google get access?


I have only one assumption: one of the employees uses Google Desktop - (this is indicated by its user-agent string). Apparently Google Desktop sends cookies. Essentially steals cookies. I do not think that he is transmitting all the data of the forms - it would be a scandal, and there are no POST requests from the bot.

UPD: PHPBB is actually to blame - it considers bots as registered users, and gives them read permissions. Thank you khim mikes

Source: https://habr.com/ru/post/47055/


All Articles