Common Crawl organization made a generous gift to developers and companies that work in the field of information retrieval and processing. Amazon S3 has an open index of 5 billion web pages with metadata, PageRank and graph of hyperlinks.
If you saw in the logs of the web server CCBot / 1.0, then this is their crawler. Non-profit organization Common Crawl stands for freedom of information and has set a goal to make a public search index that will be available to every developer or startup. It is assumed that this will lead to the creation of a whole pleiad of innovative web services.
The Common Crawl search cluster runs on Hadoop, data is stored in the HDFS file system, and processed by MapReduce, after which all content is compressed into
ARC format archives, 100 MB files (total database size is 40-50 TB). Files can be downloaded to yourself or directly processed on EC2 using the same MapReduce. Access to the
bucket 's is possible only with the Amazon Requester-Pays flag, that is, for registered EC2 users (read more about Amazon Requester-Pays
here ). Download 40-50 TB from the external network will cost about $ 130 at current rates Amazon, treatment through MapReduce inside EC2 - for free.
The data is available with almost no restrictions: see the data access
instructions and
terms of use . It is forbidden only to upload the downloaded data somewhere else, sell access or use the data in any illegal way.
')
Add that the head of the Common Crawl Foundation is Gilad Elbaz, widely known in narrow circles, the main developer of Google AdSense and the executive director of the startup
Factual .