10 tools for parsing information from websites, including competitors' prices + legal assessment for Russia

Web scraping tools (parsing) are designed to extract, collect any public information from websites. These resources are needed when you need to quickly obtain and save in a structured form any data from the Internet. Site Parsing is a new data entry method that does not require re-entry or copy-and-paste.

This kind of software searches for information under the control of the user or automatically, selecting new or updated data and storing it in such a way that the user has quick access to it. For example, using parsing you can collect information about products and their cost on the Amazon website. Below we consider the options for using web-based data extraction tools and the top ten best services that will help collect information, without the need to write special software codes. Parsing tools can be used for different purposes and in different scenarios, we consider the most common use cases that may be useful to you. And we will give a legal assessment of parsing in Russia.

1. Data collection for market research

Web-based data extraction services will help monitor the situation in which the company or the industry will seek in the next six months, providing a powerful foundation for market research. The parsing software is capable of receiving data from a variety of data analytics providers and market research firms, and then combining this information into one place for reference and analysis.
')
2. Retrieving Contact Information

Parsing tools can be used to collect and organize data such as email addresses, contact information from various sites and social networks. This allows you to create convenient lists of contacts and all related information for business - data about customers, suppliers or manufacturers.

3. Download Solutions with StackOverflow

With the tools of parsing websites, you can create solutions for offline use and storage, collecting data from a large number of web resources (including StackOverflow). This way, you can avoid dependence on active Internet connections, since the data will be available regardless of whether it is possible to connect to the Internet.

4. Job search or staff

For an employer who is actively looking for candidates to work in his company, or for a job seeker who is looking for a specific position, parsing tools will also become indispensable: with their help, you can customize data selection based on the various attached filters and effectively receive information without routine manual search.

5. Tracking prices in different stores

Such services will be useful for those who actively use the services of online shopping, tracking the prices of products, looking for things in several stores at once.

I did not get into the review below the Russian site parsing service and the subsequent monitoring of prices XMLDATAFEED ( xmldatafeed.com ), which was developed in St. Petersburg and is mainly focused on collecting prices with subsequent analysis. The main task is to create a decision support system for managing pricing based on open data of competitors. From the curious it is worth highlighting the publication of data on parsing in real time :)

Top 10 Web-Based Data Collection Tools:

Let's try to consider the 10 best available parsing tools. Some of them are free, some offer free trial for a limited time, some offer different tariff plans.

1. Import.io

Import.io offers the developer to easily form their own data packages: you only need to import information from a specific web page and export it to CSV. You can extract thousands of web pages in minutes, without writing a single line of code, and create thousands of APIs according to your requirements.

To collect huge amounts of information needed by the user, the service uses the latest technology, and at a low price. Free applications for Windows, Mac OS X and Linux are available with the web tool for creating data extractors and search robots that will provide data download and synchronization with your online account.

2. Webhose.io

Webhose.io provides direct, real-time access to structured data obtained by parsing thousands of online sources. This parser is able to collect web data in over 240 languages and save the results in various formats, including XML, JSON and RSS.

Webhose.io is a web browser application that uses its own data parsing technology, which allows you to process huge amounts of information from multiple sources with a single API. Webhose offers a free tariff plan for processing 1000 requests per month and $ 50 for a bonus plan covering 5,000 requests per month.

3. Dexi.io (formerly CloudScrape)

CloudScrape is able to parse information from any website and does not require the download of additional applications, as well as Webhose. The editor sets its own search robots and retrieves data in real time. The user can save the collected data in the cloud, for example, Google Drive and Box.net, or export data in CSV or JSON formats.

CloudScrape also provides anonymous access to data, offering a range of proxy servers that help to hide user credentials. CloudScrape stores data on its servers for 2 weeks, then archives them. The service offers 20 hours of work for free, after which it will cost $ 29 per month.

4. Scrapinghub

Scrapinghub is a cloud-based data parsing tool that helps you select and collect the necessary data for any purpose. Scrapinghub uses Crawlera, a smart proxy rotator equipped with mechanisms that can bypass the protection from bots. The service is able to cope with vast amounts of information and sites protected from robots.

Scrapinghub converts web pages into organized content. A team of specialists provides an individual approach to customers and promises to develop a solution for any unique case. The basic free package gives access to one search robot (processing up to 1 GB of data, then $ 9 per month), the bonus package gives four parallel search bots.

5. ParseHub

ParseHub can parse one or many sites with support for JavaScript, AJAX, sessions, cookies and redirects. The application uses self-learning technology and is able to recognize the most complex documents on the network, then generates an output file in the format that the user needs.

ParseHub exists separately from a web application as a desktop program for Windows, Mac OS X and Linux. The program gives free five trial search projects. The $ 89 Premium tariff plan includes 20 projects and 10,000 web pages per project.

6. VisualScraper

VisualScraper is another software for parsing large amounts of information from the network. VisualScraper extracts data from several web pages and synthesizes the results in real time. In addition, data can be exported in CSV, XML, JSON and SQL formats.

A simple point and click interface helps to use and manage web data. VisualScraper offers a package with processing more than 100 thousand pages with a minimum cost of $ 49 per month. There is a free application similar to Parsehub, available for Windows with the ability to use additional paid features.

7. Spinn3r

Spinn3r allows you to parse data from blogs, news feeds, RSS and Atom news channels, social networks. Spinn3r has an “updateable” API that does 95 percent of the indexing work. This includes improved spam protection and enhanced data security.

Spinn3r indexes content like Google and saves extracted data in JSON files. The tool constantly scans the network and finds updates of the necessary information from a variety of sources, the user always has information that is updated in real time. Administration Console allows you to manage the research process; There is a full-text search.

8. 80legs

80legs is a powerful and flexible web site parsing tool that can be very precisely tailored to the user's needs. The service handles amazingly huge amounts of data and has the function of immediate retrieval. 80legs customers are such giants as MailChimp and PayPal.

The option "Datafiniti" allows you to find data super-fast. Thanks to it, 80legs provides a highly efficient search network that selects the necessary data in seconds. The service offers a free package - 10 thousand links per session, which can be updated to the INTRO package for $ 29 per month - 100 thousand URLs per session.

9. Scraper

Scraper is an extension for Chrome with limited data parsing functions, but it is useful for online research and data export to Google Spreadsheets. This tool is intended for both beginners and experts who can easily copy data to the clipboard or storage in the form of spreadsheets using OAuth.

Scraper is a free tool that runs directly in the browser and automatically generates XPaths to determine which URLs to check. The service is quite simple, it does not have full automation or search bots, like Import or Webhose, but this can be considered as an advantage for beginners, since it does not have to be adjusted for a long time to get the desired result.

10. OutWit Hub

OutWit Hub is a Firefox add-on with dozens of data extraction features. This tool can automatically view pages and store the extracted information in a format suitable for the user. OutWit Hub offers a simple interface for retrieving small or large amounts of data as needed.

OutWit allows you to “pull” any web pages directly from the browser and even create automatic agents in the settings panel to extract data and save it in the right format. This is one of the easiest free web-based data collection tools that do not require special code writing skills.

The most important thing is the legality of parsing?

Does the organization have the right to perform automated collection of information that is publicly available on sites on the Internet (parsing)?

In accordance with the legislation in force in the Russian Federation, everything that is not prohibited by law is permitted. Parsing is legal in the event that it does not violate the prohibitions established by law. Thus, with automated data collection, it is necessary to comply with current legislation. The legislation of the Russian Federation establishes the following restrictions related to the Internet:

1. Infringement of Copyright and related rights is not allowed.
2. Unauthorized access to legally protected computer information is not permitted.
3. It is not allowed to collect information constituting a trade secret in an illegal manner.
4. The deliberately unfair exercise of civil rights (abuse of right) is not allowed.
5. It is not allowed to use civil rights in order to restrict competition.
From the above prohibitions it follows that the organization has the right to carry out automated collection of information that is publicly available on websites on the Internet if the following conditions are met:
1. Information is in the public domain and is not protected by copyright and related rights.
2. Automated collection by legal means.
3. Automated collection of information does not lead to disruption of websites on the Internet.
4. Automated collection of information does not limit competition.
Subject to the restrictions, Parsing is legal.

ps on the legal issue, we have prepared a separate article , which deals with Russian and foreign experience.

What data extraction tool do you like best? What kind of data would you like to collect? Tell us in the comments about your experience with parsing and your vision of the process ...

Source: https://habr.com/ru/post/340038/

All Articles

10 tools for parsing information from websites, including competitors' prices + legal assessment for Russia

Top 10 Web-Based Data Collection Tools:

The most important thing is the legality of parsing?

More articles: