Search engines can kill copy-pasteing! Why not?

After reading a few topics about the fight with hateful copy-pasting , there was a feeling that it lacked something simple enough to cool the main mass of evil copy-pasteors and their habitat resources.

The idea is such

If we have that on the Internet everything is strongly tied around search engines (top positions, traffic, SEO), then why shouldn't search engines take care of protecting the rights of the authors of the original content?

Something from what you read here will seem like a repetition of thoughts that have long been voiced, but I want to form a complete scheme and find the reasons why there is no such thing in the open spaces of the network.
')
So:

From copying freely available information can not be protected.

But you can try to make it so that such copying is not profitable.

Let's try to push off from the fact that the resources that publish content, want to be well visible in search engines , as this brings additional traffic and, accordingly, money. On this, you can try to play, excluding resources, or their individual pages from the index if the rules of republishing are violated .

Judging by what I read in the topics and comments to them, this is far from new and in fact is being done now, but only on the initiative of the right holder . The standard practice is to knock down search engines and providers on websites that have got you through special forms. But this practice is very ineffective, as it takes time until your complaint is considered manually . The main problem is the lack of automation .

There are many reasons for the lack of automation, but the main one is that anyone can tap on any site.

What to do?

Everyone is interested in high-quality original and non-duplicated content: the author, the news resource and search engines (link to the original in the search results). Advertising content is not included here, as the uncontrolled copying and distribution only to the hand of the advertiser.

But of these three, only search engines have the power to influence the situation . That is why a decision can be expected precisely from them, and not from a separate information agency, which will introduce protection of its content .

Moreover, search engines have the ability to automate such a process. All that is needed for automation is to know exactly who the author of the content is . Obviously, the author is the one who has this content before appeared. From this it follows that for such players as Google and Yandex, it’s enough, apart from the standard form “add a website”, to make the form “ add original content ”. And whoever uses this form first is the author.

The application form is very simple :

Actually content (not html page with all the garbage, namely the clean text of the publication with the title, so as not to strain the system with unnecessary information)
The URL (or several, when paginated) by which the content should be available on the web. It is important that at the time of sending this form, this page does not have any external links and has not yet appeared in the RSS. That is, it was accessible from the outside, but as if by a secret address. This is not necessary, but it is desirable that someone else did not have time to see the content on the network, and send you such a “ application for authorship ” before. After submitting this form (and confirming that the search engine has entered the database), the content page can be opened to the public and robots (send to RSS and link to from other pages of the site).
If possible, this request (adding to the database and indexing the page) should be processed in real time and not put in a queue (this is not adding a whole site), but this is not necessary, since the presence of the first such request in the queue will not allow others applications for the same content. But in the first version, if, after processing such an application, the spider indexes the page with the same content, it will already be able to determine whether it is the original or a copy.
Additional fields for the author of the publication are possible such:
- The kind in which the use of this content on other sites is permissible: in no way; only with reference to the original; only a small part with reference to the original; Only the title with the link; unrestricted use)
- The ability to make lists of sites for which to make exceptions and what (complete ban for competitors, complete freedom for partners, certain conditions for aggregators, etc. ...). So you can easily set up posting on Habré and in a personal blog, for example.
- A list of small competitive local / industry-scale sites that should be checked during indexing (this is necessary to carry out targeted verification of sites, since absolutely all network resources can be checked when new publications appear with them, I think it will never be possible for technical reasons)
Obviously, if the author often generates content, the search engine should provide the ability to save such settings in the personal account of the author / informational agency for use in subsequent applications for publication authorship.

Ideally, in order not to send such an application to every search engine, not to duplicate everywhere the content authorship base and publication settings for each author, such a service should exist independently of search engines, and the latter should use such a common base when indexing and ranking pages with the same content.
Exactly such a scheme suggested flashvoid in one of the comments . But he suggested pushing off from the service to the search engines:

And when a significant base of signed articles is accumulated, it will be possible to offer a universal API to search engines so that they select original articles in the search.

Of course, this is more correct, but based on what is, I believe that search engines should take the initiative, since they already have the power to influence the profitability of copy-pasting . Creating such an independent international service is unclear by whom it should be financed and smells, as a result, a fee in use for authors, while for search engines this scheme will help them to improve the quality of search results. It will be enough for them to agree on a single standard for publisher account settings (as happened with the sitemap, so that you can use one settings file in all search engines).

Sooner or later, the implementation of such a scheme will have to be implemented by every search engine respected himself, otherwise he will lose to competitors who respond with a quality link to the original, while he responds with an artificially displayed top page with stolen content that opens with a reader of all banners rainbow colors and pop-up windows of all possible shapes and sizes.

What can a search engine do when indexing a page with stolen content? It already depends on the search engine strategy : from excluding a resource from the index (through a warning), excluding a specific page and lowering the rating of the resource, to lowering the rating of that particular page so that it does not exactly appear higher than the original, even if there are more links to it (all this is desirable to do with relevant messaging, which resource owners understand that they did not and will not be repeated, and not dully surprised decrease in position to issue, or even the disappearance of their site from the index. addresses for a long time admins You can take the services of the relevant search engine for webmasters to which all normal sites try to be connected, or display a message directly to these services). It is precisely this strategy that will ultimately determine the reputation and quality of search engine output.

One more thing. Search engines will need to provide an API through which you can find out what content the author places on the restrictions on its use.

Eventually

A site, such as an information agency , for which a position in a search engine is important (traffic = profit, and in fact, for multiple violation of restrictions, you can not only lower the rating, but also ban it), will not allow itself to publish someone else’s content, violating the restrictions set by the author ( it is easy to determine if there is a link to the original, how much publication is compared to the original, etc.), and before any questionable publication, it will be checked through the API and insured against undesirable consequences.

In a situation with the blogosphere , discussion services and aggregators, it is simpler: they simply should not rise above the original in search results. And the search result, when the original goes to the first, and the second to the discussion on the discussion service, is quite useful.
For most small sites, nothing will change. They do not affect anything, so it’s not worth soaring with them, except for the author’s application, or the author’s inclusion of such a resource in the list of such which it is desirable to check (a competitive local or thematic small resource that would not be checked by default).

In a situation with translations, I think a normal practice should be translation with the consent of the author of the original, with the subsequent filing of an application for authorship of the translation already from the translator.

In any case, the publication will need to send an application, even with the option "full free use" in order for someone else to not take your material and send an application for its authorship with the ban on republishing.

Such a scheme is very necessary, effective and, most importantly, realizable, in my opinion.

Seen the pros and cons:

- Downgrading the search engine ranking may not be critical for resources that create their own audience and are not dependent on the traffic from the search.

- How to be when very similar texts are actually written about the event that the system considers duplicates? Or vice versa, you can change the text in such a way and check through the API that the system does not see a similarity in it and then issue it as your own (but a strong change in the text really pulls into your work, while we are fighting with evil copy-pasting).

- When indexing, the search engine has a resource-intensive task of comparing incoming content with the base for determining duplicates and breaking replication rules. This, it seems to me, is the reason for the lack of implementation of such a scheme today. But computing resources are increasing day by day, and then it is only a matter of time. On the other hand, stupid when indexing a new page on narod.ru without a single external link, immediately check it with the entire database. It is enough to carefully check only the most active publishing resources for the publication (where there is a large audience and on which the direction of traffic depends) and check only the latest recent publications (again, interest in which traffic still attracts). That is, you can safely postpone (and not even carry out) check small sites, and do not check with the old publications, the importance of which has long been extinguished. But leave the opportunity to make such a check at the request of the copyright holder of the publication, who, seeing a copy of his ten-year article on the portal of his neighbor over the fence, sends a link to the site and the system automatically performs such checking and position change in the issue (this is already possible, since requests will be a tolerable amount). And once again I emphasize the main thing - everything is automated .

+ The search engines have all the levers, resources and technical capabilities (perhaps almost) for the implementation of such a scheme.

+ International and free solutions. (In a recent post a solution was proposed at the level of state control, with a legal linkage, or paid service)

+ Full automation of processes that are now in need of it.

Why do you think this scheme has not been implemented yet?

Ideally, I would like to hear the thoughts of representatives of Yandex or Google.

Source: https://habr.com/ru/post/66110/

All Articles