📜 ⬆️ ⬇️

How many scientific articles on the Internet?

Professor Lee Giles of the College of Information Technology at the University of Pennsylvania devoted a significant part of his career to developing search engines based on scientific articles so that the academic community has easy access to materials.

Recently, the professor published the first of its kind research, in which he assesses the number of available scientific articles on the Internet. The work "The Number of Scholarly Documents on the Public Web" was published in the May issue of the journal PLoS ONE and cited in Nature.

The work takes into account only English-language documents, taking into account the overlap in the two largest specialized search engines: Google Scholar and Microsoft Academic Search. Scientific documents are publications in journals and reports from conferences, theses and dissertations, books, technical reports and working papers (preliminary versions of scientific articles).
Statistical methods have shown that at least 114 million scientific documents in English are available through the Internet, of which about 100 million are available through Google Scholar. At least 27 million documents (24%) are in the public domain.

The authors have adapted in their work the method of double coverage, which is usually used in ecology to estimate the size of animal populations. There he assumes the capture of a certain number of animals, which are marked and released into the wild. It is then re-captured in the same area. Scientists estimate the percentage of ringed animals in the second sample - and make an approximate estimate of the total population size using a simple formula.

Giles’s research also has practical meaning for him as a developer. Back in 1997, he and his colleagues released the open search system CiteSeer on scientific documents, mainly from the field of computer science. At the same time, the search engine took into account citations and references in documents in order to build an index taking into account the ranking. It is believed that this is the first automatic quotation indexing system, the predecessor of tools such as Google Scholar and Microsoft Academic Search.

In 2008, a new version of CiteSeerX was released, in which topics were expanded to physics, economics, medicine and other scientific branches. Giles is trying to assess what infrastructure is needed for indexing documents in each industry.

Giles emphasizes the fact that 24% of all documents are freely available on the Web, in the form of direct links to documents through Google Scholar (in computer science, the percentage of freely available documents is 50%). The professor also notes that open access documents are more frequently cited and have more weight.

Source: https://habr.com/ru/post/240049/

All Articles