I bring to your attention a review article of scientific papers, in which I am a direct participant and author, on the topic of assessing the quality of Wikipedia in different languages. I write my scientific publications on this topic mainly in English and Polish. I decided to share my knowledge and experience in this field for the Russian-speaking audience, and chose Habrahabr for the first such article. I will be glad to hear comments and suggestions on this topic, maybe someone will be interested in cooperation in this direction. In the following articles I plan to elaborate in more detail on the individual methods and algorithms for analyzing the quality of articles in different languages. I also plan to post code samples (mainly Python), which can be useful for extracting and analyzing data from Wikipedia.

Despite the fact that Wikipedia is often criticized for its poor quality, it still remains one of the most popular knowledge bases in the world. Currently, this Internet encyclopedia is on the
5th place among the most visited sites in the world (after Google, Youtube, Facebook, Baidu). Articles in this encyclopedia are created and edited in about 300 different languages. Wikipedia currently has over
46 million articles covering various topics.
')
Every day the number of articles in Wikipedia is growing. They can be created and edited even by anonymous users. Authors do not need to formally demonstrate their skills, education and experience in certain areas. Wikipedia does not have a central editorial board or group of reviewers who could comprehensively check all new and existing texts. For these and other reasons, people often
criticize the concept of Wikipedia, in particular, pointing to the poor quality of information.
Despite this, you can sometimes find valuable information on Wikipedia, depending on the language version and subject matter. In almost every language version there is a reward system for the best articles. However, there are very few such articles (less than one percent). In some language versions it is possible to set other quality ratings as well. However, the overwhelming proportion of articles has no ratings (in some languages, more than 99%).
Automatic assessment of the quality of Wikipedia articles
So, in Wikipedia, many articles do not have quality ratings, so each reader should independently analyze their contents. The topic of automatic quality assessment of Wikipedia articles in the scientific world is not new. In general, research papers concern the most developed language version of Wikipedia, the English one, which already contains more than 5.5 million articles. I study different language versions of Wikipedia: English, Russian, Polish, Belarusian, Ukrainian, German, French, etc.
Since its inception and with the growing popularity of Wikipedia, more and more scientific publications on this topic have appeared. One of the first studies showed that measuring the volume of content can help determine the degree of “maturity” of an article. Work in this direction shows that, in general, better-quality articles are long, use links in a consistent manner, are edited by hundreds of authors, and have thousands of revisions (versions).
How come to such conclusions? Simply put: compare between good and bad articles.

As mentioned earlier, in almost every language version of Wikipedia there is a system of assessing the quality of articles. The best articles are awarded in a special way - they receive a special “badge”. In Russian Wikipedia, such articles are called "
Selected Articles " (IP), in English Wikipedia - "Featured Articles". There is another “badge” for articles that do not “reach out” to the elect a bit - “
Good Articles ” (XC) (in the English version - this is “Good Articles”). In some language versions, there are other ratings for more “weak" articles. For example, in Russian Wikipedia there is
also : High-quality, Complete, Developed, In development, Procurement. In the English version you can find more: A-class, B-class, C-class, Start, Stub. Already on the example of the English and Russian versions, it can be concluded that the standards for the grading of grades vary depending on the language. Moreover, not all language versions of Wikipedia have such a developed system of assessing the quality of articles. For example, the German Wikipedia, which contains more than 2 million articles, uses only two estimates - analogues of IP and XC.
Therefore, often assessments in scientific papers are combined in two groups:
[1] [2] [3] [4] [5] [6] [7]- "Complete" - estimates of IP and XC,
- "Incomplete" - all other estimates.
Let's call this method
“binary” (1 - Full articles, 0 - Incomplete articles). Such a division naturally “blurs” the boundaries between individual classes, however, it allows you to build and compare quality models for different language versions of Wikipedia.
Data mining
To build such models, you can use various algorithms, in particular Data Mining. In my works, I often use one of the most common and efficient algorithms - Random Forest
[1] [2] [3] [4] [5] [6] [7] (“Random Forest”). There are even studies
[4] that compare it with other algorithms (CART, SMO, Multilayer Perceptron, LMT, C4.5, C5.0, etc.). Random forest allows you to build models even using independent variables that correlate with each other. Additionally, this algorithm can show which particular variables are more significant for determining the quality of articles. If we need to get other information about the importance of variables, we can use other algorithms, including logistic regression
[13] .
The results show that there are differences between the quality models of articles in different language versions of Wikipedia
[1] [2] [3] [4] . Thus, if in one language version one of the most important parameters is the number of notes (sources), in another language the number of images and the length of the text will be more important.
Thus, quality is modeled as the probability of attributing an article to one of two groups - Complete or Incomplete. The conclusion is made on the basis of the analysis of various parameters: the length of the text, the number of notes, images, sections, links to the article, the number of facts
[6] , the visit, the number of revisions and many others. There are also a number of linguistic parameters
[5] [7] , which depend on the language in question. Currently, a total of more than 300 parameters are used in research, depending on the language version of Wikipedia and the complexity of the model built. Some parameters, such as notes (sources), can be estimated additionally
[14] - that is, not only count the quantity, but also assess how well-known and reliable sources are used in the Wikipedia article.
Where to get these parameters?
There are several sources - it can be
Wikipedia backups ,
API service ,
special tools and others
[12] .
To get some parameters, you just need to send a request to the appropriate API, for other parameters (especially linguistic) you need to use special libraries and parsers. Much of the time, however, is spent on writing our own tools (let's stop at this in separate articles).
Are there any other ways to assess the quality of articles other than binary?
Yes. Recent studies
[8] [9] propose a method for evaluating articles on a scale from 0 to 100 (as a continuous assessment). Thus, an article may receive, for example, an estimate of 45.78. This method has been tested on 55 language versions. The results are available on the
WikiRank service, which allows you to evaluate and compare the quality and popularity of a Wikipedia article in different languages. The method, of course, is not perfect, but it works for locally known topics
[9] .

Are there ways to assess the quality of not the whole Wikipedia article, but its parts?
Of course. For example, one of the important elements of the article is the so-called “card” (
infobox ). This is a separate frame (table), which is often located at the top right of the article and shows the most important facts about the subject. Thus, there is no need to search this information in the text - just look at this card. The quality of these cards is assessed by separate studies
[2] [11] . There are also projects, such as
Infoboxes , which allow you to automatically compare cards in different language versions.
Why is this all about?
Wikipedia is used frequently, but does not always check the quality of information. The proposed methods can simplify this task: if the article is bad, then the user, knowing this, will be more careful in using its materials to make decisions. On the other hand, the user can also see in which language the topic of interest is described better. And most importantly, modern techniques allow you to transfer information between different language versions. This means that it is possible to automatically enrich weak Wikipedia versions with high quality information from other language versions
[11] . It will also improve the quality of other semantic databases for which Wikipedia is the main source of information. First of all, it is
DBpedia ,
Wikidata (Wikidata), YAGO2 and others.

Source of illustrations - [8]Literature
- [1] Lewoniewski, W., Węcel, K., & Abramowicz, W. (2016). Quality and Importance of Wikipedia Articles in Different Languages . In International Conference on Information and Software Technologies (pp. 613-624). Springer International Publishing. DOI: 10.1007 / 978-3-319-46254-7_50
- [2] Węcel, K., & Lewoniewski, W. (2015). Modeling the quality of the attributes in Wikipedia infoboxes . In International Conference on Business Information Systems (pp. 308-320). Springer International Publishing. DOI: 10.1007 / 978-3-319-26762-3_27
- [3] Lewoniewski, W., Węcel, K., & Abramowicz, W. (2015). Analiza porównawcza modeli jakości informacji w narodowych wersjach Wikipedii . Prace Naukowe / Uniwersytet Economic w Katowicach, 133-154.
- [4] Lewoniewski, W., Węcel, K., Abramowicz, W. (2017), Analiza porownawcza modeli
- [5] Khairova, N., Lewoniewski, W., & Węcel, K. (2017). Estimating the model of fact extraction using the logical model of fact extraction . In International Conference on Business Information Systems (pp. 28-40). Springer, Cham. DOI: 10.1007 / 978-3-319-59336-4_3
- [6] Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., & Abramowicz, W. (2017). Using the Morphological and Semantic Features of the Wikipedia . In International Conference on Information and Software Technologies (pp. 550-560). Springer, Cham. DOI: 10.1007 / 978-3-319-67642-5_46
- [7] Lewoniewski, W., Wecel, K., & Abramowicz, W. (2017). Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features . DOI: 10.20944 / preprints201801.0017.v1
- [8] Lewoniewski, W., Węcel, K., & Abramowicz, W. (2017). Multipleual Wikipedia Articles . In Informatics (Vol. 4, No. 4, p. 43). Multidisciplinary Digital Publishing Institute. DOI: 10.3390 / informatics4040043
- [9] Lewoniewski, W., & Węcel, K. (2017). Assessment of Wikipedia articles . In International Conference on Business Information Systems (pp. 282-292). Springer, Cham. DOI: 10.1007 / 978-3-319-69023-0_24
- [10] Lewoniewski, W. (2017). Multimediaual Wikipedia Based on Quality Analysis . In International Conference on Business Information Systems (pp. 216-227). Springer, Cham. DOI: 10.1007 / 978-3-319-69023-0_19
- [11] Lewoniewski, W. (2017). Completeness and Reliability of Wikipedia Infoboxes in Various Languages . In International Conference on Business Information Systems (pp. 295-305). Springer, Cham. DOI: 10.1007 / 978-3-319-69023-0_25
- [12] Lewoniewski, W., Węcel, K., (2017), Cechy artykułów oraz metody ich ekstrakcji na potrzeby oceny jakości informacji w Wikipedii . Studia Oeconomica Posnaniensia 12/2017. DOI: 10.18559 / SOEP.2017.12.7
- [13] Lamek, A., Lewoniewski, W. (2017), Zastosowanie regresji logistycznej w ocenie jakości informacji na przykładzie Wikipedii . Studia Oeconomica Posnaniensia 12/2017. DOI: 10.18559 / SOEP.2017.12.3
- [14] Lewoniewski, W., Węcel, K., Abramowicz, W., (2017), Analysis of References across Wikipedia Languages . Information and Software Technologies. ICIST 2017. DOI: 10.1007 / 978-3-319-67642-5_47