📜 ⬆️ ⬇️

Game to improve the quality of Wikipedia

The beta version of the online game WikiBest, which is part of the research in data quality on Wikipedia, was announced today. It is noteworthy that at present the game allows you to compare the quality of data in 5 language versions of Wikipedia: Russian, Ukrainian, Belarusian, Polish, English. In the near future it is planned to expand the number of languages.

image

Despite its popularity, Wikipedia is often criticized for its poor quality information. In the scientific world there are various approaches to the automatic assessment of the quality of articles in this free encyclopedia. However, a large number of problems are still not resolved. For example, how to automatically evaluate or compare the quality of individual facts in different language versions on the same topic?

In Wikipedia, each article can have several language versions (even more than 200). On the one hand, it simplifies access to information for individual language communities. On the other hand, this may create difficulty in determining better information, since Each of these versions can be created and edited independently of each other. For example, readers and editors of the English version of an article on Yekaterinburg do not have to know what is written about this city in the Russian version of Wikipedia, although it can be expected that once in the latter information may be of better quality (of course, this rule does not always work; )).
')
The WikiBest game was created to build algorithms for automatic comparison of data quality between separate language versions of articles based on the decisions of users (players) in the future using machine learning and artificial intelligence. This can help you choose more complete, current, and reliable information that other language versions of Wikipedia could enrich.

Game address

The first short video lecture on how WikiBest works:



Key features


Currently, the minimum requirements for the player - knowledge of 4 languages ​​(Russian, Ukrainian, Polish, English) at the basic level, which would allow to compare the contents of the cards (in English “infobox”, in simplification - data tables) of Wikipedia articles. Knowledge of Belarusian is also recommended - then it will be possible to compare the quality in all the available 5 language versions.

Registration is required to participate in the game. After receiving the activation code in the mail - you can start the "struggle" for quality in Wikipedia!)

Cards appear in 5 (4) language versions for the same subject - for example, it can be a city, a computer game, a university, a company or another object. For ease of comparing data, windows with cards can be moved. For each language version, it is possible to note four options regarding the data they contain: the best quality, the best completeness, the best relevance, the best reliability.

Ideally, each of the available options should be marked only once within 5 (4) languages. Those. we must determine who is the best in each of the four "nominations". However, there are exceptional cases where two language versions can be the best. Then the game invites the player to add also a comment, with information about why he (she) thinks so.

To go to the next five (four) cards, you must click "Next". And we repeat as described above.

For the work done in the game "earned" experience, which leads to higher levels.

Due to the fact that research is conducted mainly by specialists in machine learning and data analysis, the gamification of the service is not the strength of this project;) This will have to be learned. I will be glad to links to useful materials in this direction.

Generally speaking, the project is non-commercial. Any help is welcomed)

Some theory


What is data quality ? The question is not simple, and the scientific community does not have a single definition - it all depends on the context;) Let's start with the fact that quality assessment is a subjective concept and depends on a specific person, his knowledge and experience, and the demand for this information at a given time. Simply put, data quality can be defined as suitability for use.

In order to assess the quality of data, it is also necessary to take into account its various characteristics, such as, for example, completeness, relevance, accuracy.

In WikiBest, fullness means how widely an object is described. Those. It is necessary to see which characteristics are entered in the card - whether all the basic parameters for this object are available to the reader. For example, if it is a city, then one of the most important parameters could be: population, area, mayor, etc.

The relevance is related to the difference between the entered parameters of the object and the actual situation. For example, a higher relevance of population data will have a card, where the value is given as of 2018, compared to the card, where the same parameter has a value from 2016.

Reliability in the context of the game, shows how the information is supported by reliable sources. Thus, the reader can verify the correctness of the entered value of a particular parameter.

Why exactly 5 languages?


As mentioned above, the game is part of scientific research, in which I participate directly. I can be confident in the basic knowledge of these languages, so I can conduct research on the data obtained.

As for the non-binding nature of the Belarusian language, this is related to the size of the Belarusian Wikipedia section. Currently there are approx. 150 thousand articles. For comparison, the Ukrainian Vicky already contains more than 800 thousand, the Russian one - almost 1.5 million ( source ).

The main goal of the research is to enrich the less developed language sections of Wikipedia. In this sense, the Belarusian section has a big potential - data from other studied language sections can be transferred there. However, we already know that the quality of the data depends on the topic and language version, so first it is necessary to determine the “candidate” for “copying” (in fact, translation of this data is still needed - but this is not a problem when using semantics).

Source: https://habr.com/ru/post/418713/


All Articles