How Wikipedia works (part 2)

Hi, Habr!

At the main work they gave me a couple of days off, and besides important personal matters I decided to devote them to continuing the series of posts about Wikipedia.

The first part of the series was perceived positively, so I will try to make the second one even more interesting for the local audience: today it will be devoted to some technical aspects of the project.

As you know, Wikipedia is a project of volunteers, and this principle is not completely canceled even in such a matter as technical support for its work. In principle, any participant with a sufficiently small amount of effort can choose a simple bug and send his patch, even without using Gerrit.

But a much more common form of participation is the development of additional tools, settings and robots for Wikipedia.
')

Bots

In order to perform some routine or voluminous tasks, participants often launch bots or ask other, more technically savvy participants to do so. MediaWiki provides access to the API , and for accounts with the bot's flag, this access is even wider (for example, a bot can display not 5000 entries, but 5000 entries through the API). In addition, the bot flag for the account allows you to hide edits from other users, which protects participants from thousands of minor edits that would clutter up the lists of fresh edits.

Pywikibot

The most popular of the existing frameworks is pywikibot , which already includes a large set of ready-made scripts - for example, deleting pages on the list, transferring articles from one category to another, and much more. About 100 people have already participated in the development of this framework, in terms of using this script is also extremely simple: even an ordinary Windows user can install Python, download the distribution kit, enter the login and password into the config and run one of the ready-made scripts.

Previously, the popular assignment for pywikibot was the placement of links between articles in different language sections: for example, someone creates an article about Rally in Russian Wikipedia, knows that there is a similar Rallying in English Wikipedia, and links to it. Then comes the bot, which sees that in the English section there are links to 20+ different language sections, and in the Russian section there are no links to them: therefore, the bot adds a link to a new Russian article in each of the sections, and in the Russian article updates the full list similar interwiki links.

As we can see, the work is really not the most interesting for a person, but very voluminous, and it was done by dozens of bots, which resulted in millions of edits. For example, my bot now has about 990+ thousand edits, 80 percent of which are just like interwiki edits. Not so long ago, the Wikipedia engine was reworked, and such edits in each of the sections are no longer needed, but the number of routine tasks is still not reduced.

But let's go back to pywikibot - the framework has 2 branches:

Core - a new branch where the code was rewritten, it became more structured and efficient.
Compat is an old direction, but there is a wider set of scripts in it, it works better in third-party projects of MediaWiki, and even more familiar with it.

The bugs and suggestions of new features are collected in the bugtracker that is common with MediaWiki, development is now going through Git / Gerrit , which made it easier to attract new developers, add new patches and their reviews. Previously, the development went through SVN, but in the end, to unify resources with MediaWiki and expand the circle of developers, it was decided to move to Git / Gerrit: there are even a Habré topic about the advantages of Git over SVN.

I will not describe the entire set of existing functions of the framework, those who wish can walk around the repository and see: I can only say that it is actively being filled, and existing scripts require minimal setup for running in any language section.

AutoWikiBrowser

If the bot described above works in the console, then AWB (AutoWikiBrowser) is a friendlier tool for the simple user.

AWB has a full-fledged interface, automatic update, and it works only on Windows (unofficially - and under Wine). Typical AWB functions: replacing text with regular expressions, other edits to a specific list of articles. AWB can recursively bypass Wikipedia categories , compare lists, highlight unique elements or intersections, and even handle Wikipedia dumps . At the same time, there are also restrictions for working with accounts that do not have the administrator or bot flag - lists for such participants are limited to 25,000 lines. If you have a bot flag, then the restrictions are completely removed when loading a special plug-in . Important caveat: potentially using AWB, you can quickly make a number of non-constructive edits, including vandal, then its use is technically limited to users approved by administrators : if the user name is not specified on this page, then AWB will refuse to work.

In the general case, when saving each edit, you need to manually click on the “Save” button, autosave is possible only if AWB is running from an account with a bot flag. Therefore, AWB is difficult to use for really large-scale tasks, but for small tasks it is very convenient, as it allows you to automate certain actions and quickly implement what you want without having to contact participants with more advanced bots (for example, see above). Personally, I often use AWB to compile lists, and then quickly run pywikibot with the necessary task: pywikibot also has a special page generator that can do all this, but personally I can make it easier and clearer to do everything through a program with a GUI .

The AWB source code is open , the program is written in C # and is supported by a limited number of developers. When you start the program itself checks for updates and installs them, the distribution is also laid out on SourceForge . In case of critical errors during the work, AWB makes a bug-report and helps to pass it on to the developers .

Other

There are bots running on Perl , .NET , JAVA , but they are more often supported by individual enthusiasts, and are not widely distributed. I personally once ran wiki scripts in PHP, but the massive pywikibot support, the active bugtracker and the responsiveness of a large number of developers completely inclined me to work with this bot, so I am not able to tell in detail about other frameworks :)

Toolserver

The section above was devoted to scripts and bots that are mostly launched from the participant’s computer or from its server. But besides this, there is the possibility of running scripts from Wikimedia organizations: Earlier, Tulserver, which was supported by the German branch of Wikimedia, existed since June 30, 2014, because Labs was created, but in order.

The history of Tulserver began in 2005, when Sun Microsystems donated a V40z server (2 * Opteron 848, 2.2 GHz, 8 GB RAM, 6 * 146 GB disk, 12 * 400 GB external RAID) for use at Wikisource Conference in Frankfurt. After the conference, one of the participants in the German section took him home and made a coffee table out of it; after some time, it was decided to install it in Amsterdam on the basis of Kennisnet , where about fifty servers of the Wikimedia Foundation were installed.

After that, Tulserver starts to run various scripts and tools (revision counters, various article analyzers, file upload tools, etc.), and Tulserver’s power began to increase: at the time of closing, 17 servers were running , more than 3 million calls to Tulserver were registered day traffic reached 40 MB / s. Each of the servers had from 24 to 64 GB of RAM, mostly they worked on Solaris 10 (and gradually switched to Linux), the total disk space was 8 TB.

What were the main advantages of Tulserver as a platform?

Replication with Wikimedia Foundation servers — more details below.
Openness - in the presence of reasonable goals and skills, the account was easy to get .
Sufficient closeness of its code: if the new Labs requires a code under an open license, and the files themselves are mostly open for viewing by all Labs participants (with the exception of passwords, logins, etc.), then Tulserver was much more democratic in this regard.

There were also disadvantages:

the system worked in the “as is” mode, since funding was limited, and the capabilities of the admins supporting the system were limited: many system errors were not corrected over the years;
in the event of a developer’s inactivity, his code and work were lost, and Wikipedia was often deprived of tools that it had long been used to; at Labs, due to the openness and availability of code, any project can be started by another developer;
At some point, severe restrictions were imposed on the consumption of system resources, which led to the shutdown of some useful but costly tools.

But let's go back to the main advantage - replication: without it, Tulserver would be no different from regular hosting, where you can run your processes. With the help of replication on Tulserver, there were always actual Wikipedia databases, so the tools could work directly with the database, rather than make huge API requests, sometimes processing irrelevant dumps, etc.

An exemplary replication scheme is shown in the picture below:

Tampa on the scheme - this is the main database of the Foundation, located in the United States; clusters s1-s4 are responsible for the database: for example, s1 is English Wikipedia, s2 is some large sections, etc. The data from Tampa is replicated to the Tulserver DB in Amsterdam, and already there the Tulserver users and their tools access it. Naturally, there was always some kind of replication lag, and due to the use of different clusters there could be a situation when the lag for processing the data of English Wikipedia was 1 minute, and for Russian Wikipedia it was 2-3 days. For example, on June 21 (shortly before shutdown) the lag was up to 28 seconds .

This availability of up-to-date data was the main advantage of Tulserver: it was possible to analyze almost online which files were not used, how many edits and actions any participant made in all Fund projects and a lot of other information that Wikipedia does not provide directly.

Conclusion

Tulserver’s support was heavy for the German branch, the system had some limitations, and from July 1, Tulserver was completely replaced by the new Labs project, which is fully supported by the Wikimedia Foundation itself. This is a new big project, I will write about it in the next post, but I can publish the June Labs statistics for a seed :)

213 projects are working
3 356 users are registered
1 714 312 MB RAM is used
occupied 19,045 GB of disk space

See you soon!

Source: https://habr.com/ru/post/230219/

All Articles