📜 ⬆️ ⬇️

How we ran Habr for the humanities

“In the next two years, you should not try to portray something special in yourself, but just be smart enough to compose what humanity has already created” (c) bobuk

A year ago, on the domestic hackathon, our Rostov guys crossed a visual text editor, Muravyev’s Typographer and anti-plagiarism service overnight. It turned out the thing that helped quickly prepare and send the publication to the blog.

At one time, the thing lived as a side project, then we were given some resources — well, like an internal startup. The result was a convenient collective media without editing.


Old man gutenberg would be pleased
')
It allows people to read interesting stories, like a 40 year old uncle has been raising sunken ships in the Barents Sea, and writing popular non-technical topics for writers to earn a little on texts.

Let's see what to consider when developing such a service, and what to choose so that no crutches.

By tradition, we broke the post into the main board - each has tips and direct speech of the project participant with a minimum of external comments.

Why choose from 10 editors Medium (JS)


Thanks to Muravyov for the typographer (Python)


How does the anti-plagiarism system (Pytnon) work


How does the author understand what effect texts give?


How to support it (about working with sysadmins)


The story of the guy-diver (on uPages)


But let's start from the beginning. When the side-project began to grow into a high-loaded platform, there appeared an understanding that something needs to be added and rewritten.

0. How we chose technology


We expect that by the end of the year, 8 thousand authors will use the platform, and the readership will be close to 300 thousand.

Ilya devhard , CTO uPages: “When choosing technologies for the project, we proceeded from such considerations: prospects and resistance to high loads.

The choice fell on a bunch of Node.js + MongoDB. On the client, no angulyars, reactors, reductions. As a framework, we chose Express.js because of its minimalism and the availability of everything needed out of the box or with small additional installations.

We also had ESlint , a utility that greatly helped to bring the code of several developers to a more or less unified style without much controversy in the spirit of "which is better: tabs or spaces." Very useful in the early stages of development.

ESlint has a demo on the site. You can download the project from GitHub.

Docker containers - as a working environment for project applications - to protect yourself from paranoia like “update the library and everything breaks” and, if necessary, quickly get the necessary versions of libraries or even several completely different builds (roughly speaking, stable and bleeding_edge assemblies).

We also had 3 Git repositories (one locally in the office, two in different data centers). And I knew that someday we would start writing a visual editor. ”


1. If you decide to write your own WYSIWYG, we sympathize with you


Sergey, our developer: “The prototype from the hackathon was made for blogs and websites on uCoz. And naturally, the first thing we thought was to take something from there. But as you know, the uCoz part is written in Perl - and we chose Node.js. This means that the editor would have to be connected as a separate service or rewritten.

Having reviewed a dozen more options, we dropped them too, because:

  1. They either were not “what-see-and-receive” editors.
  2. Or out of the box did not look modern, but as Word 2003.
  3. Either customizing them is the same as caching.

A list of editors that we do not recommend, you will find in the first comment, but one will open right now:


CKEDITOR - not your bro (at least for the reason number 2)

The authors on the site will be different people - and professional journalists, and copywriters, and very beginners, and those who have tried to blog in the harsh 2000s. I wanted to make the editor so that everyone could download the article quickly and without any fuss.

Looking for more, we realized that for the spring of 2016, the choice was clear - Medium Editor , an open source editor, inspired by the popular blog platform of the same name. At first glance, it has some advantages.

ME has quite complete and clear documentation, the project is being constantly finished, it is not abandoned. He also had the necessary functions out of the box (toolbar and tools for working with text - this is what GitHub has) and customization options.


Editing tools appear only when needed. Behind the plus side is hidden insert photos and videos.

Widgets that were not included in the package - “video”, “pictures” and “divider” - I wrote myself, starting from the design. And their creation did not take much time.

But not without a fly in the ointment. The first surprise was the fact that ME redefined the standard events inside the editor - keyup, paste , etc., replacing them with their own. To take control of the situation, I had to climb into the ME code and add an exception.


Finding a vulnerability, we wrote to those who use ME in projects, and explained how to close it.

Vulnerability was revealed at the first tests. We realized that ME does not protect against dangerous links, such as:

javascript:alert('xss http://www.ru') 

This can be solved by tracking and breaking javascript: sequences in an elementary fashion, as we did.

On the one hand, we do it roughly - in the case of a dangerous link, we just don’t try to save the article. But on the other hand, if a person decides to conduct such an experiment, then he is clearly not our user.


2. Quotes Christmas tree and a dash instead of a hyphen


Peter, our developer: “In general, it is worth noting that the processing of text in natural language is not a trivial task, especially with regard to the Russian language. It was obvious - do not delay the development of attempts to make something of their own, since you can use ready-made tools with the functionality we need.

It was decided to implement utilities familiar to us for the hackathon. We pushed them into a separate microservice in the form of a Docker container - a kind of Python-wrapper with its API, working through WSGI.



One of the tools was the Muravyev’s Typographer . In my humble opinion, there is no better tool for correcting typography in the whole Runet: an impressive summary of the rules, implementations in PHP and Python. And what is very important is the license.

Much to the honor of @emuragev, it is distributed as a public domain (public domain), so we took it and screwed it to our project in the Python implementation. So far, nothing has changed, although there is an opportunity and ideas on how to supplement the rules. ”


3. How not to become SEO content farm


In order not to be flogged by grafomans and lovers of reposting from cozy blochikov, we introduced pre-moderation. The text of publications is checked for uniqueness: both the author and we can perform such a check.



Peter, our developer: “The concept of the platform is in amusing stories and reviews without strict regulation of themes and formats. But we also give money on each text. And it is clear that it must be unique.

In general, to implement online functionality to verify the uniqueness of the text, you need the Yandex.XML technology and, ideally, the base of text patterns, in order to first chase the text on it, and only then knock on Y.XML.

But the number of requests from one domain to the Y.XML service is limited and directly depends on the site's TCI. And what is the TIC of a web project that has not yet been released? No. And during development, it was constantly required to send requests, parse the answer and do something with the data.



It would be possible, of course, to send a request from some domain subject to us, where the site with a large TCI lies. But in the end we decided not to do this and take the ready-made Content-Watch (CW) API. The system is paid, but for us the guys went on special conditions.

Although the reviews on the network differ, it seems they understood the question perfectly and wrote down the service with good documentation, a minimal API and some algorithms in addition to Y.XML.

For us, as for the users of their service, everything works very simply - we send a request for the CW service with the text that needs to be checked, and then the answer is returned to us in the form of json. The answer contains information about the degree of uniqueness of the text (we show a beautiful pie chart) and links to pages on the network where this or that piece of text is found - the topic with links is currently used only by moderators who check articles before publishing on the main one. ”


4. How to let authors analyze articles


To make it more interesting for authors, we decided to introduce text analysis and payment tools for everyone. Income is formed as follows: next to the text there are two banners - advertising and recommendation. The banner has a cost per click (it is determined by the advertising system) - we give 80% of the cost of each click to the author.

Ilya, CTO uPages: “The interesting task was to show the author how much and when he earned, and at the same time show the fact that not everyone read the article, and we must learn to work on the involvement within the text, not just on the teaser and the title.

Therefore, we made a statistics module.



It consists of the client side - we draw charts via Chart.js , giving numbers and lists.

On the server side, we consider something ourselves - the number of likes, readings, bookmarks, for example.

And part of the data - about clicks on banners, from which the author earns - we take through the Google Analytics API and from the Engageya recommendation service. The latter have no convenient API, but we managed to agree that once a day they upload us reports with all the necessary information. So we show the clicks on advertising on the side of the article and the income of the author.



In the case of the Google API, requests go at regular intervals to meet the limits.
Yes, the Google API is a pain . With so many different products, you have to reread a huge amount of documentation and try several approaches. At first, we tried to use the AdSense Management API to get Adsense revenue data, but in their reports you cannot get information for a detailed account of income sources.

After a long googling and beating my head on the keyboard, salvation came from analytics.
A link is established between Google Adsense and Analytics accounts, after which AdSense data becomes available when you query the analytics API. ”


5. How to support it



Ruslan pys , system administrator: “The task subtly hinted that it should be a cloud server. Well, I didn’t really like to do all this monitoring of the motherboard temperature, fan speeds, monitor the disks - and change them, not forgetting to rewrite serials.

We took the following requirements for cloud virtual machines:




Another important task was the software deployment model. Obviously, life on the development server and in battle are two big differences. It is good that the guys immediately chose a general containerization - for which, taking this opportunity, thanks are expressed from all the system administration.

We chose Saltstack as a container deployment and configuration management system , since we have already successfully applied it to other projects.

As a result of bringing the service into combat readiness, a natural desire arose to conduct exercises, i.e. firing from Yandex . Tank . In the course of experiments with a different ratio of requests to the processor cores and system and application software settings, we determined the capacity of a single node and the correlation of this capacity with the configuration of virtual hardware and OS settings. Well, it started on November 1st. ”

PS Send your http-requests to our new service !

Source: https://habr.com/ru/post/314796/


All Articles