📜 ⬆️ ⬇️

How do we write a web service for a billion users

IT-director of the project BeSmart.net Maxim Model about the work on the global training service

image

Our team is working on the project BeSmart . Now we have nine programmers, including IT directors, that is, I myself (of course, there are designers, marketers, and other specialists — altogether more than 20 people). We work in Belarusian Vitebsk, famous in Russia for the festival “Slavic Bazaar”.
')
BeSmart.net is a service for hosting training lectures in video, audio and PDF formats, which we hope will eventually be watched around the world. There are many ambitions, but for the time being we will set them aside and tell you what two goals we face, the developers, and how we fulfill them.


Our first goal is to create a highly loaded system that would withstand the appeals of millions of users who would download lectures or buy them. At the same time, we understand that we have taken on an extremely difficult task, and therefore we want to do everything thoroughly.

The second goal is to protect content to the maximum. Of course, we understand that it is now quite difficult to protect information on the Internet, if at all possible. But nevertheless we intend to complicate as much as possible and, if possible, prevent unauthorized access to lectures hosted in our system.

Now I will describe the technologies that we used to solve both problems.

What programming language have we chosen

We develop the core of the system in C ++, which, like any language, is not without flaws. The main one is long development time. C ++ programs are not written, but designed. First, the architecture of the future application is developed, and only after that the development stage begins. Of course, this approach is not mandatory, but it will greatly facilitate the maintenance and development of the project in the future.

C ++ is an extremely powerful tool, and this power requires careful handling. A programmer who writes in C ++ must be highly qualified so that the code is clear and understandable for those who will support the project.

The second drawback (it also affects the first) is the lack of ready-made solutions, especially in the field of web development. If, for example, there are a lot of them for PHP, then for C ++ there are practically none. Much has to be done from scratch, and BeSmart.net is a 90% self-made product.

It took more than two years to work. In March 2012, the preparation of the TZ began, which lasted until May. Then began the development of the core of the project, which lasted until November 2012. In May 2013, we started developing our own file server. And only in October 2013 the project reached the point of commercial use - it became possible to use the site fully.

What do we have in common with Facebook?

As you already understood, we realize that it is rather difficult and long to write C ++ code. But now let us turn to the advantages of C ++, for which we have chosen this language. First of all, it is the possibility of obtaining ready efficient compiled code, which is impossible or extremely difficult on other platforms. This is very important for highly loaded projects.

The code is converted by the compiler program (we use the GCC compiler) into machine code that is understandable only to the computer, after which the computer executes it. The speed of processing user requests will be higher. When thousands or even millions of people use the service at the same time, they will receive information quickly, and the physical capacity of the servers will require less than it could.

Facebook was written in PHP, but later Zuckerberg needed to reduce the load on the servers: when you have millions of users it will provide significant savings. Then Facebook wrote its own software - HipHop, which translates PHP into C ++ and immediately compiles it into machine code. However, not all PHP scripts can be translated, and Facebook switched to C ++ only partially.

Vkontakte programmers, as far as we know, still write code in PHP, but they do it consciously. First, new functions of Vkontakte are quickly written in PHP, then tested, and if they work, they are written in a compiled programming language (a year ago Pavel Durov announced his own development, KPHP).

Everything was originally written in C ++, and this is an indicator of a serious project. When we started work on BeSmart.net, we met with IT people from other projects, and one of them asked me: “Well, how are you? What are you doing? ”I say:“ Here, we will make an application in C ++. ” To which my friend replied: "Are you going to serve a billion people?" I said, "Yes, we are going to serve a billion."

Where we store user data

So far we have no billion lecturers and listeners (by the way, you can bring our goal closer and download some lecture on BeSmart.net). But we are already preparing to store large amounts of multimedia information. Storage needs to be made safe, and this is our second big goal. For this, we have both software and hardware solutions.

We rent servers in two data centers: one is in Moscow, the second is in Hong Kong. Now we are negotiating the lease of servers in another European country. Information on servers is duplicated. Depending on where in the world the request will be made, content will be sent from the server closest to the user.
Of course, we are aware of the existence of companies that provide CDN services that do the same thing, but outsourced. But the fact of the matter is that we do not use the services of other companies. After all, we vouch for the safety of all data on the site.

We also wrote our own asynchronous file server (responsible for storing software) in C ++. Its peculiarity is that it gives users content not from servers, but from our local network.
Here it is necessary to clarify: in addition to servers with the software part of the BeSmart.net service, we have separate servers with user data. They are united in a local network that is not directly connected to the Internet. This local network is accessed by the file server when it provides content to users of the site. In fact, this is one bundle of servers inside another. Thus, data is better protected from hacking.

So the data is stored in Besmart.net

image

DC - data processing center;
Cluster Web - a cluster of web servers for receiving and processing orders from users;
Cluster FS - a cluster of server access to the file resources of the project (download and upload content);
GW is a gateway that forms a secure channel for uniting the internal networks of different Besmart.net data centers into a single network (used to create a distributed storage network and replicate them).

How we protect copyright content

The site accepts lectures from users in all popular video formats. But they are stored in MP4 format, and when viewed, users receive them using the HLS (HTTP Live Streaming) protocol developed by Apple. Streaming video is not given by the whole file, which is gradually played on the computer from the cache, but by fragments of 10 seconds each. Some browsers, displaying the video, keep these pieces in the cache, some do not save, but in any case, the technology complicates the work of pirates.

All the content that users upload to the site, we mark with a watermark ("watermark"). Of course, its use will not save from illegal distribution of video, but to designate your rights to it is, in general, not such a meaningless undertaking. At the very least, it will definitely be easier to remove our lectures from sites that are in the legal field.

So that the pirates could not easily get rid of the “watermark”, we made it dynamic - the BeSmart.net logo slowly floats around the picture, changing the trajectory. During the saving of clips on the servers, the processing takes place: the recordings are sorted into audio and video tracks, a “watermark” is superimposed on them, after which the tracks are collected again. As a result, the trajectory of the “watermark” movement in the frame is always different. To separate audio and video tracks and merge them back, we wrote our software. And for frame information "watermark" and the video stream, we use FFmpeg .

Let me remind you, in addition to video and audio files, you can upload PDF to the site. These files are also marked “watermark”. When downloading, we split these files into separate pages, and a program specially written by us puts a "watermark" on each of them. At the same time, we deliberately degrade the quality of images so that the texts are readable, but not for printing.

What will happen next

Every day there are new tasks that do not always pursue our two main goals, but somehow improve service. Recently, we made the so-called “exchange of business cards” for users of the system who want to communicate with each other - an analogue of “friendship” in social networks. In addition, the opportunity to blog, leave comments and evaluate lectures. It took two weeks of writing code.

Our project is young and for this reason it is not without flaws, but we are working on it. The development process is not complete, we are at the beginning. Nothing is impossible - there are only tasks and deadlines for their implementation.

Source: https://habr.com/ru/post/238179/


All Articles