"Any technical change should answer the question" why? "- Classmates about Java and not only
How do Odnoklassniki use sun.misc.Unsafe combined with increased requirements for reliability? Why was the Cacti monitoring system being developed there? How does work in OC intersect with scientific activities? If the social network is called Odnoklassniki, then does all its Java code consist of one class?
Answers to these and other questions - in our post. In anticipation of the Joker , where three employees of the OC will immediately be the speakers, and another one participates in the program committee, we asked all four of them - and not only them. Our questions were answered: ')
Oleg Anastasyev , Lead Developer (member of the Joker 2016 program committee)
Andrey Pangin , Lead Developer (Speaker Joker 2016)
Vitaly Khudobakhshov , Leading Analyst (Speaker Joker 2016)
- Classmates tend to use the new, or do not want to run ahead of the locomotive?For example, when you release a new major version of Java, you try to quickly transfer the server to it, or do you live quietly with the old one?
“We are away from the task: any technical change must answer the simple question“ why. ” If it answers this question - we will do it, no - we will not.
And so with the Java versions it turns out differently. We introduced Java 8 at an accelerated pace, because there were lambdas. In the development of the web parts of the portal, we use our own function-oriented framework, we had to write a lot of anonymous classes. And Java 8 fit perfectly on our task: it turned out to reduce the code of the lambdas, it turned out to be both readable and faster.
But in the case of Java 9, if we proceed from the task, so far I don’t see how it will improve our lives. Perhaps it will be faster, and then there will be a reason to spend time on the transition, but with this everything will become clear only from the final release. Transition to modules in our case will not give the advantages justifying it.
Moreover, in some respects, Java 9 will make our lives more difficult, not simpler: because of the rejection of sun.misc.Unsafe, which we use. Unsafe allows conveniently, without leaving Java, to implement a lot of low-level code, without it I would have to write this code, for example, in C. Even if JNI worked quickly (and this was not the case), I would have to spend much more effort on development.
In addition, since we are a giant high-load project, reliability is of course important for us. Therefore, before the transition, we must be sure that everything is already working fairly stable and not worse than the previous version. So, we don’t intend to install Java 9 per day general availability, although, of course, we will start testing it.
- Listen, how is “reliability important” combined with the use of Unsafe in production?
- Of course, with Unsafe, you can easily break a lot. But if you understand very well what you are doing, then you know what you can break, and you know if this is particularly important in your case.
For example, in certain cases, the “write once, run anywhere” principle may break: you get code that doesn’t run correctly. But if this is important for Java as a whole, then we have our own specifics. We run the code on well-defined servers. We obviously will not change our server park tomorrow for something completely different. And our goal is to make the code on these servers work as optimally as possible.
And in this case, the “write once, run anywhere” principle starts not to help, but to interfere: it does not allow the programmer to use those features of the operating system with which he could get significantly faster and optimal code. For example, it does not allow to take a memory page directly from the OS. Does not allow to recommend the OS, how to cache a specific area of ​​memory, whether it is necessary to do it at all, for how long. In Java, there are no such built-in features in principle, but with the help of Unsafe this is implemented simply.
Oracle's motivation for abandoning Unsafe is clear: yes, it provides many ways to shoot yourself in the foot, and for many people it all ends with this. But I would like to note that there is also our case in which the refusal of Unsafe is not “they took the ax away from the child so that he wouldn’t cut a finger”, and “adults lose a simple and useful working tool”. And VarHandles coming to replace him help only in one of our cases.
But generally speaking, ideally, I would like not even Unsafe. I would like Java to have a closer integration with the OS, a variety of libraries implemented in other languages, C and Go, the ability to write low-level code with manual memory management, where necessary, even the ability to switch to assembler and just write code on it in places where speed is critical.
- You, as a member of the Joker program committee, have already seen many reports.Did you like any particular you can recommend?
- I really liked the report of Philip Delgyado “DBMS: Individual Tailoring and Fit to the Figure” about how, knowing the capabilities of your DBMS well, you can quickly solve complex tasks and at the same time avoid complicating the architecture of the application.
And, of course, I am biased, but the report by Andrei Pangin is very interesting. You should also definitely listen to Dmitry Bugaychenko from Odnoklassniki on how to apply streaming analysis of tens of millions of events per second. For such a task, “just taking Spark” is not an option.
Andrey Pangin (Lead Developer)
- What will you tell on Joker?
- I had several themes in mind, but the listeners themselves chose performance myths. Well, then, I will talk about how Java slows down. Or does not slow down - someone like :)
In general, I’ll share the performance features of the JVM and tell you how easy it is to make a mistake when analyzing performance problems.
- A year ago, in “Without Slides,” you talked about how everything is technically arranged in the OC - and has something significantly changed this year?Though quantitatively, at least qualitatively.
- Of course. The number of servers, storage and traffic is something that is constantly growing. The traffic alone has doubled over the past year.
We have launched several new stand-alone projects, in particular, OK Live and OK Messages. Naturally, they required new technical solutions. A year ago, we didn’t have any video streaming at all, but now the online broadcasting is available to all users on any devices.
Significantly reworked the backend messaging system with an eye on mobile devices and mobile networks, for which it was necessary to implement your server with a custom protocol.
Learned to "cut" the video on the GPU. According to our measurements, video cards transcode common video formats 3 times faster than CPU.
Of the other major technological breakthroughs - launching your own “cloud”. While in experimental mode. Previously, as a rule, one application worked on each physical machine. Now, the deployment of services in the "cloud" will allow us to more efficient use of computing resources. And developers will not have to wait for administrators to install and configure servers: typical tasks for deploying and scaling applications in production will be performed automatically.
There are many other technical interests that are still in the research stage. When it comes to launch, I’ll certainly tell you about them.
- Because of the name Odnoklassniki, we won’t keep ourselves from such a question: how many Java classes in their code?
- It is hard to say. The Odnoklassniki program code includes more than 300 modules. I never saw them all together. I have about a quarter pumped out of my work, and this is about 50 thousand classes. The largest module has more than 8,000 classes.
- Class!
Vitaly Khudobakhshov (leading analyst)
- What exactly are you doing in Odnoklassniki?
- I am a leading analyst. I have to do many different things, for the most part my work involves analyzing large amounts of data using Spark / Scala or other similar tools. I am engaged in data processing and building all kinds of models. Sometimes you have to invent different algorithms and write implementations at all levels, including the distribution of data by users by means of high-load services in Java, but for the most part I am developing matmodels.
- In the “Hacker” material you mentioned situations where int-addressing is not enough - and how often do you in OK with their data volumes have to deal with similar situations in practice?
- When processing big data, a situation with a lack of int-addressing did happen several times. I can not yet call this problem common, but it will increasingly be encountered in practice. Int-addressing is only part of the problem. For example, many people say that LinkedList is a bad data structure, but using a large volume ArrayList is often impossible due to fragmentation or Promotion Failure - so this is a bit deeper problem than people think about it. I actually used code similar to what I described in the Hacker for large calculations, and I would say that the data structure with long-addressing is not enough. Actually, if I find the time, I will write my library.
- It is obvious that there is a lot of data for analysis in OK - and if it’s not quantitative, but qualitative, do they have their own unique specifics?
- In fact, the volume is not the only characteristic of big data, even in the general case. Yes, there is, of course, its own specifics. The most obvious is a social graph, and this is already a lot of specificity. Moreover, when content is generated in such a volume and so different in type (and language), it creates a lot of different difficulties. Users are quite creative and their tricks and ingenuous in their other actions, all this together creates a lot of difficulties and interesting tasks at different levels.
- What will you tell on Joker?
- I will talk about a very popular topic of functional programming in the context of big data processing with examples on Scala / Spark. I will tell you what functional programming does in practice and why it became popular right now. Much is known about the main features of OOP and its scope, there are patterns, there is encapsulation / inheritance / polymorphism, many people think that these are some special features of the OOP, and few people can say at once about what the functional paradigm is all about. And, of course, all this is especially interesting in the context of the MapReduce model.
Dmitry Bugaychenko (engineer-analyst)
- What exactly are you doing in OK?
- Initially, I was invited to the company to work on a music recommendation system, which turned out to be more than interesting. Further, we continued to develop this experience, implementing recommendation systems in other services (groups, videos, and so on).
At some point, they approached such a complex object as a tape, and here it was necessary to fundamentally change the approach to the development of the system: large volumes, strict requirements for reaction speed, heterogeneous content types, and so on. As a result, the tasks of the tape have become a powerful driver for our analytical infrastructure, the development of which I myself was engaged in.
Now the basis of my work consists of three components: the development of an analytical infrastructure, experiments with tape construction algorithms, assistance to colleagues and other teams in implementing data analysis into their processes and products.
- Do you have a scientific background - does it help when working in Odnoklassniki?
- Yes, it is quite. A very useful skill is the search for scientific articles and publications. Trying to solve any problem, we always begin with a search for what people generally did in this area. And often there are no industrial publications on the relevant topic, but there are many academic ones, and from there you can learn a lot of new things.
Of course, in order to not only find these publications, but also to use them, you need to know the language in which the scientific community expresses itself. Academic language is quite different from industrial.
- Does the work on such a large project as Odnoklassniki turn out to be closer to science than on something of a smaller scale?
- Yes, and for three reasons at once. The first is that on a smaller scale, they try to use ready-made designs to minimize costs. The difference between Odnoklassniki-level companies is that they do not just use ready-made, and often develop new solutions, moving the whole area forward. And the amount of investment that a company of the OK level invests in the development of both technologies and processing algorithms is incomparable with what a small company can afford - at least in terms of the amount of computing resources.
The second reason is, of course, the data. Working in Odnoklassniki, we have access to very large amounts of statistical data, which in an academic environment is just very difficult to obtain.
The third reason is more technological: you can make such an algorithm that will work normally in a small company, but in order for it to work on the scale of OC, you also have to solve many new non-trivial tasks, both technological and algorithmic. It is very interesting.
- When working in OC, do you (and also your colleagues) continue scientific activities in parallel?Are scientific articles published based on the experience gained in OC?
- I work at the university as a teacher. In addition, we stimulate the development of the Russian data-science community: we organize contests, hakatons, and publish in open access some of the anonymized datasets, built on the basis of real data. And in this regard, I continue.
As for scientific articles, OK really appeared: for example, we have a patented system of musical recommendations about which articles were published. But to produce them is obtained infrequently, the most recent year and a half. We compensate for this with speeches at various conferences, both technological ones like Joker and academic ones - there is something to tell about recommender systems, data analysis.
- To the question about Joker: what do you tell about there?
- I'll tell you about the system, which in Odnoklassniki is used to calculate the CTR of objects in our tape. And also about the different features of standard repositories that are used in the Java ecosystem, and about an alternative approach to data processing: not using key-value storage, but using streaming analysis.
Andrey Guba (Deputy Technical Director)
- What exactly is included in your circle of tasks?
- Probably, it can be said that I am “engaged in exploitation”. Formally I supervise the areas of system administration, information security. I lead the teams API and platform. Platform team are the developers who are engaged in some of the most difficult tasks. They write their own protocols through which our applications communicate, their own data storage systems - they recently launched a new one.
- I would like to know more about the launch: why did you need to write this system, what are its features?
- We have already had our own system in which we stored photos, music, and video. But since the volumes are growing (in particular, the video service is developing very actively), the issue of storage costs is becoming ever more urgent. And besides, since the number of servers is increasing, there is a question of ease of operation. And we decided to redo the existing system on the basis of these two considerations. We collected a list of what we want to achieve, and wrote a new one according to it. Now she is working in production.
Here, for example, a visual metric: replication factor 2.1 is in it, that is, we store all the data in 2.1 copies. And before that there was a system distributed between three data centers, where there were three copies, one each. Now we store copies and some checksums, and could lose entirely any of the data centers, without losing the data and preserving the functionality.
In addition, it is convenient in scaling and operation. For example, replacing a disk is a very simple operation: the old “discarded” disk is simply deleted, the new one is simply inserted, everything starts automatically and continues to work. Expansion can be done with any number of servers, different volume and number of disks.
It seems to work with our cloud: if in the case of data storage we wanted to do it cheaper, then in this case we are going to use the resources of our servers more efficiently (processor, memory, disk space). There are many servers, not all resources are fully used, one is heavily loaded somewhere, another is somewhere else. Therefore, we decided to launch a cloud that allows using them more efficiently.
We looked at what was on the market, realized that it was damp somewhere, somewhere did not fit the requirements, and decided to write our own. And now, in the process of launching, working in the alpha stage, we are conducting production experiments with it. We start part of the services, look at the statistics and modify it. According to our plans, next year a significant part of the production will work in the cloud.
- And in the case of system administration, do you also have any own solutions?
- Yes, there the standard tools also do not always fit, so in some cases it is necessary to refine the existing ones, and in part, to write your own.
For example, this is what happened with the monitoring and statistics systems. If you use the popular Cacti system on several hundred servers, it will work successfully. But we have 8500 servers, and on such a scale it will not work as standard, we have to modify it.
- In addition to its own tools, what are the administration specifics in Odnoklassniki?
“Our goal is to provide fault tolerance for a gigantic project in a distributed environment. There are several data centers, and we have set ourselves the task of ensuring that if we lose any of them, everything will work. Accordingly, we must build services on the basis of this, start and serve them accordingly. Administrators must be sufficiently prepared and they must have all the necessary tools for everything to work both in a normal situation and if there is a problem until the data center fails.
And the important part of the work is not technical, but organizational: to ensure that in a relatively small and distributed team everyone acts and exchanges information in a certain way. Standard procedures should be developed, everyone should know them, and when new people are added, they should also learn everything they need.
Christine Steinberg (Head of HR)
- How many Java developers are there in the company, in which cities are they located, and is there a relocation between them?
- At the moment there are about 120 developers in the company. Located in three cities: Moscow, Petersburg and Riga. Relocations do not happen very often, but from time to time someone moves to another office.
- Since Joker will be held in St. Petersburg, let us clarify separately: which teams / processes are located in the St. Petersburg office?
“Almost all product development is there: mobile apps, videos, music, and so on.”
- What kind of feedback does your participation in Java conferences bring, what feedback do you get?
“In the Java community, OK is well-known for its Java development, it also shows brand research. Many turn to our speakers for advice, catch them on the stand, in order to clarify some points with high-loaded systems. Conference attendees are well aware that no one in Russia has such an experience of using Java as OK - so, after the speeches of our speakers, a line of people willing to ask questions will form several meters.
- Some developers have a prejudice towards OK.Do you want to say something like that?
- I would like to say that all people are different, some people like bicycles, others like cars :) Developers who are familiar with what we do know that OK has very interesting technical tasks. There are so many challenges to solve the problems of high-load systems.
- Thank!We will wait for you all on Joker 2016 - for now let us recall some previous reports of speakers from Odnoklassniki: