ASC'18: Perseverance and regular training as a way to achieve a goal

Student supercomputer competitions are held annually in various points of the world, and aim to attract young talent in the field of high-performance computing in industry and science. This year, our team took part in the Asian competitions, and this article will discuss the experiences and impressions received at this event.

Tasks and the progress of the qualifying stage

This year, for the first time, there were no tasks to be solved using the equipment provided by the organizers: all the tasks had to be run on their hardware. Thanks to the persistence and perseverance of the professors, shortly before the start of the qualifying stage, our team had access to several nodes with NVIDIA P100 and P6000 video cards, which greatly helped us in the preparation. Tasks differed little from last year . They are described below.

Build a cluster configuration and describe why certain components were selected.
Measure cluster performance using Lynpack and HPCG. The difference from last year was only that last year Lynpak had to be optimized for the cluster provided by the organizers with Intel Xeon Phi processors, and in this for any available cluster.
Optimize Relion (software for image recognition from a cryo-electronic microscope) under a video card.
Build a neural network to respond to user search queries using the CNTK framework and MS MARCO dataset.

Linpack and HPCG. Due to the emergence of new nodes with video cards and Vadim, who was engaged only in performance tests, we have advanced significantly in the first and second assignments. Vadim was able to make as many test runs as needed to confidently fit the parameters to a specific system. Also, on the new nodes it became possible to regulate power consumption, which made it possible to select the cluster configuration taking into account changes in the frequency of the processor and the graphics chip. The emergence of new nodes was the biggest event for the team.

Relion. The code written by biomedical chemists did not differ in a well-thought-out architecture and contained hard-to-read files of several thousand lines of code. Synchronization was provided by the sleep() system call. There were dozens of gigabytes of input data, even more days off, one iteration took an average of forty minutes, and it was impossible to understand at once how to optimize all this. After two weeks of searching, my own memory allocator for a video card was written; Fourier transform and some other subprograms were transferred to video cards. Due to the complexity of the code and the limited time, the remaining optimizations were made after the preliminary stage.

CNTK. As usual, in the task of machine learning a basic configuration of the neural network was given, from which it is worth repelling. The framework and the network itself out of the box did not work. CNTK required a special version of OpenMP, in the utility for checking the result, the functions had incompatible types of parameters, and their number did not match. When everything finally started up, they began to deal with the network architecture. Alas, neural networks are still a weak point of our team, so we didn’t make any very complex changes. The percentage of discarded neurons was changed, the learning rate, the initial value, were experimented with using GRU instead of LSTM in the recurrent part.

Preparing for the final and finding a sponsor

So, cheers, we went to the final!

This time we immediately wrote to our last year's sponsor and began to prepare. Soon, two events happened: last year's sponsor refused us, and the university allocated funds covering part of the cost of the flight. Next was the generation of ideas where to find the balance. In the end, we volunteered to help the company Devexperts , engaged in the development of financial software for exchanges, brokers and investment companies. In the qualifying round, the team solves problems on the equipment that is available to it. Sometimes part of the problem is solved on the equipment provided by the organizers. In the final, everything is exactly the same except for one but ... this very cluster the team has yet to assemble!

Last year, none of the team members had any experience in setting up a cluster, which is why we have little time left for launching competitive tasks, so this year we conducted a series of trainings on the training cluster. At each training session, we created backup copies of the nodes, fully configured one node, then copied its image over the network to other nodes. As shown, this is the fastest and most painless way to configure a cluster from scratch, which does not require in-depth knowledge of low-level technologies. Several trainings were enough to completely debug and automate the process.

Final: Day One and Day Two

In the first two days of the competition, teams assemble and set up a cluster on which all applications will be subsequently launched. As a rule, the more video cards you have in your system, the more you will be able to get performance on most tasks, and the fewer nodes you can install in a rack due to a power limit (the power should not exceed 3 kW, otherwise the results of the task are not counted). However, there are applications in which video cards are not used in any way, and the presence of a large number of nodes is beneficial.

This year, the competition sponsors provided four NVIDIA V100s to each team. The first to find the right person and get the cherished boosters, we set about the installation. No one on the team (including the trainer) had ever had experience installing video cards on a server. After studying the instructions kindly provided by the organizers, we coped with the task, and we did not even have to completely pull out and disassemble the server, as most teams do (see video).

Next, it was necessary to install and configure the operating system. As a rule, students have the least confidence in this area, because it uses niche technologies, the knowledge of which is not useful in other areas, so we had to do some preliminary training to fully configure the system of five nodes from scratch on the training cluster using Clonzilla.

Configuration scripts were debugged on a virtual cluster using Vagrant, since only in it one can easily raise several identical virtual machines. Due to the low level of software that we use, Docker and other technologies based on Linux namespaces are not suitable for us. Configuration scripts simply do not work on them.

Armed with the experience gained in training, we deployed the operating system and the rest of the packages even faster than in training - the performance of the servers cannot be compared with our training cluster. One of the features of the competition is the lack of access to the Internet from the servers, so we downloaded the repository with the packages in advance and recorded them on two USB disks that we took with us.

After setting up the cluster, each team member began setting up and testing his application for the new system, and here we were in for an unpleasant surprise. The version of Lynpak, which remained from the last year of the competition, refused to work correctly on the new system. Installing different versions of CUDA, going through various options and kernel settings did not give the desired effect. As a result, we decided to launch the usual non-optimized version in order not to lose points for the task. (This is due to the new scoring system this year: even if your result is the best in speed or performance, but the output is incorrect, you will receive only half of the maximum possible points. The second half of the points are collected for the correctness of the output.)

To understand the essence of the problem, it is worth telling what Lynpak is. Linpack is used to measure the performance of supercomputers and compile a list of the TOP500 most powerful supercomputers in the world. The easiest way to take a high place in this list is to purchase a cluster with a large number of video cards (the number of processors is not so important, because 99% of the task is given to a video card). For each accelerator, there is an optimized version of Lynpack, whose code is usually closed. You can get a binary only if you have a supercomputer that can rank on the TOP500 list, or if you participate in a supercomputer competition. Despite this, the organizers of the competition did not provide the binary, the Russian branch of NVIDIA also refused to do so. There are no clusters with V100 in Russia that could be included in the TOP500 list, so searches for familiar colleagues were also unsuccessful. The fact that Linpack is not used anywhere except for performance testing, neither in science nor in technology, adds to the incomprehensibility of the situation. If you want to help the team and know how to get the coveted program, you are welcome in PM. Well, we, with the inherent spontaneity, noted this story in the final presentation, which could please the jury members.

Final: third day

Lynpak, HPCG, Relion and the secret application were waiting for us on the third day of the competition, and this day became the most difficult for the team. Having quickly dealt with Linpack (see previous section) and HPCG, we received work orders (input data) for a secret application. It turned out to be a program for calculating the molecular dynamics of siesta . The first disappointment was the fact that on the part of the tasks Siesta gave an error to the address (despite the fact that it was written in Fortran, in which such an error is not so easy to get), and it was not possible to debug it. Nevertheless, the remaining tasks earned and at the end of the day we successfully passed them.

In parallel with Siesta, we had to launch Relion in advance. All nodes without video cards were given to Siesta, and nodes with video cards were given to Relion, so the programs did not interfere with each other.

We have changed the Relion code a lot in the preliminary stage to make it work efficiently on video cards. Among other things, we have parallelized many functions, rewritten the memory allocator on the video card, transferred the most resource-intensive routines to the video card and added the ability to use nodes with and without video cards at the same time. This greatly accelerated the program, and it worked perfectly on university technology. However, at the competition we got video cards with a smaller memory size, which is why Relion crashed with an error. A deeper analysis of the error showed that the code will work only if it is rewritten under the new system. We had no time for this, and this was the second disappointment of the third day.

Final: fourth day

On the fourth day of the competition, CFL3D and MSMARCO remained, and this day passed much more calmly. Freed from the applications assigned to them, team members began to help each other. For CFL3D, which has a very complex input file format, Ruslan wrote a script that generates it. Since we had a lot of nodes compared with teams with a large number of video cards, we launched several tasks in parallel and after several starts of each task we were able to select the optimal parameters.

Running a previously prepared MSMARCO also did not cause serious problems. The preliminary data processing took a few hours, which is why there was no time left for a long study, but thanks to more powerful video cards, it was possible to complete it, albeit with a smaller number of epochs. We still have a model trained in more epochs from the qualifying stage (in the final, the input data changed, and there was no new file for verification), but according to the rules, a model that was trained during the final was needed, and we decided to turn in an honestly trained model. Despite the coordinated work and the absence of surprises, we used all the allotted time and finished late in the evening.

Final: Day Five

The next day, we were waiting for the presentation. In the evening of the fourth day, we inserted the results into a template prepared in advance and wrote a speech. The presentation was easy, we were never asked any interesting questions, but for some reason we were allowed to shoot only the speaker and slides.

A few hours later the award ceremony began. The sensations were mixed: on the one hand, we performed much better than last year, on the other - we could have performed even better if it were not for the annoying errors with the applications. As a result, despite the fact that our cluster did not differ by a large number of video cards, due to the greater number of nodes and perseverance, we went around the other teams in CFL3D, for which we were awarded a separate prize for the competition. In the overall standings, we took the eleventh place out of twenty teams that reached the final (and out of three hundred teams that participated in the preliminary stage). The overall champion, like last year, was Xinhua University. For our team, this was a victory over ourselves: we performed better than last time, gained invaluable experience, which we will use next year, and went around the others in one of our assignments.

Conclusions and general impressions

The cluster configuration, in which there are many more video cards than processors, is advantageous in most cases, but not universal. Nodes become smaller, and not every application can, in principle, work on a video card. Such applications include programs in Fortran, which, because of their venerable age, are not rewritten as a video card, and most often they do not even use all the processor cores. For such applications, the presence of a large number of nodes allows you to run more parallel tasks, and therefore more optimize applications.

The team may not know all the subtleties of installing operating systems and casting images, but this gap is easily replaced by training. Of course, participants will not know all the subtleties, but they will surely perform the installation point by point. Installation scripts are easily debugged on virtual machines.

During the competition, you can meet with the most wonderful open source software. Programs that are compiled by illegible scripts, programs that use library functions rewritten with errors, programs written in Fortran with C inserts, programs with hard-coded dependencies and compilation flags. I can not remember a single program that would be assembled the first time or produced a clear error during assembly. (Fresh example: the old OpenMPI version on new systems tries to connect the library with an empty name. The problem is reliably solved only by auto-replacement in the generated make-files.) The contest teaches not to be surprised at anything and to overcome the difficulties that arise. I want to believe that a person who has worked with such software will never create something like this in his life.

At the competition you can’t stop wondering about Chinese ingenuity. This year, under the place in which the competition is to be held, the Chinese redid the square-shaped conference room with cut corners. It brought racks with servers and a cooling system with a liquid outlet to the nearest bathroom (I am not an expert in this subject and I don’t know the exact name of the equipment). When they realized that the temperature in the hall did not fall below thirty degrees Celsius, they brought huge ice blocks in the basins. This situation, of course, did not change, but it provided the team with chilled drinks.

Thanks

Participation in the competition would be impossible without our sponsor - the company Devexperts ( http://devexperts.com/ ). The company assumed the cost of airfare to China.

Impressions of China

Some team members were in China for the first time, which made them a little culture shock. Says Anton.

Familiarity with the land of the rising sun began with the fact that some of us had their batteries taken off for charging, because they did not have a power mark. Apart from this plot, everything else was friendly. We were met by two volunteers with a sign of our university, and then taken by bus to the hotel. It is worth noting that after St. Petersburg, where the snow was about to melt, it was pretty hot in China (although, of course, the real people of Volgograd did not even feel this). Upon arrival, we were settled in rooms. After daily flights I decided to flop with fatigue on a huge mattress.

Hard was the mattress.
Depressed samurai.

At that time it was about ten in the morning, so an hour later we went to inspect the local university. To say that he is big - to say nothing. If we compare the territory of the local campus and St. Petersburg State University, then Nanchang University is five times more. We were shown the local canteen, where for the next five days we ate noodles and rice. For the most part, the first communication with ordinary Chinese, whose knowledge of English is not so good, began right here.

The dining room is designed in such a way that each window is a mini outlet where you can buy something. , , NFC , . , . , , . . , . , . , . , , . - , .

, . (, , .) , , .

, , , , , .

Source: https://habr.com/ru/post/417917/

All Articles