Intensive by Kubernetes: about the work of support

On February 1-3, Slurm-3 will be held, intensively by Kubernetes. Announcement and program here.

Today I will tell you a little about the internal kitchen: how we help students cope with the practice and what comes of it. At the same time, future participants will understand what to expect from support.

I myself go through paid courses 2-3 times a year, always take options with practice, and very rarely complete it. For me, the situation looks as if I ordered a kilogram steak in a restaurant: I ate as much as I could, I left the rest on my plate. But in those who go to Slurm, I would like to cram the whole batch.

At the first Slurm, we took the practice calmly, they say, we give assignments, and the participants cope as they can. And this would have led to a catastrophe if there were no initiative and talented guys in the audience: “15 minutes ago I wrote to the chat about the problem, I had already decided it myself and helped five more.”

Therefore, in the second Slurme, in addition to three speakers, a dozen of tech support employees worked with students: system administrators from the Southbridge team.

Where are the problems with practice?

The “Do It Yourself” approach. It would be possible to make a walkthrough: "copy the config, start the playbook, voila, your cluster is ready." It would be very fast, very simple and very pointless. We went a difficult way: to complete the task, you need to understand the topic and manually correct the configs-settings, etc.

Snowball. All topics and tasks are related to each other. If you didn’t deploy a cluster on the first day, you won’t be able to roll an application on the second day. The most important and difficult topic was Ceph.

Tin and fakapy

Ceph is a key and complex topic, and it is impossible to move on without it, so the mass plugging on Ceph was comparable in its destructiveness to a package. Here supports have laid down bones.

Error on the slide. We are all humans, speakers too. There were mistakes on the slides, and they meant that all 87 students will now write to the chat, as nothing works for them.

Glitches broadcast. We bought a dedicated channel from the provider and kept the backup channel from the megaphone, but according to the law of meanness it did not save. On the first day of Slurm, a major backbone provider fell through which the channel ran through to the Facecast service. We launched the broadcast on YouTube, but during this time, speakers with full-time students ran ahead, and lagging behind online students made a scandal, even disconnecting from classes. The next day, Facecast changed the providers' connection scheme, but the system did not immediately work well for all users. And the whole wave of indignation fell on our supports.

(The problem was solved because of the fallen provider: they stopped classes, waited for full working capacity and repeated all the missed material. The lags of the second day had to be endured).

So, the student asks for help.

Support should choose a line of conduct:
- to give the student their own workout troubling;
- find the student's mistake and explain it;
- To do the practice stage for the student.

There are undetectable errors: incorrect login, letter I instead of l (big i instead of small L), like that.

If there is a backup, a queue is formed for the support. It is impossible to help thoughtfully to immediately five in conditions of time trouble.

And the time trouble was serious: in the internal chat tech support for the day flowed several thousand messages. The supports were turned off at midnight, and they began to work at 6 in the morning (the benefit is also supported, and the students are scattered in different time zones).

Therefore, instead of parsing, the participants received the answer: “I have corrected everything, now your cluster is working as it should, move on.” Yes, “Do It Youself” is fuckin, but I managed to avoid a snowball.

Small simple joys

The support team collected questions from the chat and a special form, sorted, answered, passed on complex questions to the speakers. Therefore, there are no pending issues.

It turned out that online participants are inconvenient to switch between the broadcast and the console, and we do not have a text file with commands, only the presentation on the speaker’s laptop. Therefore, one of the supporters sitting in the hall recruited and sent commands from the slides to the telegrams.

In general, a dozen hard workers stand behind the bright speakers, thanks to whom the overwhelming majority of participants reached the end of the practice. Benefit Southbridge is engaged in infrastructure support, everyone can help us.

Slurm 3 will be better than Slurm 2

What was done spontaneously at Slurm-2, we systematize and optimize:
- we assign our group to each support so that the students know their support by sight;
- we write base of typical mistakes and decisions;
- we are preparing shortcuts “If you didn’t cope with the practice, but want to move on”;
- we prepare a participant’s memo with instructions on the organization of the workplace and interaction with support.

Slurm-3: we start Kubernetes cluster

Source: https://habr.com/ru/post/433922/

All Articles