The history of one bug

Just yesterday, I had to deal with one very subtle and specific bug. The bug turned out to be a feature that stumbled over another bug. During the study of the problem, I was forced to learn several features of Debian, to kill 4 hours of time and get a lot of experience.

In a slightly sleek look, I cite the chronology of events, I hope someone will be interested to see how system administrators work.

Prehistory

During the deployment of the bench for experiments from several identical servers, I wanted to be able to run the necessary versions of the application without manual work on updating the code on a bunch of hosts. It was decided to run the necessary programs with NFS-balls. Applications were internal use only, disposable, and written for a specific task. The ball was mounted to the / opt directory at startup and applications were launched from there using the rc.local script. Since it was about an experimental booth with very frequent code changes, it was lazy to play fair developer (packages, repository, updates, init.d scripts). Everything happened under Debian Squeeze.
')
The ball was written in / etc / fstab, the launch of the necessary tests - in rc.local. It would seem that everything is done.

... And then I stumbled upon the Mystic. Applications started out of five times, and the version of the "crooked application" was ticked off almost immediately - just as sometimes other executable files were not launched. And with / opt. From other directories worked fine. At the same time, you run rc.local with your hands - 100% is fine When loading - successful start time from five, or even less often.

In the beginning, I did not take this problem as serious, and tried to solve it with a limp. Since the problem manifested itself only for / opt I added the command ls -a1 / opt> / var / log / ls to rc.local. As expected, in / opt at the time of executing rc.local there were only two files - a dot and two dots. In other words, the NFS ball was not mounted. Sometimes. And sometimes it was mounted.

Obviously, the situation is “sometimes mounted, sometimes not” (despite the fact that after the login I saw a mounted ball always) means exactly one thing - a race condition between mounting the ball and running rc.local.

We start to decide

Hypothesis: Debian Squeeze (and the whole thing went on squeeze) switched to a new dependency system for init scripts. I registered in /etc/init.d/rc.local (the init script calling /etc/rc.local) a direct dependency on $ mountnfs (that is, it demanded that the script be run after mountnfs).

That helped. By a ridiculous coincidence, there are already two successive reboots. Then everything continued as it was - sometimes success, more often not.

I was surprised, because nfs-balls are mounted on S runlevel, and rc.local is launched on the second runlelvel (that is, later on S).

I added the ls -a1 / opt >> / var / log / ls sacramental line to the /etc/init.d/rc script - this is the script that executes all the other scripts from /etc/init.d. To be precise, it reads the /etc/rcS.d, /etc/rc2.d directories, etc., and, depending on the first letter (K or S), runs the scripts from there with the stop or start parameter. These directories themselves contain symlinks for scripts in / etc / init.d, called S10service. S (or K) points to start (kill), a number allows you to order the order in which scripts are run.

So, at the very end of /etc/init.d/rc I registered the next ls. Rebooted. I found in the log three sets of output ls (one was executed at shatdown, two at start are the levels 'S' and '2'). Surprisingly, the last two were empty. That is, at the end of the execution of the scripts for the transition to runlevel 2 balls still not mounted! In other words, the script dependency system, launch order, etc. does not play any role - at the time of the end of loading there are no balls

I removed hooliganism in /etc/init.d/rc and added a similar line to the last boot script called stop-bootlogd (it appears in /etc/init.d/rc2.d under the name S99stop-bootlogd, that is, it is launched last ).

Of course, after downloading, I got a clean directory. Bearing in mind that sometimes the directory was still mounted, and this is most likely a “race condition” (race condition), I added crazy 360 before ls.

After 6 minutes of tea drinking, I discovered that the catalog was still mounted at the time of the end of the "heavily braked" download. Then I began to reduce the time: 60, 45, 30, 20, 10, 5 seconds - the files were stable. Moreover, the files were even with sleep 1, but when it was deleted, it is most likely not. (I made 20 reloads in this place, but I received reliable results - 10 out of 10 sleep 1 gave a mounted directory, 8 out of 10 reloads without sleep gave an empty directory. That is, the race condition was very thin.

At some of the Ubuntu forums (yes, I spent 15 minutes searching for various keywords, mainly around the words nfs, rc.local, boot) this problem was discussed, and there also appeared a solution in the form of sleep (but already in the rc.local file itself).

Deeper and deeper

At this one could calm down, if it were not for the internal discontent with such a shamanic method of caulking incomprehensibility.

I remembered that when I added the $ mountnfs dependency, I opened the /etc/init.d/mountnfs.sh file (in order to copy the contents of the Provides line). And there was a description:

 # Short-Description:
 # Description: Network file systems are mounted by
 # /etc/network/if-up.d/mountnfs in the background
 # when interfaces are brought up;  this script waits
 # for them to be mounted before carrying on.

For me, there were many important discoveries in this short description:
1) NFS is mounted in the background, asynchronously, that is, in parallel with the main load. That is, race condition goes between the NFS mount process and the boot process.
2) This script should ensure that all NFS balls are mounted at the time of their execution (it was called closer to the end of the 'S' runlevel). And he does not.

Then I turned to the mount.nfs manual, where the following was written: the bg option does the mount in the background (as far as I can tell, this feature appeared in squeeze, because the lenny manual doesn’t mention anything like that) in synchronous mode.

The obvious solution is to add fg to the list of NFS mount options in / etc / fstab. Bingo? Reboot ... no. Did not work.

At that moment, I wanted to send everything, insert sleep 5 and not sweat, for the time went by midnight, and the petty problem suddenly grew into quite a research, almost scientific. But sports interest still persisted.

I looked in the mindmap (by this time I recognized the problem as a “problem” and switched from mindful notes to mindmap) to a description from mountnfs.sh and focused on the phrase “mounted by /etc/network/if-up.d/mountnfs ".

I got into this script.

There was a meager comment:

 # Using 'no! =' Instead of 'yes =' to make sure async nfs mounting is
 # the default even without a value in / etc / default / rcS

Do we have a file / etc / default / rcS? (In Debian, the script settings from init.d are stored in / etc / default, but it was somewhat unexpected that the configuration of the runlevel was not stored in the same file).

There were a few lines in this file, but nothing interesting but one:

 See the rcS (5) manual page.

We have a manual!

And there I stumble upon the option description:

 ASYNCMOUNTNFS
  Set this to 'no' to disable asynchronous mounting of network
  drives when the network interfaces are mounted
  it only once when the machine boot.  The default is 'yes'.  It
  root file system
  in NFS until ifup from ifupdown worked properly in such setup.

Bingo? Bingo! Bingo! Banzai We happily set the line “ASYNCMOUNTNFS = no” into the / etc / default / rcS file, rebooting ...

... and again we pick exactly the same problems.

I went back to the /etc/network/if-up.d/mountnfs file, where I quickly found the function that was executed if the ASYNCMOUNTNFS option is set to 'no':

 exit_unless_last_interface () {
     grep "^ [: space:] * auto" / etc / network / interfaces |  \
         sed -e 's / [\ t] * auto [\ t] * //; s / [\ t] * $ //; s / [\ t] / \ n / g' |  \
         while read i;  do
         if [`grep -c $ i / etc / network / run / ifstate` -eq" 0 "];  then
             msg = "if-up.d / mountnfs [$ IFACE]: waiting for interface $ i before doing NFS mounts"
             log_warning_msg "$ msg"
             # Can not pass this while variable
             mkdir / var / run / network / mountnfs_earlyexit 2> / dev / null
         fi
     done
     if [-d / var / run / network / mountnfs_earlyexit];  then
         rmdir / var / run / network / mountnfs_earlyexit 2> / dev / null
         exit 0
     fi
 }

Attention to the second line. Grep. And what are we getting hot? That's right, the line with 'auto'.

Stop, but the debian installer ... Yes, quite right, instead of the auto option (in the / etc / network / interfaces file) it uses the hot-plug option, which allows you to turn the interface on / off on the go ...

Decision

In other words, since the eth0 interface is not written to auto in the default installation, but is written to the hot-plug, the script does not expect this interface to be ready for mounting and we get the race condition not between the nfs mount process and rc.local, but between the network readiness interface and mount process. It is clear that the stubborn NFS will resend the bag again if it failed the first time. But he will not do it right away ... Taking into account that it was about optics - as long as the module detects SFP +, while it is initialized, while the physical link appears there ... In fact, you can see it with your eyes - the process is not very fast, and in some circumstances sleep 1 and not enough ...

So, we set the interface in / etc / network / interfaces in auto, we are overloaded, and, hurray, the ball appears on time. One more time, one more time and one more time. 12 of 12. Problem solved.

Replace rc.local, it works. On the clock half past one night, good guards closed the gate and I had to knock for about 10 minutes to let me go to the car.

But it was worth it.

Morality

Moral 1: Magic and God does not exist. If you are confronted with strange behavior, he always has a rational explanation, perhaps with irrational motivation (bug).

Moral 2: having found the true cause of the problem, you will be able to come up with a more reliable workaround than the one built on the model of a scientific question (it seems to work and it’s ok).

Moral 3: in the course of investigating the problem, you somehow learn in more detail how the next component of the system works. It is from such knowledge that truly profound knowledge in the profession develops. For myself, I learned exactly how NFS balls from / etc / fstab are mounted, learned that there is a file / etc / default / rcS, clarified the exact list of used runlevels in Debian (I knew that they were there, I knew what it was and how to use it, but like this, in order to precisely ottarabanit what lefts for which they go - no. Now - yes).

Moral 4: and you can also send bugreport and solve this problem by the most correct method - fix the problem. In this case, the better (and in fact) you describe the problem, the more likely it is to solve it. Agree that a statement of an error in a particular function, with an explanation of why and under what circumstances, sounds much better than the phrase “NFS is not mounted at the time of running rc.local”.

Source: https://habr.com/ru/post/113482/

All Articles