How we hunted for two weeks on the NFS bug in the Linux kernel

Detailed description of the bug search from the GitLab task that led to the patch for the Linux kernel

On September 14, the GitLab support service reported a critical problem that occurred to one of our clients: first, GitLab works fine, and then the users get an error. They tried to clone some repositories via Git, and suddenly there was an incomprehensible message about an obsolete file: Stale file error . The error persisted for a long time and did not allow to work until the system administrator manually started ls in the directory itself.

I had to study the internal mechanisms of Git and the NFS network file system. As a result, we found a bug in the Linux v4.0 NFS client, Trond Myklebust wrote a patch for the kernel , and since October 26 this patch is included in the main Linux kernel .

In this post I will tell you how we studied the problem, in which direction we thought and what tools we used to track the bug. We were inspired by the excellent detective work of Oleg Dashevsky, described in the post “How I have been hunting for a memory leak in Ruby for two weeks” .

This is also a great example of that debugging open source software is a team sport in which many people, companies and countries participate. The GitLab motto “ Everyone can contribute ” is valid not only for GitLab itself, but also for other open source projects, for example, for the Linux kernel.

Replay bug

We kept NFS on GitLab.com for many years, but then we stopped using it to access repository data on machines with applications. We moved all the Git calls to Gitaly . We support NFS for clients who manage their installations on GitLab, but have never encountered a problem like that of the above client.

The client gave some helpful hints :

Full text of the error: fatal: Couldn't read ./packed-refs: Stale file handle .
Apparently, the problem arose when the client manually started garbage collection in Git with the git gc command.
The error disappeared when the system administrator ran the ls utility in a directory.
The error disappeared when the git gc process terminated.

It is clear that the first two points are related. When you send changes to a Git branch, Git creates a weak link — a long file name that indicates the name of the branch for a commit. For example, when sending to master , a file will be created with the name refs/heads/master in the repository:

 $ cat refs/heads/master 2e33a554576d06d9e71bfd6814ee9ba3a7838963

The git gc command performs several tasks. For example, it collects these weak links (refs) and packs them into a single file called packed-refs . This speeds up the work a bit, because reading one large file is easier than many small ones. For example, after running the git gc command, the packed-refs file might look something like this:

 # pack-refs with: peeled fully-peeled sorted 564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable 94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable 2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master

How is the file packed-refs ? To find out, we ran the command strace git gc where we had a weak link. Here are the lines that are relevant:

 28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3 28705 open(".git/packed-refs", O_RDONLY) = 3 28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4 28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0 28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0

System calls have shown that the git gc command:

Opened the packed-refs.lock . This tells other processes that the packed-refs file is locked and cannot change.
packed-refs.new opened packed-refs.new .
Recorded weak links in packed-refs.new .
Renamed packed-refs.new to packed-refs .
Removed packed-refs.lock .
Removed weak links.

The key point here is the fourth, that is, the renaming, where Git enters the file packed-refs . git gc not only collects weak links, but also performs a much more resource-intensive task - it searches for and removes unused objects. In large repositories this can take more than an hour.

And we asked ourselves: does the large gos repository keep the git gc file open during the cleanup? We studied the strace logs, launched the lsof utility, and here is what we learned about the git gc process:

As you can see, the file packed-refs closes at the very end, after the potentially long process of Garbage collect objects .

So the following question arose: how does NFS behave when the packed-refs file is opened on one node and the other renames it at this time?

"For scientific purposes" we asked the client to conduct one experiment on two different machines (Alice and Bob):
1) In an NFS shared volume, create two files: test1.txt and test2.txt with different contents, to make them easier to distinguish:

 alice $ echo "1 - Old file" > /path/to/nfs/test1.txt alice $ echo "2 - New file" > /path/to/nfs/test2.txt

2) On the Alice machine, the test1.txt file must be open:

 alice $ irb irb(main):001:0> File.open('/path/to/nfs/test1.txt')

3) On Alice's machine, continuously display the contents of test1.txt :

 alice $ while true; do cat test1.txt; done

4) Then, on the Bob machine, execute the command:

 bob $ mv -f test2.txt test1.txt

The last step reproduces what git gc does with the packed-refs file when overwriting an existing file.
By client machine, the result looked like this:

 1 - Old file 1 - Old file 1 - Old file cat: test1.txt: Stale file handle

There is! It seems we controlled the problem in a controlled way. But in the same experiment on a Linux NFS server, this problem did not arise. The result was expected - after renaming, new content was accepted:

 1 - Old file 1 - Old file 1 - Old file 2 - New file <--- RENAME HAPPENED 2 - New file 2 - New file

Where does this difference in behavior come from? It turns out that the client used the Isilon NFS storage, which only supported NFS v4.0. When we changed the connection parameters to v4.0 using the vers=4.0 parameter in /etc/fstab , the test showed a different result for the Linux NFS server:

 1 - Old file 1 - Old file 1 - Old file 1 - Old file <--- RENAME HAPPENED 1 - Old file 1 - Old file

Instead of the outdated Stale file handle Linux NFS v4.0 server showed obsolete content . It turns out that the difference in behavior can be explained by the NFS specifications. From RFC 3010 :

A file descriptor may become obsolete or expire when renamed, but not always. Server implementers are advised to take steps so that file descriptors are not outdated and do not expire in this way.

In other words, NFS servers can choose how to behave when the file is renamed, and the NFS server rightly returns a Stale file error in such cases. We assumed that the cause of the problem was the same, although the results were different. We suspected that it was a cache check, because the ls utility in the directory cleared the error. Now we had a reproducible test script, and we turned to the experts - Linux NFS maintainers.

False trail: delegation to the NFS server

When we managed to reproduce the error step by step, I wrote to the Linux NFS contacts about what we learned. For a week, I corresponded with Bruce Fields, the Linux Maintainer of the NFS server, and he suggested that the bug in NFS should be examined for network traffic. He thought the problem was delegating tasks to the NFS server.

What is delegation to the NFS server?

In a nutshell, the NFS v4 version introduced a delegation function to speed up file access. The server can delegate read or write access to the client so that the client does not have to constantly ask the server whether the file has been changed by another client. Simply put, delegation of a record is how to lend your notebook to someone and say: “You are still writing here, and I will take it when I'm ready.” And a person will not have to ask for a notebook every time they need to write something down - he has complete freedom of action, until the notebook is taken away. In NFS, please return the notebook is called a revocation of delegation.

A bug in the NFS delegation recall could explain the problem of the Stale file handle . Remember how in our experiment Alice opened test1.txt , and then replaced it with test2.txt . Maybe the server failed to revoke the delegation for test1.txt , and this led to an invalid status. To test this theory, we recorded the NFC traffic with the tcpdump utility and visualized it with Wireshark.

Wireshark is a great open source tool for analyzing network traffic, especially for exploring NFS in action. We recorded the trace using the following command on the NFS server:

 tcpdump -s 0 -w /tmp/nfs.pcap port 2049

This command records all NFS traffic that usually passes through TCP port 2049. Since our experiment was successful with NFS v4.1, but not with NFS v4.0, we could compare the behavior of NFS in the working and non-working cases. With Wireshark, we saw the following behavior:

NFS v4.0 (obsolete file)

In this diagram, you can see that in step 1, Alice opens test1.txt and obtains an NFS file descriptor with the identifier stateid 0x3000. When Bob tries to rename the file, the NFS server asks to try again by sending the NFS4ERR_DELAY message, and he himself withdraws the delegation from Alice via the CB_RECALL message (step 3). Alice returns the delegation (DELEGRETURN in step 4), and Bob tries to send the message RENAME again (step 5). RENAME is executed in both cases, but Alice continues to read the file using the same descriptor.

NFS v4.1 (working case)

Here the difference is visible in step 6. In NFS v4.0 (with an obsolete file), Alice tries to use the same stateid . In NFS v4.1 (working case), Alice performs additional LOOKUP and OPEN operations, so the server returns a different stateid . In v4.0, it does not send any additional messages. This explains why Alice sees obsolete content — she uses the old descriptor.

Why does Alice suddenly decide on an additional LOOKUP ? Apparently, the delegation recall was a success, but some problem apparently remained. For example, the invalidation step is missed. To check this, we excluded the NFS delegation on the NFS server itself with this command:

 echo 0 > /proc/sys/fs/leases-enable

We repeated the experiment, but the problem did not disappear. We made sure that the problem is not in the NFS server or delegation, and decided to look at the NFS client in the kernel.

Digging deeper: Linux NFS client

The first question we had to answer to the NFS maintainers was:

Does this problem persist in the latest kernel version?

The problem occurred in CentOS 7.2 and Ubuntu 16.04 kernels with versions 3.10.0-862.11.6 and 4.4.0-130, respectively. But both cores lagged behind the latest version, which at that time was 4.19-rc2.

We deployed a new Ubuntu 16.04 virtual machine on the Google Cloud Platform (GCP), cloned the latest Linux kernel, and set up a kernel development environment. We created a .config file using menuconfig and checked that:

The NFS driver is compiled as a module ( CONFIG_NFSD=m ).
The required GCP core parameters are correct.

Genetics track the evolution in real time of fruit flies, and using the first item, we could quickly make changes to the NFS client without restarting the kernel. The second point guaranteed that the kernel would start after installation. Fortunately, we were satisfied with the default kernel parameters.

We made sure that the problem of the obsolete file has not disappeared in the latest version of the kernel. We asked ourselves:

Where exactly is the problem?
Why does this happen in NFS v4.0, but not in v4.1?

To answer these questions, we delved into the NFS source code. We did not have a kernel debugger, so we sent two types of calls to the source code:

pr_info() ( printk ).
dump_stack() : it shows the stack trace for the current function call.

For example, first of all, we connected to the nfs4_file_open() function in fs/nfs/nfs4file.c :

 static int nfs4_file_open(struct inode *inode, struct file *filp) { ... pr_info("nfs4_file_open start\n"); dump_stack();

Of course, we could dprintk with dynamic Linux debugging or use rpcdebug , but we wanted to add our own messages to check for changes.

After each change, we recompiled the module and reinstalled it into the kernel using the commands:

 make modules sudo umount /mnt/nfs-test sudo rmmod nfsv4 sudo rmmod nfs sudo insmod fs/nfs/nfs.ko sudo mount -a

With the NFS module, we could repeat the experiments and receive messages to understand the NFS code. For example, you can immediately see what happens when an application calls open() :

 Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace: Sep 24 20:20:38 test-kernel kernel: [ 1145.233462] dump_stack+0x8e/0xd5 Sep 24 20:20:38 test-kernel kernel: [ 1145.233480] nfs4_file_open+0x56/0x2a0 [nfsv4] Sep 24 20:20:38 test-kernel kernel: [ 1145.233488] ? nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4] Sep 24 20:20:38 test-kernel kernel: [ 1145.233490] do_dentry_open+0x1f6/0x360 Sep 24 20:20:38 test-kernel kernel: [ 1145.233492] vfs_open+0x2f/0x40 Sep 24 20:20:38 test-kernel kernel: [ 1145.233493] path_openat+0x2e8/0x1690 Sep 24 20:20:38 test-kernel kernel: [ 1145.233496] ? mem_cgroup_try_charge+0x8b/0x190 Sep 24 20:20:38 test-kernel kernel: [ 1145.233497] do_filp_open+0x9b/0x110 Sep 24 20:20:38 test-kernel kernel: [ 1145.233499] ? __check_object_size+0xb8/0x1b0 Sep 24 20:20:38 test-kernel kernel: [ 1145.233501] ? __alloc_fd+0x46/0x170 Sep 24 20:20:38 test-kernel kernel: [ 1145.233503] do_sys_open+0x1ba/0x250 Sep 24 20:20:38 test-kernel kernel: [ 1145.233505] ? do_sys_open+0x1ba/0x250 Sep 24 20:20:38 test-kernel kernel: [ 1145.233507] __x64_sys_openat+0x20/0x30 Sep 24 20:20:38 test-kernel kernel: [ 1145.233508] do_syscall_64+0x65/0x130

What are the do_dentry_open and vfs_open ? Linux has a virtual filesystem ( virtual filesystem, VFS ), an abstraction layer that provides a common interface for all file systems. The VFS documentation states:

VFS implements open (2), stat (2), chmod (2) and other system calls. VFS uses the pathname argument passed to them to search the directory entry cache (dentry cache, or dcache). This provides a very fast search mechanism that converts the path name (or file name) to a specific dentry. Dentry are in RAM and never stored on disk - they exist only for performance.

And it dawned on us - what if the problem is in the dentry cache?

We noticed that the dentry cache is usually checked in fs/nfs/dir.c We were especially interested in the nfs4_lookup_revalidate() function, and as an experiment we made it work earlier:

 diff --git a/fs/nfs/dir.cb/fs/nfs/dir.c index 8bfaa658b2c1..ad479bfeb669 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -1159,6 +1159,7 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags) trace_nfs_lookup_revalidate_enter(dir, dentry, flags); error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label); trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error); + goto out_bad; if (error == -ESTALE || error == -ENOENT) goto out_bad; if (error)

And in this experiment, the problem of an obsolete file did not arise! Finally, we attacked the trail.

To find out why the problem did not occur in NFS v4.1, we added pr_info() calls to each if block in this function. We experimented with NFS v4.0 and v4.1 and found a special condition in version v4.1:

 if (NFS_SB(dentry->d_sb)->caps & NFS_CAP_ATOMIC_OPEN_V1) { goto no_open; }

What is NFS_CAP_ATOMIC_OPEN_V1 ? This kernel patch says that this is a feature of the NFS v4.1 version, and the code in fs/nfs/nfs4proc.c confirmed that this parameter is in v4.1, but is missing in v4.0:

 static const struct nfs4_minor_version_ops nfs_v4_1_minor_ops = { .minor_version = 1, .init_caps = NFS_CAP_READDIRPLUS | NFS_CAP_ATOMIC_OPEN | NFS_CAP_POSIX_LOCK | NFS_CAP_STATEID_NFSV41 | NFS_CAP_ATOMIC_OPEN_V1

Therefore, the versions behaved differently - in v4.1 goto no_open causes more checks in the nfs_lookup_revalidate() function, and in v4.0 the function nfs4_lookup_revalidate() returns earlier. And how did we finally solve the problem?

Decision

I talked about our findings on the NFS mailing list and suggested a primitive patch . A week later, Trond Myklebust sent a series of patches to fix the bug and found another related problem in NFS v4.1 .

It turns out that the fix for the NFS v4.0 bug was deeper in the code base than we thought. Trond described it well in the patch :

It is necessary to make inode and dentry correctly rechecked when an already opened file is opened. Now we do not recheck either one or the other in NFSv4.0, because the open file is cached. Let's fix this and we will cache open files only in special cases - to restore open files and return delegations.

We made sure that this fix solved the obsolete file problem, and sent bug reports to the Ubuntu and RedHat teams.

We were well aware that the changes would not be in a stable version of the kernel soon, so we added a temporary solution to this problem in Gitaly . We experimented and verified that calling stat() in the packed-refs file causes the kernel to recheck the renamed file in the dentry cache. For simplicity, we implemented this in Gitaly for any file systems, not just NFS. The check is performed only once before Gitaly opens the repository, and for other files there are already other stat() calls.

What we have learned

A bug can hide in any corner of the software stack, and sometimes you need to look for it outside the application. If you have useful links in the open source world, it will ease your work.

Many thanks to Trond Müklebust for correcting the problem, and to Bruce Fields for answering our questions and helping to figure out NFS. It is for such responsiveness and professionalism that we value the open source community.

Source: https://habr.com/ru/post/432922/

All Articles