📜 ⬆️ ⬇️

So how to delete millions of files from one folder?


Enchanting arrangement of points over i in the matter of deleting files from a crowded directory.

I read the article Unusual hard disk overflow or how to delete millions of files from one folder and was very surprised. Really, in the standard Linux toolkit there are no simple tools for working with overflowed directories and it is necessary to resort to such low-level methods as calling getdents() directly.

For those who are not aware of the problem, a brief description: if you accidentally created in a single directory a huge number of files without a hierarchy - i.e. from 5 million files in a single flat directory, then quickly remove them will not work. In addition, not all utilities in linux can do this in principle - either they will heavily load the processor / HDD, or they will take a lot of memory.
')
So I took the time, organized a test site and tried various tools, both suggested in the comments and found in various articles and my own.

Training


So how to create a crowded directory on your HDD of the working computer, then you don’t want to suffer with its removal, we will create a virtual file system in a separate file and mount it through a loop device. Fortunately, with Linux, everything is simple.

Create an empty 200GB file
 #!python f = open("sparse", "w") f.seek(1024 * 1024 * 1024 * 200) f.write("\0") 

Many people recommend using the dd utility for this, for example dd if=/dev/zero of=disk-image bs=1M count=1M , but this works incomparably slower, and the result, as I understand it, is the same.

Format the file in ext4 and mount it as a file system
 mkfs -t ext4 -q sparse # TODO: less FS size, but change -N option sudo mount sparse /mnt mkdir /mnt/test_dir 

Unfortunately, I found out about the -N option of the mkfs.ext4 command after the experiments. It allows you to increase the limit on the number of inode on FS, without increasing the size of the image file. But, on the other hand, the standard settings are closer to real conditions.

Create a set of empty files (will work for several hours)
 #!python for i in xrange(0, 13107300): f = open("/mnt/test_dir/{0}_{0}_{0}_{0}".format(i), "w") f.close() if i % 10000 == 0: print i 

By the way, if at the beginning the files were created fairly quickly, then the next ones were added more and more slowly, random pauses appeared, the memory usage of the kernel increased. So storing a large number of files in a flat directory is in itself a bad idea.

We check that all inodes on the FS are exhausted.
  $ df -i
 / dev / loop0 13107200 13107200 38517 100% / mnt 

File size directory ~ 360Mb
  $ ls -lh / mnt /
 drwxrwxr-x 2 seriy seriy 358M Nov.  1 03:11 test_dir 

Now we will try to delete this directory with all its contents in various ways.

Tests


After each test we reset the file system cache
sudo sh -c 'sync && echo 1 > /proc/sys/vm/drop_caches'
In order not to take up all the memory quickly and compare the removal rate under the same conditions.

Delete via rm -r


$ rm -r /mnt/test_dir/
Under strace several times in a row (!!!) calls getdents() , then it calls unlinkat() lot and so on in a loop. Took 30MB of RAM, not growing.
Removes content successfully.
  iotop
  7664 be / 4 seriy 72.70 M / s 0.00 B / s 0.00% 93.15% rm -r / mnt / test_dir /
  5919 be / 0 root 80.77 M / s 16.48 M / s 0.00% 80.68% [loop0] 

Those. delete overflowed directories with rm -r /// is quite normal.

Delete via rm ./*


$ rm /mnt/test_dir/*
Starts a child shell process, which has grown to 600MB , nailed up ^C Did not delete anything.
It is obvious that glob is processed by the shell itself by an asterisk, accumulates in memory and is transmitted to the rm command after the whole directory is considered.

Uninstall via find -exec


$ find /mnt/test_dir/ -type f -exec rm -v {} \;
Under strace, only getdents() called. The find process has grown to 600MB , nailed by ^C Did not delete anything.
find works the same way as * in a shell - first it builds a complete list in memory.

Deleting via find -delete


$ find /mnt/test_dir/ -type f -delete
Grew up to 600 MB , nailed on ^C Did not delete anything.
Similar to the previous team. And this is extremely amazing! I placed hope on this command initially.

Deleting via ls -f and xargs


$ cd /mnt/test_dir/ ; ls -f . | xargs -n 100 rm
The -f option says that you do not need to sort the file list.
Creates a hierarchy of processes:
  |  - ls 212kb
  |  - xargs 108Kb
     |  - rm 130kb # pid rm is constantly changing 

Removes successfully.
  iotop jumps strongly
  5919 be / 0 root 5.87 M / s 6.28 M / s 0.00% 89.15% [loop0] 

ls -f in this situation behaves more adequately than find and does not accumulate a list of files in memory unnecessarily. ls without parameters (like find ) - reads a list of files in memory entirely. Obviously, to sort. But this method is bad because it constantly calls rm , which creates an additional overhead projector.
This results in another way - you can redirect the output of ls -f to a file and then delete the contents of the directory on this list.

Remove via perl readdir


$ perl -e 'chdir "/mnt/test_dir/" or die; opendir D, "."; while ($n = readdir D) { unlink $n }' $ perl -e 'chdir "/mnt/test_dir/" or die; opendir D, "."; while ($n = readdir D) { unlink $n }' (picked up here )
Under strace it calls getdents() once, then unlink() many times, and so on in a loop. Took 380Kb of memory, not growing.
Removes successfully.
  iotop
  7591 be / 4 seriy 13.74 M / s 0.00 B / s 0.00% 98.95% perl -e chdi ...
  5919 be / 0 root 11.18 M / s 1438.88 K / s 0.00% 93.85% [loop0] 

It turns out that the use of readdir is quite possible?

Uninstall through a C program readdir + unlink


 //file: cleandir.c #include <dirent.h> #include <sys/types.h> #include <unistd.h> int main(int argc, char *argv[]) { struct dirent *entry; DIR *dp; chdir("/mnt/test_dir"); dp = opendir("."); while( (entry = readdir(dp)) != NULL ) { if ( strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") ){ unlink(entry->d_name); // maybe unlinkat ? } } } 

$ gcc -o cleandir cleandir.c
$ ./cleandir
Under strace it calls getdents() once, then unlink() many times, and so on in a loop. Took 128Kb of memory, not growing.
Removes successfully.
  iotop:
  7565 be / 4 seriy 11.70 m / s 0.00 b / s 0.00% 98.88% ./cleandir
  5919 be / 0 root 12.97 M / s 1079.23 K / s 0.00% 92.42% [loop0] 

Again, we are convinced that using readdir is quite normal, if you do not accumulate the results in memory, but delete the files right away.

findings



PS: Unfortunately, I did not find functions in Python for iterative reading of a directory, to which I am extremely surprised; os.listdir () and os.walk () read the entire directory. Even PHP has readdir .

Source: https://habr.com/ru/post/157613/


All Articles