Interview with reiser4 developer Edward Shishkin

In view of the fact that Edward is a busy man, the epic interview was stretched for an indefinite period. But, in spite of everything, the developer reiser4 still took the time and answered the questions of the respected community Habra and LOR. What came of it - read under the cut.

- How are things going with the promotion of reiser4 to the core?

I don’t see any technical obstacles for this: all problems from the famous “list for inclusion” have been solved. It remains only to clarify the relationship with VFS, and the corresponding article for publication is not yet ready.

In general, the promotion of reiser4 to the Linux kernel is now of low priority. Simply, then you will need to instantly respond to all changes in the VFS / block layer. And I do not always have this opportunity. In the -mm branch, no one demands this from me. If something breaks, Andrew Morton just sends a notification. And when I find the time, I fix it.
')
Regarding the popular predictions that “reiser4 will not be included in the core and it will die,” I want to say that I don’t understand the obsessive idea of a “ticket to life” allegedly given by including the project in the main Linux kernel. Reiser4 is the result of 18 years of research in the field of data storage, not tied to a specific operating system. The result, which worked a lot of scientists. Not included in Linux - will be included in another OS, where our ideas will seem interesting. On Linux, the light did not come together ...

- Does it make sense to conduct something like an advertising campaign about reiser4 to improve its image?

The best advertising campaign is to explain to people how it works. For everyone looks at her code, and no one understands anything. So much for open source. How to explain it? Only articles published in authoritative publications. And, of course, there can be no talk of any Wikipedia. Wikipedia is good for illuminating the work of artists of the Renaissance. And the page of your project here risks turning into a latrine for assessors.

For my part, I'm going to publish a couple of articles. The first will be about the modular architecture itself, the second will be entirely devoted to the “dancing tree” module (the only existing plug-in of the “TREE” interface ). It will be very interesting: the technology used in this module is one of the most sophisticated. Further, it would be nice to explain how the transaction manager and other plugins work, but if you want, you can understand this from the code ( which is not the case with the tree ).

- What do you personally think about BFS ( CPU scheduler ), BFQ ( I / O scheduler )?

To be honest, I have not been following schedulers for a long time: on my laptop, in addition to a text editor and a browser, I do not run much. I just remember that the appearance of BFS was preceded by a rather unpleasant story ( incidentally, characteristic of the Linux kernel development model ). And Hans used to be very interested in elevators, he constantly commissioned him to implement his various ideas. About ten years ago, I also modified some kind of elevator on his instructions. True, he did not work better from this. Maybe because I was not interested in the elevators ...

- What do you think, why the more raw and unreliable ext4 FS was adopted almost immediately into the kernel?

Well, this is the logical continuation of the de facto standard Linux ext3 file system. It would be surprising if they had a red light here.

- How do you feel about that huge amount of FS in the core? Is this justified?

Of course, not justified. To a large extent, this zoo contributes to the outdated concept of VFS, which considers the file system as an opaque monolithic module. Previously, there was simply no such amount of FS. And now, except that only a lazy person does not understand that many of them are doing the same. It's time to draw some conclusions. I have a number of suggestions for improving the situation ( everything will be in the article ).

- At what exhibitions and forums are you going to perform this year and next?

While not invited anywhere. I myself never take initiatives.

- What motivates you to develop reiser4? After all, there are many other FS.

There is no strong motivation. At first, I finally wanted to finish the transparent compression. Then, after her announcement in 2007, I was busy with Reiser4 because of nothing to do: I could not find a job for a long time. Now I continue to be interested in some aspects of the data storage science on which Reiser4 is based. The remaining local filesystems are not interesting for me.

- Do you continue informally, "behind the scenes", to communicate with Hans? Does he take any part in the development?

There is no other way to communicate. He completely departed from computer science, although he is trying to provide all possible assistance, but for full participation he needs a computer, which he is not supposed to have. And since Hans can not sit idle, he was surrounded by books and began to take up his old passion - physics. Here, I found some inconsistencies in the special theory of relativity, asking to find a Russian scientist who would review his new article. He curses America, "the country of lawyers, where science is not put into anything." He remembers Russia warmly. He is following with interest the initiatives of Minister Andrei Fursenko, who in his opinion tries to revive the former prestige of Soviet science and education. He believes that there will be a place for foreigners in his project, and he says that he is generally ready to move to Russia and finally learn Russian.

- What is your main job?

I work in the file system department of Red Hat.

- Do you have enough time to practice reiser4?

Relatively speaking, there is enough, but only for support: I am usually aware of all the changes carried out in the VFS, block layer. To adapt to them reiser4, days off is usually enough. Development means programming new plug-ins. This implies full employment, no less. Those. this is only possible for the salary, but nobody is going to pay for it.

- How many people are involved in support? Do you have a successor if you suddenly would have to abandon the support of reiser4?

I'm alone for now. All previous developers have gone to work, but no new ones. Plunge into this area is not easy. It is necessary to sit all day and plod around the monitor. In flowering years, usually not before. Well, when a person is already over thirty, he wants stable work and money. Where can I get them to him?

- Is it possible to license the reiser4 code for use in a proprietary OS?

I am far from such questions. You can ask Hans if very interesting.

- Can reiser4 become the default filesystem in one of the next releases of RHEL?

This is a question, rather, to my manager: I cannot discuss Red Hat plans. I can only say that so far I have not offered anything to anyone, but nobody asked me anything.

- Do you plan on porting reiser4 to FreeBSD? Perhaps you should consider creating a port using FUSE? What do you think about the policy of making changes to the core?

In general, porting as such has never been interesting to me. But I heard that FreeBSD is an operating system that has academic roots ( University of California, Berkeley ). And this means that with high probability we will find a common language with the developers. In any case, they will not look at you with the word “algorithm” with incomprehension. In Linux, the key concept is a patch concept. And there is a committee of certain people who decide ( based probably on their own intuition, and also to a large extent on the patch author's ability to “get along” with the kernel development team ), whether they accept the patch into the core or not. I don’t really like this approach: I graduated from MSU, not MGIMO.

- What are the "pitfalls" that anyone who wants to try reiser4 in their daily work may encounter? How do you evaluate its stability?

General comment: over the past four years, I don’t remember that someone would lose data on the reiser4 section when the hardware is working properly. I was approached by several people complaining about the work of fsck. In the end, they all received both their data and a working fsck.

The most annoying thing is that it may be necessary to roll back to the previous version of the kernel after the upgrade ( I am not very good at testing patches for the next version ). The next trouble is the absence of the defragmentation utility. Also still lives an old hard-to-reproduce bug, leading to reports of "key inconsistency". In any case, if you decide to contact reiser4, then you definitely need to be patient. If you have any problems, you should send a bug report to the mailing list, or directly to me ( if you do not speak English ). You do not need to think that I will immediately decide them: at reiser4, I only have time after work and weekends. If I stopped responding to emails, do not hesitate to remind myself again. Well, complaining on the forums is the most inefficient way to solve problems.

- Do you plan to create a utility for defragmentation? For example, when using reiser4 on a section with torrents, it turned out more than 11,000 fragments per 700-megabyte file, and no copying could get this figure down to at least several hundred. At the same time, there were negative consequences for productivity.

Yes, it is planned. Having such a utility is important. Reiser4 Transaction Manager uses a mixture of journaling and copy-on-write techniques. The latter in itself already means fragmentation. In order to get rid of it, a single copy may not be enough: after all, free disk space can also be fragmented. In general, the defragmentation utility will significantly improve the situation in several passes of the tree. With external fragmentation can be fought - this is not a sentence for the FS.

About torrents. About three years ago, the fallocate (2) system call appeared in Linux, which is designed to prevent fragmentation in such cases. The application must specify in advance the offset and size of the piece in the file, and the file system should allocate for this piece ( as less fragmented as possible ) disk space. However, reiser4 does not support this system call yet. It is easy to make such support, but in the near future I, most likely, will have no time for that.

- Are there problems with a specific iron when using reiser4?

I have not heard of such. It seems like, omnivorous.

- Will reiser4 support be implemented for grub2 by the reiser4 developers themselves?

Hope to be. This is hard work, but it is guaranteed to be crowned with success. There is a patch for grub-0.97. Based on it, you can miraculously organize reiser4 support for grub2. The disadvantage of the existing patch is that the download cannot go through stage1_5 for the reason that the corresponding binary turns out to be too large and does not fit in the 62 sectors allocated to it. And the inability to boot through stage1_5 means that every time a defragmenter has worked on your partition, you need to reinstall grub. With reiser4 support for grub2, everything should be done well. The module loading btrfs from multi-devices fit me in 62 sectors. Why does reiser4 not fit there?

- Is it possible in the future to remove plugins in userspace? Are there any such plans? Is it planned to create an infrastructure that could load plug-ins both in kernel space and in user space?

Bringing individual plugins to userspace has no particular meaning. How will they interact with each other? Each plugin performs any service and, in turn, asks other plugins for any service. Imagine that the plug-in X, working in the kernel, needed some operational service, and the plug-in providing it Y works in userspace. After all, nothing good will come of it? Dynamic loading of plug-ins as kernel modules is useful, but this is not an interesting and burning question. Well, let's make them dynamically loaded ...

- Is it worth thinking about writing a set of tests that will test the strength of the filesystem in various ways and show problems? For example, it could be a set of perl scripts that would conduct aggressive parallel writing, reading and deletion, would show the correspondence of the read data to the written data, and also check the structure of the file system for problematic places.

It is a dream of many to have such tests. So that in half an hour of their run it would be possible to make another release with confidence. I can only say that everything here is very difficult. Writing comprehensive tests to identify problem areas in software products is a very difficult task. Yes, and the tester will rest mainly in the regression of other subsystems of the kernel. And either fix them, or wait for someone else to fix them.

- How did zfs / btrfs affect reiser4?

No The reiser4 was partially influenced by the development of xfs ( “delayed allocation” technique ). Mostly used their own work.

- Do you directly develop btrfs?

Partly on behalf of the employer. I did support btrfs in grub-0.97 ( with grub2, our distributions do not work ). I do not know what to charge more. It is possible that the trendy data deduplication feature.

- What is your opinion on the current state of affairs with btrfs according to the results of the recent sensational correspondence with Mason?

Why are the "sensational"? Normal working environment. I was instructed to investigate btrfs for its applicability in enterprise systems, so I found a strong internal fragmentation on those models where the rest of the file systems work flawlessly. Accordingly, I began to find out whether this is an error or a feature. True, half a year has already passed, and I still haven't heard anything intelligible about the btrfs algorithms. What opinion can there be? I understood only that they want the “tail packing” feature of the reiserfs file system, completely unaware of how the algorithms and data structures of the latter work. I can only say that in B-trees the concept of “tail packing” is completely devoid of any meaning. And, moreover, an attempt to place variable-size items in such trees leads to unlimited internal fragmentation. And Reiserfs does not use B-trees and their well-known modifications. They use completely different algorithms (the invention of Russian scientists, by the way ) - the history of Namesys began with them in the early 90s. And modifying them for top-down balancing, as required by the btrfs design, is not a trivial task, unlike classic B-trees.

Very often I hear that btrfs main engineer, Chris Mason, having worked at Namesys in his time, like Duncan MacLeod borrowed all the positive experience from there. For now, I see only the opposite. For some reason, he saved the keys (the key in btrfs is 136 bits, in reiser4 - 192 bits ), but the terabytes of disk space ( and RAM ) of users failed to balance by derailment. Additional key fields are the ability to group data and metadata differently. And what, all this is not necessary ??? But balancing from top to bottom is generally, in my opinion, a complete compromise: the squeeze phase of balancing, as well as compression and data encryption cannot be postponed like the “delayed allocation” technique. And then, it seems to me that these guys will run into problems with scalability due to the inability to organize a decent lock scheme on such a tree. I can only say that it is much more profitable to distribute the “work on the tree” between a large number of processes, and to let some of them meet ( from the bottom up ), and not so that they all break into this tree from above through a common root.

In general, I do not know ... I, of course, will help than I can, but there’s such a thing: if the project is based on unsuccessful ideas, it’s hard to make candy. By the way, the whole history of Namesys is continuous contacts with academic institutions ( Moscow State University, Institute of Program Systems of the Russian Academy of Sciences in Pereslavl-Zalessky ). XFS is also an entire school at Silicon Graphics. And Btrfs is the story of what? Couples low-level workshops? And how else to name the events at which non-existent features are announced? I have long ceased to believe in miracles ...

- How do you see the future of the FS? What will be their functionality?

A file system is a subsystem that manages the “disk space” resource. And all its "features" should be aimed at the effective management of this resource. And this means that the future of file systems is based on more progressive algorithms, i.e. for those that do the job better. However, there are new physical media, read-write technologies, some are moving from userspace to the core ( atomicity, transparent compression, encryption, etc. ). Existing file systems are becoming obsolete: it is cheaper to rewrite them than to adapt to innovations. File systems should be able to “meet” such features. Do not rewrite them every time again ... And for this they must have the appropriate technical base.

An attempt to create such a base was undertaken in reiser4: unlike its predecessors, it has a completely modular structure. In reiser4, all implementations of the file_operations, inode_operations, address_space_operations methods are just thin layers — dispatchers who decide which plugin ( module ) to transfer further control to. And each module implements some abstract class (interface) of a certain interface subschema, reflecting a certain concept of data storage ( meta ).

I will try on fingers to explain how it all works. Let's say you want to implement btrfs functionality ( snapshots, etc. ). As you know, this file system uses the “copy-on-write” transaction model implemented on the basis of top-down balancing of the tree. This is its main difference from what Reiser4 currently offers.Therefore, we need to create a new “TMGR” interface “cow” plugin ( transaction manager ), as well as a “TREE” new “multi-root-tree” plugin for the repository tree with the family of roots ( “history” ) and balanced from above way down. In this case, the latter must be supplied with its own blocking scheme. As for TMGR, this is an abstract class for controlling objects, which in the next article are called “particles” (the concept dual to the primitive “transcrash”, the article is here ).

If you look at the transaction managers of different file systems ( at the moment there are three types of such managers), it is easy to notice some of their common features. Namely, in the TMGR interface, you can select a set of the following main methods:

enter_context ();
try_capture ();
exit_context ().

The first and last are invoked respectively during the process entry and exit from the actual file system. The second is in all places where data ( pages or buffers ) are modified . Now a single TMGR plugin works in reiser4, let's call it “jcow” (a symbiosis of the techniques “journalling” and “copy-on-write” without saving the history ), the method -> try_capture () of which adds a block to the so-called. “Atom” ( special name for “particles” in reiser4 ). And in our newly created “cow” plugin, this method will budge a new root of the storage tree ( in the btrfs code, the corresponding function is called btrfs_cow_block ).

As a home exercise, I propose to understand that in this case it will be “atoms” (those. should be sent to disk entirely ). For an educational program you can refer to the article by Ohad Rodeh "B-trees, shadowning, and clones".

These new roots need to be able to add and extract somewhere: if you want the “writable snapshots” feature, then they must form a double-indexed set. But it's not a problem. For example, btrfs uses a separate “root tree” for this purpose.

So we only need two new plugins to get the btrfs functionality. And really: why do we need something else? FILE interface plug-ins are selected from the item tree in accordance with the TREE interface methods and do not need to know how a particular tree is balanced. Plug-ins of other interfaces ( NODE, ITEM, etc.)) also remain in business: why do we need to change the format of tree nodes for organizing snapshots? Simply, our “multi-rooted” tree will contain different internal nodes that refer to the same blocks.

I'm not saying that programming new TREE and TMGR interface plug-ins is a job for the lazy, but believe me, this is much easier than re-creating the file system and the most complicated fsck utility ( which, by the way, is also modular for reiser4 ) here lies in the fact that the existing plug-ins of other interfaces do not need to be taught to work with new family members, which means that there is no need to write and debug code, the percentage of which will tend to 100% (With a well-organized interface scheme, you successfully implement more than one functionality ).

In the same way, with the help of plug-ins, you can organize and manage logical volumes in ZFS or btrfs. However, here I must caution: this will be the so-called layering violation . The fact is that in Linux, volume management is carried out by a separate subsystem ( lvm ), and an attempt to mix it with the file system can end badly: you will be asked to remove this functionality, and no longer do that: there is an inexplicable policy of double standards: mix someone levels allowed ( for example, btrfs), and in reiser4 this is not welcome. In any case, recalling the flurry of accusations against reiser4 on the layering violation, I would not risk the effort expended.

Details and other equally interesting applications of modular architecture can be found in my article ( it has not yet come out, it will be announced on the reiserfs-devel mailing list ).

So, I would describe the future of local file systems, in particular, as “grinding” of such “internal” interfaces. In fact, if you look closely, you can see that they are not internal. It's like in algebra: if you have any linear space V splits into ( internal) direct sum of subspaces, then you can go the opposite way: to build a space that is isomorphic to V. Using the outer direct sum construction, well, and since they are not internal, this is already the property of all file systems. There are no problems with VFS here ( more on this in the article ). In general, here I see a lot of analogies between software systems ( to a greater extent this applies to data storage systems ) and such concepts of homological algebra as a module, graduation, filtering, etc. that seem to me very useful.

And finally: about the "features". I am often asked how to write a plugin for reiser4. Moreover, the response question, and what he will implement with us, often puts the questioner at a standstill. I don’t like the idea of putting on the production of features for a file system with a modular architecture. It is a discipline, not a mass entertainment industry. No one puts on the flow of proof of mathematical theorems ...

I think that first there must be a useful idea from the field of information storage ( for example, snapshots ). I do not think that such ideas may be too much. With such ideas - welcome. We will think about how to express it in the language of attachments, add, if necessary, new interfaces to the general scheme and write the appropriate plugins.

, «» : ( ) «» : - « » , .

Source: https://habr.com/ru/post/108629/

All Articles

Interview with reiser4 developer Edward Shishkin

More articles: