📜 ⬆️ ⬇️

Tagsistant: semantic file system

Hey. On Habré there was already a material dedicated to Tagsistant, but it seemed confusing and incomplete to me. This attempt to submit it in a different way is a brief extract from the English manual and its own observations.

The Tagsistant project positions its creation, tagfs, as following the general trend. Step by step, they are trying to translate the Internet onto semantic rails, and the file systems, the authors of the project consider, have become stuck in obsolete principles - the hierarchy, the directories, that's all .
And in principle, I agree with them on something. Imagine that you have several hundred photographs, some of which were made in Cologne, others were made at sunset, the third shows girls, and the fourth are made in 2010. Now imagine that you want to perform the following operation: get a list of photos that were taken at sunset in Cologne with your girlfriend, excluding those that were taken in 2010.
Yes, perhaps, someone will say, because you can create directories, for example, Koeln, sunset, girls, 2010, then put softlinks into files on them ... Something like that, but does it provide the necessary flexibility and convenience in composing requests (although would be in solving the above example)?
Yes, you can try to use EXIF ​​tags. But the camera does not indicate the presence of girls in the photo and other criteria limited by your imagination. And if we are not talking about photographs at all, but about reports ?
You can try to write original tags in the file attributes using ext4, using setattr \ getattr - at least I saw such a sentence in the question of tagging files, did not try. But this is also a half decision, even if it works.
A real example for the seed that I can think of, based on my needs. I have a folder with a huge amount of different picture trash ever saved in Downloads and later patched (in fact, not one). I want to get a list of photos of members of the forum girls who are made in Kiev from all this garbage cannon, contain images of beer and made before 2012. Together with them, I want to get images of all forum admins that I have:
$ ls ~ / tagsistant / store / forum / girls / beer / = Kyiv / time: / year / lt / 2012 / + / admin / @ /



Consider what tagfs offers.


The first thing is tagging files, directories, devices, and even pipelines (!) . The second is the relationship between tags, which can be include, exclude, equivalent and requires. Files are stored in the technical directory / archive , tags - in / tags , matching files with tags - in the directory / store . Total directories 6:
alias / archive / relations / stats / store / tags /
')
A file can have as many tags as possible (reasonably). The syntax for tagging a file is:
$ ln -s ~ / foto1.jpg ~ / tagsistant / store / koeln / wife / sunset / @

We attributed the photos to a set of three tags independent of each other: “Cologne”, “wife” and “sunset”. Now this photo will be selected by any of these tags, and in any combination of them.

Why ln -s and why the dog at the end? First, why not? Why copy the whole file, taking up more space and time to copy it, if the file itself already exists, and we only need to create a correspondence between it and the tags?
The second is that the @ symbol serves as a marker denoting the end of a series of tags. Tagsistant in the path points to the mount point tagfs , the store directory is used to directly associate and access files with tags. All further is a row of tags attributed to the file. Now imagine that we added ten more files with different sets of tags: some contain only / wife / sunset , others only / koeln / wife , etc. Now you can make different selections:
$ ls tagsistant / store / koeln / @ /
Result: All photos taken in Cologne
$ ls tagsistant / store / koeln / wife / @ /
result: all photos of my wife taken in Cologne
$ ls tagsistant / store / koeln / wife / - / sunset / @ /
result: the same, excluding the "sunset" photos


Why is the @ marker? But why:
fbi (console viewer, opens the photo) tagsistant / store / koeln / sunset / @ / (specify the set of tags and complete it) foto2.jpg (specify the specific file from the set of photos corresponding to the specified tags)

How else would the file system service figure out where the tag is, and where is the file name already? ..

Operators


More complex selections can be made using the +, - operators and braces. Examples:
$ ls ~ / tagsistant / store / koeln / + / sunset / @ /
Result: photos from Cologne and photos of the sunset; not a superposition of these tags (photos taken in Cologne at sunset), but a merging of two different selections (photos of Cologne at any time of the day and photos of sunset taken in absolutely any place).

$ ls ~ / tagsistant / store / koeln / - / sunset / - / wife / @ /
Similarly, all photographs of Cologne, except those taken at sunset. And my wife, too, aside, we only need photos from fishing. :)

(The sample merge operator + / works in the same way. Each operator refers only to one subsequent tag, so two statements will be needed to merge samples for the three tags.)
For grouping tags are braces; Imagine that you need to make a selection of three sets of files. The first set is modeled at the same time as “starwars” and “image”, the second - as “starwars” and “music”, the third - as “starwars” and “video”. Using merge operators, this can be expressed as:
$ ls ~ / tagsistant / store / starwars / image / + / starwars / music / + / starwars / video / @ /

But better this way:
$ ls ~ / tagsistant / store / starwars / {/ image / music / video /} / @ /


More complex use cases suggest the following query example:
$ ls ~ / myfiles / store / {/ starwars / startrek /} / {/ images / music / video /} / @ /

Which will give us a sample of all the pictures, music and videos related to two different films. An equivalent query made without groupings would look like this:
$ ls ~ / tagsistant / store / starwars / image / + / starwars / music / + / starwars / video / + / startrek / image / + / startrek / music / + / startrek / video / @ /

Grouped tags cannot contain other groupings ( they can not be tagged ) . Also be sure to observe a pair of brackets and do not forget to close them.

Enumeration of file tags and meta tag ALL


Another, naturally expected, ability is the listing of all tags associated with a file. To obtain it, we need to use the cat command to refer to a file with the suffix ".tags" . Like this:
$ cat tagsistant/store/koeln/@/photo1.jpg.tags
koeln
wife
sunset
image

This is if we remember at least one tag related to the file. And if not, is the memory completely gone? We are rescued by the global meta tag ALL / . The result is the same.
$ cat tagsistant/store/ALL/@/photo1.jpg.tags


The ALL metadata tag provides an absolutely complete list of the files contained in the tagfs , and can be useful, for example, for automatic processing of all files, since recursive processing of the store folder does not work, as in conventional hierarchical systems. Or in the case, as above - you remember that you tagged a specific file, but do not remember any of its tags. To view their list, you use the general meta tag.

Triple (composite) tags


Perhaps it's time to finish with flat tags and go to namespaces and triple tags (triple tags). Despite the fact that previous examples showed quite good flexibility of use, they have certain limitations. I will not reinvent my own bike and take an example from the manual: let's say I want to enter tags with delimitation by year. How can I implement this? Create tags like 2000 , year_2000 , etc. by tag for each year. This will overload the tag directory and in addition, there may be collisions in the tag names.
The second level of tag development in tagfs , in its pursuit of structuring and usability, is expressed in the composition of three-element tags, which looks like:
Namespace - Key - Value

The namespace describes the semantic identity of the key-value pairs contained and can be quite general, for example, the namespace time can be used for the given example over the years. The keys will look like year , but the specific numbers will be contained in the third element value .
In real use, there is another element in the composite tag: the operator . From the list of operators, their role becomes clear: eq (equals to), inc (includes), gt (greater than), lt (less than). Thus, the complete scheme of the composite tag looks like this:
namespace: / key / operator / value /

Pay attention to the colon after the namespace, it is necessary and serves just to define this space.

So we can classify photos also by year, month, etc., without cluttering up the tag directory. Attributing to all the photos from the directory with photos from Cologne the tags that characterize them as made in Cologne in August 2010 will look like this:
$ ln -s ~ / Koeln_fotos / *. jpg ~ / tagsistant / store / photos / koeln / time: / year / eq / 2010 / time: / month / eq / August / @ /

Sometime after, we will be able to search all the photos taken in 2010 and see among them photos from Cologne.
$ ls ~ / tagsistant / store / photos / time: / year / eq / 2010 / @ /

The system also contains the basics of automatic tagging based on the file's metadata, but so far it’s not really possible to test them because there is no set of files with normal metadata (in most cases, these are photos). The manual says that you can customize regular expressions that affect what information will be extracted from the metadata by editing the configuration ini-file. It would be convenient if the system also added automatic tags, based on the file extension, for example, would output all jpeg-png-gif by the image tag, mp3-flac - music, etc. It is not yet clear whether such functionality is incorporated into the project or not, perhaps you can write your own plugin with such a function.

As for the relationship


There are only four of them: include, exclude, is_equivalent and requires . The standard manual does not give a detailed explanation of each of them. Only one example is given:
$ mkdir ~ / tagsistant / relations / TAG1 / includes / TAG2 /

After creating such a relationship, any request to TAG1 will produce a list of files with both the TAG1 tag and TAG2 . An example of real use is the images tag contains photos . For example, during a trip to London in 2014, we took a few photos and simultaneously downloaded a certain number of pictures from the Internet. Some of them are comics from a bashorga, and some are wallpapers for a desktop. At some point we wanted to view photos from London (/ London / photos /) for that period, along with wallpaper (/ images /) , but not to waste time on comics (/ comics /) . Then the request will look something like this:
$ ls ~ / tagsistant / store / London / images / time: / year / eq / 2014 / - / comics / @ /


Exclude is stable and obvious. The set A includes (include) the set B , the set B excludes C. Now, if we have three files: fileA (tag A ), fileB (tag B ), and fileC (tags B and C ), the request for tag A will give fileA and fileB , and fileC will be excluded from the search. fileC will be available only with direct access to the C tag.
The query / store / A / + / B / - / C / @ / would have the same effect. Relationships allow you to establish long-term relationships and reduce queries.

From sources other than the standard manual, it becomes clear that the is_equivalent attitude has the most obvious and simple functionality: it makes one tag equivalent to another in the eyes of the reasoning block. There were such examples: the beatles became the equivalent of the_beatles, and the second example against the background of the argument that someone might not like tags using the bottom line, like my_home , made my_home equivalent to my \ home. Why - it is not clear. (This is just my opinion.)

The most obvious thing that a relationship requires requires is to hide one tag inside another in the file system hierarchy. That is, for example, if you run:
$ mkdir ~ / tagsistant / relations / TAG1 / requires / TAG2 /
$ ln -s ~ / somefile.txt ~ / tagsistant / store / TAG1 / @ /

Then in the future we will be able to access somefile.txt via the TAG1 tag, but we will not see TAG1 in the list of tags in the store directory - it will be hidden inside TAG2 /.
$ ls ~ / tagsistant / store /
+ / - / @ / @@ / ALL / TAG2 {/
$ ls ~ / tagsistant / store / TAG2 /
+ / - / @ / @@ / ALL / TAG1 {/
$ ls ~ / tagsistant / store / TAG1 / @ /
somefile.txt # call goes through the desired tag, although at this hierarchical level it is not. However, the hierarchy here is not very in the business ...
$ cat ~/tagsistant/store/ALL/@/somefile.txt.tags
TAG1 # that is, the file is tagged with only one tag

In the case, if the relationship requires between these tags will not, then TAG1 will be contained at the top level in the directory store /. While the deep ontological meaning of this relationship has not reached me. In scanty descriptions it is not plainly written.
UPD: in the project's ChangeLog , however, I found a reference to a new relationship called required. Literally, it says the following (translated from English) :
Introduced the relationship "necessary." If the M tag is necessary for the S tag, then the S tag will be shown only when the M tag is contained in the last position of the query, for example:
store / M /
store / P / Q / + / M /

But it will not be shown in:
store / P /
store / P / + / Q /

The purpose of this relationship is to organize tags into a hierarchical structure to prevent cluttering up the root directory. To some extent, it complements the namespace functionality.

Frankly, clarity is not introduced. At least, I have not observed the behavior described. Perhaps I just do not understand something.

Deduplication and other poles in the wheel


Deduplication is a mechanism that prevents identical files from using two different inodes in the file system. This means that joking with the creation of an empty temporary file as a flag would not have worked here, but this is not necessary - this is an auxiliary specialized system.
It looks like this:
$ touch ~ / tagsistant / store / tag1 / @ / tempfile1
$ touch ~ / tagsistant / store / tag2 / @ / tempfile2
$ touch ~ / tagsistant / store / tag2 / @ / tempfile3

The result of these manipulations will be only one tempfile1 with tags tag1 and tag2 assigned to it. Attempts to create the remaining two files will encounter content checking, it turns out that they have the same first (they are all equally empty) and the tags assigned to the last pair will be assigned to the first file with the same name.

Disable reasoner (block thinking)


Completion of a number of tags in the request with the @ "includes" symbol above the block, forcing it to perform all the logic of the request, using the same two symbols: @@ "turns off" it. It is useful in some cases, among which there are separate operations with files and viewing the set associated with a tag without participation of relations. For example, if the tag A contains the tag B, then on request to the tag A, the system will produce both sets. If we turn off the rizoner with a similar request, we get only the set A:
$ ls ~ / tagsistant / store / A / @ /
Afile1 Afile2 Bfile1
$ ls ~ / tagsistant / store / A / @@ /
Afile1 afile2


Aliases


A familiar tag chain can be conspired into a short alias, denoted by the = sign. The query-chain associated with the alias will be substituted by as is, so that some tricks can happen as a result. Aliases are stored in the aliases directory as files that contain an associated query string. Suppose an alias file named behemoth contains the string behemoth / file: / format / eq / AVI / . Later we substitute it into a more general request:
$ ls ~ / tagsistant / store / = behemoth / time: / year / lt / 2000 / @ /

The mentioned trick may be that if the alias contains the + / operator, then the entire part of the query that follows the alias will apply only to the second part. By the way, it is also not entirely clear, because in the manual it was said that the operator refers only to one tag following it; Perhaps the information did not have time to update after the next innovation.

Tag Merge


Also an important part, could not not mention it. To merge two tags into one, simply transfer all the contents of the directory of one to the directory of another. And delete the first one.
$ mv store / merged_tag ​​/ @ / * store / destination_tag / @ /
$ rmdir tags / merged_tag

Not less important note. In no case can not delete a non-empty folder in the directory / store . Each directory of each tag contains links to all other tags (their directories), so by removing the folder of one tag, you will demolish the entire repository. All deletions in the / store folder can be only with a completed request. The request becomes complete when it contains markers of the Reason thinker : @ or @@ . In this case, only the files that will be the result of referring to the tags listed in the request will be deleted.
To remove a tag, you need to refer to the directory / tags , and not / store . All files to which this tag is assigned will remain intact, but will lose the corresponding tag.

Assembly


Tagsistant is designed to build under Linux or BSD, requires the library glib2, fuse, libdbi with plugins libdbd-sqlite3, libdbd-mysql and libextractor. I do not have a desktop distribution, so I manually collected half the dependencies. At the same time, Tagsistant gathered only with sqlite3 headers (in fact, as you can see, it is enough for him either), but gives some junk messages. Perhaps, just because I collected it without mysql-ovsky headers - after starting and when working in the terminal, messages like "no tables in statement!" It is enough to redirect the standard output to astral 1> / dev / null in order for this to stop - this does not affect the work in a visible way.

Of course, someone can speak out in the spirit: “why this bike, if you can organize a hierarchy of folders”. I believe that no hierarchy of folders will give such flexibility and convenience, allowing you to ask absolutely any queries that come to mind. In addition, from my point of view of the bike just like fuss with the hierarchy, links and the like. EXIF tags that someone could accidentally think of because of the examples with pictures are hardly suitable for tagging the archives of correspondence and anything else that can tag Tagsistant. The system has room to develop, but it is already comfortable and stable. Pay attention to her.

Source: https://habr.com/ru/post/357630/


All Articles