Relevant connection - specific and universal attributes

We continue the description of the properties of the operation relevant sample . In the first part , the filtering of tuples of the register of rules by the input vector values was considered with the subsequent sorting of the selected tuples according to their relevance. Emphasis was placed on how to correctly consider (evaluate) the relevance of tuples.

Here we will dwell on the sampling operation itself (there will not be a single formula!). In the general case, not only a vector with a table, but also two tables can participate in this operation (join). An operation on tables in which an element is checked for belonging to a set is called a relevant connection . Next, consider what its features.

Attributes concrete and universal

In the previous article, the data tables were conditionally divided into fact tables, which contain specific attribute values, and rule tables, which may contain set-values (universal values).

However, it is more correct to speak about the division not of tables, but of their attributes. In the same table, one attribute can contain specific values, and another - set-values (universal). In accordance with this, you can put forward the following thesis:

All attributes (relations) can be divided into two kinds - specific and universal .

Attributes of a particular kind are those whose values are interpreted as elements of sets. For example, the absence of the value of such an attribute means exactly the empty value (and not the universe).

Attributes of the universal genus are those whose values are interpreted as sets. Even if this set consists of one element. The absence of the value of such an attribute usually means the value of its universe (all values of the set).

Values of universal attributes are involved in assessing the relevance of table tuples, these are attributes of relevance. It makes no sense to evaluate specific attributes, since the relevance of all values is always the same.

Example - variables whose value depends on the parameter (s)

Consider a register in which there are attributes of both genders. Let it be a table containing the values of variables of a certain information system. Let's call it "Variable values". Each variable is characterized by its name (identifier) and value. Assume that the value may depend on the user of the system. In such a table, you can set the default variable value (for all users) and override it for specific users.

The register determinant consists of two attributes (dimensions) - "Name" and "User". The root is from one resource “Value” (well, another system attribute is relevance L).

Register values can be approximately the same (the value of users is taken as 100, the tuples are numbered for convenience):

No	Name (0)	User (100)	Value	L
one	Mode		eight	0
2	Mode	Ivanov	6	100
3	Date of Birth		01/01/1980	0
four	Date of Birth	Ivanov	02/15/1987	100
five	Date of Birth	Petrov	09.12.2008	100
6	Qualification	Petrov	Average	100

The value of the variable "Mode" for all users is equal to 8. The exception is Ivanov, for which the value of the mode is equal to 5. Similarly, the value of the variable "Date of Birth." By default, the system for all users adopted the date of birth on January 1, 1980. But for users, Ivanov and Petrov have more accurate values. There is also the variable "Qualification", the value of which is set only for one Petrov.

In this register, the attribute (dimension) "Name" belongs to a specific genus. Its value must always be specified, and there is no point in the meaning of the universe in this attribute. There is no sense in assessing the relevance of such an attribute — assigning zero weight to such measurements — this distinguishes specific measurements from abstract ones.

But the dimension "User" - on the contrary, the universal kind. The absence of a specific user value means using the "All Users" universe. A link to a specific user (Ivanov or Petrov) is actually interpreted as a set of one element (Ivanov).

The type of attribute affects the algorithm of the relevant selection (connection).

Unity and struggle of opposites

As indicated at the beginning of the article, two tables participate in the operation of fetching data from a certain table. One of them (the data source) is the data to be extracted, and the second (the input table) sets the sampling parameters (what exactly should be selected).

If the input table consists of one tuple, then the input data can be considered as a vector. This is the most common situation in practice. But in the general case there can be several tuples in the input table.

The relevant connection differs from the usual connection (in terms of equality of values) in that it uses the operation of "belonging an element to a set". That is, on the one hand there should be a set (or sets), and on the other - the values of a set.

The consequence is that the genus of attributes of the tables participating in the relevant join is the opposite. If in one table the joined attribute is of a specific kind, then in the second kind of this attribute should be universal. And vice versa. This is the antisymmetry property of the kind of attributes of the relevant connection.

We explain. Let a data source table consist of one attribute of a specific kind. In this case we deal with a certain set of values. For example, it can be many different words. Then, with a relevant input connection, you need to specify sets (words). Such a set can be the value of 'I' - which is interpreted as all words ending in 'I'. You can set a set of two subsets - 'I' and 'a' - then all the words of the source table ending in 'I' or 'a' should be in the sample.

The opposite situation - the source table consists of one universal attribute. For example, contains the set-values of words. Such set-values can again be the endings of words - ('I', 'a', 'b', 'e', 'pb', ...). Then the input table (set) must contain the specific meanings of the words (if the vector is one word). The select operation will return the endings that match this particular word.

In the case of a relevant sample, a check is made that the value of a specific attribute is assigned to a universal value. Regardless of which table (input or source) contains attributes, the relevance condition is:

The value of a particular attribute belongs to the value of a universal attribute.

We emphasize once again that concreteness or universality is an interpretation of attribute values. If an attribute is declared concrete, then all its values (even if they appear to be sets) become concrete. And in the operation of the relevant connection, they are on the side of the elements, not the sets.

Algorithm of the relevant connection

Referring to the register "Variable values". What samples of it make sense? The most obvious is to retrieve the value of a variable for a given user. For example, to retrieve the value of the "Mode" variable for the user "Ivanov", the input table would be:

Name	User
Mode	Ivanov

In fact, this is just a vector (one tuple). Please note that the genus of attributes of this table is the opposite of the original one. That is, here the attribute "Name" is universal, and the attribute "User" is specific.

The universality of the attribute "Name" means that its values are interpreted as sets. The meaning of the name "Mode" here is a set of one element. An empty (not specified) value of this attribute is interpreted as a universe. Here is such an input vector, for example,

Name	User
∀	Ivanov

means a sample of all variables set for the user "Ivanov".

Selection of relevant values

The result of the connection (sample) must include the values of specific attributes. That is, the determinant of the result of the compound should consist of specific values. This allows you to sort the selected tuples by relevance within the determinant value (key).

So, after selecting the relevant tuples of the table "Variable values" for the above vector {Name: ∀, User: "Ivanov"}, we get the same table, but with the user value filled in:

No	Name	User	Value	L
one	Mode	Ivanov	eight	0
2	Mode	Ivanov	6	100
3	Date of Birth	Ivanov	01/01/1980	0
four	Date of Birth	Ivanov	02/15/1987	100

Leaving for each value of the key "Name + User" the most relevant tuples, we get the required ones:

No	Name	User		Value	L
2	Mode	Ivanov		6	100
four	Date of Birth	Ivanov		02/15/1987	100

Consider a more complex example. Suppose you need to choose the values of variables for two users - Ivanov and Petrov. Then the input table will contain two vectors (tuples):

Name	User
∀	Ivanov
∀	Petrov

After performing the operation of the relevant connection (determining whether the elements belong to sets), we obtain an intermediate table:

No	Name	User	Value	L
one	Mode	Ivanov	eight	0
2	Mode	Ivanov	6	100
3	Date of Birth	Ivanov	01/01/1980	0
four	Date of Birth	Ivanov	02/15/1987	100
one	Mode	Petrov	eight	0
five	Date of Birth	Petrov	09.12.2008	100
6	Qualification	Petrov	Average	100

We leave the tuples with the highest relevance within the determinant (key):

No	Name	User	Value	L
2	Mode	Ivanov	6	100
four	Date of Birth	Ivanov	02/15/1987	100
one	Mode	Petrov	eight	0
five	Date of Birth	Petrov	09.12.2008	100
6	Qualification	Petrov	Average	100

This table is the result of the relevant key selection.

Soft selection conditions

Soft selection conditions are those for which the attribute, the values of which are addressed, is specific. Accordingly, the selection parameters are universal - they consist of set values.

Usually this situation arises when referring to some facts, from which it is necessary to choose suitable under conditions with varying degrees of relevance. That is, the conditions of sampling (selection) of data are not rigid, but desirable - soft.

In the store, the buyer can ask for "red" - and if there is none, then any. When buying tickets (booking seats) a certain date is desirable for the buyer, and if it is not available, then the next free one. Etc.

Soft selection conditions are also a relevant compound. The only peculiarity is that usually as a result of such a sample, all relevant tuples are left, and not just the first one.

Let us demonstrate the soft selection conditions when accessing the data in the table "Variable values". Let it be necessary to select for the user Petrov the variable "Qualification", and if it does not exist, any other. Then the conversion table (input table) will look like:

Name	User		Li
∀	Petrov		0
Qualification	Petrov		five

We draw attention to the observance of the rule of antisymmetry of the kind of table attributes. In the table "Variable values" the attribute "Name" is specific, therefore in the input table it is universal. On the contrary, the attribute "User" in the table of variables is universal, therefore in the input table it should be specific (specified).

When referring to the fact table (here, the table “Variable values” plays the role of the fact table that is extracted) the relevance must be specified in the input table (if the number of tuples is more than one).

After the sampling operation, we obtain a table sorted by the relevance of the input table and the table of variables:

No	Name	User	Value	L	Li
6	Qualification	Petrov	Average	100	five
five	Date of Birth	Petrov	09.12.2008	100	0
one	Mode	Petrov	eight	0	0

Features of sampling values belonging to the interval

In the first part, relevant samples were considered, for which the sets were specified by one interval boundary. There, it was possible to estimate (before sampling) the power (and relevance) of the intervals in advance, since the values of the interval sets were contained in the attribute that was accessed (measurement of the register of rules).

With a "soft sample", the situation is different - specific values of the attribute (facts) are selected, and the relevance interval is set at the input. For example, the dates of departure of trains (facts) are known and it is necessary to choose the dates belonging to the specified (by the user) interval.

As a rule, it is necessary not only to select dates, but also to sort them by relevance - the degree of closeness to a given border. The degree of proximity to the boundary is inversely proportional to the power of the interval formed by the boundary of the input interval and the boundary specified by the value from the sample. Typically, the sample values set the right value of the boundary ("To"), and the left value ("C") is set to the corresponding boundary of the input interval. The relevance of the obtained intervals can be estimated relative to the input interval.

Suppose we have the following set of departure dates (facts): (15, 17, 23, 25, 30). Then, choosing facts from the given set in the interval [20, 25], we obtain a set of two intervals: ([[20, 23], [20, 25]). In accordance with the general rule, the relevance of the smaller interval [20, 23] is higher.

Conclusion

It is surprising that simple and intuitive things required so many letters for a relatively clear description.

The initial purpose of the work was to fix the mathematics used in evaluating the relevance of a sample of data containing universes. And also show the need to understand and account for the kind of attributes of tables. Interpretation of attributes as concrete or universal allows defining the abstract operation of the relevant connection .

What is all this for? Ultimately - to reduce the entropy of information systems. If we compare, for example, the implementation of any logic by the program code or tables (registers) of the rules, then it is clear that the tables are much less entropic (besides, they are declarative, not imperative). The concepts of rule registers, universes, relevant connections fit well into the relational data model. Articles show only how to prepare them correctly. Use!

Source: https://habr.com/ru/post/324996/

All Articles