Elements, universes and registers of rules

"Duels are forbidden on Saturday, Sunday and the rest of the week."

The article will discuss some of the nuances of the data retrieval operation. This rather demanded operation in information systems is actually reduced to determining the belonging of values (elements) to sets. A table function containing set values is called a register of rules. If there are several sets to which an element belongs, the question of determining the most relevant of them arises. The first part of the paper is devoted to the evaluation of the relevance of data sampling.

Looking ahead, we point out that the main result (of many years of observation) is that in relational relations, one should take into account the kind of attributes — whether the values of the attribute of a relationship are concrete (elements) or abstract (sets). At the same time, in the data fetching operation, the attributes of the input table and the table to be accessed must be of different genders . More on this in the 2nd part .

And one more reservation. Where we had to choose between the simplicity (clarity) of the description and its rigor, the author tried to choose simplicity (although there were still plenty of words, including not always clear ones).

Data..

Data of information systems can be divided into two categories. The first includes data related to taking into account certain facts. An example is a frequently used database of books about libraries in textbooks. The main table of such a database usually contains information about the author of the book, its title, release date and, for example, about the genre. A feature of such tables (facts) is that the values of their attributes are always specific, do not contain sets. If any value is missing in the table, then it is exactly "zero" (missing).

Working with such fact tables often comes down to various samples of them. A sample is the filtering of the table data (the set of its tuples, records) based on the values of the input vector. The input vector to the fact table can look like this: {Author: "Ivanov"}. This means that you need to select all the books of the author Ivanov.

The absence of any attribute of the table in the input vector means that all values of this attribute are valid in the sample. For example, the sampling vector of all the books of Ivanov can be written like this: {Author: "Ivanov", Title:, Date Release:, Genre: ∀} - the result of the selection will not change. Here is the designation of the universe ("all values"). Obviously, in such a record, the input vector defines a certain set of values.

Thus, sampling from the fact table is reduced to determining whether the records of the table belong to the set defined by the input vector.

... and the rules

Along with the fact tables in information systems, others are also used - the rule tables. Such tables contain set elements. Most often such a set is a universe (attribute), but there are others.

As an example of the values of such a table can result in the end of words. In such a table, the ending value 'a' means the set of all words ending in 'a'. The ending 'eh' is the set of all words ending in 'eh', etc.

The vector to access the rule table describes the input situation and contains specific values. For example, for word endings, the input vector will consist of a specific word. And the result of the sampling is the sets into which the given word belongs.

Typically, the input vector corresponds to several records (tuples) of the table. Let's call such records relevant. Often, of all the relevant tuples, you must choose one, the most relevant. This means that you need to be able to evaluate the relevance of the tuples of the rule tables.

Rule tables usually contain much fewer records than fact tables, but they are no less important for building the logic of system functioning. Consider them in more detail.

Register definition rules

A register is a table function that connects certain input parameters (description of the situation) with output (result). According to the definition function $y = f (x)$ represents a three of objects:

lots of $x$ called the scope of the function;
lots of $y$ called the function value domain ;
$f$ - the rule that each element $x$ some element is matched $y$ .

Unlike tables, a register has a structure (and not just a set) of attributes. Domain $x$ Is a register determinant $y$ - its root . “ Determinant determines the root ” is the essence of the register. Attributes of a determinant are also called register dimensions , attributes of a root are resources .

A feature of rule registers is that the definition domain $x$ may contain set values.

Let us give an example of the simplest register of rules. For example, it is necessary to associate the days of the week with the category - working (P) or output (B). It is set that Saturdays and Sundays are holidays, and all other days are working days. Here is the scope $x$ is the set of days of the week, the range of values $y$ - many categories of the day. Need to build $f$ .

A regular table function can look like a set of pairs, connecting every day of the week with a certain category: {(Mon, P), (W, P), (Wed, P), (Th, P), (Fri, P), (Sat , B), (Sun, B)} .
This is quite a possible way to set a function. For this set of pairs you can get the category of the day, if the day of the week is set at the entrance.

But there are also disadvantages. A significant drawback is verbosity (cost). Natural is the desire to improve the efficiency of our description, reducing the number of pairs. A shorter expression of our function can contain only two entries: {(∀, P), ((Sat, Sun), B)} . Or in the form of a table:

Day of the week		Category
∀		Working
(Sat, Sun)		Output

The rules read like this: "By default, all days of the week are working days , with the exception of Saturdays and Sundays, which are holidays."

The table below represents the values of the register of rules - the correspondence between the elements of the domain of definition (determinant) and the domain of values (root) using universums and subsets. Typically, the values of a register determinant are unique, that is, they form a key . Matching rules are a set of value pairs <determinant - root>.

In our table, the determinant is the attribute "Day of the week", and the root attribute is the attribute "Category". This is a simple structure. In complex registers, the determinant (and root) can consist of several attributes.

Root extraction

So, there is a register (table function) containing universes. The purpose of the register is to return the root value for a given input vector. To increase the effectiveness of the rules, one has to pay with a certain complication of the value extraction algorithm. In general, the algorithm compares the values of the input vector against the values of the register determinant and leaves relevant. It then evaluates the relevance of each value and returns the most relevant.

Refer to the example above. Suppose the input is set to the day of the week "Tuesday". Then everything is simple. The extraction algorithm will select only one suitable rule - the Universum – Worker pair, since the Tuesday element belongs to the Universe set, but does not belong to the subset (Sat, Sun). It does not make sense to evaluate the relevance of one rule and the algorithm will return the value of the corresponding category (“Work”).
A more difficult situation, if the input is set Saturday or Sunday. In this case, both register rules are relevant. It is necessary to make a choice between them.

Assessment of relevance is reduced to assessing the uniqueness of this rule. Intuition suggests that the more specific (specialized) a rule is, the greater its relevance. Conversely, the most abstract rules have the least relevance.

The universe of a set means the space of all the elements of a set and is accordingly more abstract than any other subset. Therefore, in our example, the value of the determinant (Sat, Sun) is more relevant. Accordingly, the algorithm should return the value of the "Output" root.

Before proceeding to assess the relevance of the tuples of rules registers, we note their main properties.

Reduction

The utility of rule registers is based on the fact that the register determinant has the property of reduction . When calculating the value of the root of the register of rules for a given input vector is selected the most suitable of the available in the set. The space of input vectors is reduced to the space of values of the register determinant. Thus, a large variety of input parameter values is reduced to several output values.

The more in the rules of universums - the higher the reduction coefficient. Therefore, registers of rules in one form or another are necessarily present in any information system that claims to be effective.

Completeness and collisions

Two main problems of the registers of rules - the completeness of description and collisions.
Completeness means that for any value of the input vector in the register of rules there must be at least one relevant rule. To ensure completeness, it is recommended to set the most abstract rules first, then moving on to more specific ones.

Collisions mean that the same input vector corresponds to several rules of the same relevance. Sometimes this is permissible, but most often it creates problems. In simple (low-dimensional) determinants, collisions can be eliminated by specifying different powers (weights) for attribute universes. In more complex (multidimensional), additional assumptions are needed to resolve collisions.

Calculate the relevance of tuples

Multiplicative Relevance

As indicated above, the relevance of a rule is inversely proportional to its abstractness (generality). Denote by $N (m)$ power set $M$ . For discrete sets, the power is equal to the number of elements of the set. The power of the universe of set and the power of set are the same thing. The power of the "Days of the Week" set is 7, since there are seven days in a week.

Accordingly, the power of discrete subsets $N (m)$ also determined by the number of elements of the subset. The power of the subset (Sat, Sun) is 2.

The relative power of a subset is determined by the ratio of its power to the power of its universe:

$P (m) = N (m) / N (M) \ quad (1) \\$

The relative power of a subset is its abstractness (size). With this definition, the maximum abstraction is equal to 1. The abstractness of the set (Sat, Sun) = 2/7 = 0.286.

The definition of abstractness coincides with the definition of the probability of events. The greater the likelihood of an element in a subset - the higher the abstractness of this subset.

Relevance is the reciprocal of abstractness. Relevance is a concreteness, a rarity. In accordance with definition (1), we obtain the expression for relevance:

$R (m) = 1 / P (m) = N (M) / N (m) \ quad (2) \\$

Zero-power subsets cannot be evaluated (not relevant). Another consequence of formula (2) is that relevance cannot be null for non-empty sets.

Our table of rules with the filled relevance is:

Day of the week		Category	R
		Working	one
(Sat, Sun)		Output	7/2 = 3.5

In this table, the universum sign is omitted. It is usually assumed that the missing value in the attribute of a determinant means a universum (if more precisely, then only for universal attributes, but more on that later).

The algorithm for extracting the rule from such a table is simple: 1) select tuples according to the condition that the input vector belongs to a subset of the determinant of the tuple, 2) sort by decreasing relevance and 3) return the first tuple.

Multidimensional (composite) determinant

Let's complicate an example. For example, the category of the day depends not only on the day of the week, but also on the gender of the workers. Add a rule that women do not work on Friday.

The register of rules after adding the attribute of the determinant and the new rule takes the form:

Day of the week	Floor	Category	R
		Working	one
(Sat, Sun)		Output	7/2 = 3.5
Fri	F	Output	?

Calculate the relevance of the new rule. Obviously, it should be higher than the other register rules, since the area (space) of action of this rule is the smallest (the rule is specific).

To calculate the relevance of determinants from several attributes (dimensions), we again turn to the calculation of the cardinality of sets. The set of elements when combining two (independent) sets is determined by the Cartesian product of the elements of both sets. Consequently, the power of the universe of the combined set is calculated simply as the product of the powers of the universums of the original sets.

The set of "Sex" consists of two elements - (male, female). Accordingly, its power is equal to 2. Then the power of the determinant of our register (the universe of the determinant) will be equal to:

$N (M) = N ($ Day of the week $) \ cdot N ($ Floor $) = 7 \ cdot 2 = 14 \\$

The same goes for determinant values. The power of a tuple is defined as the product of the powers of its attributes. Then you can write an expression for the relevance of a tuple as a product of the relevance of its attributes:

$R (m) = R (m1) \ cdot R (m2) \ cdot R (m3) \ cdot ... \ quad (3) \\$

Where $R (m1)$ - relevance of the 1st attribute, etc. This formula clearly demonstrates the multiplicity of the entered relevance.

For a tuple (Fri, G), the relevance is:

$R (Fri, F) = R (Fri) \ cdot R (F) = 7/1 \ cdot 2/1 = 7 \ cdot 2 = 14 \\$

Day of the week	Floor	Category	R (Day of the week)	R (Paul)	R (Total)
		Working	one	one	one
(Sat, Sun)		Output	3.5	one	3.5
Fri	F	Output	7	2	14

The total relevance of a tuple is calculated as the product of the relevance of its attributes.

Additive relevance

Multiplicative relevance is not very convenient for practical use. For example, when adding a new attribute to a determinant (this is quite a usual operation for information systems), its values are not filled. Blank values correspond to the universe of a new attribute. The values of the multiplicative relevance of the universe of this attribute must be equal to 1 (for formula 3 to be satisfied). It is not comfortable.

Anyway, it is always easier to add than to multiply. Therefore, from the relevance of products (multiplicative relevance) we turn to the relevance of additions (additive relevance). To do this, it is enough to prologue the value of the multiplicative relevance. The base of the logarithm can be any, - we take the typical value for information systems 2.

So, the additive (logarithmic) relevance is defined as

$L = log_2 (R) \ quad (4) \\$

Substituting (2) into (1), we get:

$L (m) = log_2 (N (M) / N (m)) = log_2 (N (M)) - log_2 (N (m)) \ quad (4 ') \\$

For universals, additive relevance is always zero — this is convenient. It is also important that the value of additive relevance for one element of a discrete set is equal to the entropy of the set.

The rule table of our example with additive relevance is:

Day of the week	Floor	Category	L (Day of the week)	L (Paul)	L (total)
		Working	0	0	0
(Sat, Sun)		Output	$log_2 (7) -1$	0	$log_2 (7) -1$
Fri	F	Output	$log_2 (7)$	one	$log_2 (7) + 1$

The convenience of such relevance is also in the fact that when a determinant attribute is added, there is no need to add a new column for relevance - its values are still zero. Therefore, it suffices to leave one common additive relevance of the rules:

Day of the week	Floor	Category	L
		Working	0
(Sat, Sun)		Output	$log_2 (7) -1$
Fri	F	Output	$log_2 (7) + 1$

Relationship of relevance and entropy

The additive relevance introduced by us is related to informational entropy. Magnitude $log_2 (N (m))$ is the entropy of a subset $H (m)$ . Then the additive relevance of a subset is the difference between the entropy of the whole universe and a given subset. Hence it is clear that it is always greater than or equal to zero.

$L (m) = H (M) - H (m) \ quad (5) \\$

This definition does not depend on the dimension of the set. From formulas (3) and (4) we obtain a detailed expression for the logarithmic relevance of multidimensional sets:

$L (m) = L (m1) + L (m2) + L (m3) + ... \ quad (3 ') \\$

Substitute (5) in (3 '):

$L (m) = L (m1) + L (m2) + ... = H (M1) - H (m1) + H (M2) - H (m2) + ... \ quad (5 ') \ \$

By grouping the terms, we again get the original expression of relevance:

$L (m) = (H (M1) + H (M2) + ...) - (H (m1) + H (m2) + ...) = H (M) - H (m) \ quad ( 5) \\$

The entropy of universums is the same for all tuples. Therefore, to sort tuples by relevance, it is enough to know only the entropy of the tuple itself — for some types of sets this is useful.

Relevance of some sets

Weight (significance) of attributes

In the considered example, the power of the universals of the determinant does not depend on time. The number of days of the week is unlikely to change in the coming years. We can call such sets stationary. In practice, one often has to deal with sets whose power is either changing or unknown. In this case, the easiest way is to force the entropy (power) of the sets, which is also the weight (significance) of the values of such sets.

Consider the task of determining the rules for choosing the price of sales, depending on the counterparty and division. Suppose the case of rules looks like this:

	Counterparty	Subdivision	Price
one			BUT
2		P1	B
3	K1	P1	AT
four	K1		R
five	K2		R

The rules are read this way. In general, it is common for an organization to trade at price A. But in unit P1, they trade at price B. In the same department, special price list B is provided for counterpart K1. And the rest of the organization with this counterpart work at price G. In addition, the organization has a counterpart K2, which also work on the price list G.

A collision occurs when the K2 counterparty turns to unit P1. What price list do you need to work with him? Three rules are relevant - first, second and fifth. To select a price, it is necessary to calculate the relevance of all the rules ( hereinafter, additive relevance is always used ).

The values of the determinant attributes here are of two types - universes and specific. This is the most common case. As already explained above, the relevance of universums is always zero, and the relevance of an element is equal to the entropy of its universe.

The power of the set of counterparties and the set of units is not fixed - their number may vary with time. For such dynamic sets it is impossible to calculate in advance the values of relevance.

There are two ways to solve this problem.
A difficult way is to dynamically take into account the change in the power of sets. In the procedure of extracting the root of the register (finding the relevant value), it is necessary to transfer the entropy of the attributes of the determinant. The evaluation of the relevance of the tuples in this case will occur on the fly (at the time of the request).

The other approach is much simpler: when setting up the register of rules, one can forcibly set the power of universums (entropy, weight) of these sets. At the same time, in order to avoid collisions, the weights of different dimensions of the determinant must differ between themselves. Often the most significant measurements are placed to the left - then the order of measurements sets their weight.

Let us assume that the weight of the measurement (entropy) of the “Division” is 3, and the weight of the measurement of “Counterparty” is 10 (the numbers are arbitrary, but usually the counterparties are greater than the divisions). Calculate the relevance of price table tuples:

	Counterparty (10)	Division (3)	Price	L (K)	L (P)	L (total)
one			BUT	0	0	0
2		P1	B	0	3	3
3	K1	P1	AT	ten	3	13
four	K1		R	ten	0	ten
five	K2		R	ten	0	ten

For the input vector (K2, P1), the relevant tuples will be as follows:

	Counterparty	Subdivision	Price	L
one			BUT	0
2		P1	B	3
five	K2		R	ten

We see that price G has the greatest relevance.

Relevance of ordered sets. Intervals

Here ordered sets are those whose elements more-less comparison operations are applicable to. Typical examples of such sets are numbers and dates. Subsets of ordered sets can be defined by boundaries (intervals). Accordingly, to assess the relevance of the rules, one must be able to calculate the interval power.

In the simplest case of a one-dimensional interval, its power is determined by subtracting the values of the boundaries (if such an operation applies to them). For example, the power of the numeric interval [3, 11 [equals 11-3 = 8. We use the normal boundaries [C, Do [, for which the right boundary of the interval is not included in the set (this allows you to add intervals:
[3, 11 [+ [11, 15 [= [3, 15 [).

The power of the date interval depends on the type of unit of measurement. Can be counted in days, hours, minutes, seconds, etc. For example, the interval power [30.12.16, 02.01.17 [in days will be equal to the difference of dates - 3.

In order to calculate the interval relevance, it is necessary to know not only the power of a given interval, but also the power of the universe (intervals) relative to which relevance is estimated. This is necessary in order to obtain the overall relevance of the tuple, if the determinant includes other attributes of relevance.

Let us explain by example. Let the price, according to which the divisions of the organization trade, depends on the numbers of the month. A rule register might look like this:

	Subdivision	Period [From, To [	Price
one			BUT
2		[3, 11 [	B
3	P1		AT

It is described as follows. All divisions of the organization always sell at price A. The exception is the interval from the 3rd to the 10th. In this case the price B is used . And there is a separate rule for division P1 . It works on the price.
Question. What price list should the unit P1 of the 5th do?

To answer this question, one must know the entropy of the set of subdivisions and the entropy of the set of numbers of the month. In a month, on average, 30 days, rounded up to 32. Then the entropy of the month will be equal to $H (32) = log_2 (32) = 5$ . The entropy of the interval [3, 11 [is equal to $log_2 (11-3) = 3$ . The entropy of the divisions is set the same as in the previous example - 3. This corresponds to 8 divisions of the organization.

Now we can calculate the relevance of the determinant values:

	Subdivision	Period [From, To [	Price	L (Price)	L (Period)	L (total)
one			BUT	0	0	0
2		[3, 11 [	B	0	5-3 = 2	2
3	P1		AT	3	0	3

We see that for given entropy values, unit P1 always works according to the price list .
If the significance (entropy) of the units is reduced from 3 to 2, then a collision will arise in this register. Two valid rules will have the same relevance value - 2.

Open intervals

Often there are situations in which the subsets of intervals are given only by one boundary. , , . 4- — 100%- , 4- 16 — 50%-, 16 — . :

		%
0		100
four		50
sixteen		0

, . 70 , :

_	%	L
0	100	0
four	50	$log_2(70)-log_2(66)=0.08$
sixteen	0	$log_2(70)-log_2(54)=0.37$

: " >= _" . 5 , 50%- .

, , - "" . , , .

.

. () . ( ). .

, $n$ . , 32 . $n$ $32^n = 2^{5n}$ . $5n$ .

, ( ) $32^{n-1}$ . $5(n - 1)$ . — . , :

$L(1) = H(n) - H(n, 1) = 5n - 5(n - 1) = 5.\\$

$L(2) = H(n) - H(n, 2) = 5n - 5(n - 2) = 2 \cdot 5 .\\$

. . $k$ :

$L(k) = H(n) - H(n, k) = k \cdot Hc \quad(6)\\$

Here $Hc$ — . , . , .

-. . , . , — ( ). , .

one
- 1.1
- 1.2
2
- 2.1
- 2.2

, . , , , .

— ( — ), .

$L(e) = H(n) - H(e)\\$

, .

. , (, ) — $c$ . . $k$ :

$N(k) = 1 + c \cdot N(k+1) \quad(7)\\$

$N(k) = 1 + c \cdot (1 + c \cdot N(k+2)) = 1 + + ^2 \cdot N(k+2) \quad(7')\\$

Etc. (7) :

$L(1) = log_2(N(0)/N(1)) = log_2(1/N1 + c) \quad(8)\\$

, $1/N1$ $c$ . 1- :

$L(1) = log_2(c) \quad(8')\\$

(7') 2- :

$L(2) = log2(c^2) = 2 log2(c) \quad(9)\\$

$k$ - :

$L(k) = k log2(c) = k \cdot Hc \quad(10)\\$

Here $Hc = log_2(c)$ — . (6) (10), , . .

, .

1-

Let's pause. In practice, it is not often necessary to use the above formulas for calculating relevance. Usually enough approximations and understanding of the basic principle.

In the second part, the concept of the relevant sample will be generalized. There is in fact no clear division into fact and rule tables. But there is a division of the attributes of relations into concrete and universal ones.

Continued ...

Source: https://habr.com/ru/post/324710/

All Articles