📜 ⬆️ ⬇️

Data analysis. Approximate sets

Decided to create a series of posts about data analysis. I have been working in this (and as it turned out, quite interesting) field of computer science for several years. I bring to your attention the analysis of data from the point of view of the Theory of approximate sets.

What will be discussed?


The theory of approximate sets (rough sets) was developed [Zdzisław Pawlak, 1982] as a new mathematical approach to describe uncertainty, inaccuracy and uncertainty. It is based on the statement that we associate some information (data, knowledge) with each object of the universe. Objects characterized by the same information are indistinguishable (similar) from the point of view of the information about them. The indistinguishability relation generated in this way is the mathematical basis of the theory of approximate (coarse) sets.

The basis of the concept of the theory of approximate sets is the approximation of sets.

We now give the concept of approximation of approximate sets:

Example

')

Actually use


Approximate sets are used when working with data tables, which are also called attribute-value tables, or information systems, or decision tables (decision tables). The decision table (decision table) is a triple Τ = (U, C, D), where
U is a set of objects
C is a set of condition attributes,
D is the set of decision attributes.

Sample table

UCD
HeadacheTemperatureFlu
U 1Yesnormalnot
U 2YeshighYes
U 3Yesnormalnot
U 4Yesvery highnot
U 5nothighnot
U 6notvery highYes
U 7nothighYes
U 8notvery highYes


Table analysis

Sets:
U = {U 1 , U 2 , U 3 , U 4 , U 5 , U 6 , U 7 , U 8 }
C = {Headache, Temperature}
D = {Flu}

Possible attribute values:
V Headache = {yes, no}
V Temperature = {normal, high, very high}
V Flu = {yes, no}

The splitting of the set U in accordance with the values ​​of the attribute Headache has the form:

The splitting of the set U in accordance with the values ​​of the attribute Temperature has the form:

The partition of the set U in accordance with the values ​​of the attribute of the solution. Influenza is:

The data presented in this table, for example, U 5 and U 7 are inconsistent, and U 6 and U 8 are repeated.
U 5nothighnot
U 6notvery highYes
U 7nothighYes
U 8notvery highYes

Actually using approximate sets we can “extract” from inaccurate, contradictory data those that are “useful to us”.

What are we going to work on?


The following posts will demonstrate the practical implementation (in Python) of data analysis using this theory, including:

Source: https://habr.com/ru/post/137284/


All Articles