Finding association rules is a well-known data analysis method. Habré already had a
publication with a history of the question about this method and general definitions. This article will discuss the adaptation of the association rule search algorithm in the data obtained by respondents. The results of the algorithm are demonstrated on the data of the
European Social Research (ESS).
Foto: Owen Humphreys / AP
')
The survey results are understood as the base of respondents' answers to research questions. The nature of the study - sociological, marketing, or any other, for subsequent analysis does not matter. In addition to ESS, examples of various studies can be found on the websites of VTsIOM, Levada Center, TNS Russia and many other organizations.
Definitions and examples of association rulesThe term transaction, as applied to surveys, means a set of answers to research questions received from one respondent. An example of a transaction in the survey: Gender = Female, Education = Higher, Marital status = Not married, ....
Quantitative variables in the survey are often presented in the form of segments: Age = 18-25, The volume of consumption of soft drinks in the last 6 months. = 5-10 liters.
Thus, all sorts of answers to research questions determine the final set
I , and the survey data is a set of transactions
D , the number of which is equal to the number of respondents to the study.
Let
A be some non-empty subset of
I. The number of transactions from
D containing all elements from
A is called
A support and is denoted by
supp (A) .
The associative rule
X ->
Y is a pair of disjoint subsets of
X ,
Y from the set
I. Two main characteristics of the rule
X ->
Y - its support and reliability:
An example of an associative rule is Age = 55+ -> Social Status = Pensioner. His support of 0.3 means that 30% of survey respondents are at least 55 years old and are retired. The reliability of rule 0.7 shows that among respondents to a study aged at least 55 years old, 70% are retired.
A feature of surveys is that transactions in a study are not necessarily equally probable. As a rule, the respondents in the study determined the weight. The Apriori algorithm, in the
implementation of Christian Borgelt (version 6.18), allows the use of integer positive transaction weights, which determine their multiplicity.
The source files of this program are in open access. Small changes in its code allow us to determine the positive real-valued transaction weights in the association rule search algorithm.
ESS is a project to measure attitudes, perceptions, opinions and behavior of the population of more than 30 European countries.
On the project website, the methodology and the results of the surveys are publicly available. The examples used data 6 waves (version 2.1).
Formulation of the problem.Let some group of countries
C be selected for analysis, for example,
C = {Denmark, Russia, France}.
On the basis of the ESS data, it is required to identify those signs whose shares for one of the countries
C significantly exceed the shares of this characteristic in the other countries
C.Data preparation.Rule
X -> Y should show that the sign of
X is more characteristic of representatives of country
Y.Let the respondent weights of the ESS survey (dweight variable) be as follows:

Here, the index
j determines the country, and the index
i lists the respondents of the
j -th country.
We need to standardize the weights so that the sum of the weights of the respondents for each country in
C is 100, that is,
With such a definition of weights, the support of the rule
supp (X -> Yj) determines the share of the attribute
X in the country
Yj .
The support of attribute
X is equal to the sum of the shares of this attribute for all countries from
C :

.
Then the validity of the rule
X ->
Yj is the ratio of the share of the attribute
X in the country
Yj to the sum of the shares of
X in all countries of
C.The survey data was loaded into the R environment using the
foreign package.
Answers of respondents to common questions for countries were converted into a database of transactions using the
arules package.
Finding and visualizing the solution of the problem.The following restrictions were used when finding the rules:
1) The right-hand side of
Y , in rule
X ->
Y , contains only data about the country;
2) The left part of
X consists of no more than 2 elements (statements);
3) Minimum allowable values of support and reliability of the rule - 3% and 2/3 * 100%, respectively.
As is well known, the number of obtained association rules can be quite large. For the convenience of displaying the solution, the
Tableau Public service was used, which allows you to create data visualization control panels.
The highlighted rule says that about 17% of people in France who have reached 15 years of age:
- completely agree with the statement that for most people in the country, life becomes worse rather than better
and
- do not personify yourself with a person for whom it is important to be rich, to have a lot of money and expensive things.
The confidence level of this rule is high - 84%. It shows that the total share of respondents who answered exactly the same in Denmark and Russia is more than 5 times less than 17% (since 0.84 / 0.16> 5).
Calculate the left part
X of this rule for each country separately. We get the following results
Who is interested in looking at the resulting rules on their own can follow this
link .
As for marketing surveys. According to their results, for example, you can search for rules in groups of loyal consumers of various brands of goods of the same type. In this case, the manufacturer of the brand N reveals the features of "their" customers and receives information about the distinctive characteristics of consumers who prefer products of competitors.
The search for associative rules in the database of weighted transactions allows solving the tasks of exploratory data analysis obtained by respondents. This method detects characteristic events based on their point values (fractions). In the next part of the article, the methods of statistical evaluation of these results will be considered.