Statistical problems of identification of network structures
In complex networks, through various filtering processes, important network structures can be distinguished that carry meaningful information about the network. Among network structures, the following are traditionally considered: the maximum spanning tree, the maximum filtered planar graph, the clipped graph, the maximal clicks and the maximal independent sets of the clipped graph, and others.
Given the statistical nature of the source data, the problem of identifying network structures arises. The lecture that we have chosen for you today is devoted to the recent development of this topic within the framework of the theory of simultaneous testing of many statistical hypotheses (multiple decision statistical procedures, multiple test procedures). This approach allows us to develop methods for assessing the statistical uncertainty of network structures and to identify optimal and stable statistical identification procedures. It turns out that network structures built on the probability of coincidence of signs are preferable to structures built on the classical Pearson correlations. The story describes the application of the results to the analysis of stock markets. ')
The report was read at the Faculty of Computer Science, opened with the support of Yandex in HSE. Lecturer Valery Kalyagin - Doctor of Physics and Mathematics, ordinary professor at the Higher School of Economics. Head of the Department of Applied Mathematics and Informatics and the Laboratory of Algorithms and Technologies for Analysis of Network Structures of the HSE in Nizhny Novgorod.
Under the cut - a full transcript of the lecture. The task of identifying network structures for us was completely new — we started working on it in 2012, when the laboratory was created. One of the tasks in this laboratory was associated specifically with some network structures. We became interested in it differently than it was delivered. What I am going to tell is more a mathematical part than an applied one.
The task at first glance is very simple, everyone can understand it. There is a complete weighted graph, let's call it a network in which there are N nodes. What this task highlights is the fact that we associate a random variable with each node. Thus, we have some random vector (multidimensional, N-dimensional random vector). Then, there are some connections between these random variables. They are measured by some function (it has a lot of names: similarity, association, dependence ...), which is a characteristic of the connection of two random variables. Here is such a still abstract stream. The network structure for us is some graph that we can extract from this network, a subgraph of this network. Not any, but interesting, with some good, useful properties, maybe.
The identification task is formulated as follows: we have a sample of length n from this random multidimensional distribution, this vector, n observations. And we need to identify, by observation, the structure that was somehow determined by the links between random variables in the network. That is, there is some kind of structure hidden from us, there are observations, and we should reconstruct this structure according to the observations. Typical task statistics. The only thing that maybe the object is a bit unusual: the object that needs to be restored is the graph. This task, naturally, in such a general setting (especially since we do not know what a structure is) is never solved.
There are several examples, so-called Gaussian graphic, or graph models. These are multidimensional normal distributions, the relationship between nodes is given by the so-called partial correlations between two random variables. The private correlations are the so-called pure link, from which the influence of all the others is excluded. In mathematical language, these random correlations are directly related to the elements of the inverse matrix. Σ -1 . As a network structure - a graph of concentration. The concentration graph is constructed very simply: if the connection between them is zero, then we do not draw an edge, if non-zero, we draw. In fact, this is for the inverse covariance matrix zero or non-zero (for the corresponding covariances). The theory of Gaussian graph (graphic) models itself is a fairly advanced thing, there are many books on this subject. There are very interesting conditional distributions.
A little tricky, you can see that if some of the links are missing, they may be missing somewhere else (so far, very inaccurate). And, accordingly, it examines how the distributions are arranged when a concentration graph is specified.
The task of identification is more interesting for us here - that is what I want to talk about. You have observations of these random vectors and must identify a concentration graph on the basis of these observations. And here comes something new that, in fact, influenced the rest of the activity - now you have to test more than one hypothesis. Now I will try to explain it: we can select some edge and test the hypothesis, whether the private correlations are zero here or not. Such an approach would be wrong; We can formally apply it, but it’s difficult to say what quality such a statistical procedure for identifying a concentration graph will have, and here’s why: there are connections between these random variables, and these connections can play a role in testing each individual hypothesis. If we primitively try to test these individual hypotheses, without taking into account the existing dependencies between random variables, we will get not very good statistical procedures. Recently, many procedures have been built to identify this concentration graph, more and more accurate, more and more adequate. The new thing here is that you have to test many hypotheses at once - this is how the task is put correctly. Not checking each individually, but many at once. Later I will show, in the statement, as we see it; just a priori, a graph of concentrations can be any graph - and this is one hypothesis, and there are as many hypotheses as there are graphs with all vertices. And we need to choose from all of them one - the most adequate to the observations that we have.
Another model is the so-called network model of the market: the assets in the market correspond to the tops of the network, the relationship between the assets is given by some weighted edge, and the weight of the edge reflects the similarity in the behavior of the assets. This network is obtained by a complete weighted graph, random variables here are some characteristics of these assets. I wrote prices, but prices are not invested in our model, because pricing is modeled more like a martingale. This is a more complex construction, it is certainly not such a random process - it is clear that the price at each new point in time depends on the previous one. If you make some assumptions, there will be some random process with the martingale property, but here's the profitability, the sales volumes fit this model perfectly. Yields have the properties we need, we can consider them independent and equally distributed, because we have a sample there that we make. This is a standard sample: independent, equally distributed random variables. The literature uses various measures of the relationship between random variables, in our case, between assets: Pearson correlations, partial correlations, sign correlations, τ (Tau) Kendall correlations, Krascal correlations, Spearman correlations, and many others.
Knowing the true values of these connections γ between random variables, we get what we called the reference network, or core network, the foundation. The one that takes place if we calculated the characteristics between random variables, made edges, and this is their weight. This is a network that does not depend on any observations - it seems to assume that this is our model, the model of our network.
Such structures have been popular and popular so far.
Maximum Spanning Tree (MST): in a full graph, any sequence of vertices is a spanning tree, but among them it is necessary to choose with maximum closeness (with total closeness), and this characteristic is called the maximum spanning tree. In general, this topic appeared under the theme of "econophysics" - when professional physicists are engaged in economics. Sometimes it turns out good, and sometimes bad.
There is such a Eugene Stanley , a very famous person, an economist and a mathematician. He once spawned and supported this business. And within this framework, attempts were made to analyze network structures. With the help of MST, an analysis of existing markets was made, purely empirical analysis. It was noted that there are some hierarchical structures, there are some interesting observations - this can be attributed to some primitive data mining. So we want to find out if there is something like this on the market that we don’t see: build this tree, see how it changes (there are some changes at the time of crisis), etc. That is, around this maximum spanning axis wood is some kind of activity.
Then the same group of people (there are several of them, including Mantegna) published an interesting article in a good journal: they suggested taking a non-spanning tree as a characteristic.
Planar Maximally Filtered Graph (PMFG) is structured as follows: we take the Kruskal algorithm to build a spanning tree, it is very simple: you arrange all the edges and add them to your design until the loop appears. A cycle appears - you miss it, and so on. This is how the minimal one is built (the maximum one in our case, just another order must be done) a spanning tree. This structure is poor. Why? There is not a single click, that is, there are no three vertices connected to each other, although in the real market, say Russian, the maximum clique of the market graph consists of eight companies that are very closely linked. You know them all. And these eight companies account for 97% of the total market volume. There is no such concentration in the maximum click in any other market. There are maximum cliques, that is, closely related assets, in the American, and in English, and in the French, and in the German, and in the Japanese markets, but they are not so important in terms of capitalization or trading volume.
Now we want to build not a spanning tree, but a planar graph with maximum connections between these assets. And it is done this way: it is ordered again and we add edges, while the graph remains planar. If it ceases to be planar, throw away the edge, etc. So they proposed, and it was an interesting job. Why only planar? Thor, a sphere with two handles - on any surface of a given kind you can build such a construction.
What interest on the surface of the genus G? Such a constructed graph will always have a limited click size, and it seems that this is one of the advantages - this was not explained, it is just some characteristic. Parlausen and his students proposed such a construction, it has long been known in graph theory, this is the so-called threshold-method, or a cut-off graph. This design is used in many data mining methods using graphs. We fix a certain threshold (specified threshold) and remove all edges whose weight is less than or equal to this threshold, that is, leave only the connections that represent some significant value, starting with some value of the proximity measure.
If in the column for the Russian market to find the maximum click, it will be so arranged. But not only maximum clicks can be interesting here, there are also maximum independent sets in this column. It turned out (about this was a small empirical work) that they can be useful if we want to build an optimal portfolio in the Markowitz model. Our goal is to keep a small number of assets in our portfolio. In principle, you can use the entire market and build the best investment portfolio from different points of view, but in the Markowitz model, the quantity that we want to control is quite definite: it is a utility in which risk and profitability are involved. Of course, there are many assets on the asset market - how to find those that will allow you to build a good portfolio? It turned out that independent sets can be useful here.
In particular, the concentration graph is connected with such a graph, because in fact it is a one-side cut, and here it is necessary to do a two-side cut. We check that we have zero or non-zero partial correlations, and we can build confidence intervals - and the result will be a task, not to say that this is exactly the same, but similar.
A simple example, it is exactly taken from this work, like ours, about the Planar Maximally Filtered Graph. We carefully picked up 10 assets, for which:
Here is a matrix of correlations that are empirically calculated from observations.
Here is the maximum spanning tree (here minimum). We specifically picked it up - here one and the second cluster are interconnected (and then you can talk a lot on this topic, but we are not interested). This is how this structure turned out.
This flat graph - he no longer has a beautiful, good visualization, but with him we can count.
As for the cut-off graph, here is a threshold of about 0.55, here it is itself, it has this maximum click, six nodes, six assets are very closely interconnected among these ten, and well, the maximum independent set is also drawn. There are many independent sets here, but this is the set of minimum weight (drawn). And there can be a lot of maximum clicks too, and you can also choose the one with the maximum weight (but this is for example).
Identification task. Again, we have a set of observations; we set the task as follows: we have a reference network, some kind of structure, for example, the maximum main tree or a market graph. And we want to restore this structure by observation. What is interesting: it is immediately given that we can build a lot of such networks, depending on what measure of proximity we took. At first it was not clear what the difference between them was, then it turned out to be a very interesting (I will show it to you) phenomenon. People, without thinking, in large part of the work simply take the market, take some measure of communication, they think, they find characteristics. They are trying to interpret or do some kind of analysis, what it looks like and how it could be analyzed from the point of view of the market. The most interesting thing is this: when time goes on, characteristics change, connections change. What happens in times of crisis? It is probably not so difficult to guess that in the moments of crisis an increase in ties occurs. That is, everyone goes either down or, if before a crisis, they all go up together. Such phenomena are observed, even trying to build algorithms to predict some crises.
What will we be interested in: how to build any identification procedure? I will now say about one, primitive, which everyone uses. How to measure the uncertainty or quality of this procedure (since each procedure gives an error, how can we evaluate this error?), But the quality is the inverse of the error. Can we make some procedures optimal and generally speak about this optimality of such procedures.
And there is another important topic in statistics: for the market, the crucial point is whether we can build robust or robust procedures — that is, those that are insensitive to the distribution we are considering. Because often we do not know the distribution. We can build an optimal procedure for a given distribution, as soon as we slightly change this distribution (well, for example, we don’t know what it is, it’s different with us), this procedure gives a much worse result than any other, more primitive. I will try to answer these questions as I can.
What is the simplest procedure that comes to mind? We want to restore the structure with some γ i, j , just such weights in a weighted graph. How are we going to do? Let's take some estimate of a set of observations, an estimate of this characteristic γ i, j (with a cap). If it is, for example, Pearson correlations, then it will be Pearson’s selective correlations, if Kendall’s, then Kendall’s selective correlations, sign correlations (this is a coincidence of the signs). And we will build a so-called sample network (before that we had a reference network), this would be a sample network.
We need a spanning tree of maximum weight - so we will assume that it was built in this sample network. Here we build it, here we have it. We need a planar maximum filtered graph - we will do the same algorithm here on this network (reference network) and we will assume that this is the result of our identification, that you have found the structure. There will be a cut-off graph (it is given by the characteristic γ i, j ≤ γ o ), we simply take all the selective links, evaluate and delete those that are less than or equal to a given threshold, and get a certain selective market graph. Actually, these are the objects that everyone works with. If we have a different set of observations, it is clear that this will change with us. Here there is a serious task: how to assess the error that arises, how to estimate the uncertainty of a particular structure and how to assess the quality of the procedures that we can offer for this identification.
Let's go through these tasks: take the simplest model - Gaussian multidimensional distribution. Between these random variables, a measure is given, a measure of proximity, so we have a standard statistical model X (t): random variables equally distributed with these at each time t. Let's set this graph reference reference graph or reference marking graph using some incidence matrix, it will be 0 if there is no edge, 1– if there is an edge, γ o is a certain threshold, we will look through all such incidence matrices that can be with N nodes .