The best reports about machine learning with ICML 2018 according to the version of Yandex employees

From July 10 to July 15, the ICML conference, one of the world's largest in machine learning, was held in Stockholm. This year, it was visited by a record number of participants - 7,000. Among them were Yandex employees who represented CatBoost and DeepHD technologies, Toloka, Zen and Translator services, unmanned vehicles and other products. Today they will share with Habr's readers a collection of reports that they remember most.

By the way, in the photo above, the organizers' joke are statistics on the words from the letters of those who sent their report to the conference. More precisely, from letters with answers to criticism of reviewers, which led to the reception of the report. Here is the "perfect" answer, from the point of view of machine learning. But we digress, go to the review of the five best reports (all the headlines are clickable and lead to the work of the speakers).

1. Fixing the broken ELBO

One of the most interesting reports about generative models and training presentations at the conference.
')
Prehistory

Variational autocoders (VAE) try to model the distribution of observable data

p (x)

$p (x)$ following process:

1. First, from some a priori distribution

p (z)

$p (z)$ choose "hidden" variable

z

$z$ .
2. Then using neural network decoder with parameters

t h e t a

$\ theta$ distribution parameters are obtained from it

p (x | z)

$p (x | z)$ .
3. From the distribution obtained

p (x | z)

$p (x | z)$ choose the next object.

Such a formulation allows building very complex models and well-interpreted representations from simple bricks. For example, if

p (z)

$p (z)$ Is a two-dimensional normal distribution, and

p (x | z)

$p (x | z)$ - 724-dimensional Bernoulli vector, VAE successfully learns the variety of MNIST numbers.

If in the resulting space

z

$z$ in place of points draw the expectation of the corresponding distributions

p (x | z)

$p (x | z)$ , we will see that the model carefully laid the variety of numbers on the plane.

And if for each class of numbers from the test set to draw a posteriori distribution

p (z | x)

$p (z | x)$ in a separate color, we will see that the hidden variables learn very useful ideas: on such two-dimensional features it is much easier to teach the classifier, and it definitely does not need a lot of marked up data.

Likelihood of observable data

X

$X$ in this model is written in the form:

p (X) = p r o d_{x i n X} p (x) = p r o d_{x i n X} i n t p (x | z) p (z) d z

$p (X) = \ prod_ {x \ in X} p (x) = \ prod_ {x \ in X} \ int p (x | z) p (z) dz$

If a

p (x | z)

$p (x | z)$ given by a deep neural network, we have no chance of taking this integral analytically, so direct maximization of likelihood is impossible. Instead, use the so-called variational Bayesian output:

- For each object of the training sample, an auxiliary distribution is introduced

q (z | x)

$q (z | x)$ described by parameter

p h i_{x}

$\ phi_x$ . Task

q

$q$ - as closely as possible approximate the posterior distribution

p (z | x)

$p (z | x)$ . In other words, it predicts from which

z

$z$ observable object was obtained

x

$x$ .

- The lower estimate of the likelihood logarithm of the sample is recorded:

l o g p (X) g e q s u m_{x i n X} i n t q (z | x) l e f t (l o g p (x | z) + l o g f r a c q (z | x) p (z) r i g h t) d z

$\ log p (X) \ geq \ sum_ {x \ in X} \ int q (z | x) \ left (\ log p (x | z) + \ log \ frac {q (z | x)} { p (z)} \ right) dz$

This lower score is also called ELBO (Evidence Lower Bound) - the lower limit of reasonableness. It is exactly equal to the log likelihood if

q (z | x) = p (z | x)

$q (z | x) = p (z | x)$ .

- The resulting score is maximized by

p h i

$\ phi$ and

t h e t a

$\ theta$ a stochastic gradient descent, and the stochastic gradient of the internal integral is obtained either by reparametrization trick , or by using a log-derivative trick (also known as score function estimator, also known as REINFORCE).

Because to store and optimize parameters

q (z | x)

$q (z | x)$ for each sample object is too expensive, an additional neural network-encoder is introduced, which accepts input

x

$x$ and returns

p h i_{x}

$\ phi_x$ . This approach is called a depreciated variational output. It was the use of amortized output that allowed us to study complex hierarchical probabilistic models on big data.

Too smart decoder

When we limit

p (x | z)

$p (x | z)$ simple family of distributions (for example, normal distributions with diagonal covariance matrices),

z

$z$ learns useful data views. Simple arithmetic in space

z

$z$ You can change the hairstyle of a person in a photo or mix tunes.

However, simple models in which all elements of the vector

x

$x$ independent provided

z

$z$ , do not do very well with complex data, such as texts or high-resolution images. It is reasonable to try to use as

p (x | z)

$p (x | z)$ complex “neural network” distributions: autoregressive networks, such as WaveNet or PixelCNN, normalizing flows, for example, RealNVP, nested VAE, or even

a l p h a

$\ alpha$ -GAN.

When trying to train VAE with such decoders, researchers face great difficulties: the “too smart” decoder itself “eats” all the entropy in the data and stops looking at the hidden variable at all.

z

$z$ and

x

$x$ turn out to be independent, and it’s impossible to get useful views from a hidden variable. Different articles have suggested ways to get around this problem, but basically they boil down to either simplifying the decoder or different hacks when learning. The authors of this article offer a first fundamental explanation of the causes of this effect and ways to combat it.

The main ideas of the article

- Apparatus from Information Bottleneck Theory is applied to the task of learning views. The authors propose to consider mutual information between

z

$z$ and

x

$x$ . When mutual information is equal

H (x)

$H (x)$ , hidden variable compresses lossless data: we get the perfect auto-coding. When the mutual information is equal to zero, the hidden variable is independent of the data, the decoder “devoured” all the entropy. In order to train useful ideas, we want to get a variant in the middle: the hidden variable should describe important information about the object (for example, the shape of the face, the color of the eyes, ...) and throw away the details (the colors of specific pixels, textures).

- It is proposed to look at VAE from an alternative position: now let the main part of the model

q (z | x)

$q (z | x)$ , but

p (x | z)

$p (x | z)$ and

p (z)

$p (z)$ - approximation for

q (x | z)

$q (x | z)$ and marginal distribution

E_{x} [q (z | x)]

$E_x [q (z | x)]$ .

- For fixed

q (z | x)

$q (z | x)$ considered lower and upper estimates for mutual information, depending on

p (z)

$p (z)$ and

p (z | x)

$p (z | x)$ :

H - D l e q I (x, y) l e q R

$H - D \ leq I (x, y) \ leq R$

Where

H

$H$ - data entropy

D

$D$ - loss in compression and

R

$R$ - KL divergence between

p (z)

$p (z)$ and

E_{x} [q (z | x)]

$E_x [q (z | x)]$

Each family of models corresponds to a convex set of points in the RD space, the shape of this set depends on the “balance of forces” between the encoder and the decoder.

- It is shown that optimization of ELBO is equivalent to minimization

D + R

$D + R$ , i.e. searching for a point with a tangent coefficient

b e t a = 1

$\ beta = 1$ on the boundary of the set. If the decoder is flexible enough

p (z)

$p (z)$ unable to approximate well

E_{x} [q (z | x)]

$E_x [q (z | x)]$ , it is not profitable for models to put additional information into a latent variable: from the ELBO point of view, it is cheaper to learn it with a decoder.

- To build the most useful ideas, the authors propose to optimize

D + b e t a R

$D + \ beta R$ where

b e t a

$\ beta$ - a hyperparameter responsible for the balance between the encoder and the decoder.

- The authors experimentally showed that by varying

b e t a

$\ beta$ , it is possible to learn representations with different levels of abstraction: from autoencoder (

b e t a = 0

$\ beta = 0$ ), which compresses lossless data, to “auto decoder” (

b e t a = i n f t y

$\ beta = \ infty$ ), which completely ignores the data. Picking up the value

b e t a

$\ beta$ , you can get a "semantic encoder", which encodes only the general form of the object, or "syntactic encoder", which is able to recover small details. In this case, all experiments were carried out with a superpowerful autoregressive PixelCNN ++ decoder, which until then had not been able to use in VAE.

2. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

The authors set themselves the task of displaying music (as a sequence of characters) in an adequate hidden space while preserving the semantics. This will allow to solve the problem of continuing the musical sequence and automatically creating smooth "transitions" between different fragments. It is proposed to use VAE.

The main problem of VAE is that the decoder very quickly forgets the “hidden” code. The authors propose to deal with this through a hierarchical decoder. The output sequence is divided into small pieces (in the article, the music was broken into bars). The "hidden" code is fed to the input of the upper-level decoder, which returns the embedding for the beat. And then the low-level LSTM predicts the notes inside the beat. Formally, this approach does not solve the problem of forgetting, but in practice it turns out that the resulting sequences correspond much better to the original piece.

The encoder is a bidirectional LSTM. Hidden layers

o v e r r i g h t a r r o w h_{T}

$\ overrightarrow h_T$ and

o v e r l e f t a r r o w h_{T}

$\ overleftarrow h_T$ concatenated and used to predict distribution parameters

z

$z$ :

m u = W_{h m u} h_{T} + b_{} m u

$\ mu = W_ {h \ mu} h_T + b_ \ mu$

s i g m a = l o g l e f t (e x p l e f t (W_{h s i g m a} h_{T} + b_{} s i g m a r i g h t) + 1 r i g h t)

$\ sigma = log \ left (exp \ left (W_ {h \ sigma} h_T + b_ \ sigma \ right) +1 \ right)$

The overall network layout looks like this:

For quality assessment, Lakh MIDI Dataset is used, which contains about one and a half million midi files. The authors used only tracks with a size of 4/4. On short sequences (prediction of the next two bars), the quality of MusicVAE is comparable to the classic VAE. However, already on the prediction of 16 cycles, the error rate of the classic VAE is more than 27%, and for the hierarchical method in the range of 5-11%. It is also shown that the hierarchical method gives a smoother interpolation between fragments of music.

For a snack: over the latent code vectors you can make arithmetic operations. For example, the authors built vectors for 5 musical characteristics (C diatonic membership, note density, average interval, 16th and 8th note syncopation). With some weights, the vector of characteristics can be added to the sampled vectors, and the resulting sequence obtained at the output of the decoder will more or less correspond to a given characteristic depending on the weight.

3. Neural Autoregressive Flows

When training VAE, it is important to choose the right form.

q (z | x)

$q (z | x)$ , since the accuracy of the lower likelihood estimate depends on its expressiveness. A good family should satisfy the following properties:

- It is quite flexible (ideally, it can approximate any distribution).
- From it you can quickly get a sample.
- The resulting sample is easy to calculate the credibility.
- The complexity of these operations grows linearly with increasing dimension.

One of the approaches to the construction of such families is normalizing flows. They are based on the variable change formula: if a random variable

v a r e p s i l o n

$\ varepsilon$ transform with a reversible differentiable function

f

$f$ the density of magnitude

f (v a r e p s i l o n)

$f (\ varepsilon)$ equals

p (v a r e p s i l o n) d e t (J^{- 1})

$p (\ varepsilon) \ det (J ^ {- 1})$ where

J

$J$ - Jacobian functions

f

$f$ . By choosing a family of functions in which the Jacobian determinant can be effectively considered, it is possible to obtain complex distributions, transforming Gaussian noise.

In the previous approach ( Inverse Autoregressive Flow ) each element of the vector

v a r e p s i l o n

$\ varepsilon$ undergoes affine transformation, the parameters of which are set by an autoregressive neural network (for example, MADE or PixelCNN), looking at previous elements

v a r e p s i l o n_{j}

$\ varepsilon_j$ ,

j < i

$j <i$ .

Due to the autoregressive network structure, the Jacobian of the resulting transformation is lower triangular, which means that its determinant can be easily calculated as the product of diagonal elements.

IAF is a powerful family, but due to the fact that an affine transformation is used at each step, it is not able to approximate well multimodal distributions.

The main ideas of the authors of the article

- Instead of affine transformation in IAF, use a miniature neural network with positive weights and strictly monotonous activations.

- The resulting normalizing flow is a universal approximator for distributions.

- Two architectures of transforming networks were proposed and a mechanism for scaling to large dimensions was described using conditional batch normalization.

- Experiments show that NAF can really learn multimodal things.

4. Noise2Noise: Learning Image Restoration without Clean Data

If one phrase - this is super-resolution without GAN'ov.

The classic approach to the super-resolution task (obtaining a clean image from a more noisy one) is to input noisy images to the algorithm and teach them to predict a clean image. But what if we don't have clean images? This is true for processing photos from telescopes, MRI images and other similar tasks. The idea proposed in the article is to teach the network on noisy images. But it is necessary that the noise in the images was on average zero. The authors conclude: "... we can find out about the network learners". In the datasets they consider, the results are not much worse, and sometimes even better than the state of the art. In addition, the speed and speed of learning is much higher.

The authors tried to work with frames from the video, however, they didn’t really make a super-resolution video yet - the recovered image “jumps”.

5. Approximate k-core graph decomposition

Despite the fact that ICML is devoted to machine learning, there are many good reports on related areas. So, in the section devoted to the work with distributed computations, an interesting algorithmic report on graphs was presented.

Suppose we want to solve some problem on the graph. The task is abstract enough to describe it in terms of formal logic. For example, we are given a graph of user communications, and we want to detect anomalies in it (bot networks, fraud) or to identify links with communities of people.
For these and many other tasks in teaching models related to graphs, the selection of entities called k-core is well suited.

It is important to note that it is not only the approach to extracting these k-core that is of interest, but also the ability to do it efficiently, in a distributed way, and, if possible, also on a stream. These are the implementations that were immediately presented in this report.

Definitions

Let be

G = (V, E)

$G = (V, E)$ - undirected graph with

| V | = n

$| V | = n$ tops and

| E | = m

$| E | = m$ ribs

H

$H$ - subgraph

G

$G$ . For each vertex

v \in G

$v ∈ G$ we denote by

d (v)

$d (v)$ the degree of this vertex, and for

v \in H

$v ∈ H$ denote as

d H (v)

$dH (v)$ - degree

v

$v$ in the subgraph

H

$H$ .

k

$k$ -Core - maximum subgraph

H \in G

$H ∈ G$ such that for

\forall v \in H

$∀v ∈ H$ we have

d H (v) \geq k

$dH (v) ≥ k$ in other words - the degree of any vertex does not exceed

k

$k$ . Also called such a graph

k

$k$ - degenerate.

It is said that the top

v

$v$ has coreness number

C G (v) = k

$CG (v) = k$ if she belongs

k

$k$ -core but not owned

(k + 1)

$(k + 1)$ -core.

Core labeling graph

G

$G$ - a graph where each vertex is labeled with its coreness number. It is worth noting that core labeling is unique and defines a hierarchical decomposition.

G

$G$ .

1 - ε

$1 − ε$ Approximate

k

$k$ -core - subgraph

H

$H$ at

G

$G$ such that

$ inline $ ∀v ∈ H dH (v) ≥ (1 - ε) k $ inline $ .In the figure below, examples 3 − core and

2 / 3

$2/3$ -approximate 3-core.

The idea of the algorithm

The basic idea is to more confidently throw out the ribs in dense areas and less confidently in sparse ones.

Brief description of the algorithm:

- we throw out edges from the graph with probability

p

$p$ ;
- for each vertex we define its coreness number, that is, we do core labeling, we find the maximum

k

$k$ -core, and add all its vertices to the set

S

$S$ ;
- delete from

G

$G$ all edges, both ends of which are in

S

$S$ ;
- doing the same thing, increasing

p

$p$ twice as long as possible.

Pseudocode:

Let us analyze how this algorithm can be implemented in the MapReduce ideology for parallel computing.

It is argued that when starting probability

p_{0} = 12 l o g (n) / ε^{2} n

$p_0 = 12log (n) / ε ^ 2n$ the algorithm is executed no more

2 l o g (n)

$2log (n)$ iterations. In the first iteration, by throwing out the edges, we get some subgraph

H_{0}

$H_0$ . At the second iteration on each machine we do core labeling

H_{0}

$H_0$ . Save the vertices with the largest core number in some set

S

$S$ . At the third iteration, we send a set to all the machines.

S

$S$ and in parallel we throw out

H_{0}

$H_0$ edges, both ends of which are in

S

$S$ get

H_{1}

$H_1$ . Further, with probability

p_{1} = 2 p_{0}

$p_1 = 2p_0$ doing the same

2 l o g (n)

$2log (n)$ time.

Pseudocode:

Note

Let's return to the search for communities on the graph. Consider a graph of social connections. For example, a graph where vertices correspond to user pages on a social network, and edges indicate that users have added each other as friends. We also know about some subset of pages information about where this user is studying (city, university, faculty, specialty, group).

In this case, the hierarchy with

k

$k$ -core can be interpreted as follows: kernels of a higher order reflect the group in which the student is studying - the community is rather narrow, but with a high density of connections in it. The core of the order below will reflect the community-specialty of this user - the links are generally smaller, but they are fairly dense. Moving further along the hierarchy we find the community and the faculty, and the university.

The exploitation of this idea for spam analysts looks less obvious, but is similar in essence. If we have an idea about which kernels are considered good, then using factors like “maximum order

k

$k$ -core to which the user belongs, density

k

$k$ -core user ", as well as the ratio of the nearest entities may well help in the detection of suspicious nodes and the identification of botnet patterns.

Instead of an epilogue

Today we talked about only five reports, but there were more of them at the ICML conference. If specialists in the field of machine learning would be interested in such a format, then we could repeat it in the future.

Source: https://habr.com/ru/post/418421/

All Articles