📜 ⬆️ ⬇️

Machine learning - 3. Poisson random process: views and clicks

In previous articles devoted to the probabilistic description of a site conversion, we considered the number of events (views and clicks) as a sample of a random variable, without time dependence. Now it's time to take the next step and introduce it into consideration.


Random processes


The random process f (t) is, slightly simplifying, a time-dependent random variable. The set of values ​​of f (t) for a certain period of time T is called the implementation or sampling of a random process. For example, the number of page views per day is an example of a discrete random process (or random sequence), for which both the argument (time) and the range of values, i.e. possible values ​​of f (t) are discrete values. Accordingly, the sample of the random process will be the vector f (t i ). An example of two samples of random processes is shown in the graph (calculations, like everything on my blog, are prepared using Mathcad Express and you can take them here ).

image
')
As examples, we still use the data on views (a series of Views, red dotted line) and clicks (Regs, with a multiplier of 100) in this blog on Habré in March 2015, which we discussed in detail here and here . In particular, in the second article, we found that a few days after each article was published on Habré, the number of views and clicks reached an approximately constant level of 100-200 views and 1-3 clicks per day, i.e. we can say that after a short period of non-stationarity, the random process can be considered stationary (neglecting, of course, a weak decreasing trend and corrected for dependence on the day of the week, which I hope to talk about in the next articles when we get to detrending). The next chart is the stationary “tail” of the implementation of the number of views (for the end of March 2015).

image

As we already see, random processes can be classified according to the nature of the argument and values ​​(discrete or continuous). As it is easy to figure out, four combinations are possible (for more details, see Tikhonov’s book Statistical Radio Engineering). An example of a discrete process with continuous time is the number of site views (or clicks on the site), starting from a certain point in time (for example, since the publication of the article). The following graph shows an example of the implementation of such a process - the number of clicks per day (March 22, 2015).

image

What can we say about the views / clicks model, looking at the latest chart? Obviously, we have two sequences of random events (event A - view and event B - click on some link), which can occur at random times. Accordingly, the random process b (t), the implementation of which is shown in the previous graph, is defined as the number of clicks starting from midnight on March 22nd until time t.

It is quite logical to assume that for any period of time (for example, an hour or a day) the probability of an event (A or B) depends only on the duration of this period (this property is called the uniformity of the random process in time). For example, if in an hour there are about 6 views of the article, then in a day there will be about 6 * 24 ~ 140. Or, if an average of 2 clicks per day occurs, then we can conventionally say that the average number of clicks per hour is 1/12. The histogram shows the scatter of the number of views corresponding to the “tail”. Sample averages are λ = 140 views and λ2 = 1.2 clicks (per day).

image

Here it is important to make a few reservations. First, such an approach is purely probabilistic in nature. We do not know in advance neither the exact number of clicks per day, nor the moments at which the clicks will occur. Secondly, we still know something about the situation: that about 140 people open an article per day and 1-3 of them click in a certain place of the article. Thirdly, of course, the daily trend will be present in the data (in the daytime the article is read much more than at night). For simplicity (and for the time being - when we get to detrending), we will also neglect them. And, in the fourth (attention!), While we are not talking about what the probability of events A and B (views and clicks) is equal to.

Poisson event flow


As is known from probability theory, the formulated model, called the random Poisson process or the Poisson flow of events , describes not only views and clicks, but also a huge number of other real phenomena: telephone calls, equipment failures (soon there will be a separate article), requests for maintenance, etc. If it is agreed to consider the sampling function to be continuous on the right, then, as shown in the corresponding figure above, it will be an integer, and increasing only by integer jumps. Accordingly, the probability density of the number of clicks and views will be discrete, as shown in the figure (for the case of clicks, the row is in the form of red "sticks"):

image

In the same picture, “bars” (which is clearly, but a bit wrong, since we are talking about a discrete process) shows the corresponding histogram of the distribution by the number of clicks, and let's say a little later about the meaning of the dotted curve. The formula for the probability density of Poisson is shown in the graph (in the “stick” legend). Thus - attention! - calculating the sample average number of clicks λ2 = 1.2 (clicks per day), we get a tool for forecasting, and, considering it together with conversion data (see previous articles), we obtain an algorithm for calculating the necessary number of visits to achieve certain targets for clicks.

Another important quantity characterizing the Poisson process is the waiting time for the first event τ . Obviously, τ is a random variable. It is known from probability theory that its distribution function is exponential and is given by the formula F (t) = P ( τ <t) = 1-exp (-λt). Accordingly, the event that a click (or view) has not yet occurred, is characterized by the distribution function
1-F (t) = exp (-λt). For clarity, we will draw this distribution function on the graph, choosing the clock (rather than a day) as the unit of time. The blue curve P (t) refers to the views, and the red to clicks.



Accordingly, the probability density of the waiting time is written as follows:



We denoted the density of the distribution of the number of scans by the function p (t) in order to recall the connection between the density of the distribution and the probability density of a random variable.

As it is easy to calculate, the average value of the first click waiting time is 1 / λ2, and the views, respectively, 1 / λ, which gives an idea of ​​the simple probabilistic sense of the parameter λ as the average number of events occurring per unit time and equal to the average waiting time of the first event.

References:
  1. Pytyev Yu.P., Shishmarev I.A.
    The course of probability theory and mathematical statistics for physicists. M .: MSU, 1983
  2. Tikhonov V.I. Statistical radio engineering M: "Soviet radio", 1966
  3. D.V. Kiryanov, E.N. Kiryanova. Computational Physics. M: Polybuk Multimedia, 2005. §5. Random processes and fields

Source: https://habr.com/ru/post/253755/


All Articles