The book "Probabilistic programming in Python: Bayesian inference and algorithms"

Hi, Habrozhiteli! Bayesian methods scare the formulas of many IT people, but now it’s impossible to do without analyzing statistics and probabilities. Cameron Davidson-Paylon talks about the Bayesian method from the point of view of a programmer-practitioner who works with the multifunctional PyMC language and the NumPy, SciPy and Matplotlib libraries. Expanding the role of Bayesian findings in A / V testing, fraud detection and other pressing tasks, you will not only easily understand this non-trivial topic, but also begin to apply this knowledge to achieve your goals.

Excerpt: 4.3.3. Example: sorting comments on reddit

You may not agree that the law of large numbers applies to all, albeit implicitly, in subconscious decision making. Consider the example of online product ratings. Do you often trust an average rating of five points based on a single review? Two reviews? Three reviews? You subconsciously understand that with such a small number of reviews, the average rating reflects badly how good or bad the product really is.

As a consequence, there are omissions when sorting goods and generally when comparing them. For many buyers, it is clear that sorting the results of an online search by rating is not very objective, no matter if it’s about books, videos or comments on the Internet. Often, first-place films or comments get high marks only due to a small number of enthusiastic fans, and really good films or comments are hidden on subsequent pages with allegedly imperfect ratings of about 4.8. What to do with it?
')
Consider the popular site Reddit (I deliberately do not provide links to it, because Reddit is notorious for dragging its users, and I am afraid that you will never return to my book). On this site there are many links to different stories and pictures, and comments on these links are also very popular. Users of the site (which is usually called the word redditor1) can vote for or against each comment (the so-called votes for (upvotes) and votes against (downvotes)). Reddit by default sorts comments in descending order. How to determine which comments are the best? Usually focus on the following several indicators.

1. Popularity . A comment is considered good if many votes are cast for it. Problems with the use of this model begin in the case of a comment with hundreds of votes “for” and thousands “against”. Although very popular, this comment seems to be too ambiguous to be considered "best."

2. Difference . You can use the difference between the number of votes "for" and "against." This solves the problem of using the “popularity” metric, but does not take into account the temporary nature of the comments. Comments can be sent many hours after the publication of the original link. At the same time there is a shift, due to which the highest rating is received not by the best comments, but by the oldest ones, who managed to accumulate more “for” votes than the newer ones.

3. Time adjustment . Consider a method in which the difference in the votes “for” and “against” is divided by the age of the comment and the frequency (rate) is obtained, for example, the magnitude of the difference per second or per minute. A counterexample immediately comes to mind: using the “per second” option, a comment left a second ago with one vote “for” will be better than a left one 100 seconds ago with 99 votes “for”. This problem can be avoided if only comments left at least t seconds ago are taken into account. But how to choose a good t value? Does this mean that all comments left later than t seconds ago are bad? The case will end by comparing unstable quantities with stable (new and old comments).

4. Value . Ranking comments by the ratio of votes "for" to the total number of votes "for" and "against." This approach eliminates the problem with the temporal nature of comments, so that recently left comments with good ratings will receive a high rating with the same probability as left long ago, provided that they have a relatively high ratio of votes "for" to the total number of votes. The problem with this method is that a comment with one vote “for” (ratio = 1.0) will be better than a comment with 999 votes “for” and one “against” (ratio = 0.999), although it is obvious that the second of these comments would rather be the best.

I wrote for a reason "rather . " It may also be the case that the first comment with a single vote “for” is really better than the second, with 999 votes for. It is difficult to agree with this statement, because we do not know what the 999 potential next votes for the first comment could be. For example, he could have received as a result another 999 votes “for” and not one vote “against” and be better than the second, although such a scenario is not very likely.

In fact, we need to evaluate the actual proportion of votes "for". I note that this is not at all the same as the observed ratio of votes “for”; the actual ratio of votes “for” is hidden; we observe only the number of votes “for” compared with the votes “against” (the actual ratio of votes “for” can be viewed as the probability of a vote receiving the vote for “instead”). Thanks to the law of large numbers, it is safe to say that a comment with 999 votes in favor and one against will have an actual ratio of votes in favor likely to be 1. On the other hand, we are much less sure of what the actual ratio of votes "for" for a comment with one vote "for". It looks like a Bayesian task.

One of the ways to determine the prior distribution of the ratio of votes “for” is to study the history of the distribution of the ratio of votes “for”. This can be done by scraping Reddit comments and then determining the distribution. However, this method has several drawbacks.

1. Asymmetric data . The number of votes for the absolute majority of comments is very small, as a result of which the ratios of many comments will be close to extreme (see the “triangular” graph in the example with the Kaggle dataset in Fig. 4.4) and the distribution will be strongly skewed. You can try to take into account only comments whose number of votes exceeds a certain threshold value. But here too there are difficulties. It is necessary to seek a balance between the number of available comments, on the one hand, and a higher threshold value with a corresponding ratio accuracy - on the other.

2. Displaced (containing a systematic error) data . Reddit consists of many sub-forums (subreddits). Two examples: r / aww with photos of funny animals and r / politics. It is more than likely that users' behavior when commenting on these two sub-forums of Reddit will be radically different: in the first of these, visitors are likely to be touched and behave friendly, which will lead to a greater number of votes in favor, compared to the second, where in the comments are likely to diverge.

In the light of the above, it seems to me that it makes sense to use a uniform prior distribution.

Now we can calculate the a posteriori distribution of the actual ratio of votes “for”. The comments_for_top_reddit_pic.py script is used to scrap comments from the current most popular Reddit image. In the following code, we scrapped Reddit comments related to the image [3]: http://i.imgur.com/OYsHKlH.jpg.

from IPython.core.display import Image #       %run #   i-   . %run top_pic_comments.py 2

 [Output]: Title of submission: Frozen mining truck http://i.imgur.com/OYsHKlH.jpg

 """ Contents:       Votes:   NumPy  ""  ""    """ n_comments = len(contents) comments = np.random.randint(n_comments, size=4) print "  (    %d) \n -----------"%n_comments for i in comments: print '"' + contents[i] + '"' print " ""/"": ",votes[i,:] print

 [Output]:   (   77) ----------- "Do these trucks remind anyone else of Sly Cooper?"  ""/"": [2 0] "Dammit Elsa I told you not to drink and drive."  ""/"": [7 0] "I've seen this picture before in a Duratray (the dump box supplier) brochure..."  ""/"": [2 0] "Actually it does not look frozen just covered in a layer of wind packed snow."  ""/"": [120 18]

With N votes and a given actual ratio of votes "for" p, the number of votes "for" resembles a binomial random variable with parameters p and N (the fact is that the actual ratio of votes "for" is equivalent to the probability of giving a vote "for" compared to a voice " against "with N possible voices / tests). We will create a function for the Bayesian output of p with respect to the set of votes "for" / "against" a specific comment.

 import pymc as pm def posterior_upvote_ratio(upvotes, downvotes, samples=20000): """         ""  "",   ,    ,    . ,    . """ N = upvotes + downvotes upvote_ratio = pm.Uniform("upvote_ratio", 0, 1) observations = pm.Binomial("obs", N, upvote_ratio, value=upvotes, observed=True) # ;    MAP,     #       . map_ = pm.MAP([upvote_ratio, observations]).fit() mcmc = pm.MCMC([upvote_ratio, observations]) mcmc.sample(samples, samples/4) return mcmc.trace("upvote_ratio")[:]

The following are the resulting posterior distributions.

 figsize(11., 8) posteriors = [] colors = ["#348ABD", "#A60628", "#7A68A6", "#467821", "#CF4457"] for i in range(len(comments)): j = comments[i] label = u'(%d :%d )\n%s...'%(votes[j, 0], votes[j,1], contents[j][:50]) posteriors.append(posterior_upvote_ratio(votes[j, 0], votes[j,1])) plt.hist(posteriors[i], bins=18, normed=True, alpha=.9, histtype="step", color=colors[i%5], lw=3, label=label) plt.hist(posteriors[i], bins=18, normed=True, alpha=.2, histtype="stepfilled", color=colors[i], lw=3) plt.legend(loc="upper left") plt.xlim(0, 1) plt.ylabel(u"") plt.xlabel(u"  ''") plt.title(u"    '' \   ");

 [Output]: [****************100%******************] 20000 of 20000 complete

As can be seen from fig. 4.5, some distributions are strongly “compressed”, while others have relatively long “tails”, expressing what we do not know for sure what the actual proportion of votes is for.

4.3.4. Sorting

So far, we have ignored the main goal of our example: sorting comments from best to worst. Of course, it is not possible to sort distributions; sort need scalar values. There are many ways to extract the essence of the distribution in the form of a scalar; for example, one can express the essence of a distribution through its expectation, or average value. However, the average value for this is poor, because this indicator does not take into account the uncertainty of the distributions.

I would recommend using the least plausible value of 95%, which is defined as a value with only a 5% probability that the actual value of the parameter is lower than it (compare with the lower limit of the Bayesian confidence interval). Next, we build graphs of a posteriori distributions with the indicated 95% least plausible value (Fig. 4.6).

 N = posteriors[0].shape[0] lower_limits = [] for i in range(len(comments)): j = comments[i] label = '(%d :%d )\n%s…'%(votes[j, 0], votes[j,1], contents[j][:50]) plt.hist(posteriors[i], bins=20, normed=True, alpha=.9, histtype="step", color=colors[i], lw=3, label=label) plt.hist(posteriors[i], bins=20, normed=True, alpha=.2, histtype="stepfilled", color=colors[i], lw=3) v = np.sort(posteriors[i])[int(0.05*N)] plt.vlines(v, 0, 10 , color=colors[i], linestyles="—", linewidths=3) lower_limits.append(v) plt.legend(loc="upper left") plt.ylabel(u"") plt.xlabel(u"  ''") plt.title(u"    '' \   "); order = np.argsort(-np.array(lower_limits)) print order, lower_limits

 [Output]: [3 1 2 0] [0.36980613417267094, 0.68407203257290061, 0.37551825562169117, 0.8177566237850703]

The best, according to our procedure, will be those comments for which the highest probability of obtaining a high percentage of votes "for". Visually, these are comments with the closest to 95% least plausible value. In fig. 4.6 The 95% least plausible value is shown with vertical lines.

Why is sorting based on this indicator such a good idea? Streamlining according to the 95% least plausible value means maximum caution in declaring comments to be the best. That is, even in the worst case scenario, if we strongly overestimate the ratio of votes in favor, it is guaranteed that the best comments will be on top. With this ordering, the following very natural properties are provided.

1. Of the two comments with the same observed ratio of votes “for”, a comment with a higher number of votes will be recognized as the best (since there is a higher confidence in a higher ratio for it).

2. Of the two comments with the same number of votes, the best is the comment with the most votes in favor.

»More information about the book can be found on the publisher site.
» Table of Contents
» Excerpt

For Habrozhiteley 25% coupon discount - javascript

Upon payment of the paper version of the book, an e-book is sent to the e-mail.

Source: https://habr.com/ru/post/456562/

All Articles

The book "Probabilistic programming in Python: Bayesian inference and algorithms"

Excerpt: 4.3.3. Example: sorting comments on reddit

4.3.4. Sorting

More articles: