📜 ⬆️ ⬇️

Histogram and box with a mustache on the fingers

In this post I want to describe two types of graphs for one-dimensional data, namely


image

Consider an arbitrary sample of real numbers. X = (x_1, ..., x_N) , we denote ordinal statistics x _ {[k]} such that x _ {[1]} \ leq \ ldots \ leq x _ {[k]} \ leq \ ldots \ leq x _ {[N]} .

bar chart


Most likely to change this type of schedule from the school or university program, which looks approximately like in the picture.
')
Bar chart example

First of all, it must be remembered that the values ​​of the input sample are located along the x axis, and along the y axis there is the number of times that this value is encountered (let's call them samples). The histogram allows you to harden and make the data set more compact, while not diminishing its specificity.
Important characteristics of the histogram are the following:

Columns


In most cases, the histogram is defined on the segment I = [min (X) - \ varepsilon_1; max (x) + \ varepsilon_2] where X - initial sample, \ varepsilon_1, \ varepsilon_2 auxiliary constants, rounding to the nearest “readable” numbers, which in each case depend on the scale and, as a rule, dozens of dividers in the scale of the initial data. If suddenly it became interesting how to set the cutoffs in the data, then you can see the link: R (pretty)

Also, histograms usually divide segment I into subsegments of equal length and, here, the choice of the number of segments is an art, although several formulas can be given:

Where n - the number of columns N - the size of the original sample, \ hat {\ sigma} - assessment of standard deviation, IQR = X _ {[3 / 4N]} - X _ {[1/4 N]} - interquartile distance, which is still below.

You can also note a few rules of common sense:

Otherwise, if the number of columns is excessive and the initial data is small, the histogram will resemble a bar code, as for example in the figure below.

Barcode

Y axis


Histograms are in absolute values ​​when the number of elements of the original sample in each of the intervals is plotted along the y axis, and relative when the sum of the columns is normalized to one, in this case the histogram is an estimate of the density of the distribution and only the scale changes from the point of view of the graph.

Since a regular histogram is a density estimate, we can summarize the columns and get an estimate of the probability function as follows: s_i = \ sum_ {j = 1} ^ {i} n_i / N . The following two graphs are plotted using the same data, the left is not a normalized histogram, on the right is the accumulated values ​​of the normalized histogram.
Absolute Values ​​HistogramEmpirical distribution function

Grouping data


So far, we have considered the case when we have a characteristic that we just want to look at, it is usually much more interesting to compare the behavior of the same characteristic for different subgroups. In this case, the histogram will be as follows.

Three group dash histogram

In this case, the width of each column for each group decreases in proportion to the number of groups and slightly shifts relative to each other, as an alternative, you can consider a translucent overlap, which will look like this for the same data.
Three-group overlap histogram

In the dry residue


To draw a histogram you need to define

To draw a histogram for each group, the following values ​​must be stored:

Span Chart


The “mustache box” does not have an officially established name, but to call it a “mustache box” my language does not turn, especially when there are several boxes, and the span diagram is not a very frequent, but more harmonious name. Let us give an example of the three boxes on the left; the corresponding values ​​of the initial data are displayed (they are not part of the swing diagram). First of all, it should be noted that in the case of span diagrams, the initial characteristic is plotted along the Y axis, and the X axis is arbitrary and is a grouping variable.

Scope diagram, example

To draw a box for one group about the source data you need to know all three characteristics:

Sometimes the following additional ones are added to the “mandatory” set:

Thus, the box with a mustache in the section will look like this.

Box with a mustache in the cut

Some points require clarification. The box, that is, the object between Q_ {25} and Q_ {75} , almost everywhere is limited by these values, but the “whiskers” can differ and if you are really interested in numbers, you need to clarify what is meant in each individual case. The most important is the length of the whiskers: we presume that it 1.5 \ cdot IQR = 1.5 (Q_ {75} -Q_ {25}) .

The minimum and maximum marks are often omitted, extreme points, that is, those outside the whiskers, are also omitted or drawn with dots or asterisks. Depending on the data structure, the desire to draw extreme values ​​can significantly increase the amount of data for drawing a span chart.

Magic number 1.5 appeared in the work of Tukey Exploratory Data Analysis (1977) and the reason for its appearance is not very clear, but since that time nothing has changed, many tools offer it as the default value, but allow you to set arbitrary, down to zero, in this case, “ whiskers ”will cover the entire segment from the minimum to maximum values ​​of the original data.

There is an assumption that 1.5 arose as follows. Mustache width is 4 \ cdot IQR , it is known that IQR / 2 for symmetric distributions, it coincides with the absolute deviation from the median (MAD), which in turn is an estimate of the variance with a coefficient 1.48 . Which means 4 \ cdot IQR \ approx 8 \ cdot MAD \ approx 16/3 \ cdot \ hat {\ sigma} \ approx 6 \ cdot \ hat {\ sigma} , we get not unknown 3 sigmas to the left, 3 sigmas to the right.
Sometimes a spacing is suggested as a mustache end. [Q_ {5}, Q_ {95}] In this case, it is obvious that always (if the source data is greater than 20) points should be obtained that do not fall inside the interval and therefore they are usually ignored with this approach.

In the dry residue


To draw a “span chart” you need to define:

To draw a “box with mustache” for one group, only 3 numbers are required.

Source: https://habr.com/ru/post/267123/


All Articles