Histogram and box with a mustache on the fingers

In this post I want to describe two types of graphs for one-dimensional data, namely

bar chart
mustache box

Consider an arbitrary sample of real numbers. $X = (x_1, ..., x_N)$ , we denote ordinal statistics $x _ {[k]}$ such that $x _ {[1]} \ leq \ ldots \ leq x _ {[k]} \ leq \ ldots \ leq x _ {[N]}$ .

bar chart

Most likely to change this type of schedule from the school or university program, which looks approximately like in the picture.
')
Bar chart example

First of all, it must be remembered that the values of the input sample are located along the x axis, and along the y axis there is the number of times that this value is encountered (let's call them samples). The histogram allows you to harden and make the data set more compact, while not diminishing its specificity.
Important characteristics of the histogram are the following:

number of columns (called bins or bars)
absolute or density y-axis readings
how data is grouped

Columns

In most cases, the histogram is defined on the segment $I = [min (X) - \ varepsilon_1; max (x) + \ varepsilon_2]$ where $X$ - initial sample, $\ varepsilon_1, \ varepsilon_2$ auxiliary constants, rounding to the nearest “readable” numbers, which in each case depend on the scale and, as a rule, dozens of dividers in the scale of the initial data. If suddenly it became interesting how to set the cutoffs in the data, then you can see the link: R (pretty)

Also, histograms usually divide segment I into subsegments of equal length and, here, the choice of the number of segments is an art, although several formulas can be given:

Sturges Rule (Not a Photographer). $n = 1 + log_ {2} N$
Scott's rule. $n = 3.5 \ cdot \ hat {\ sigma} \ cdot N ^ {- 1/3}$
Friedman-Dyakonis Rule. $n = 2 \ cdot IQR \ cdot N ^ {- 1/3}$

Where $n$ - the number of columns $N$ - the size of the original sample, $\ hat {\ sigma}$ - assessment of standard deviation, $IQR = X _ {[3 / 4N]} - X _ {[1/4 N]}$ - interquartile distance, which is still below.

You can also note a few rules of common sense:

it’s good that most columns have more than one source value
each column of the histogram requires at least one pixel in width, and in general the restriction of “no more than 200” columns is quite common

Otherwise, if the number of columns is excessive and the initial data is small, the histogram will resemble a bar code, as for example in the figure below.

Barcode

Y axis

Histograms are in absolute values when the number of elements of the original sample in each of the intervals is plotted along the y axis, and relative when the sum of the columns is normalized to one, in this case the histogram is an estimate of the density of the distribution and only the scale changes from the point of view of the graph.

Since a regular histogram is a density estimate, we can summarize the columns and get an estimate of the probability function as follows: $s_i = \ sum_ {j = 1} ^ {i} n_i / N$ . The following two graphs are plotted using the same data, the left is not a normalized histogram, on the right is the accumulated values of the normalized histogram.
Absolute Values Histogram

Grouping data

So far, we have considered the case when we have a characteristic that we just want to look at, it is usually much more interesting to compare the behavior of the same characteristic for different subgroups. In this case, the histogram will be as follows.

Three group dash histogram

In this case, the width of each column for each group decreases in proportion to the number of groups and slightly shifts relative to each other, as an alternative, you can consider a translucent overlap, which will look like this for the same data.
Three-group overlap histogram

In the dry residue

To draw a histogram you need to define

Number of columns
Do I need to normalize and accumulate data
The way to display different groups

To draw a histogram for each group, the following values must be stored:

$n + 1$ column bound value where the very first value $x$ -coordinate of the left border of the leftmost column, and the last - $x$ - coordinate of the right border of the rightmost column
$n$ values - the number of elements caught in each of the columns.

Span Chart

The “mustache box” does not have an officially established name, but to call it a “mustache box” my language does not turn, especially when there are several boxes, and the span diagram is not a very frequent, but more harmonious name. Let us give an example of the three boxes on the left; the corresponding values of the initial data are displayed (they are not part of the swing diagram). First of all, it should be noted that in the case of span diagrams, the initial characteristic is plotted along the Y axis, and the X axis is arbitrary and is a grouping variable.

Scope diagram, example

To draw a box for one group about the source data you need to know all three characteristics:

First quartile $Q_ {25} = X _ {[1/4 N]}$
Median $Q_ {50} = X _ {[1/2 N]}$
Third quartile $Q_ {75} = X _ {[3/4 N]}$

Sometimes the following additional ones are added to the “mandatory” set:

Minimum $Min = X _ {[1]}$
Maximum $Max = X _ {[N]}$
5% percentile $Q_ {5}$
Ninety five percent percentile $Q_% 7B95% 7D$
Set of extremes $X & lt; Q_ {25} -1.5 \ cdot IQR$ , $X & gt; Q_ {75} +1.5 \ cdot IQR$

Thus, the box with a mustache in the section will look like this.

Box with a mustache in the cut

Some points require clarification. The box, that is, the object between $Q_ {25}$ and $Q_ {75}$ , almost everywhere is limited by these values, but the “whiskers” can differ and if you are really interested in numbers, you need to clarify what is meant in each individual case. The most important is the length of the whiskers: we presume that it $1.5 \ cdot IQR = 1.5 (Q_ {75} -Q_ {25})$ .

The minimum and maximum marks are often omitted, extreme points, that is, those outside the whiskers, are also omitted or drawn with dots or asterisks. Depending on the data structure, the desire to draw extreme values can significantly increase the amount of data for drawing a span chart.

Magic number $1.5$ appeared in the work of Tukey Exploratory Data Analysis (1977) and the reason for its appearance is not very clear, but since that time nothing has changed, many tools offer it as the default value, but allow you to set arbitrary, down to zero, in this case, “ whiskers ”will cover the entire segment from the minimum to maximum values of the original data.

There is an assumption that $1.5$ arose as follows. Mustache width is $4 \ cdot IQR$ , it is known that $IQR / 2$ for symmetric distributions, it coincides with the absolute deviation from the median (MAD), which in turn is an estimate of the variance with a coefficient $1.48$ . Which means $4 \ cdot IQR \ approx 8 \ cdot MAD \ approx 16/3 \ cdot \ hat {\ sigma} \ approx 6 \ cdot \ hat {\ sigma}$ , we get not unknown 3 sigmas to the left, 3 sigmas to the right.
Sometimes a spacing is suggested as a mustache end. $[Q_ {5}, Q_ {95}]$ In this case, it is obvious that always (if the source data is greater than 20) points should be obtained that do not fall inside the interval and therefore they are usually ignored with this approach.

In the dry residue

To draw a “span chart” you need to define:

data grouping method
whisker length
whether to mark extreme values

To draw a “box with mustache” for one group, only 3 numbers are required.

Source: https://habr.com/ru/post/267123/

All Articles

Histogram and box with a mustache on the fingers

bar chart

Columns

Y axis

Grouping data

In the dry residue

Span Chart

In the dry residue

More articles: