Introduction
Suppose I ask you how many people should be in a room so that two of them have a birthday with a 50% probability of one day. What will be the answer? This is what is called the birthday paradox.
The paradox reads:
If there are 23 people in the room, then with a 50% probability two of them were born on the same day.
Some versions of the paradox make even stronger statements:
')
If the room has 70 people, then with a probability of 99%, two of them were born on the same day.
At first, this seemed surprising and counterintuitive to me. Let's find out why this is true. To simplify the task, we make the following assumptions:
- We assume that all people in the room were not born in a leap year. We make this assumption so that we do not have to analyze two different cases.
- There are no twins in the room. With a pair of twins, the solution will be trivial.
- We assume that people are born evenly and randomly. What does it mean? People are equally likely to be born on any day of the year. If this is formalized a little, then the probability of birth on any chosen day is equal to $ inline $ \ frac {1} {365} $ inline $ .
- People are born independently of each other. This means that the date of birth of any person does not affect the date of birth of another.
It is worth noting that these conditions are not necessarily observed in the real world. In particular, in the real world, people are not born with uniform chance.
This link has statistics for the days on which people are born. Although our model is not an accurate reflection of the real world, its simplicity makes it much easier to analyze the problem.
I must say that if there are 366 or more people in a room, then it is guaranteed to have two people with the same birthday. This follows from the Dirichlet principle (the “principle of pigeons and boxes”): if there are 366 people and 365 days, then at least two people must have the same birthday.
Imagine that there is one person in the room, then the probability of his common birthday with someone else is 0. Let
$ inline $ (A_n) $ inline $ will be the outcome in which among
$ inline $ n $ inline $ people do not have a single birthday. Let be
$ inline $ \ Pr [A_n] $ inline $ - the probability that among
$ inline $ n $ inline $ of people in the room, everyone has a birthday on their day. Similarly, let
$ inline $ \ overline {A_n} $ inline $ will be complement
$ inline $ A_n $ inline $ , i.e. an outcome in which among
$ inline $ n $ inline $ people two people have the same birthday.
$$ display $$ \ Pr [A_n] + \ Pr [\ overline {A_n}] = 1 $$ display $$
$$ display $$ \ Pr [A_1] = 1 \ text {because there is only one person in the room} $$ display $$
Suppose there are two people in the room, A and B. Without loss of generalization, just imagine that person A was born on January 1. For B and A to have different birthdays, you need B to be born on any day except January 1st. Person B will have 364 birthday options.
$$ display $$ \ Pr [A_2] = \ Pr [\ text {B is different from A}] = \ frac {364} {365} $$ display $$
Imagine that there are three people in the room, A, B and C. Suppose that person A was born on January 1, so that all of them were born on different days, person B only had 364 days, as in the previous example. Since A and B take two days, person C can only be born at 365 - 2 = 363 days.
$$ display $$ \ Pr [A_3] = \ Pr [\ text {B is different A and C is different from B, A}] = \ frac {364} {365} \ times \ frac {363} {365} $$ display $$
Something more interesting happens here: suppose the room already has
$ inline $ k - 1 $ inline $ people. When he enters the room
$ inline $ k $ inline $ the second person to
$ inline $ k $ inline $ people had different birthdays, two outcomes must be true
- Everything $ inline $ k - 1 $ inline $ people who entered the room before him should have different birthdays. What is the likelihood of this? $ inline $ \ Pr [A_ {k-1}] $ inline $ .
- $ inline $ k $ inline $ - that person needs to have a different birthday from all others $ inline $ k - 1 $ inline $ people in the room. What is the likelihood of this? $ inline $ \ frac {365 - (k-1)} {365} $ inline $ .
It should also be noted that the two outcomes presented above are independent of each other, because by assumption (4) made at the beginning of the post,
$ inline $ k $ inline $ The second person was born independently of all the others.
$$ display $$ \ begin {equation *} \ begin {split} \ Pr [A_k] & = \ Pr [k - 1 \ text {people in the room have different birthdays} \ textbf {AND} \\ & \ \ \ \ \ \ \ \ text {person k has a different birthday from the} k - 1 \ text {people}] \\ & = \ Pr [k - 1 \ text {people in the room have different birthdays}] \\ & \ \ \ \ \ times \ Pr [\ text {person k has a different birthday from the} k - 1 \ text {people}] \\ & = \ Pr [A_ {k-1}] \ times \ frac { 365 - (k-1)} {365} \ end {split} \ end {equation *} $$ display $$
Now we calculate the probabilities:
$$ display $$ \ begin {equation *} \ begin {split} \ Pr [A_1] & = 1 \\ \ Pr [A_2] & = \ Pr [A_1] \ times \ frac {364} {365} = \ frac {364} {365} \ approx 0.997 \\ \ Pr [A_3] & = \ Pr [A_2] \ times \ frac {363} {365} = \ frac {364} {365} \ times \ frac {363} {365} \ approx 0.991 \\ \ Pr [A_4] & = \ Pr [A_3] \ times \ frac {362} {365} \ approx 0.983 \\ & \ vdots \\ \ Pr [A_ {22}] & = \ Pr [A_ {21}] \ times \ frac {344} {365} \ approx 0.525 \\ \ Pr [A_ {23}] & = \ Pr [A_ {22}] \ times \ frac {343} {365 } \ approx 0.493 \\ \ end {split} \ end {equation *} $$ display $$
Since we got
$ inline $ \ Pr [A_ {23}] = 0.493 <0.5 $ inline $ , this will mean that the probability of a common birthday for two people is
$$ display $$ \ Pr [\ overline {A_ {23}}] = 1 - \ Pr [A_ {23}] = 1 - 0.493 = 0.507> 0.5 $$ display $$
With increasing
$ inline $ n $ inline $ probability tends to 1 and reaches it.
More difficult task
Suppose we want to generalize this problem to the case when there is
$ inline $ n $ inline $ people and
$ inline $ m $ inline $ possible birthdays. Having
$ inline $ n, m $ inline $ , can we determine the likelihood that two people will have a common birthday?
You can use the same system: we will have an outcome
$ inline $ A_n $ inline $ denoting all
$ inline $ n $ inline $ people born on different days. In the case of one person, nothing changes
$$ display $$ \ Pr [A_1] = 1 $$ display $$
In the case where there are two people, we designate them as A and B. In order for person B to be born on another day, person B must have a birthday among
$ inline $ m - 1 $ inline $ other options.
$$ display $$ \ Pr [A_2] = \ frac {m-1} {m} $$ display $$
In the case of three people, person B must have a birthday different from the birthday of A. Person C must have a day different from the birthdays of A and B.
$$ display $$ \ Pr [A_3] = \ frac {m-1} {m} \ times \ frac {m-2} {m} $$ display $$
Generally for any
$ inline $ n $ inline $ we can use the same recursive formula as in the previous section:
$$ display $$ \ Pr [A_n] = \ Pr [A_ {n-1}] \ times \ frac {m - (n-1)} {m} $$ display $$
Suppose if we want to find an expression in closed form for
$ inline $ \ Pr [A_n] $ inline $ , then we decompose the expression
$$ display $$ \ begin {equation *} \ begin {split} \ Pr [A_n] & = \ Pr [A_ {n-1}] \ times \ frac {m - (n-1)} {m} \ \ & = \ Pr [A_ {n-2}] \ times \ frac {m - (n-2)} {m} \ times \ frac {m - (n-1)} {m} \\ & = \ Pr [A_ {n-3}] \ times \ frac {m - (n-3)} {m} \ times \ frac {m - (n-2)} {m} \ times \ frac {m - (n -1)} {m} \\ & \ \ vdots \\ & = \ Pr [A_2] \ times \ frac {m-2} {m} \ times \ frac {m-3} {m} \ times \ cdots \ times \ frac {m - (n-3)} {m} \ times \ frac {m - (n-2)} {m} \ times \ frac {m - (n-1)} {m} \\ & = \ prod_ {i = 1} ^ {n-1} \ frac {mi} {m} = \ prod_ {i = 1} ^ {n-1} 1 - \ frac {i} {m} \\ & \ approx \ prod_ {i = 1} ^ {n-1} e ^ {\ frac {-i} {m}} \ text {using the identity} 1 - x \ approx e ^ {- x} \\ & = e ^ {- \ sum_ {i = 1} ^ {n-1} \ frac {i} {m}} \\ & = e ^ {\ frac {-n (n-1)} {2m}} \ approx e ^ {\ frac {-n ^ 2} {2m}} \ end {split} \ end {equation *} $$ display $$
An approximation of this result is
$ inline $ \ Pr [A_n] \ approx e ^ {\ frac {-n ^ 2} {2m}} $ inline $ . Although we can find more approximate boundaries, this gives us a sufficient approximation. The only identity we used in this approximation:
$$ display $$ 1 - x \ approx e ^ {- x} $$ display $$
In its abstract form, the birthday paradox has many uses in computer computing. In particular, it is used in hashing, which in itself has many uses. The conclusions made in this task are key for analyzing the probability of hashing two elements into one key. This problem can be further generalized to the probabilistic problem of balls and basins (balls and bins problem), which we will leave for another post.
Application
The birthday paradox is not just an artificial task or an interesting party trick, it has many real-world applications. Personally, I believe that the probabilistic analysis used in the evidence is useful in analyzing other problems in which randomization is involved. In particular, this relates to the development of cryptographic hash functions. We will see how the analysis above can be quite useful in creating hash functions with protection against attacks of “birthdays”.
First, let's define what a hash function is. Hash function
$ inline $ h: U \ rightarrow [0, m-1] $ inline $ Is a function that performs a mapping from a set
$ inline $ u $ inline $ in a number in the range
$ inline $ [0, m-1] $ inline $ . Function
$ inline $ h $ inline $ defined as
$ inline $ h (x) = x \ mod m $ inline $ - example hash function from
$ inline $ \ mathbb {N} \ rightarrow [0, m-1] $ inline $ . Hash functions have many uses, especially in combination with the popular hash table data structure. It is also used in cryptography, where a specific type of hash function called “cryptoraphic hash function” is used.
One of the many properties that a cryptoraphic hash function must have is collision resistance. It must be difficult to find two such
$ inline $ u_1, u_2 \ in U $ inline $ so that they form a collision, i.e.
$ inline $ h (u_1) = h (u_2) $ inline $ .
It is on the resistance to collisions that we will focus on. First, I will explain why it is a desirable property. To do this, we will first consider digital signatures.
In the past, people and organizations used signatures and seals to “sign” documents. Recently, there has been a transition to electronic or digital signatures. A digital signature must satisfy three basic properties.
- When signing a document, you need to be able to verify who signed the document.
- After signing the document, no one should be able to fake it.
- The person who has signed the document cannot subsequently refute the signing of the document.
A digital signature is a way of provably verifying the truth of a document or message. A digital signature ensures that the received message was created by a known sender and not changed.
Let's say we have a document
$ inline $ m $ inline $ . How do we sign it?
Each user / organization has a unique private key
$ inline $ p_k $ inline $ and public key
$ inline $ pub_k $ inline $ . They use the signing function to sign a document with their own private key. However, a digital signature can only sign a small number of documents. The scope of the signature function is small. One way to solve this problem is to create a smaller document that represents the original document, but in a much smaller size. Most often, a hash function is applied to this large document. A hash function is used to map it from a large space to a smaller one, and the result of such an operation is called a “fingerprint”. The signature uses the fingerprint and private key to create the signature. The procedure can be described as follows:
- Get the private key $ inline $ p_k $ inline $ .
- Hash a document $ inline $ m $ inline $ and get $ inline $ h (m) $ inline $ .
- Sign $ inline $ h (m) $ inline $ using the private key $ inline $ p_k $ inline $ .
- We send $ inline $ sign (h (m), p_k) $ inline $ and public key $ inline $ pub_k $ inline $ anyone who wishes to confirm the document.
Anyone who has
$ inline $ sign (h (m), p_k)) $ inline $ and public key, can check if the document is really
$ inline $ m $ inline $ signed by the right person.
Problem: suppose attacker Eva found two documents
$ inline $ m, m '$ inline $ where
$ inline $ m $ inline $ - this is the present contract, and
$ inline $ m '$ inline $ - a fraudulent document such that
$ inline $ h (m) = h (m ') $ inline $ . Suppose Eve knows in advance that Alice will only sign
$ inline $ m $ inline $ , but not
$ inline $ m '$ inline $ . Before signing, a cryptographic hash function is applied to the document
$ inline $ h $ inline $ . Eve goes to Alice and asks her to sign a document
$ inline $ m $ inline $ using the sequence described above. Now, Eve may claim that Alice signed a fraudulent document
$ inline $ m '$ inline $ , because
$ inline $ h (m) = h (m ') $ inline $ . Alice will not be able to prove in any way that she did not sign
$ inline $ m '$ inline $ .
In the example above
$ inline $ h $ inline $ turned out to be a collision-resistant function, so Eve was able to find two incoming data sets that hashed the same value. Eve was able to claim that Alice signed
$ inline $ m '$ inline $ although in fact she signed only
$ inline $ m $ inline $ . This underlines the importance of collision resistance and shows why digital signatures are vulnerable if the hash function is unstable.
Let be
$ inline $ h: U \ rightarrow [0, m-1] $ inline $ will be a hash function. Assume that the input data is randomly uniformly and independently hashed into any of the elements
$ inline $ m $ inline $ .
The birthday attack task is to find the smallest number of elements
$ inline $ n $ inline $ which can be hashed with
$ inline $ h $ inline $ so that we can find two elements
$ inline $ x, y \ in U $ inline $ at which
$ inline $ h (x) = h (y) $ inline $ .
When attacking "birthdays", the attacker will randomly pick up
$ inline $ x \ in U $ inline $ and keep couples
$ inline $ (x, h (x)) $ inline $ . An attacker will repeatedly select and save these pairs until he finds two values
$ inline $ x, y $ inline $ at which
$ inline $ h (x) = h (y) $ inline $ . We need to determine how many times the attacker needs to repeat this operation until he finds a collision.
To analyze the “birthdays” attack, you can use the same principles that we used for the birthday paradox. In the attack "birthdays"
$ inline $ m $ inline $ denotes the number of days in a year, and
$ inline $ u $ inline $ similar to people entering a room. People are hashed on their birthdays, which can be one of the meanings.
$ inline $ m $ inline $ . Suppose we need to find a collision with a probability of 99%. We need to know what is the smallest
$ inline $ n $ inline $ , in which the hash of two values ​​will be one birthday (in the world of hash functions, this means that two input data sets are hashed into the same value).
We previously showed that
$ inline $ \ Pr [A_n] \ approx e ^ {\ frac {-n ^ 2} {2m}} $ inline $
We want to ask
$ inline $ \ Pr [\ overline {A} _n] \ approx e ^ {\ frac {-n ^ 2} {2m}} = \ frac {99} {100} $ inline $ , i.e
$ inline $ \ Pr [A_n] \ approx e ^ {\ frac {-n ^ 2} {2m}} = \ frac {1} {100} $ inline $ .
$$ display $$ \ begin {equation *} \ begin {split} e ^ {\ frac {-n ^ 2} {2m}} & = \ frac {1} {100} \\ \ frac {n ^ 2} {2m} & = \ ln 100 \\ n ^ 2 & = 2 m \ ln 100 \\ n & = \ sqrt {2 m \ ln 100} \ end {split} \ end {equation *} $$ display $$
The analysis shown above tells us that to get collisions in hash functions with a size interval
$ inline $ m $ inline $ an attacker needs to hash uniformly and independently hash approximately
$ inline $ \ sqrt {2 m \ ln 100} = O (\ sqrt {m}) $ inline $ for an almost complete guarantee (99% probability) that two elements are hashed to the same value.
Suppose we want to get a collision with a probability of 50%; then we need
$ inline $ n = \ sqrt {2 m \ ln 2} $ inline $ . The important conclusion here is that in order to get a collision with a probability of more than 0.5, we have to hash the order
$ inline $ \ sqrt {m} $ inline $ elements. This is consistent with our previous analysis for
$ inline $ m = 365 $ inline $ , because
$ inline $ \ sqrt {2 \ ln 2 \ cdot 365} $ inline $ approximately equal to 23.
Additional Tasks
- Having $ inline $ n $ inline $ people, $ inline $ m $ inline $ days and a certain number $ inline $ k $ inline $ find the probability that exactly $ inline $ k $ inline $ people have the same birthday.
- Let's transform the above task a bit. Let's say we have $ inline $ m $ inline $ days and a certain number $ inline $ k $ inline $ . What will be the smallest value $ inline $ n $ inline $ at which no less $ inline $ k $ inline $ people will have the same birthday with a probability of at least 0.5? Can you generalize this task to any probability $ inline $ p> 0 $ inline $ ?
- Suppose we have 100 numbers from 1 to 100, as well as a machine guessing a random number from 1 to 100 in a uniform and random order. How many times will we need to use the machine as expected, so that the machine guesses all numbers from 1 to 100?
- Can you generalize this task to any $ inline $ n $ inline $ ?
- Suppose we have a hash function that randomly uniformly displays elements in a memory region. Let total areas $ inline $ m $ inline $ . How many items $ inline $ n $ inline $ do we need to add mathematical expectation to our data structure so that at least two elements are hashed in each region?