Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.
Problem
Choose k entries from n numbers. Make sure each number is selected with the probability of k/n
Idea
- Choose k items and put them into the reservoir.
- For (k+1)th item, pick it with probability k/(k+1), and randomly replace a number in the reservoir.
- For (k+i)th item, pick it with probability k/(k+i), and randomly replace a number in the reservoir.
- Repeat until k+i reaches n.
Example
Sample size 1
Choose 1 from [1, 2, 3, 4] => initial reservoir: [1]
For 2:
- with probability 1/2, keep it (discard the old one aka 1)
- with probability 1 - 1/2 = 1/2, keep the old item (ignore the new one)
- For 1, the probability that it keeps staying in the reservoir is:
- P(1 was in the reservoir last time) * P(1 is not replaced by 2)
- = 1 * (1 - 1/2 * 1)
- = 1/2
For 3:
- For 1, the probability that it keeps staying in the reservoir is:
- P(1 was in the reservoir last time) * P(1 is not replaced by 3)
- P(1 was in the reservoir last time) * (1 - P(3 is selected and replaces 1))
- = 1/2 * (1 - 1/3 * 1)
- = 1/2 * 2/3
- = 1/3
Same for 4