Reservoir Sampling and Bloom Filter notes

2022/October 2022

Reservoir Sampling and Bloom Filter notes

hajinny 2022. 10. 19. 12:33

Reservoir Sampling

Reservoir sampling is sampling technique from a stream of elements. It makes sure that the probability of an element remaining in a sampling of size s after encountering m elements from the stream is s/m.

Reservoir sampling with sample size of 1 ensures that any element from the stream gets picked into the sample with a probability of 1/m (that is, after encountering m elements). When m-th element arrives from the stream, we replace the sample with that element with a probability of 1/m, else we discard m-th element.

Reservoir sampling with sample size of s ensures that any element has s/m probability of being in the sample. When m-th element arrives from the stream, we take m-th element into the sample while uniform-randomly taking out one element from the sample - otherwise, we discard m-th element.

Proofs are not examinable, but it's pretty followable.

Bloom Filter

Bloom filter can report whether a key is in the storage quickly. Suppose we have a set S of keys in storage. We set up k hash functions, each mapping to n buckets. We also set up bit array B of size n. We hash each key s in S, and set B[h_i(s)] = 1 for i =[0,k) (essentially, we hash each element in S using k hash functions and set B[h(s)] = 1).

When we have an incoming query asking 'is a key a in the storage S?', we simply check that, for all i=[0,k), B[h_i(a)] = 1.

This guarantees that if a is really in S, then bloom filter will always report yes (so there's no false negatives, where the system falsely reports that S does not have a) - because hash function deterministically maps to the index in B for every element.

However, there are false positives: even if a isn't in S, bloom filter may sometimes say a is in S. This is there may exist s in S such that for all i=[0,k), h_i(a) = h_i(s). The probability of false positives is given as (1-e^(-km/n))^k. Note that too high k may result in over saturation of 1's in B vector, which results in greater false positives. So there's an optimal k, as shown below. The higher n, the less Pr[FP].