Help me choose
2022/October 2022

Universal hashing and minhash notes

by hajinny 2022. 10. 18.

1. Universal hashing (W2.2)

We have numbers from a large set S, which we want to store, query, and remove. Our requirements are:

- time complexity of O(1)

- space complexity that's less than O(|S|)

 

Deterministic hash functions (maps S -> {0,n-1}, n<<|S|) satisfies the space complexity requirement, but suffer from collisions. It's easy to find a set of values that causes collision at the same index.

 

The problem here is that we determine a hash function upfront, and we have no idea what the incoming elements might be. To solve this, we want to simulate randomising the selection of hash function. In particular, we randomly choose a hash function from a family of hash functions such that whatever hash function h we pick, Pr[h(x) /= h(y)] <= 1/n for any pair of x and y that is x/=y.

 

One such family is universal hash family defined like this:

H := {h | 0<a<p, 0<=b<p, h = (ax+b) mod p mod n}

 

Selecting any hash function from this family H will make sure that the collision probability is contained in <= 1/n, hence the collision chain length is num(elements)/n. As long as num(elements) ~ n, the time complexity is O(1). 

2. Minhash (W2.3)

2.1 Relevant background

Given a document, we want to find a similar document in the world wide web (of >>>>millions of documents). What we are basically doing here is, given a document d, we are trying to get a document d' such that similarity between d and d' is above certain threshold. Naive comparison achieves this with O(n^2), but we can do this in O(n) time.

 

We define similarity as the number of k-shingles (k-consecutive words) in common - this is called jaccard similarity. Suppose we have documents 'rose is a' and 'rose is b'. Then the set of all the possible 2-shingles are: 'rose is', 'is a', 'is b', and their jaccard similarity is 1/3, as they share one 2-shingle in common. For a more compact representation, we can create a hash function that maps the possible shingles to a number - eg 'rose is'->0, 'is a'->1, 'is b'->2. Then, 'rose is a' can be represented as a bitvector 110, and 'rose is b' as 101. Computing their similarity is just computing the number of 1's in common in the same place divided by total number of shingles possible.

 

However, such similarity computation is expensive (remember, we haven't even touched O(n^2) comparison). To make the comparison fast, we use minhash as discussed below. Btw, although will not be in the exam, we use LSH to hash these minhash values where signatures of high similarity is put into the same bucket - hence become candidate similar pairs, which brings comparison to O(n) times. 

2.2 Now what is minhash?

Minhash is a bitvector signature such that similarity between two minhash indicates the similarity of the documents that produced them. Pr[minhash(a)=minhash(b)] = jaccardSimilarity(a,b).

 

We have two approaches.

2.2.1 Permutation approach

Permutation approach produces minhash as follows. Recall from the background that each document can have a bit-vector representation indicating what shingles it has. That representation is actually a table - for example, 'rose is a' is represented as 110 in a sense that:

Shingle ID Present? (in 'rose is a' document)
0 1
1 1
2 0

To create one signature of multiple documents, we permutate the shingle IDs, like shown below:

Permuated shingle ID Shingle ID Present? (in 'rose is a') Present? (in 'rose is b')
1 0 1 1
2 1 1 0
0 2 0 1

We go through rows in the increasing order of permuated shingle ID, such that the lowest permuated number for which the document has the shingle becomes their signature. 

 

We start with the row indicated by permutated shingle id of 0 - here, 'rose is a' document doesn't have the shingle, but 'rose is b' has. The signature assigned to 'rose is b' is therefore 0. Similarly, 'rose is a' document gets 1 as its signature.

 

We do this multiple times to create multiple signatures, and we can compute signature space similarity through this.

Bit vector similarity is actual jaccard similarity, and signature similarity computes similarity of minhash signatures

However, creating permutated shingle IDs is expensive. This is why we use the second approach to simulate permutation.

2.2.2 One-pass minhash signatures

We pick a hash function from universal hash family that maps each shingle ID to [0, |shingle IDs|-1], with preferrably zero collision (and universal hash family more or less ensures this, when you pick n = |shingle IDs|-1). Then, those values become your permutated Shingle ID's in the previous approach. Rest is the same.

 

'2022 > October 2022' 카테고리의 다른 글

Collaborative Filtering (user-based, item-based) notes  (0) 2022.10.23
Community detection notes  (0) 2022.10.22
Reservoir Sampling and Bloom Filter notes  (0) 2022.10.19