Garage of Code: Reverse search engine

A search engine takes a query, and returns a matching document from a set of documents. A reverse search engine takes a document and produces a query that will match the document better than the other documents in the set. Who does this? Well, everyone who uses Google, for example.

The user starts out with an (incomplete) idea of the document to be found, and tries to write a query that will make google match this document without matching similar but incorrect documents. This is clearly not trivial, since some people seem to be very bad at it. But let's not dwell on how to do this in particular with google, but rather make an abstract problem.

Smallest unique subset

We start out with a set S of documents. Don't consider the structure of the documents, but just treat them as sets of words (bag of words model). It needn't be words either, just any element that can be equal to or not equal to another element.

Suppose we are dealing with a very simple search engine that we want to reverse. Basically, it takes a query Q, which is a set of words, and returns the documents that contain all the words in Q. If D contains all the words in Q, then we say that D matches Q.

Now we are given a document D, and want to return a set of words B that matches D, but does not match any other document in S. Furthermore, we want to return a minimal such B. There need not be such a B. B exists if and only if D is not a strict subset of another document in S.

We should analyze the "Smallest Unique Subset" problem to see if we can hope to find a minimal B fast.

Analysis: hitting set on complements

A quick survey of the well known NP-complete problems shows that "Smallest Unique Subset" reduces to hitting set problem [1]. The hitting set problem is thus: given a set S of sets of elements, find a smallest "hitting set" H such that for every set S_i in S, H contains at least one element in S_i. The elements all belong to some universe U. Hitting set is reducible to Smallest unique set in the following way:

Suppose we can solve smallest unique subset in polynomial time. Take a hitting set problem, and replace all S_i with the complements of S_i in S. Call the set of complements C. Now poll smallest unique subset with C, and with D as the universe U. The answer B from Smallest unique subset will be exactly the smallest hitting set H for S.

Conversely, we can show that Smallest unique set is reducible to Hitting set by solving SUS by calling Hitting set with the complements of S, after filtering out all elements that are not in D.

So, smallest unique subset is NP-complete.

Greedy implementation

Exhaustive search for this problem can be done in 2^|D|, times polynomial factors of |S| and |U|. Exhaustive search can only be used for small, uninteresting instances. Since the problem is NP-complete, we can't hope to find an optimal solution that is always fast. One of the design goals "optimal", "always" or "fast" is going to have to go. With a greedy implementation, we forget about "optimal". The idea is very simple. We pick the element of D that is in the fewest other sets in S. That is to say, the "rarest" element in D. The other sets are removed, and the selection is repeated with a new rarest element.

Assumes that there are no duplicate elements in the sets of S.

Stress testing the greedy implementation

I want to target this towards something like a dataset of wikipedia articles. Wikipedia has about 8 million articles. Most are quite short, something like 500 words long. Suppose English has 10000 common words. We represent the words with integers. In a simple model, let's assume that words are uniformly random distributed. With 100,000 documents in S, the filtration goes like this:

Remaining elements: 100000
Remaining elements: 4627
Remaining elements: 190
Remaining elements: 2
B: [3913, 1283, 1311, 7171]
time: 3.866 s

So it takes almost 4 seconds for a single document, when we check only 100,000 documents. Each filtration reduces the size about 20-fold, which makes sense because we have documents with 500 random words from a dictionary of 10,000 words.

If we want to find a SUS for each D in S, there is a way to speed up the computations a lot. We can store a dictionary that maps each word to an ID for the documents that contain it. Given D, we can do the first selection by just polling this dict and checking lengths of the values (which are lists). The remaining selections can be done with the original algorithm. The time saving should be about a factor 20.

References
[1] wikipedia - set cover / hitting set

Garage of Code

codecogs equations

torsdag 24 oktober 2019

Reverse search engine

Inga kommentarer:

Skicka en kommentar