codecogs equations

måndag 28 oktober 2019

TSP - Representation and basic operations

Given a set of points and their pairwise distances, we want to find the shortest path through all of them. This is known as the travelling salesman problem. It is NP-hard. Here I will consider the version where the path must go back to the original node, to make a cycle. It can be turned into the non-cyclic version, and vice versa.

Representation

We are given N points. Without loss of generality, they can be represented as the integers 0..N-1. How to represent the solution? One option is to see the problem of finding the best path as selecting N edges from the complete graph with N vertices, under constraints that the edges need to form a path. This representation is good for a MIP approach, but not for more specialized TSP solving. I did write a MIP model for TSP, but with CBC it can only solve instances with about 7 nodes, within 10 seconds. 10 nodes take over a minute, which makes MIP uninteresting as an approach.

A smarter representation comes from thinking that we want to return the elements sorted in a certain order. We could use an array list for this, and slices could perhaps be manipulated efficiently. But a linked list just seems more natural.

Example TSP with solution
How to represent the linked list? The dict is often a natural choice in Python, and so it is here. The implicit representation becomes that of a directed path, rather than an undirected path. Why choose a dict? Dicts are fast in Python. Manipulations also become simple to implement, as we shall see. The dict that holds edges will be referred to as G, for graph.
The used representation: a dict G with N key-value pairs. In this case, G[u] = v.

The Triple Switch

The triple switch is really neat. It involves three edges. Two edges are cut, thus separating a segment from the path. The gap is closed with an edge. A third edge is cut. The segment is spliced in to the resulting gap by creating two new edges. This is one way of seeing it! In fact, the three cut edges define three distinct paths. The way they are connected to each other is cycled.


The before & after of the triple switch.
The best thing is, it can be done with a single line in Python:

G[u0], G[u1], G[u2] = G[u1], G[u2], G[u0]
Note that this requires that the nodes are oriented as in the picture, otherwise the graph becomes disconnected.

The Cross Switch

The cross switch is meant to take two edges that cross each other and replace them with two edges that do not cross each other. This will make the total path shorter (can be shown by using the triangle inequality on a 4-corner convex polygon).

The three steps of the cross switch. 1) crossing edges switch heads. 2) change direction of one of the loops. 3) switch heads again. 

The cross switch can also be written quite neatly:

def cross_switch(self, u0, u1):
G[u0], G[u1] = G[u1], G[u0]
u0 = reverse_cycle(G, u0)
G[u0], G[u1] = G[u1], G[u0]
def reverse_cycle(G, u0):
path = get_path(G, u0)
for i, j in zip(path, [path[-1]] + path[:-1]):
G[i] = j
return path[0]
def get_path(G, i0=0):
i = i0
path = []
while True:
path.append(i)
i = G[i]
if i == i0:
return path
view raw cross_switch.py hosted with ❤ by GitHub

Ruin & Recreate

Ruin & Recreate is quite a popular principle for solving TSP in practice. The basic operation is: select some subset of the nodes, for example a cluster, and remove all their incident edges. That is the ruin step. In the recreate step, solve TSP with the ruined nodes only, under the constraint that they must link up to the existing path. It was described in [1].

References

[1] Schrimpf, Gerhard, et al. "Record breaking optimization results using the ruin and recreate principle." Journal of Computational Physics 159.2 (2000): 139-171.

torsdag 24 oktober 2019

Reverse search engine

A search engine takes a query, and returns a matching document from a set of documents. A reverse search engine takes a document and produces a query that will match the document better than the other documents in the set. Who does this? Well, everyone who uses Google, for example.

The user starts out with an (incomplete) idea of the document to be found, and tries to write a query that will make google match this document without matching similar but incorrect documents. This is clearly not trivial, since some people seem to be very bad at it. But let's not dwell on how to do this in particular with google, but rather make an abstract problem.

Smallest unique subset

We start out with a set S of documents. Don't consider the structure of the documents, but just treat them as sets of words (bag of words model). It needn't be words either, just any element that can be equal to or not equal to another element.

Suppose we are dealing with a very simple search engine that we want to reverse. Basically, it takes a query Q, which is a set of words, and returns the documents that contain all the words in Q. If D contains all the words in Q, then we say that D matches Q.

Now we are given a document D, and want to return a set of words B that matches D, but does not match any other document in S. Furthermore, we want to return a minimal such B. There need not be such a B. B exists if and only if D is not a strict subset of another document in S.

We should analyze the "Smallest Unique Subset" problem to see if we can hope to find a minimal B fast.

Analysis: hitting set on complements

A quick survey of the well known NP-complete problems shows that "Smallest Unique Subset" reduces to hitting set problem [1]. The hitting set problem is thus: given a set S of sets of elements, find a smallest "hitting set" H such that for every set S_i in S, H contains at least one element in S_i. The elements all belong to some universe U. Hitting set is reducible to Smallest unique set in the following way:

Suppose we can solve smallest unique subset in polynomial time. Take a hitting set problem, and replace all S_i with the complements of S_i in S. Call the set of complements C. Now poll smallest unique subset with C, and with D as the universe U. The answer B from Smallest unique subset will be exactly the smallest hitting set H for S.

Conversely, we can show that Smallest unique set is reducible to Hitting set by solving SUS by calling Hitting set with the complements of S, after filtering out all elements that are not in D.

So, smallest unique subset is NP-complete.

Greedy implementation

Exhaustive search for this problem can be done in 2^|D|, times polynomial factors of |S| and |U|. Exhaustive search can only be used for small, uninteresting instances. Since the problem is NP-complete, we can't hope to find an optimal solution that is always fast. One of the design goals "optimal", "always" or "fast" is going to have to go. With a greedy implementation, we forget about "optimal". The idea is very simple. We pick the element of D that is in the fewest other sets in S. That is to say, the "rarest" element in D. The other sets are removed, and the selection is repeated with a new rarest element.

def smallest_unique_set(S, D=None):
"""Greedy version"""
if subset is None:
D = set(chain.from_iterable(S))
else:
# filter elements
S = [Si & D for Si in S]
p = []
while S:
c = Counter(chain.from_iterable(S))
if subset - set(c):
r = next(iter(D - set(c)))
p.append(r)
break
else:
r, freq = c.most_common()[-1]
if freq >= len(S):
raise ValueError("No unique set exists")
p.append(r)
S = [si for si in S if r in si]
return p
view raw sus_greedy.py hosted with ❤ by GitHub
Assumes that there are no duplicate elements in the sets of S.

Stress testing the greedy implementation

I want to target this towards something like a dataset of wikipedia articles. Wikipedia has about 8 million articles. Most are quite short, something like 500 words long. Suppose English has 10000 common words. We represent the words with integers. In a simple model, let's assume that words are uniformly random distributed. With 100,000 documents in S, the filtration goes like this:

Remaining elements: 100000
Remaining elements: 4627
Remaining elements: 190
Remaining elements: 2
B: [3913, 1283, 1311, 7171]
time: 3.866 s

So it takes almost 4 seconds for a single document, when we check only 100,000 documents. Each filtration reduces the size about 20-fold, which makes sense because we have documents with 500 random words from a dictionary of 10,000 words.

If we want to find a SUS for each D in S, there is a way to speed up the computations a lot. We can store a dictionary that maps each word to an ID for the documents that contain it. Given D, we can do the first selection by just polling this dict and checking lengths of the values (which are lists). The remaining selections can be done with the original algorithm. The time saving should be about a factor 20.

References
[1] wikipedia - set cover / hitting set

Compression: Word bias in LZ78

Previous:
In the previous post, I implemented a simple (but powerful, and often used) compression algorithm in Python. Look here at the last 20 matches it makes at the end of a 6.1MB file (big.txt - link leads to download [1]). I removed a 0.4MB at the end of big.txt, so that it should end at the end of "War and Peace".

nd to r
ecognize a
 motion w
e did n
ot feel;
 in the prese
nt case
 it is sim
ilarly n
ecessary to ren
ounce a
 freedom th
at does n
ot exist
, and to re
cognize a
dependence of
which we a
re not co
nscious.

To me, it seems wasteful that there are so many nonsense words in the phrase tree. The nonsense words appear when we start encoding in the middle of a word. My idea of language is that it consists of words with meaning, and if we preserve the words better in encoding, we should get longer matches and therefore better compression.

Restarting encoding at the beginning of words

I implemented an encoder and decoder that, after each emitted sequence, restarts reading from the last non-alphabetic character, as long as the matched sequence contains a non-alphabetic character. The phrase tree is still expanded from the longest possible match.

we did
did not fe
feel;
 in the pre
present ca
case it is s
similarly n
necessary to ren
renounce a
a freed
freedom tha
that does
does not exi
exist, an
and to recog
recognize a d
dependence of
of which w
we are not co
conscious.

The average match length is clearly longer, however the compression rate is not improved by this. The obvious waste is that most of the text is "transmitted twice" because the sequences overlap. If we don't transmit the sequence corresponding to the full match when emitting, but let the end of it be implied by the beginning of the next sequence, we can do much better:

original                                            6331174
LZ78 (plaintext)                             4209692
LZW  (plaintext)                             3357669
LZ78 + word backtrack                  4580235
Word backtrack + implicit initial     3301512
unix compress                                 2321803

So we are able to beat plaintext LZW by a small margin, but are still very much worse than the binary representation, despite using a more data specific model. 

References
[1] norvig.com

onsdag 23 oktober 2019

Compression: LZ78

Compression fascinates me. The value added is obvious: we use less storage and transfer resources without losing any information. We pay for this with the time it takes to compress and decompress. The algorithms have two obvious performance metrics for a given input data: compression rate, and run-time. I am most interested in optimizing compression rate, though one shouldn't forget the importance of speed for the practicability.

No Free Lunch

No compression algorithms can compress all possible input files without loss. We can call this the "No Free Lunch"-theorem of compression. It is very easy to prove. A compression algorithm maps an input file to an encoded file. For lossless compression, this mapping must be invertible, so two input files can never map to the same encoded file. Therefore, the compressor must map the space of input files to itself in a one-to-one mapping, and the average size of the encoded files must be the same as the average size of the input files.

If the encoding is to be invertible for all binary files, we must map the space of input files to itself, and therefore the average file size remains the same after compression. 
For this reason, all compression must rely on qualifying a subset of files that we are interested in compressing, and then writing an algorithm that maps these files to a smaller size. There is a trade-off between generality and performance at work here. In the extreme, if all we ever want is to store or transmit two different files, we would just need 1 bit, either 0 or 1.

Another way to see it is that we think about the set of files that we want to target for compression, and think about whether the naive representation is redundant. By naive representation I mean plain text. Some examples:

  • If we want to compress source code files, we can observe that names of variables and functions are long strings that describe their purpose and identity. This is what we want for human readability, but it is not a necessary representation. The names could just be random, compact strings. Since variable names in source code are used at least twice (if they are both assigned and used), it is probably worth it to store a dictionary with full names and short names.
  • If we want to compress large matrices with a lot of zeroes, we should use a sparse matrix representation. The simplest representations store a list of (row, column, value). So they have a positive compression rate if the matrix is less than about 1/3rd non-zero values. Otherwise, this representation makes the matrix file bigger!
  • In video compression, we can save a lot of space if only small parts of the scene change between exposures. In that case, we only need to send a small diff image. If the camera is moving, this is not possible to the same degree. Same if there is a lot of noise. In that case, noise had a twofold negative impact on the video: it reduces the quality, and makes the file larger! If there are reflections or blinking lights in the image this also makes compression harder. In fact, a way to detect problems with security cameras is to see if their compression rate drops suddenly. 

Lempel Ziv 1978

I found a very good book which deals with compression of data strings. It's the PhD thesis of Ross N. Williams, Adaptive Data Compression (link leads to download) [1]. The first chapter of the book provides a thorough introduction to the field of data compression. Starting on page 52, Williams describes the LZ78 algorithm.

From page 54 of William's Adaptive Data Compression [1]
The idea is to incrementally build a tree of phrases. Every path from the root node represents a phrase that has been encountered in the data. Every edge holds a symbol. Phrases that have the same initial strings can reuse the same nodes. A node is identified by its ID, which we can assume is a natural number. The tree is transmitted by sending child->parent edges together with the corresponding symbol. The child always uses the lowest available number for its ID, so we only need to send the parent ID and the symbol. As explained on page 54 in [1], we don't even need to send the symbol, since it can be inferred to be the first symbol in the next phrase to be sent.

Python Implementation

The encoding is super easy. The phrase tree G can be represented with a dict of dicts where the keys of the top dict are sequence IDs, the keys of the sub dicts are symbols, and the values of the sub dicts are sequence IDs.

def encode(reader):
G = defaultdict(dict) # phrase tree
seq = 0 # sequence id
top = 0 # highest id in use
for a in reader.read():
seq_old = seq
try:
seq = G[seq][a]
except KeyError:
yield seq, a
top += 1
G[seq][a] = top # extend tree
seq = 0 # restart at root
if seq:
yield seq_old, a
view raw encode.py hosted with ❤ by GitHub
Note that we need seq_old in case the final string is not a complete match, so we emit a non-leaf. The decoding is almost as easy:

def decode(reader):
S = {} # child -> (parent, symbol) mapping in phrase tree
top = 0
reader = iter(reader.read())
while True:
try:
seq = get_id(reader)
except StopIteration:
break
a = next(reader)
yield from climb_to_root(S, seq)
yield a
top += 1
S[top] = (seq, a)
def get_id(reader):
id_num = []
while True:
a = next(reader)
if a not in "0123456789":
return int("".join(id_num))
else:
id_num.append(a)
def climb_to_root(S, seq):
stack = []
while seq:
seq, a = S[seq]
stack.append(a)
for a in reversed(stack):
yield a
view raw decode.py hosted with ❤ by GitHub
Note that if we hit the end (StopIteration) we have no work half-done: the encoder only emits full matches.

Testing correctness is easy! Just throw in some large text files and check that the file is identical after decompression (os.system("diff {} {}".format(fn, fn_reconstructed)) - Warning: prints to screen if files aren't identical. May be a lot if the files are large. To minimize printing, add flag '--brief'.). Debugging is straightforward: edit a test file by trial and error to be the shortest possible file that triggers an error. Many bugs can be found with files that are only a handful of symbols long.

The encoded output of this very simple algorithm looks like this:


The original is:


Which is considerably shorter. So in the beginning, while the phrase tree is still very small, the overhead of the compression is definitely larger than the space saved by abbreviating matching strings to their sequence ID.

Testing on a file (Nils Holgersson, a Swedish novel) which is 1.1MB gives (in bytes):

Before:  1146989
After:    1176260

Well darnit, the compression isn't even better than plain text.

Optimization and Testing

It's really wasteful to use integers as IDs, which we write in plaintext. The ID can be strings of any characters. To begin with, we can use all letters, lower and upper case, and some punctuation. Just raiding the keyboard gives me 89 characters that can be used. Note that it is very important to only pick characters that are represented with one byte. The punctuation I used was:

".:,;!%&/()=@${[]}^~'*<>|-`\"

And the full alphabet:

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:,;!%&/()=@${[]}^~'*<>|-`\"

With this new encoding the compressed files looks like:

Which definitely looks more mumbo-jumbo than before. Which is good! More randomness in encoded files means more potential for compression. With this, the result is:

Before:  1146989
After:    830805

Yes! We compressed the file by about 25%. Is that good? Linux's built-in zip-tool, unix compress, manages to get the file down to:

original  1146989
this         830805
zip         417624 

So the built in is less than half the size of our compression! What compression algorithm does compress use? A quick lookup tells us that it uses LZW, which just so happens to be quite a simple optimization of LZ78. The optimization is that the final symbol needn't be transmitted, but can be inferred from the next sequence ID. We can easily calculate how big that would make the file with our implementation:

original     1146989
this            830805
LZW opt   657987
zip             417624 

So even with the LZW-optimization, we're not close to the built-in program. It's possible that more opimizations are used, but a lot of the difference is to be explained with the fact that compress works on a pure binary level. The encoded file for compress looks like:


[1] has the following to say about LZW:


References
[1] Williams, Ross N. Adaptive data compression. Vol. 110. Springer Science & Business Media, 2012.

fredag 11 oktober 2019

Optimizing shortest_path_length in networkx

Summary

As of networkx version 2.4rc1 [1], the helper function
networkx.algorithms.shortest_path.unweighted._single_shortest_path_length
can be optimized.

Current implementation: add neighbours of visited nodes to "nextlevel" set. Later, iterate through "nextlevel" set to expand their unseen neighbours. Iterate until "nextlevel" is empty.

Suggested optimization 1: keep track of nodes to be expanded in a queue (collections.deque) that remains sorted by distance from source node. Do not add already visited nodes to any set.

Suggested optimization 2: terminate immediately if all nodes have been visited. This saves a lot of work for dense, strongly connected graphs.

Investigation and analysis shows the effectiveness of the optimization. Measured speedup is at least a factor of 2. For dense strongly connected graphs, the speedup can be up to O(n), where n is the number of nodes.

Code

Current implementation
def _single_shortest_path_length(adj, firstlevel, cutoff):
"""Yields (node, level) in a breadth first search
Shortest Path Length helper function
Parameters
----------
adj : dict
Adjacency dict or view
firstlevel : dict
starting nodes, e.g. {source: 1} or {target: 1}
cutoff : int or float
level at which we stop the process
"""
seen = {} # level (number of hops) when seen in BFS
level = 0 # the current level
nextlevel = firstlevel # dict of nodes to check at next level
while nextlevel and cutoff >= level:
thislevel = nextlevel # advance to next level
nextlevel = {} # and start a new list (fringe)
for v in thislevel:
if v not in seen:
seen[v] = level # set the level of vertex v
nextlevel.update(adj[v]) # add neighbors of v
yield (v, level)
level += 1
del seen
view raw current.py hosted with ❤ by GitHub


Suggested implementation
def _single_shortest_path_length(adj, firstlevel, cutoff):
"""Yields (node, level) in a breadth first search
Shortest Path Length helper function
Parameters
----------
adj : dict
Adjacency dict or view
firstlevel : dict
starting nodes, e.g. {source: 1} or {target: 1}
cutoff : int or float
level at which we stop the process
"""
visited = set(firstlevel)
queue = deque(firstlevel.items()) # sorted by level, ascending
for u, level in queue:
yield u, level-1 # level 1 means distance 0 to source nodes
unvisited = len(adj) - 1
while queue:
node, level = queue.popleft()
if level > cutoff:
return
for v in adj[node]:
if v not in visited:
visited.add(v)
queue.append((v, level+1))
yield v, level
unvisited -= 1
if not unvisited: # all nodes seen: terminate
return
view raw optimized.py hosted with ❤ by GitHub

Results

Both algorithms were run on a set of test graphs. This included random graphs of varying sizes. It also included a large tree, a complete graph, and a couple of other special structure graphs. Both versions received exactly the same random graphs. The output was verified to be identical. The benchmarks were re-run with the two algorithms in reversed order, to check that cached variables did not play a role. 

Random + structured graphs, single iteration all_pairs_shortest_paths
time / ms                    current         optimized           
n=1e1, m=1e1                    0.15              0.14             
n=1e1, m=3e1                    0.34              0.18            
n=1e2, m=1e2                    2.04              1.35           
n=1e3, m=1e3                   56.76             31.89         
n=1e3, m=1e5                27857.46            612.96        
caveman(100,10)              3373.64           1168.82       
windmill(10, 10)               32.08              4.09          
nary_tree(3, 1000)           1886.88            810.51       
complete(100)                 205.25              3.48           
mlp(10, 10)                    15.09              5.70          
amnesia(100)                  261.17             41.43         

Small random graphs, total all_pairs_shortest_paths, 1000 iterations
time / ms        current   optimized     
n=1e1, m=1e1       63.69       56.82    
n=1e1, m=3e1      213.32       87.55   
n=1e2, m=1e2     2035.69     1296.77  

Empirical conclusions
For all graphs that were tested, the optimized algorithm was faster. The speedup for a tree is a factor of 2, which is in the lower end. For the very smallest graph, the speedup is "only" on the scale of 10%. However, these functions execute so fast that function calls and other Python overhead probably plays a large role. For the largest dense strongly connected graph, the observed speedup is a factor of 50. It is argued below that the speedup for such graphs can be up to O(n).

Analysis

Optimization 1 (use queue, not sets):
The current implementation keeps a set "thislevel" of unexpanded nodes that it goes through. All neighbours of "thislevel" are added to another set "nextlevel", whether they have been expanded or not. In the below graph, that means that all the bottom nodes will get added to "nextlevel" n times.
Current implementation will add bottom nodes to a set 'n' times, as well as do a set lookup 'n' for each bottom node. Suggested implementation will only do a set lookup for bottom nodes 'n' times.
In the suggest optimization, it is checked whether a node has been expanded before it is added to the queue. The above graph is the one referred to as "amnesia(n)" in the results table. The empiric speedup is about a factor of 6, independent of n. 

Optimization 2 (early termination):
In the current implementation, we check all neighbours of the nodes in "thislevel" until no new unvisited nodes are found. In the suggested optimization, we count the number of unvisited nodes using just an integer counter. Whenever an unvisited node is found, we decrement. If the counter reaches zero, search can terminate. This can make a big difference for dense, strongly connected graphs.

In the extreme case, we have the complete graph, which requires checking (n-1)^2 edges in the current implementation, but only (n-1) edges in the suggested optimization. That is to say, an O(n) reduction of the required work. The measured speedup for the complete graph with 100 nodes is a factor of 60. For a large dense graph with 1000 nodes and approx. 100,000 random (directed) edges, the measured speedup is a factor of 50.

A risk case is if the graph changes during the search. This can cause incorrect results. This can also happen with the current implementation. However, the optimized algorithm will still terminate (unless new nodes and edges are added faster than the algorithm can process them), because it also terminates when the queue runs out.

Impact

The function _single_shortest_path_length is directly used by
  • single_source_shortest_path_length 
  • single_target_shortest_path_length
  • all_pairs_shortest_path_length
Indirectly, references were found to these functions in:
  • algorithms.distance_measures
  • algorithms.distance_regular
  • algorithms.efficiency_measures
  • algorithms.bipartite.centrality
  • algorithms.centrality.closeness
  • algorithms.connectivity.stoerwagner
  • generators.ego
It may be used by other functions within networkx, but it's hard to determine since functions can be passed as arguments by another name.

Details, discussion

Currently, the function accepts a dict "firstlevel", where the values are not used, only the keys. In all functions that call _single_shortest_path_length directly (which is just 2), "firstlevel" is either {source: 1} or {target: 1}. 

It may be a desirable use case to "seed" the shortest_path_length with different nodes predetermined to be at different levels. This is not supported by the current implementation, however. It is not supported by this optimization either.

Further suggestion

Either: make the argument "firstlevel" a set, so that shortest distance is counted from a number of source nodes. This use case is currently supported, although it is obfuscated by the fact that a dict is expected. So basically, just change the docstring. 

Or: replace "firstlevel" with a single "source" node. The function is, after all, called _single_shortest_path_length.

Update 2019-10-24

This pull request led to a merge, see the rest of the conversation here:
https://github.com/networkx/networkx/pull/3647

References

[1] Hagberg, Aric, et al. "NetworkX." URL http://networkx. github. io/index. html (2013).

onsdag 9 oktober 2019

Finding all maximal cliques Part II

Previous:

In the previous post, I presented an algorithm for finding all maximal cliques in a graph. A subgraph Q is a clique if all nodes in Q are connected to each other. We call Q a maximal clique if there is no node outside of Q that is connected to all nodes in Q. The maximal cliques are not to be confused with the "maximum clique", the maximal clique that is largest in some sense (e.g. number of nodes or sum of edge weights). 

The key idea of the algorithm was that we only need to consider sorted cliques, where the sorting key can be chosen arbitrarily. Sorting the nodes based on their degree, so that we first append lower degree nodes to cliques, should help us reduce the remaining pool of candidate nodes more quickly. The algorithm was indeed faster than networkx's implementation for large, random graphs. The speedup was about 30%. However, for graphs with more structure, particularly the complete graph, the algorithm is several times slower than the networkx implementation. To see why, we have to look at the method that networkx uses. 

It is objectively trivial to find all maximal cliques in the complete graph (there is just one, the set of all nodes), so it should be a trivial case for a maximal clique-finding algorithm. The networkx implementation is based on research, such as [1, 2, 3]. The common key idea is to use a pivot element. During the recursion, we always keep track of the current clique, and the remaining candidates, the nodes that are connected to all the nodes in the current clique. The pivot element can be chosen arbitrarily among the candidates, but we shall see that there is an optimal way to choose it. The pivot element P partitions the candidates into two groups: those that are neighbours of P, call those P-minus, and those that are not, call those P-plus. Note that P is not a neighbour of itself, so P is in P-plus.

Now, adding any subset of P-minus to the current clique will not form a maximal clique, because we can always add P to make it larger (since P is a neighbour of all all nodes in P-minus). So, we don't need to branch on any element in P-minus. The optimal way to choose P is of course the one that produces the largest P-minus, since that prunes the most branches.

Let's examine what the pivot trick does to the complete graph, say K(10). In the beginning, all nodes are candidates. All candidates have 9 neighbours, so we pick a P at random. Now we loop through all non-neighbours of P, which happens to be just P! In the recursion, we feed the algorithm with the neighbours of P, which is K(9), the complete graph with 9 nodes. The same thing happens every time we recurse. In practice, the branching factor for the complete graph is just 1, and the running time is linear.

Can the pivot trick be used together with the node sorting? If there is, I haven't found it. 

References

[1] Cazals, Frédéric, and Chinmay Karande. "A note on the problem of reporting maximal cliques." Theoretical Computer Science 407.1-3 (2008): 564-568.

[2] Tomita, Etsuji, Akira Tanaka, and Haruhisa Takahashi. "The worst-case time complexity for generating all maximal cliques and computational experiments." Theoretical Computer Science 363.1 (2006): 28-42.

[3] Bron, C. and Kerbosch, "Algorithm 457: finding all cliques of an undirected graph". *Communications of the ACM* 16, 9 (Sep. 1973), 575--577.  

tisdag 8 oktober 2019

Finding all maximal cliques Part I

I want to find all maximal cliques in a given graph G. A subgraph Q is a clique if all nodes in Q are connected to each other. In a clique, everyone knows everyone. We call Q a maximal clique if there is no node outside of Q that is connected to all nodes in Q. Example:

The two maximal cliques are circled. Note that e.g. {1, 2} is a clique but not a maximal clique, because both 1 and 2 are connected to 3.
I read a paper [1] which is about finding triangles in a graph. A triangle is a clique with exactly 3 members. An interesting observation is made that we only need to consider sorted triplets (u, v, w), u <= v <= w, where the sorting key can be anything we like. In [1], the degree of the node in ascending order is used, which is smart because it leads to a smaller branching factor. It should be possible to apply the same idea to cliques of any size. 

I came up with an algorithm like:

def find_cliques_v001(G):
adj = {u: {v for v in G[u] if v != u} for u in G}
higher_degree_adj = get_higher_degree_adj(G)
Q = []
def get_cliques(cand, hdeg_adj_v):
for u in cand & hdeg_adj_v:
Q.append(u)
cand_u = cand & adj[u]
if not cand_u:
yield Q[:]
else:
for clique in get_cliques(cand_u, higher_degree_adj[u]):
yield clique
Q.pop()
return get_cliques(set(G), set(G))


Term explanation:
adj: adj[u] is a set of nodes adjacent to u, excluding u itself.
higher_degree_adj: higher_degree_adj[u] is a set of nodes adjacent to u, wit ha higher degree than u. If two adjacent nodes have the same degree, symmetry is broken randomly. 
Q: is the current clique. 
cand: is all nodes that are connected to all nodes in Q. If cand is empty, we have found a maximal clique, and we return Q

When we add a node u to the current clique, we recurse but iterate only over the higher degree neighbours of u. Suppose u has a neighbour v with a lower degree, that is also in cand. If there is a maximal clique that contains both u and v, we will find it when we recurse from v

Results
Let's try this out on some random graphs, and compare it to networkx's builtin clique finder [2]:

time / ms       networkx        v001
n=1e1, m=1e1         0.1         0.1
n=1e2, m=1e2         0.3         0.6
n=1e2, m=1e3         1.5         1.4
n=1e3, m=1e3         2.8        30.7
n=1e3, m=1e4        13.3        12.7
n=1e3, m=1e5       634.4       373.9
n=1e3, m=2e5      9507.6      6909.8
Wow! For the largest graphs, it is significantly faster! For the smaller graphs, probably the overhead of building the higher_degree_adj is not worth it. 

However, most graphs in reality are not just "random" but they have some structure. Let's see how we do on these:

time / ms                     networkx               v001
n=1e1, m=1e1                       0.1                0.1
n=1e2, m=1e2                       0.4                0.6
n=1e2, m=1e3                       1.4                1.3
n=1e3, m=1e3                       2.7                4.8
n=1e3, m=1e4                      11.8               15.1
n=1e3, m=1e5                     613.3              350.7
n=1e3, m=2e5                    9242.6             6617.7
caveman(100,10)                   12.4               81.3
windmill(10, 10)                   0.9                7.1
nary_tree(3, 1000)                 3.3                5.5
complete(20)                       0.6              703.4
Whops! Dismal score on all the "special structure" graphs! Seems like I cheered too soon, here. Particularly the complete graph seems to be trivial for networkx, but is very cumbersome for my algorithm. So that is probably a clue to what the weakness is. 

See Part II for the continuation of this!

References
[1] Roughgarden, Tim. "CS167: Reading in Algorithms Counting Triangles." (2014).
[2] networkx.algorithms.clique.py (find_clique)