Garage of Code: 2019

måndag 28 oktober 2019

TSP - Representation and basic operations

Given a set of points and their pairwise distances, we want to find the shortest path through all of them. This is known as the travelling salesman problem. It is NP-hard. Here I will consider the version where the path must go back to the original node, to make a cycle. It can be turned into the non-cyclic version, and vice versa.

Representation

We are given N points. Without loss of generality, they can be represented as the integers 0..N-1. How to represent the solution? One option is to see the problem of finding the best path as selecting N edges from the complete graph with N vertices, under constraints that the edges need to form a path. This representation is good for a MIP approach, but not for more specialized TSP solving. I did write a MIP model for TSP, but with CBC it can only solve instances with about 7 nodes, within 10 seconds. 10 nodes take over a minute, which makes MIP uninteresting as an approach.

A smarter representation comes from thinking that we want to return the elements sorted in a certain order. We could use an array list for this, and slices could perhaps be manipulated efficiently. But a linked list just seems more natural.

Example TSP with solution

How to represent the linked list? The dict is often a natural choice in Python, and so it is here. The implicit representation becomes that of a directed path, rather than an undirected path. Why choose a dict? Dicts are fast in Python. Manipulations also become simple to implement, as we shall see. The dict that holds edges will be referred to as G, for graph.

The used representation: a dict G with N key-value pairs. In this case, G[u] = v.

The Triple Switch

The triple switch is really neat. It involves three edges. Two edges are cut, thus separating a segment from the path. The gap is closed with an edge. A third edge is cut. The segment is spliced in to the resulting gap by creating two new edges. This is one way of seeing it! In fact, the three cut edges define three distinct paths. The way they are connected to each other is cycled.

The before & after of the triple switch.

The best thing is, it can be done with a single line in Python:

Note that this requires that the nodes are oriented as in the picture, otherwise the graph becomes disconnected.

The Cross Switch

The cross switch is meant to take two edges that cross each other and replace them with two edges that do not cross each other. This will make the total path shorter (can be shown by using the triangle inequality on a 4-corner convex polygon).

The three steps of the cross switch. 1) crossing edges switch heads. 2) change direction of one of the loops. 3) switch heads again.

The cross switch can also be written quite neatly:

Ruin & Recreate

Ruin & Recreate is quite a popular principle for solving TSP in practice. The basic operation is: select some subset of the nodes, for example a cluster, and remove all their incident edges. That is the ruin step. In the recreate step, solve TSP with the ruined nodes only, under the constraint that they must link up to the existing path. It was described in [1].

References

[1] Schrimpf, Gerhard, et al. "Record breaking optimization results using the ruin and recreate principle." Journal of Computational Physics 159.2 (2000): 139-171.

torsdag 24 oktober 2019

Reverse search engine

A search engine takes a query, and returns a matching document from a set of documents. A reverse search engine takes a document and produces a query that will match the document better than the other documents in the set. Who does this? Well, everyone who uses Google, for example.

The user starts out with an (incomplete) idea of the document to be found, and tries to write a query that will make google match this document without matching similar but incorrect documents. This is clearly not trivial, since some people seem to be very bad at it. But let's not dwell on how to do this in particular with google, but rather make an abstract problem.

Smallest unique subset

We start out with a set S of documents. Don't consider the structure of the documents, but just treat them as sets of words (bag of words model). It needn't be words either, just any element that can be equal to or not equal to another element.

Suppose we are dealing with a very simple search engine that we want to reverse. Basically, it takes a query Q, which is a set of words, and returns the documents that contain all the words in Q. If D contains all the words in Q, then we say that D matches Q.

Now we are given a document D, and want to return a set of words B that matches D, but does not match any other document in S. Furthermore, we want to return a minimal such B. There need not be such a B. B exists if and only if D is not a strict subset of another document in S.

We should analyze the "Smallest Unique Subset" problem to see if we can hope to find a minimal B fast.

Analysis: hitting set on complements

A quick survey of the well known NP-complete problems shows that "Smallest Unique Subset" reduces to hitting set problem [1]. The hitting set problem is thus: given a set S of sets of elements, find a smallest "hitting set" H such that for every set S_i in S, H contains at least one element in S_i. The elements all belong to some universe U. Hitting set is reducible to Smallest unique set in the following way:

Suppose we can solve smallest unique subset in polynomial time. Take a hitting set problem, and replace all S_i with the complements of S_i in S. Call the set of complements C. Now poll smallest unique subset with C, and with D as the universe U. The answer B from Smallest unique subset will be exactly the smallest hitting set H for S.

Conversely, we can show that Smallest unique set is reducible to Hitting set by solving SUS by calling Hitting set with the complements of S, after filtering out all elements that are not in D.

So, smallest unique subset is NP-complete.

Greedy implementation

Exhaustive search for this problem can be done in 2^|D|, times polynomial factors of |S| and |U|. Exhaustive search can only be used for small, uninteresting instances. Since the problem is NP-complete, we can't hope to find an optimal solution that is always fast. One of the design goals "optimal", "always" or "fast" is going to have to go. With a greedy implementation, we forget about "optimal". The idea is very simple. We pick the element of D that is in the fewest other sets in S. That is to say, the "rarest" element in D. The other sets are removed, and the selection is repeated with a new rarest element.

Assumes that there are no duplicate elements in the sets of S.

Stress testing the greedy implementation

I want to target this towards something like a dataset of wikipedia articles. Wikipedia has about 8 million articles. Most are quite short, something like 500 words long. Suppose English has 10000 common words. We represent the words with integers. In a simple model, let's assume that words are uniformly random distributed. With 100,000 documents in S, the filtration goes like this:

Remaining elements: 100000
Remaining elements: 4627
Remaining elements: 190
Remaining elements: 2
B: [3913, 1283, 1311, 7171]
time: 3.866 s

So it takes almost 4 seconds for a single document, when we check only 100,000 documents. Each filtration reduces the size about 20-fold, which makes sense because we have documents with 500 random words from a dictionary of 10,000 words.

If we want to find a SUS for each D in S, there is a way to speed up the computations a lot. We can store a dictionary that maps each word to an ID for the documents that contain it. Given D, we can do the first selection by just polling this dict and checking lengths of the values (which are lists). The remaining selections can be done with the original algorithm. The time saving should be about a factor 20.

References
[1] wikipedia - set cover / hitting set

Compression: Word bias in LZ78

LZ78

In the previous post, I implemented a simple (but powerful, and often used) compression algorithm in Python. Look here at the last 20 matches it makes at the end of a 6.1MB file (big.txt - link leads to download [1]). I removed a 0.4MB at the end of big.txt, so that it should end at the end of "War and Peace".

nd to r
ecognize a
motion w
e did n
ot feel;
in the prese
nt case
it is sim
ilarly n
ecessary to ren
ounce a
freedom th
at does n
ot exist
, and to re
cognize a
dependence of
which we a
re not co
nscious.

To me, it seems wasteful that there are so many nonsense words in the phrase tree. The nonsense words appear when we start encoding in the middle of a word. My idea of language is that it consists of words with meaning, and if we preserve the words better in encoding, we should get longer matches and therefore better compression.

Restarting encoding at the beginning of words

I implemented an encoder and decoder that, after each emitted sequence, restarts reading from the last non-alphabetic character, as long as the matched sequence contains a non-alphabetic character. The phrase tree is still expanded from the longest possible match.

we did

did not fe

feel;

in the pre

present ca

case it is s

similarly n

necessary to ren

renounce a

a freed

freedom tha

that does

does not exi

exist, an

and to recog

recognize a d

dependence of

of which w

we are not co

conscious.

The average match length is clearly longer, however the compression rate is not improved by this. The obvious waste is that most of the text is "transmitted twice" because the sequences overlap. If we don't transmit the sequence corresponding to the full match when emitting, but let the end of it be implied by the beginning of the next sequence, we can do much better:

original 6331174

LZ78 (plaintext) 4209692

LZW (plaintext) 3357669

LZ78 + word backtrack 4580235

Word backtrack + implicit initial 3301512

unix compress 2321803

So we are able to beat plaintext LZW by a small margin, but are still very much worse than the binary representation, despite using a more data specific model.

References
[1] norvig.com

onsdag 23 oktober 2019

Compression: LZ78

Compression fascinates me. The value added is obvious: we use less storage and transfer resources without losing any information. We pay for this with the time it takes to compress and decompress. The algorithms have two obvious performance metrics for a given input data: compression rate, and run-time. I am most interested in optimizing compression rate, though one shouldn't forget the importance of speed for the practicability.

No Free Lunch

No compression algorithms can compress all possible input files without loss. We can call this the "No Free Lunch"-theorem of compression. It is very easy to prove. A compression algorithm maps an input file to an encoded file. For lossless compression, this mapping must be invertible, so two input files can never map to the same encoded file. Therefore, the compressor must map the space of input files to itself in a one-to-one mapping, and the average size of the encoded files must be the same as the average size of the input files.

If the encoding is to be invertible for all binary files, we must map the space of input files to itself, and therefore the average file size remains the same after compression.

For this reason, all compression must rely on qualifying a subset of files that we are interested in compressing, and then writing an algorithm that maps these files to a smaller size. There is a trade-off between generality and performance at work here. In the extreme, if all we ever want is to store or transmit two different files, we would just need 1 bit, either 0 or 1.

Another way to see it is that we think about the set of files that we want to target for compression, and think about whether the naive representation is redundant. By naive representation I mean plain text. Some examples:

If we want to compress source code files, we can observe that names of variables and functions are long strings that describe their purpose and identity. This is what we want for human readability, but it is not a necessary representation. The names could just be random, compact strings. Since variable names in source code are used at least twice (if they are both assigned and used), it is probably worth it to store a dictionary with full names and short names.
If we want to compress large matrices with a lot of zeroes, we should use a sparse matrix representation. The simplest representations store a list of (row, column, value). So they have a positive compression rate if the matrix is less than about 1/3rd non-zero values. Otherwise, this representation makes the matrix file bigger!
In video compression, we can save a lot of space if only small parts of the scene change between exposures. In that case, we only need to send a small diff image. If the camera is moving, this is not possible to the same degree. Same if there is a lot of noise. In that case, noise had a twofold negative impact on the video: it reduces the quality, and makes the file larger! If there are reflections or blinking lights in the image this also makes compression harder. In fact, a way to detect problems with security cameras is to see if their compression rate drops suddenly.

Lempel Ziv 1978

I found a very good book which deals with compression of data strings. It's the PhD thesis of Ross N. Williams, Adaptive Data Compression (link leads to download) [1]. The first chapter of the book provides a thorough introduction to the field of data compression. Starting on page 52, Williams describes the LZ78 algorithm.

From page 54 of William's Adaptive Data Compression [1]

The idea is to incrementally build a tree of phrases. Every path from the root node represents a phrase that has been encountered in the data. Every edge holds a symbol. Phrases that have the same initial strings can reuse the same nodes. A node is identified by its ID, which we can assume is a natural number. The tree is transmitted by sending child->parent edges together with the corresponding symbol. The child always uses the lowest available number for its ID, so we only need to send the parent ID and the symbol. As explained on page 54 in [1], we don't even need to send the symbol, since it can be inferred to be the first symbol in the next phrase to be sent.

Python Implementation

The encoding is super easy. The phrase tree G can be represented with a dict of dicts where the keys of the top dict are sequence IDs, the keys of the sub dicts are symbols, and the values of the sub dicts are sequence IDs.

Note that we need seq_old in case the final string is not a complete match, so we emit a non-leaf. The decoding is almost as easy:

Note that if we hit the end (StopIteration) we have no work half-done: the encoder only emits full matches.

Testing correctness is easy! Just throw in some large text files and check that the file is identical after decompression (os.system("diff {} {}".format(fn, fn_reconstructed)) - Warning: prints to screen if files aren't identical. May be a lot if the files are large. To minimize printing, add flag '--brief'.). Debugging is straightforward: edit a test file by trial and error to be the shortest possible file that triggers an error. Many bugs can be found with files that are only a handful of symbols long.

The encoded output of this very simple algorithm looks like this:

The original is:

Which is considerably shorter. So in the beginning, while the phrase tree is still very small, the overhead of the compression is definitely larger than the space saved by abbreviating matching strings to their sequence ID.

Testing on a file (Nils Holgersson, a Swedish novel) which is 1.1MB gives (in bytes):

Before: 1146989

After: 1176260

Well darnit, the compression isn't even better than plain text.

Optimization and Testing

It's really wasteful to use integers as IDs, which we write in plaintext. The ID can be strings of any characters. To begin with, we can use all letters, lower and upper case, and some punctuation. Just raiding the keyboard gives me 89 characters that can be used. Note that it is very important to only pick characters that are represented with one byte. The punctuation I used was:

".:,;!%&/()=@${[]}^~'*<>|-`\"

And the full alphabet:

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:,;!%&/()=@${[]}^~'*<>|-`\"

With this new encoding the compressed files looks like:

Which definitely looks more mumbo-jumbo than before. Which is good! More randomness in encoded files means more potential for compression. With this, the result is:

Before: 1146989

After: 830805

Yes! We compressed the file by about 25%. Is that good? Linux's built-in zip-tool, unix compress, manages to get the file down to:

original 1146989

this 830805
zip 417624

So the built in is less than half the size of our compression! What compression algorithm does compress use? A quick lookup tells us that it uses LZW, which just so happens to be quite a simple optimization of LZ78. The optimization is that the final symbol needn't be transmitted, but can be inferred from the next sequence ID. We can easily calculate how big that would make the file with our implementation:

original 1146989

this 830805
LZW opt 657987
zip 417624

So even with the LZW-optimization, we're not close to the built-in program. It's possible that more opimizations are used, but a lot of the difference is to be explained with the fact that compress works on a pure binary level. The encoded file for compress looks like:

[1] has the following to say about LZW:

References
[1] Williams, Ross N. Adaptive data compression. Vol. 110. Springer Science & Business Media, 2012.

fredag 11 oktober 2019

Optimizing shortest_path_length in networkx

Summary

As of networkx version 2.4rc1 [1], the helper function
networkx.algorithms.shortest_path.unweighted._single_shortest_path_length
can be optimized.

Current implementation: add neighbours of visited nodes to "nextlevel" set. Later, iterate through "nextlevel" set to expand their unseen neighbours. Iterate until "nextlevel" is empty.

Suggested optimization 1: keep track of nodes to be expanded in a queue (collections.deque) that remains sorted by distance from source node. Do not add already visited nodes to any set.

Suggested optimization 2: terminate immediately if all nodes have been visited. This saves a lot of work for dense, strongly connected graphs.

Investigation and analysis shows the effectiveness of the optimization. Measured speedup is at least a factor of 2. For dense strongly connected graphs, the speedup can be up to O(n), where n is the number of nodes.

Code

Current implementation

Suggested implementation

Results

Both algorithms were run on a set of test graphs. This included random graphs of varying sizes. It also included a large tree, a complete graph, and a couple of other special structure graphs. Both versions received exactly the same random graphs. The output was verified to be identical. The benchmarks were re-run with the two algorithms in reversed order, to check that cached variables did not play a role.

Random + structured graphs, single iteration all_pairs_shortest_paths

time / ms                    current         optimized           
n=1e1, m=1e1                    0.15              0.14             
n=1e1, m=3e1                    0.34              0.18            
n=1e2, m=1e2                    2.04              1.35           
n=1e3, m=1e3                   56.76             31.89         
n=1e3, m=1e5                27857.46            612.96        
caveman(100,10)              3373.64           1168.82       
windmill(10, 10)               32.08              4.09          
nary_tree(3, 1000)           1886.88            810.51       
complete(100)                 205.25              3.48           
mlp(10, 10)                    15.09              5.70          
amnesia(100)                  261.17             41.43

Small random graphs, total all_pairs_shortest_paths, 1000 iterations

time / ms        current   optimized     
n=1e1, m=1e1       63.69       56.82    
n=1e1, m=3e1      213.32       87.55   
n=1e2, m=1e2     2035.69     1296.77

Empirical conclusions

For all graphs that were tested, the optimized algorithm was faster. The speedup for a tree is a factor of 2, which is in the lower end. For the very smallest graph, the speedup is "only" on the scale of 10%. However, these functions execute so fast that function calls and other Python overhead probably plays a large role. For the largest dense strongly connected graph, the observed speedup is a factor of 50. It is argued below that the speedup for such graphs can be up to O(n).

Analysis

Optimization 1 (use queue, not sets):
The current implementation keeps a set "thislevel" of unexpanded nodes that it goes through. All neighbours of "thislevel" are added to another set "nextlevel", whether they have been expanded or not. In the below graph, that means that all the bottom nodes will get added to "nextlevel" n times.

Current implementation will add bottom nodes to a set 'n' times, as well as do a set lookup 'n' for each bottom node. Suggested implementation will only do a set lookup for bottom nodes 'n' times.

In the suggest optimization, it is checked whether a node has been expanded before it is added to the queue. The above graph is the one referred to as "amnesia(n)" in the results table. The empiric speedup is about a factor of 6, independent of n.

Optimization 2 (early termination):
In the current implementation, we check all neighbours of the nodes in "thislevel" until no new unvisited nodes are found. In the suggested optimization, we count the number of unvisited nodes using just an integer counter. Whenever an unvisited node is found, we decrement. If the counter reaches zero, search can terminate. This can make a big difference for dense, strongly connected graphs.

In the extreme case, we have the complete graph, which requires checking (n-1)^2 edges in the current implementation, but only (n-1) edges in the suggested optimization. That is to say, an O(n) reduction of the required work. The measured speedup for the complete graph with 100 nodes is a factor of 60. For a large dense graph with 1000 nodes and approx. 100,000 random (directed) edges, the measured speedup is a factor of 50.

A risk case is if the graph changes during the search. This can cause incorrect results. This can also happen with the current implementation. However, the optimized algorithm will still terminate (unless new nodes and edges are added faster than the algorithm can process them), because it also terminates when the queue runs out.

Impact

The function _single_shortest_path_length is directly used by

single_source_shortest_path_length
single_target_shortest_path_length
all_pairs_shortest_path_length

Indirectly, references were found to these functions in:

algorithms.distance_measures
algorithms.distance_regular
algorithms.efficiency_measures
algorithms.bipartite.centrality
algorithms.centrality.closeness
algorithms.connectivity.stoerwagner
generators.ego

It may be used by other functions within networkx, but it's hard to determine since functions can be passed as arguments by another name.

Details, discussion

Currently, the function accepts a dict "firstlevel", where the values are not used, only the keys. In all functions that call _single_shortest_path_length directly (which is just 2), "firstlevel" is either {source: 1} or {target: 1}.

It may be a desirable use case to "seed" the shortest_path_length with different nodes predetermined to be at different levels. This is not supported by the current implementation, however. It is not supported by this optimization either.

Further suggestion

Either: make the argument "firstlevel" a set, so that shortest distance is counted from a number of source nodes. This use case is currently supported, although it is obfuscated by the fact that a dict is expected. So basically, just change the docstring.

Or: replace "firstlevel" with a single "source" node. The function is, after all, called _single_shortest_path_length.

Update 2019-10-24

This pull request led to a merge, see the rest of the conversation here:
https://github.com/networkx/networkx/pull/3647

References

[1] Hagberg, Aric, et al. "NetworkX." URL http://networkx. github. io/index. html (2013).

onsdag 9 oktober 2019

Finding all maximal cliques Part II

Finding all maximal cliques Part I

In the previous post, I presented an algorithm for finding all maximal cliques in a graph. A subgraph Q is a clique if all nodes in Q are connected to each other. We call Q a maximal clique if there is no node outside of Q that is connected to all nodes in Q. The maximal cliques are not to be confused with the "maximum clique", the maximal clique that is largest in some sense (e.g. number of nodes or sum of edge weights).

The key idea of the algorithm was that we only need to consider sorted cliques, where the sorting key can be chosen arbitrarily. Sorting the nodes based on their degree, so that we first append lower degree nodes to cliques, should help us reduce the remaining pool of candidate nodes more quickly. The algorithm was indeed faster than networkx's implementation for large, random graphs. The speedup was about 30%. However, for graphs with more structure, particularly the complete graph, the algorithm is several times slower than the networkx implementation. To see why, we have to look at the method that networkx uses.

It is objectively trivial to find all maximal cliques in the complete graph (there is just one, the set of all nodes), so it should be a trivial case for a maximal clique-finding algorithm. The networkx implementation is based on research, such as [1, 2, 3]. The common key idea is to use a pivot element. During the recursion, we always keep track of the current clique, and the remaining candidates, the nodes that are connected to all the nodes in the current clique. The pivot element can be chosen arbitrarily among the candidates, but we shall see that there is an optimal way to choose it. The pivot element P partitions the candidates into two groups: those that are neighbours of P, call those P-minus, and those that are not, call those P-plus. Note that P is not a neighbour of itself, so P is in P-plus.

Now, adding any subset of P-minus to the current clique will not form a maximal clique, because we can always add P to make it larger (since P is a neighbour of all all nodes in P-minus). So, we don't need to branch on any element in P-minus. The optimal way to choose P is of course the one that produces the largest P-minus, since that prunes the most branches.

Let's examine what the pivot trick does to the complete graph, say K(10). In the beginning, all nodes are candidates. All candidates have 9 neighbours, so we pick a P at random. Now we loop through all non-neighbours of P, which happens to be just P! In the recursion, we feed the algorithm with the neighbours of P, which is K(9), the complete graph with 9 nodes. The same thing happens every time we recurse. In practice, the branching factor for the complete graph is just 1, and the running time is linear.

Can the pivot trick be used together with the node sorting? If there is, I haven't found it.

References

[1] Cazals, Frédéric, and Chinmay Karande. "A note on the problem of reporting maximal cliques." Theoretical Computer Science 407.1-3 (2008): 564-568.

[2] Tomita, Etsuji, Akira Tanaka, and Haruhisa Takahashi. "The worst-case time complexity for generating all maximal cliques and computational experiments." Theoretical Computer Science 363.1 (2006): 28-42.

[3] Bron, C. and Kerbosch, "Algorithm 457: finding all cliques of an undirected graph". *Communications of the ACM* 16, 9 (Sep. 1973), 575--577.

tisdag 8 oktober 2019

Finding all maximal cliques Part I

I want to find all maximal cliques in a given graph G. A subgraph Q is a clique if all nodes in Q are connected to each other. In a clique, everyone knows everyone. We call Q a maximal clique if there is no node outside of Q that is connected to all nodes in Q. Example:

The two maximal cliques are circled. Note that e.g. {1, 2} is a clique but not a maximal clique, because both 1 and 2 are connected to 3.

I read a paper [1] which is about finding triangles in a graph. A triangle is a clique with exactly 3 members. An interesting observation is made that we only need to consider sorted triplets (u, v, w), u <= v <= w, where the sorting key can be anything we like. In [1], the degree of the node in ascending order is used, which is smart because it leads to a smaller branching factor. It should be possible to apply the same idea to cliques of any size.

I came up with an algorithm like:

Term explanation:

adj: adj[u] is a set of nodes adjacent to u, excluding u itself.

higher_degree_adj: higher_degree_adj[u] is a set of nodes adjacent to u, wit ha higher degree than u. If two adjacent nodes have the same degree, symmetry is broken randomly.

Q: is the current clique.

cand: is all nodes that are connected to all nodes in Q. If cand is empty, we have found a maximal clique, and we return Q.

When we add a node u to the current clique, we recurse but iterate only over the higher degree neighbours of u. Suppose u has a neighbour v with a lower degree, that is also in cand. If there is a maximal clique that contains both u and v, we will find it when we recurse from v.

Results

Let's try this out on some random graphs, and compare it to networkx's builtin clique finder [2]:

time / ms       networkx        v001
n=1e1, m=1e1         0.1         0.1
n=1e2, m=1e2         0.3         0.6
n=1e2, m=1e3         1.5         1.4
n=1e3, m=1e3         2.8        30.7
n=1e3, m=1e4        13.3        12.7
n=1e3, m=1e5       634.4       373.9
n=1e3, m=2e5      9507.6      6909.8

Wow! For the largest graphs, it is significantly faster! For the smaller graphs, probably the overhead of building the higher_degree_adj is not worth it.

However, most graphs in reality are not just "random" but they have some structure. Let's see how we do on these:

time / ms                     networkx               v001
n=1e1, m=1e1                       0.1                0.1
n=1e2, m=1e2                       0.4                0.6
n=1e2, m=1e3                       1.4                1.3
n=1e3, m=1e3                       2.7                4.8
n=1e3, m=1e4                      11.8               15.1
n=1e3, m=1e5                     613.3              350.7
n=1e3, m=2e5                    9242.6             6617.7
caveman(100,10)                   12.4               81.3
windmill(10, 10)                   0.9                7.1
nary_tree(3, 1000)                 3.3                5.5
complete(20)                       0.6              703.4

Whops! Dismal score on all the "special structure" graphs! Seems like I cheered too soon, here. Particularly the complete graph seems to be trivial for networkx, but is very cumbersome for my algorithm. So that is probably a clue to what the weakness is.

See Part II for the continuation of this!

References

[1] Roughgarden, Tim. "CS167: Reading in Algorithms Counting Triangles." (2014).

[2] networkx.algorithms.clique.py (find_clique)

onsdag 25 september 2019

sugarrush

I have released my PySAT extension as an open source package. It's called sugarrush. Get it with:

pip3 install sugarrush

It requires PySAT:

pip3 install python-sat

I wrote this package because I wanted automatic bookkeeping of variables with PySAT, and also to add extra constructs. The design goal is to make it possible to write constraints in a way that is easy to understand, and that looks neat.

See my blog posts that use sugarrush:

måndag 23 september 2019

MIP: Wood puzzle

I found a wooden puzzle in my mother's bookshelf. This is all there is to it:

All pieces can be moved freely. A single piece:

Each piece has one corner each in the colors Red, Green, Blue, and Yellow. The goal is to match colors of adjacent corners. Solving this becomes a neat programming exercise.

Brute Force

How long would it take to just enumerate all solutions? There are 8 pieces and 8 locations, which gives us 8! = 40320 permutations. However, the puzzle can be rotated 180 degrees and be identical, so it's just 20160 permutations of the pieces. The individual pieces can also be turned in 4 different directions independently of each other. So for each permutation, we have 4^8 = 2^16 = 65536 turns. In total we have 8! / 2 * 2^16 ~ 1.3 billion solutions. So it takes about the same amount of time for the computer to solve it by brute force, as for a person to solve it manually.

MIP

The problem is naturally modeled as a constraint problem. For MIP, one can use one binary variable for each combination of location, piece, and turn. In total, 8*8*4 = 256 variables. There are 22 "conflict edges" where colors from two different pieces must match in an adjacent corner.

The most numerous constraints, however, are the structural constraints. We must ensure that each piece gets exactly one location, and vice versa. Each piece must also get exactly one turn. These constraints are all on the same form: a sum of a set of binary variables must equal 1. These are also called symmetric constraints.

For each corner of a location, we must know its color, as a function of the decision variables. This is necessary so that we can express the color matching constraints in terms of the decision variables. For each corner of a location (dot) and each color, I create one auxiliary variable. In total, that becomes 8*4*4 = 128 dot-by-color variables.

Pattern Picking constraints

The dot-by-color variables are bound to the decision variables using pattern picking constraints. For each decision variable X and each dot-by-color variable Y, make the constraint:

If X dictates the same color for Y's dot as Y does, then:

X <= Y

Otherwise:

X + Y <= 1

How does this work? If X is 0, then Y can be either 0 or 1 without violating the constraints. However, if X is 1, then Y is forced to be 1 or 0 respectively, depending on which constraint is active. Note that we use the same Y variables for all decision variables, but they are cleverly bound together to all of them without risk for infeasible solutions.

Extra constraints for efficiency

It is implicit from the constraints on the pieces that for a given location, there will be exactly one dot per color, and one color per dot. This is something we can tell the solver beforehand, as constraints. This makes the solution go almost 4 times faster.

Result

The solution takes about 8 seconds for find with CBC, so it is not trivial for it.

Conclusion

Most constraints are for making the model internally self-consistent. These types of constraints arise from things that humans take for granted, such as the pieces being non-reusable. These internal constraints must be uncovered during the modeling process, and take a lot of time.

The pattern picking constraints are quite neat and abstract, and could possibly be packaged as part of a solver interface.

We can do some of the work for the solver by adding constraints that we can logically infer from the problem. This can be important for performance optimizing MIP models.

lördag 17 augusti 2019

Playing with Word2Vec

In a lecture by Yann LeCunn, he mentions that word2vec embeddings can be used to learn analogy properties. I want to see if any other natural language concepts are modeled by word2vec's.

A word2vec embedding is a vector on the order of 100 elements that represents a word. The embedding is the hidden layer of a network that for each pair of words predicts the probability that they appear close to each other in a large corpus. An example of an analogy:

London is to England what Paris is to France.

When people have looked at word2vec's, they have found that the following property holds approximately true:

w2v(london) - w2v(england) = w2v(paris) - w2v(france)

This difference is also the same for most capital-country pairs. That's quite interesting! Let's see if we can find some other interesting properties.

I use a dataset of 43000 words from Kaggle. Many of the words have leading upper case, while not being proper names. After converting everything to lower case and removing duplicates (keeping only the first to appear in the list), there remains 38000 words.

Relation
First, wouldn't it be great if this was true?

w2v(london) - w2v(england) = w2v(capital)

Looking at the 10 closest elements of london - england:

paris: 0.345
tokyo: 0.321
brussels: 0.315
mayfair: 0.314
uptown: 0.303
nairobi: 0.302
jakarta: 0.302
budapest: 0.299
amsterdam: 0.294

OK, that didn't work. It doesn't work to take england - london either. The similarity to "capital" is only 0.15.

Hypernym
A hypernym of a set of words is a word that describes the set as a whole. For example:

Cutlery is a hypernym of knife, fork, and spoon

Color is a hypernym of red, green, and blue

Can we shake out the hypernym with word vectors?

hypernym(words) = mean([w2v(word) for word in words])

hypernym(["red", "green", "blue"])
yellow: 0.459
purple: 0.440
orange: 0.436
pink: 0.433
brown: 0.418
purple: 0.418
white: 0.407
gray: 0.396
colored: 0.387
crimson: 0.380
maroon: 0.372
pink: 0.370
color: 0.355

Not too bad! Color shows up among the top answers.

w2v.hypernym(["mercedes", "jeep", "ford"])

1 mercedes: 0.615
2 ford: 0.535
3 jeep: 0.514
4 car: 0.511
5 jeep: 0.505
6 sedan: 0.490
7 jaguar: 0.470
8 buick: 0.468
9 vehicle: 0.458
10 mustang: 0.448

Great! Car is the top answer except for the three included words.



w2v.hypernym(["bed", "table", "chair"])                                                                                                                                           

1 bed: 0.499

2 chair: 0.474

3 table: 0.410

4 sofa: 0.384

5 couch: 0.378

6 divan: 0.359

7 beds: 0.349

8 chairs: 0.334

9 footstool: 0.333

10 daybed: 0.328

I wanted to see 'furniture' here, but no dice.

Reduce: 'country'
What if we subtract more than one vector from another? Can we peel back layers of meaning this way? Example:

The word 'country' has several meanings. It can be a music genre, it can be a synonym for 'nation', and it can refer to land outside of cities. Remove the 'land' and 'nation' context, we should be thinking of music.

For the implementation of this, I tried two approaches: simply subtracting away vectors, and also projecting away components. They both worked about as well. The word vector is normalized after each reduction, so the scale is relevant throughout.

w2v.reduce("country", [])

2 nation: 0.724

3 world: 0.598

4 globe: 0.514

5 america: 0.486

6 countries: 0.482

7 national: 0.478

8 abroad: 0.472

9 republic: 0.453

10 europe: 0.453

Without context, the strongest associations to country is in the sense of 'nation'.

w2v.reduce("country", ["nation"])

2 abroad: 0.445

3 countryside: 0.374

4 homeland: 0.363

5 border: 0.347

6 province: 0.347

7 villages: 0.346

8 rumanians: 0.336

9 europe: 0.335

10 thailand: 0.335

And with a heap of reductions we can indeed force it to think only of music!

w2v.reduce("country", ["land", "nation", "rural", "abroad", "province", "europe"])

2 genres: 0.211

3 carreer: 0.207

4 singer: 0.206

5 musician: 0.177

6 catatonia: 0.175

7 promoter: 0.173

8 roped: 0.169

9 duet: 0.166

10 crooning: 0.165

11 vocalist: 0.164

Reduce: 'house'

Can we do the same for 'house'? Also a music genre, with more than one meaning.

w2v.reduce("house", [])

2 senate: 0.702

3 bill: 0.541

4 appropriations: 0.520

5 congress: 0.511

6 commons: 0.487

7 congressional: 0.486

8 senators: 0.483

9 lawmakers: 0.473

10 conferees: 0.447

I make a rule to always reduce the top word.

w2v.reduce("house", ["senate"])

2 manor: 0.461

3 sanctuary: 0.414

4 lodge: 0.409

5 lounge: 0.400

6 gardens: 0.388

7 club: 0.388

8 palace: 0.381

9 gate: 0.377

10 factory: 0.375

... repeating this a number of times ...

w2v.reduce("house", ["senate", "manor", "factory", "lords", "society", "rowdy", "sanctuary"])

2 ways: 0.155

3 true: 0.133

4 diet: 0.132

5 networks: 0.122

6 inset: 0.120

7 r: 0.116

8 feat: 0.114

9 signature: 0.113

10 crestfallen: 0.110

Yes! At least we get 'feat' (short for featuring) and 'true' in there.

Alignment
Can the embeddings be used to sort things? Example:

Since the sun is bigger than a car, 'sun' should be closer to 'big' than 'car' is.

This is admittedly quite a long shot for such a simple model, and it turns out to be very wrong:

w2v.alignment("big", ["sun", "planet", "car", "dog", "ant"])

1 car: 0.120

2 ant: 0.100

3 dog: 0.099

4 planet: 0.079

5 sun: 0.022

w2v.alignment("healthy", ["carrot", "burger", "wine", "juice", "cake"])

1 juice: 0.221
2 carrot: 0.161
3 cake: 0.084
4 burger: 0.054
5 wine: 0.025

Probably what we are seeing here is mostly how often people talk about these objects in the context of being "big" or "healthy".

Conclusion

Word embeddings can yield surprisingly relevant results with very simple models. However, it is important to know that they have been generated using a simple optimization that only looks at local context, and has no real way of modelling that context. This becomes apparent when trying to make it perform more abstract tasks.

söndag 30 juni 2019

Searching expressions

During a coffee break at work, the following brainteaser came up:

0 0 0 = 6
1 1 1 = 6
2 2 2 = 6
3 3 3 = 6
4 4 4 = 6
5 5 5 = 6
6 6 6 = 6
7 7 7 = 6
8 8 8 = 6
9 9 9 = 6
10 10 10 = 6

For each row, one is supposed to fill in the operators that make the expression true. One is not allowed to add more digits, of course. The allowed operators are the arithmetic operators, exponentiation, square root, and factorial. For example, we have:

(1 + 1 + 1)! = 6
3 * 3 - 3 = 6

We solved 0 through 10, but struggled with 11. We also wondered whether there were some interesting solutions for 0-10 that we had missed.

I have been intrigued by symbolic programming and automated theorem proving since I first read about it. This seemed like a simple enough problem to start exploring those subjects. The first obstacle, which proved to be the most difficult one in the end, was conceptual blocks on my part. From proving mathematical theorems in university, I'm used to feeling that at each step in the derivation, there is an "infinite" number of possible moves, if only one is creative enough.

Tokens. x, y (numbers)
Unary operator. U: x -> y
Merge operator. M: x, y -> z

Example
State: [x, y, z]
Applied operation: M1 ('add'), index 0
Results: [M0(x, y), z] = [x+y, z]

Searching through expressions
However, if we want to do a systematic search in the computer, we need to be able to enumerate the possible moves. In our case, we have restricted ourselves to a finite number of operators, and there is a finite number of tokens to apply them to. So the number of moves in each step is not only enumerable, but finite. My idea for looking for an answer here is to simply do exhaustive search, using BFS (breadth-first search). Breadth-first search should help us find one valid solution quickly (if we assume that there is at least one relatively simple solution). However, if we want to find all solutions, it doesn't really matter which search algorithm is used, as we will have to go through all states anyway.

The really important thing is termination conditions. I pruned all states which had at least one token which was not a real integer. I also pruned all states with an integer larger than 1000.

Implementation
The search can be done very conveniently using networkx. The states are represented as nodes (the key is a tuple of the tokens). The edges are directed, and each edge has a feature with a representation of the operation. This help verify correctness, and display the result.

Another question is how to return solutions. First, I tried to return all simple paths from the starting stated to the goal state. For x=9, this gave over 17000 solutions. Clearly a lot of them were the same up to permutation. I decided to return one solution per immediate predecessor to the target state. For each immediate predecessor, I return the shortest path in the search graph. This brought down the number of solutions to no more than a few 10's at most.

Results
The resulting successful computations can be plotted using graphviz via pydot. Installed through:

apt-get install graphviz
pip3 install pydot

Some examples. The _index suffixes are there because pydot doesn't accept custom labels.

The only solution found for 1.

One of two solutions found for 8.

One of 22 solutions found for 9.

The searcher found many solutions that I hadn't thought about, however most were not very interesting. I ran it for numbers up to 1000, and sadly there was not a lot of surprises. Going back to the problem we had in the beginning, it did not find a solution for 11. The workable numbers up to 1000 are typically very even numbers. The highest workable I found was 729, which goes through 81:

Path from 729. which goes through [729, 729, 729]->[27, 27, 27]->81->9->3->6.

The highest workable prime found is 37, which goes through 36:

Conclusion
This was a very fun exercise and it yielded successful results quickly, even if it was not so surprising. It should be noted that we're relying on numerical answers here as opposed to having an exact representation of numbers. This works well for integers, but wouldn't generalize to reals.

torsdag 27 juni 2019

Convex Hull: correlated points

Convex Hull

In the previous post, I investigated the convex hull of random point clouds. I found that as we add more dimensions while keeping the number of points the same, the expected volume decreases sharply. One way to think of this is that in high dimensions, almost all points in a random point cloud will lie on the edge of the convex hull, very few will be "surrounded" by the other points.

In this post, I want to see what happens if the points are not random, but correlated to each other. Why might we think so? Well, strongly correlated systems have in a sense fewer degrees of freedom, so we expect them to behave like lower dimensional problems. Like this:

Correlated points in R^2. Out of the 100, 90% of the points are in the interior of the convex hull of the other points. These are marked on blue. The points on the edge of the convex hull are marked in red.

How to sample correlated points in R^N

Generate a square matrix A with uniform random elements, each with expectation 0. For each point in the point cloud, generate N gaussian random numbers, and apply A to these. The covariance matrix will be AA^T.

2 dimensions

Making 100 random trials gives that the average k-fold inclusion is 91%, compared to 88% for just random point clouds. So the difference is not big, not even statistically significant. The empiric variance of an estimate probability p in a binary distribution is n*p*(1-p), where n is the number of trials. With n=100 and p=0.9 in our case, this comes out to about 8. So the standard deviation is about 2.8, pretty much the same as the difference between the sets. We might want to apply a real test here, something that respects the binary distribution. But that's silly; if we wanted something more accurate we should just make more tests.

10 dimensions

Doing this in 10 dimensions give an estimate of 5.6% for the independent point clouds, and 6.0% for the correlated point clouds. Once again, not significant.

Conclusion

When we observe high-dimensional systems, almost all observations will be "extreme" in some direction, even if the system is structured.

måndag 24 juni 2019

Sampling: simple stable Autoregressive

I want to try out the convex hull inclusivity tester on data that is less random. One way to get such data is to sample a stable ARMA process. In fourier space, an ARMA can be expressed as a quotient of two polynomials:

$Y(z^{-1}) = \frac{Q(z^{-1})}{P(z^{-1})}$

P contains the poles, and Q contains the zeroes, or nills. In this post, I use only Q = 1.
The criterion for stability is that all roots of P is in the unit circle. To get a uniform sampling from the unit circle, we can sample an angle from a uniform distribution, and an absolute value from a triangle distribution (since the density must increase linearly with the absolute value). The code:

This output format fits perfectly into the ArmaProcess object from statsmodels.

Random stable AR process from 5 pairs of poles.

söndag 23 juni 2019

Convex Hull

I want to find the convex hull of some set of points in $\mathbb{R}^n$ (a euclidean space of arbitrary dimension). I am hoping that this will be an interesting programming exercise, and that it will provide some insight into the properties of higher dimensions.

Definition

The Convex Hull of a set S is the smallest convex set that contains all elements of S. So much for the mathematical definition. Intuitively, we can think of the convex hull as the shape of a rubber band that encloses all points in S.

Example of the convex hull of a set of points.

A Simple Test

When starting out, I thought about explicitly constructing the convex hull. This is indeed something we will get to later. By explicitly constructing the hull, I mean finding the linear subspaces that constitute it. In the example above, that would imply finding the line equations of the red lines. However, it struck me that this is not necessary if all we want is to test whether some given point is inside the convex hull or not. There is a way to do it by solving a single linear programming problem. We use the following fact:

A vector u is in the convex hull of a set of points V if u can be expressed as a convex linear combination of the vectors in V. That is:

$u = \sum_{v_i \in V}{\lambda_i v_i}$

such that:

$\sum{\lambda_i} = 1, \lambda_i \geq 0$ .

This can be readily formulated as a linear program, where u and V are given, and the lambdas are variables. The code:

This works beautifully in 2D:

Convex hull for the set of points (0, 0), (1, 0), (0, 1), marked with red dots. Sampled in the integers.

Convex hull (red) of random points (green) sampled in the integers.

Volume of Convex Hull using Monte Carlo

One question I had was: "suppose we take 10 random points sampled from the unit cube, what is the expected volume of their convex hull?". This requires a nested simulation. On the highest level, we need to sample a lot of 10-point sets and measure their volumes. And for each volume that we want to measure, we need to sample a lot of points to get an accurate Monte Carlo estimate of the volume. Some initial tests:

dim | volume: 10 points | volume: 100 points
2D  | 0.42              | 0.88
3D  | 0.14              | 0.69
4D  | 0.035             | 0.45

What can we say from this? Not much more than that the average coverage decreases as the number of dimensions increase.

codecogs equations