Garage of Code: februari 2019

torsdag 28 februari 2019

Zen of Python: Sparse is better than dense

In the sampling mini project, and also in the 2D interval selection problem, I had to implement a few operations on intervals one dimension and higher. Let's call intervals in an arbitrary number of dimensions boxes.

contains: given a box and a point, checks if the point is contained in the box.
intersection: given two boxes, returns the overlap between them (if there is any).
profile: given a box and a point specified in a subset S of the box's dimensions D, return the box's expansion in D\S (the remaining dimensions), if the point is contained in the box's projection onto S. We can see this as intersect between the box and the point expanded infinitely in D\S.
volume: given a box, return the product of its size in all its dimensions.
corners: given a box, return all its corners as tuples.

A simple google search did not suffice to find an impressive python package that supports these relatively abstract operations. I decided to clean up my code to make it a bit more general. It struck me that a good data structure for boxes might be a dict. The values are 2-tuples (or empty tuples for empty intervals) and the keys, representing the dimensions, can be any hashable. Even if I until now have worked with things that are naturally contained in euclidean space, I can imagine getting use for a sparse representation of intervals whenever dealing with a product space of ordered sets. Some bonus features:

Handle empty intervals
Projection onto arbitrary dimension is trivial. Just a key access, like so: box[dim].

Actually, this concept can be even further generalized. The only operations above that rely on the dimension being ordered are volume and corners. The others we can get as long as we have defined intersection on each of the dimensions in use. The general concept is a product space, stored only as a collection of expansions in each dimension.

tisdag 26 februari 2019

Sampling: mutating box trees

Idea for series sampling

In the previous post, I sampled time series by defining transition probabilities using boxes of different sizes. We want to find the most complex time series using the least complex representation of the transitions. We need to start searching for better transition functions in the box representation.

First, we need to mutate the box representation. We can see the boxes as a non-balanced binary tree. Each node represents a box. Each box is either split or not split - i.e. each node has either 0 or 2 children. The root node is the box representing the whole space. I mutate the box tree by removing both children of a non-leaf box. I then insert as many leafs as was lost in the remove operation. So, the number of leafs are kept the same. This gives a very quickly changing box landscape:

Random mutations on the box space. All sub-boxes in a random box are deleted. Then, new leafs are inserted, keeping the number of boxes the same.

Now we need a measure for the complexity of the box tree. I use area-wise entropy:

$S = -\sum_{i \in boxes} \log_2(\frac{A_i}{A})\frac{A_i}{A}$

Where A_i is the area of the box and A is the total area. This is a measure of how much we can compress a point's location, if we know that it is in one of the boxes with equal probability. If we do a greedy search to maximize the entropy, we get:

Greedy search to maximize box entropy. Optimum is that all boxes have same area. There are 64 boxes, and the total area is 1, so ideally they should all have the area 2^(-6).

Greedy search to minimize entropy. Optimum is that one box covers the whole area and the others have 0 area. This cannot be represented by the box tree, so it produces a geometric series of boxes.

We need also a score for the resulting time series. Ideally, the series should cover a lot of the box space. I use the count-wise entropy:

$S_{series} = -\sum_{i \in boxes} \log_2(\frac{C_i}{C})\frac{C_i}{C}$

Where C_i is the count of the series for box i, and C is the total number of elements in the series. With this, we can do a greedy search to maximize S_series - S_boxes:

Greedy search to maximize S_series - S_boxes. Estimating S_series by sampling is not very reliable.

The problem is that we cannot reliably find out the entropy of the series, because it relies on the stationary distribution. At this point, I realize that what we are doing actually defines a Markov chain. And for Markov chains, we can calculate the stationary distribution. So, that is a good next step in order to get a better measure of process complexity.

lördag 23 februari 2019

Sampling: series

sampling points in box shaped clusters

In an earlier post, we used the following design goal for a good sampling:

"generate a set of random points with a lot of features, using minimal coding effort"

An approach based on generating a random (non-complete) binary tree of boxes gave nice results:

From previous post

We can plot the boxes for clarity:

Fractally generated boxes.

Fractally generated boxes, with random sampled points.

Now I had the idea that this could be used to generate interesting time series. We can put the current value on the x-axis, and sample the next value from the profile distribution. Illustration:

Profile / conditional distribution of y(t) when y(t-1)=0.4. Notice that box density = conditional density.

So let's generate some time series with this. We start at y(0) = 0, and then update recursively using the conditional distribution induced by the boxes.

Sampling directly from the conditional distribution.

Well, that is a bit disappointing. The generated time series just looks kind of random. It is quite stateless. We can try residual sampling instead. Then we change the update equation, from:

$y(t) = \epsilon, \epsilon \in B(y(t-1))$

to:

$y(t) = y(t-1) + \lambda\epsilon, \epsilon \in B(y(t-1))$

B is the box conditional distribution, and lambda is a small coefficient. This makes it more like a derivative. An other way of seeing it, in terms of the direct sampling, is that is is as if we have forced all boxes to be around the y(t)=y(t-1) line. It also gives more interesting time series:

Residual sampling - we only make small updates. Time series has more "features". However, it only explores a small part of the state space.

This method looks more promising, but clearly we want to get better at generating conditional distributions that will make the sampling explore a larger portion of the state space.

fredag 22 februari 2019

SAT: 2D interval selection

SAT: optimal interval selection

"Cut rectangular slices of pizza. The pizza is divided into cells. Each slice must contain at least L tomato cells (T) and L mushroom cells (M). Each slice must contain at most H cells [1]. Maximize the total number of cells in the cuts."

In the last post, we solved a 2D slicing problem by optimizing each row independently. We should expect to do better by actually selecting 2D intervals instead. A 2D interval is represented as:

(x_start, x_end, y_start, y_end)

To generalize the code to 2D intervals, we do not have to change much. This is thanks to the fact that we only dealt with intervals in a very abstract sense before: as elements with a size (for the score) and an overlapping relation (to forbid overlapping slices). We need only to change the way we generate the intervals, and the overlapping relation. The code:

Now we need a solution strategy. Tests show that the 2D interval selection gets tough (starts taking more than 10 seconds) when solving for larger than 20x20 matrices. However 12x12 matrices can be solved quickly (about 0.6 seconds). Solving with 12x12 sections often give a perfect score on the subproblem:

Optimization on 12x12 subproblem. Colored borders mark slices.

Optimization on 12x12 subproblem. Colored borders mark slices.
In many cases, almost all slices are just rows or columns.

Now let's solve for the medium instance [1]. The max score is 50000. The row-wise maximization gave us 48977. The top result on google for "hashcode pizza" uses a greedy algorithm and gets 49567 on medium [2]. This time we get a score of 49980 missing only 20 cells. This clearly outperforms the other approaches. The runtime is about 10 minutes, without parallelization.

Conclusion
This result does not only say something about the power of model-based approaches, but also about the problem itself. It is quite an easy problem where we can get an almost perfect score by just solving independent subproblems.

[1] pizza.pdf
[2] sagishporer/hashcode-practice-pizza

onsdag 20 februari 2019

SAT: optimal interval selection

We are tasked with the following:

Cut rectangular slices of pizza. The pizza is divided into cells. Each slice must contain at least L tomato cells (T) and L mushroom cells (M). Each slice must contain at most H cells [1]. Maximize the total number of cells in the cuts. Example pizza:

TTTTT
TMMMT
TTTTT

L = 1, H = 6

Optimal solution:

TT | T | TT
TM | M | MT
TT | T | TT

I want to find (almost) optimal solutions using SAT. In this first version, I reduce it to a 1D problem, assuming that we only cut row by row. In other words, no slice cover more than one row.

Some problem analysis:

Without involving the SAT solver, we can calculate which slices are allowed in the solution.
Also without invoking in SAT, we can calculate which slices cannot overlap.
Determining which cells are covered by which slices can also be done easily without modelling.
With all this data known beforehand, we can reduce the slicing to something very abstract. We can see it as a selection problem: select some elements of a set with some mutual constraints and some objective.

The key here is that the mutual constraints and the objective can be easily expressed using clauses. First we enumerate all feasible slices. Then, for each pair of slices (s_0, s_1), we add the following clause if they overlap:

$\neg s_0 \vee \neg s_1$
For each cell, c, we find the slices s_0, s_1,...s_k that overlap it. Then for each cell we add an indicator:
$p \Leftrightarrow s_0 \vee s_1 \vee ... \vee s_k$
Now we want to find the largest S such that adding
$\sum_c p_c = S$
is satisfiable. For this, we can use pysat's ITotalizer [2]. It is based on technology that allows for some optimizations when trying different values of S [3]. I applied it to a binary search. The code:

For the instances medium.in and big.in in the input data [1], this algorithm can solve row-wise optimality in reasonable time. Sample output medium.in (max score is 250):

Row time: 0.510
Row score: 250

Row time: 0.484
Row score: 242

Row time: 0.519
Row score: 247

Sample output big.in (max score is 1000):

Row time: 4.096
Row score: 880

Row time: 3.108
Row score: 904

Row time: 3.336
Row score: 881

For medium I get a total score of 48977, which is not particularly good, because it can be beaten by a greedy algorithm (49567) [4]. The max score is 50000 (250x200). Still, it is interesting that we get so close to the real optimum with such a silly constraint as row-wise maximization.

[1] pizza.pdf
[2] PySAT: SAT technology in Python
[3] Martins, Ruben, et al. "Incremental cardinality constraints for MaxSAT." International Conference on Principles and Practice of Constraint Programming. Springer, Cham, 2014.
[4] sagishporer/hashcode-practice-pizza

SAT: parity board

We get the following challenge:

I take "clear" to mean that all tiles should be the same color, either black or white. One "move" is to click a tile, which changes the color of it, and also its 4-connected neighbors. Illustration:

Clicking a tile changes the color of it, and its 4-connected neighbors.

This can be solved with SAT. Some analysis of the problem:

Clicking a tile twice is the same as not clicking it at all. The only thing that matters is parity of clicks. If we assume that we are looking for an optimal solution, we can assume that each tile is clicked 0 or 1 times. So, they can be modeled as boolean variables.
Parity is also the only thing that matters for clearing the board. Suppose we want to make all tiles white. Then, black tiles should switch color an odd number of times. White tiles should switch color an even number of times.
If we can count the number of color switches each tile has done, we can add cardinality constraints on the click variables that cover each tile.
We also have to add a cardinality constraint on the whole set of click variables.

For small numbers like this, we can do parity as a disjunction of equality clauses.

This produces the following solution:

0 0 1 0 0

1 0 0 1 1

0 1 0 1 1

0 0 0 1 0

(1 means click this tile, 0 means no click)

måndag 18 februari 2019

SAT: disjunction operator

I want to create an operator that turns the disjunction of a set of CNFs, into one CNF. That is to say, for c ... d given, find g such that:

$(c_1 \wedge ... \wedge c_n) \vee ... \vee (d_1 \wedge ... \wedge d_m)$

I got no relevant hits on "disjoint" or "disjunction" in the pysat documentation. The implementation of the operator is not very complicated, however.

The idea is to do a Tseytin transformation. For each CNF in the input, we create a variable p that is equivalent to the satisfaction of the CNF. Call this variable an indicator variable. We have:

$p \leftrightarrow \bigwedge_{i=1..n}{c_i}$

This relationship can be turned into a CNF (standard AND-constraint):

$(\bigwedge_{i=1..n}{c_i \vee -p}) \wedge (\bigvee_{i=1..n}{-c_i} \vee p)$

So the method is: create indicators and indicator clauses for all input CNFs. We add all indicator clauses together into a joint CNF. Lastly, we need to add a clause with all indicators, since we need to make sure that at least one indicator is true. The code:

Now we can test it. Let's see if we can make a CNF that is true if and only if the sum of the variables is even. The code:

I modified enumeration_test so that if outputs which variable sums are present in the satisfying assignments, and in the unsatisfying assignments, respectively. The output is as expected:

Satisfying assignments:
{0, 2, 4, 6, 8, 10}

Unsatisfying assignments:
{1, 3, 5, 7, 9}

In conclusion, we could do disjunction as expected. Note that auxiliary variables were present here (implicit from the "equals" clauses, which use sequential counters), as in the negation experiment. But in this case, it worked fine.

SAT: How "negate" doesn't work

I want to explore the python-sat module. One thing that seems simple yet useful is to negate formulas. That is, to create a formula that has the opposite boolean value to a given formula. Let's start with a formula:

This bound was created using a sequential counter explained in [1] (p.8) and defined in [2]. It introduces two auxiliary variables; 4 and 5. We can enumerate all assignments to X and see that the formula works as expected:

Satisfying assignments:
(0, 0, 0)
(0, 0, 1)
(0, 1, 0)
(1, 0, 0)

Unsatisfying assignments:
(0, 1, 1)
(1, 0, 1)
(1, 1, 0)
(1, 1, 1)

Now we can use pysat.formula.CNF.negate to negate this formula. Expected output:

Satisfying assignments:
(0, 1, 1)
(1, 0, 1)
(1, 1, 0)
(1, 1, 1)

Unsatisfying assignments:
(0, 0, 0)
(0, 0, 1)
(0, 1, 0)
(1, 0, 0)

But, using this:

We get:
Satisfying assignments:
(0, 0, 0)
(0, 0, 1)
(0, 1, 0)
(0, 1, 1)
(1, 0, 0)
(1, 0, 1)
(1, 1, 0)
(1, 1, 1)

Unsatisfying assignments:
[]

What happened here?

Explanation

The problem has to do with the auxiliary variables. The problem is that in the original formula (bound_X), there are assignments where the main variables (X) fulfill the bound, but the assignment of the auxiliary variables (4, 5) make the formula false. By definition, such an assignment will make the negated formula (bound_X_neg) true. Thus, the negated formula can be satisfied even when the main variables are satisfy the original formula.

In conclusion, negating a formula's function with respect to a strict subset of the variables does not work in the general case.

P.S. We see the problem when we iterate over all assignments to X + the auxiliary variables (satisfying assignments for atmost(X, bound=1)):

Satisfying assignments:
X auxilliary
(0, 0, 0, 0, 0)
(0, 0, 0, 0, 1)
(0, 0, 0, 1, 1)
(0, 0, 1, 0, 0)
(0, 1, 0, 0, 1)
(1, 0, 0, 1, 1)

Unsatisfying assignments:
X auxilliary
(0, 0, 0,   1, 0)
(0, 0, 1,   0, 1)
(0, 0, 1,   1, 0)
(0, 0, 1,   1, 1)
(0, 1, 0,   0, 0)
(0, 1, 0,   1, 0)
(0, 1, 0,   1, 1)
(0, 1, 1, 0, 0)
(0, 1, 1, 0, 1)
(0, 1, 1, 1, 0)
(0, 1, 1, 1, 1)
(1, 0, 0,   0, 0)
(1, 0, 0,   0, 1)
(1, 0, 0,   1, 0)
(1, 0, 1,   0, 0)
(1, 0, 1,   0, 1)
(1, 0, 1, 1, 0)
(1, 0, 1, 1, 1)
(1, 1, 0, 0, 0)
(1, 1, 0, 0, 1)
(1, 1, 0, 1, 0)
(1, 1, 0, 1, 1)
(1, 1, 1, 0, 0)
(1, 1, 1, 0, 1)
(1, 1, 1, 1, 0)
(1, 1, 1, 1, 1)

[1] Knuth, Donald E. The Art of Computer Programming, Volume 4, Fascicle 6: Satisfiability. Addison-Wesley Professional, 2015.

[2] Sinz, Carsten. "Towards an optimal CNF encoding of boolean cardinality constraints." International conference on principles and practice of constraint programming. Springer, Berlin, Heidelberg, 2005.

onsdag 13 februari 2019

SAT: Langford Pairs

I set up a SAT to solve the Langford Pairs Problem. Definition:

Given a natural number n, order the 2n numbers [1, 1, 2, 2, ..., n, n] in such a way that the two instances of any number k has exactly k numbers between them in the sequence.

The solution for n=3:

3, 1, 2, 1, 3, 2

The solution to the SAT modelling of this is given on page 5 of Knuth's satisfiability. I implemented this as:

For this, I have wrapped pysat.solvers.Solver, so that it keeps track of which variable names have been used (variable names are stored as integers). I also added the constraint symmetric. A symmetric constraint means that exactly one of the variables should be True. For n=12 the solution is:

7, 4, 11, 5, 6, 9, 4, 10, 7, 5, 12, 6, 8, 3, 11, 9, 2, 3, 10, 2, 1, 8, 1, 12

I call the wrapper SugarRush. This is a reference to the Glucose SAT solver, which is the default [1][2]. It is also a reference to syntactic sugar.

[1] The Glucose SAT Solver
[2] Audemard, Gilles, and Laurent Simon. "GLUCOSE: a solver that predicts learnt clauses quality." SAT Competition (2009): 7-8.

tisdag 12 februari 2019

SAT: Getting started

I recently participated in Al Zimmerman's Programming Contest. Tomas Rokicki, the runner up of this contest, published a writeup of his solution, which was a very interesting read. Specifically, it made me interested in learning more about SAT solvers.

My excitement was amplified by having heard Donald Knuth say "Satisfiability is something that turned out to be extremely important the more I looked at it".

By Rokicki's recommendation, I started reading Knuth's preprint about satisfiability. After working with pen and paper for a while, I felt that the material would open itself up more naturally if I had the chance to implement SAT problems and scale.

I installed python-sat. It comes with some good solvers. The documentation looks solid too. Installation procedure:

$ pip install six

$ pip install python-sat

Pysat works with integers as variable names, as far as I can see. Below a naive way to work around that, to increase model readability:

.

måndag 11 februari 2019

Labyrinth: Improving the solving

In the previous post we saw how, with the help of a slightly more advanced random search, we could find optimally hard labyrinths (hardness defined in labyrinth-complexity). That is, optimal hardness given that the labyrinth was solved by breadth-first search. What if we use a more advanced search algorithm?

A* (A-star)

A* is a heuristic search algorithm. That means it uses some rule of thumb (heuristic) to decide which node to expand next. The node to be expanded next is the node with the lowest heuristic value. For A*, this heuristic f(n) is split into two parts:

f(n) = g(n) + h(n)

Here, g(n) is the cost of going from the start node to n. In our case, we can just use the depth in the search tree. The goal heuristic h(n) is a measure of the cost of going from n to the end node. For this, we use the manhattan distance between n and the end node. A* has some optimal efficiency properties, which need two assumptions to be fulfilled. First, the heuristic must satisfy a triangle inequality. Manhattan distance is just the L1 norm, which fulfills triangle inequality. Second, the heuristic must never overestimate the cost from n to end. This is also true in our case, since manhattan distance between A and B is at least as short as any possible route in the labyrinth between A and B. With this extra information given to the search, the we cannot get the same hardness anymore:

Optimally hard labyrinth with A* as search algorithm.

The above labyrinth is a square with side 30. The start node is in (0, 0). The end node is in (15, 15). Proof of optimality: consider the top right node (29, 29), call it n. It has g(n) = 29 + 29. It has f(n) = 14 + 14. So, it has f(n) = 58 + 28 = 86. The end node must therefore be at a depth of at least 86 if it is to be expanded before (29, 29). We also know that the length of the path will be a sum of two odd numbers (parity in each dimension), so the path length must be = 2 mod 4. The smallest number that fulfills both conditions is 86. This gives us a shortest path of 87 nodes and a score of 87 - 1 + 2*813 = 1712.

Wall-counting A*

Now, we can give the search algorithm even more information to make it an even better adversary. Here, we let the algorithm store an internal representation of all missing edges (walls) that it has seen so far in the labyrinth. Its search heuristic now becomes the length of the shortest path in this internal representation graph. Call this graph the Obstruction graph. This allows it to effectively dodge dead ends, and makes life difficult for the disrupting search.

The disruptor needs to lead the searcher on a long detour to force it to expand all nodes. This reduces the hardness, as fewer nodes are dead-end.

We see where the above is going:

A*-buster

For N=900 nodes, search is very slow. I need to optimize the Wall-counting A* to be able to run longer searches. But we are starting to see some really difficult labyrinths.

I think 172 is optimal here. End should be at least (9+9) + (4 + 4) = 26 deep.

PS: an obvious fix to Wall-counting A*
Plotting the expanding search tree (see below) made me realize: wall-counting A* expands a lot of nodes that cannot reach end without going through the rest of the existing search tree. Such a node should not be expanded, by a minimality argument. So, we should also add the edges in the search tree to the internal obstruction graph. This gives us:

Improved wall-counting A*. Does not expand nodes that are isolated by the search tree itself.

Expanding search tree for improved wall-counting A*.

söndag 10 februari 2019

Labyrinth: Improving random search

In this post, we will see some tricks for guiding the random search. We will still use the same pattern:

Start with a random labyrinth
For a number of iterations, repeat:
Mutate the labyrinth in some way
Measure the result with the hardness score
Decide if we want to accept the change or discard it

Here, we will change only steps 3 (mutation) and 5 (selection).

The mutation step for the original random search was very simple: delete a small number of edges and then randomly connect the labyrinth again. We make an observation: links deleted far away from the shortest path will probably not change the score a lot.

Changing links far away from the shortest path is unlikely to change the score.

Links deleted near the current shortest path are more likely to help us go through interesting parts of the search space. This gives us the following mutation rule:

Select random node which is on the shortest path. Call this the center node.
Iterate through the edges in a small box (2-3 tiles away) around the center node. For each edge:

If the edge is in the labyrinth, remove it with some probability.
If the edge is not in the labyrinth, add it with some probability.

Connect the labyrinth randomly again if it became unconnected.

The selection step in the random search is also very simple: if the score of the mutated labyrinth is as good as or better than the current known best solution, we accept the change. This has two drawbacks:

The labyrinth becomes generally sensitive to getting stuck in a local maximum, without any mutation path to the global maximum.
The problem is often that the shortest path is too long, since the nodes that are not on the shortest path are potentially worth double the points. But making the shortest path shorter often leaves some nodes unexpanded, thus points are lost. So we want to encourage the search to make the shortest path shorter, even if it costs some points in the short term.

With this we can append the selection to include three accepting cases:

If the score is as good as or better than the current best known solution.
If the score is at least no worse than the current best known solution minus a threshold (2 was used), then we accept temporarily with some probability (0.1 was used).
If the shortest path is shorter than in the best solution and the score is no worse than best minus a threshold (10 was used), then we accept temporarily with some probability (0.1 was used).

"Temporarily" is key here. It means that if we haven't had a new accepted solution within a number of iterations (100 was used), then we restart from the best current solution. This prevents the search from getting stuck if it accepts a very bad solution and cannot find an acceptable solution in the new neighbourhood. With this, we find optimal solutions to N=900 nodes, for the test cases:

Test 1. Optimal 1740 found in 700 iterations.

Test 2. Optimal 1738 found in 53500 iterations

Test 3. Optimal 1739 found in 28900 iterations.

We can also prove optimum for test cases 2 and 3. Test 1 optimality was proved in adversarial-search-for-bfs. Call the side length of the labyrinth n. Consider the top right node. For n=30, it will have depth at least 58 in BFS's search tree. The shortest "shortest path" that still expands the top right node must therefore at least be 58 edges long.

In test 2, start is in (0,0) and end is in (10, 10). The number of up-down edges in a path between them must be an even number. The number of left-right edges must also be even. So, the number of edges must be divisible by 4. The smallest number larger than 58 that is divisible by 4 is 60. So the shortest possible shortest path that still leads to a full expansion is 60 edges long, that is to say it contains 61 nodes. The maximal number of unexpanded nodes is then 30**2 - 61 = 839. The maximal score is then 61 - 1 + 839*2 = 1738.

For test 3, the shortest path can have any odd number as length. The smallest odd number larger than 58 is 59. This gives an optimal score of 60 - 1 + 840*2 = 1739.

fredag 8 februari 2019

Zen of Python: Flat is better than nested

mip-private-tutoring-scheduling

If this model is going to be useful, it needs to be able to read data in a realistic format. That includes at least being able to take in data on when students are available, as datetime strings.

In the first model, I used a discretized time model, and the assignment variables were stored in a list of lists with equal length. One dimension for students, and one for time. This representation requires a lot of nested loops. Now, I set the challenge as being able to handle increased complexity with a more readable model. For this, I changed the representation from list of lists to a dict that maps student-time pairs to the assignment variables.

This changes the way groups of variables are retrieved. For example:

The upside is particularly clear when we want to sum in the non-primary dimension, and won't have to do transposing. In this case, we also get out of retrieving by indexes.

In summary I was happy with the outcome of this change. The flat representation both handles a new feature, and makes for clearer operations. New results, using this data:

The lessons are 20 minutes.

Coloring: Dark blue is active. Light blue is available student. Red is teacher waiting time.

måndag 4 februari 2019

MIP: Private tutoring scheduling

A private tutor wants to schedule students. We have some necessary, and some desired properties of the solution:

All 10 students must get a time
Each student is only available at certain times
The teacher can max teach 3 students at a time
The teacher wants to fit in the students in as few days as possible
The teacher wants to minimize the total on-site time, including time waiting between classes

This can be modeled as an integer program and solved efficiently with a MIP solver. I use CBC, which is open source.

Assuming that students are available randomly at 5% of the time, we get:

Coloring: Dark blue is active. Light blue is available student. Red is teacher waiting time.

söndag 3 februari 2019

Labyrinth: Adversarial search for BFS

Now that the of infrastructure for generating, visualizing, analyzing, and searching labyrinths has come into place, we are ready for the big payoff: actually searching for hard labyrinths. It turns out to that one can get very good results with a simple algorithm:

The only thing that is being done here is to remove some random subset of 4 nodes, and then connect the labyrinth again. If the cost is higher, we keep the changes. If the cost is lower, we revert back. If the cost is equal to the best cost, we accept it, in order to get out of plateaus.

For N=100 nodes, the optimal hardness is found in less than 5000 iterations:

Optimally hard labyrinth for BFS. All nodes are expanded, and the solving path is as short as possible.

I cannot say definitely if this is the optimal hardness. However, 180 is not possible for parity reasons.

For N=900 nodes we still get close to optimum, but struggle in the final points. I guess we have to be a little smarter next time!

Optimum is 1739.

Optimum is about 1739.

Labyrinth: Monte Carlo simulations

Using Monte Carlo simulations, we can empirically verify some earlier conclusions about DFS and BFS search.

Test 1

10x10 labyrinth, start = (0, 0), end = (9, 9). end may not be a leaf. Fixed random seed for labyrinth generation, variable random seed for the search. Results:

DFS, empirical figures, 1000 iterations

N=10, M=10

Path length avg: 21.0, std: 0.0

Search cost avg: 96.8, std: 44.6

BFS, empirical figures, 1000 iterations

N=10, M=10

Path length avg: 21.0, std: 0.0

Search cost avg: 98.0, std: 0.0

This confirms that BFS has small variability when the labyrinth is fixed, but DFS still has high variability. Both algorithms have on average the same cost (also happens to be about the same as the number of nodes)

Test 2

10x10 labyrinth, start = (0, 0), end = (4, 4). end may not be a leaf. Variable random seed for labyrinth generation, variable random seed for the search. Results:

DFS, empirical figures, 1000 iterations

N=10, M=10

Path length avg: 15.0, std: 7.0

Search cost avg: 81.5, std: 56.3

BFS, empirical figures, 1000 iterations

N=10, M=10

Path length avg: 15.6, std: 7.5

Search cost avg: 64.4, std: 36.5

Here we see that BFS has a lower cost on average. This is reasonable, as end is typically on a lower depth when it is in the middle of the random labyrinth.

To make it possible to do so many iterations, I had to optimize connect_labyrinth:

A quick test gives that N=10000 nodes in the labyrinth takes 0.25 seconds and N=99225 (=315^2) nodes takes 7.94 seconds. With the old algorithm, connecting with N=10000 nodes took over 100 seconds.

Labyrinth: Breadth-first search and plotting search trees

In the previous post we investigated how a depth-first search fares in a labyrinth with N nodes. We showed that:

A good model for a search cost is to take the number of passed edges. That is equal to 1 times the number of edges in the found path, plus 2 times the remaining visited edges.
Worst case cost is 2*N if the end node is a leaf node. If end is not a leaf node, disregard all nodes which are not connected to start, if we remove end.
The average case cost is N-1, independently of the labyrinth (but assuming that end is a leaf node).

Let's move on to breadth-first search. BFS is predictable in that we know that if end has a depth of k, then all nodes with depth lower than k will be expanded before end. This allows us to make the following evil example:

A labyrinth where BFS will always get a maximal score.

So when end is on the opposite side from the start, it is easy to "break" the BFS algorithm. When end is in another place, it is not obvious what the labyrinth that forces the greatest search cost for BFS is. We will see this problem developed later.

To help us see how different algorithms work, we can plot the search trees:

For a given labyrinth, BFS always gets about the same score.

DFS is sometimes really lucky...

And other times very unlucky!

DFS was implemented as:

BFS was implemented as:

The implementations of BFS and DFS are very similar. The underlying idea for both algorithms is similar: save nodes in a data structure and apply a decision for which node to expand next.

lördag 2 februari 2019

Labyrinth complexity

We want to make good labyrinths. A good labyrinth is maximally hard to solve, but has only a small amount of nodes and edges. However, the effort it takes to solve the labyrinth depends on which algorithm is used. We would like a measure of hardness that is as independent as possible on the algorithm.

A solved labyrinth

One way to measure complexity is to look at the number of nodes that is expanded during search. We can postulate that in a perfect labyrinth, all nodes must be visited. This turns out to have a simple, but unsatisfying solution:

A labyrinth that forces you to visit all nodes, but is still "easy".

This is not what we would like to mean by a good labyrinth. Intuitively, in a hard labyrinth, solving is not a straight march to the goal, but rather one is forced to retrace ones steps a lot. There should be some penalty for backtracking. A measure that captures this is to count the number of passed edges. Now, if we encounter a dead end (leaf node), we also have to count the edges again when backtracking.

Suppose that our search agent makes all decisions by coin flip. Then, the expected remaining steps, E, for a node, v, obeys the equation:
\[E(v) = 1 + mean(\{E(v_i)\}_i)\]
where \[\{v_i\}\] are the neighbours of v. That is unless, v is the end node, in which case \[E(v_{end}) = 0\]
Solving this is just a matter of solving a linear equation system. It can be formulated nicely in a constraint paradigm:

Note that the use of a Linear Programming solver is overkill, but makes for an elegant formulation. The entire problem will be solved in the presolver. So what is the expected number of passed edges from start to end for a random labyrinth? The result for a 10x10 instance:

10x10 instance complexity computed for random walk.
Nodes are visited on average over 20 times before end is reached - bad!

The problem here is obvious: the random walk is just too stupid to take seriously. Each node is visited 20.58 times before we reach the end. Searching through the entire labyrinth with depth-first expands 2 times the number of nodes (proof later), so that should be an upper bound.

Let's assume then that our search agent does a depth-first search, remembering which subtrees have already been visited completely.

Given a labyrinth then, how do we compute the expected number of edges that will be passed by the depth-first agent to solve it? This turns out to depend on whether the labyrinth is a tree or not. The labyrinths generated in [1] are trees. This follows from the fact that all the links we added were between non-connected components. It can be visualized by plotting paths to all nodes:

Paths to all nodes

Just the paths

If the labyrinth is a tree and the end node is a leaf node, the expected number of passed edges is always the number of nodes (N) minus 1. This can be shown by an inductive argument. For N=1 we start in the end node, so the cost is 0. Now, suppose the cost of a node v is N - 1. Now connect another tree to v. The cost is now:

\[E(v) = \frac{1}{2}E(T_{original}) + \frac{1}{2}(E(T_{new} + E(T_{original}))) = N_{original} - 1 + \frac{1}{2}E(T_{new})\]

The expected cost of traversing T_new is the same as the number of nodes in the subtree. This is because all nodes are expanded and popped exactly once each. So, the sum adds up to the number of nodes in the full tree minus 1.

So, it turns out that all tree-shaped labyrinths of the same size are equally hard for a depth-first agent. This is bad news for those of us who would like to make hard mazes. But we can always try tricking another agent.

This is also valid as an upper bound for non-tree shaped labyrinths. When a node we are expanding from has a neighbour which has already been visited, we can always pretend that that edge doesn't exist. This will make the search behave as a tree search.

References

[1] garageofcode/random-labyrinths

codecogs equations

torsdag 28 februari 2019

tisdag 26 februari 2019

lördag 23 februari 2019

fredag 22 februari 2019

onsdag 20 februari 2019

måndag 18 februari 2019

onsdag 13 februari 2019

tisdag 12 februari 2019

måndag 11 februari 2019

söndag 10 februari 2019

fredag 8 februari 2019

måndag 4 februari 2019

söndag 3 februari 2019

lördag 2 februari 2019