Garage of Code: Sampling: mutating box trees

Idea for series sampling

In the previous post, I sampled time series by defining transition probabilities using boxes of different sizes. We want to find the most complex time series using the least complex representation of the transitions. We need to start searching for better transition functions in the box representation.

First, we need to mutate the box representation. We can see the boxes as a non-balanced binary tree. Each node represents a box. Each box is either split or not split - i.e. each node has either 0 or 2 children. The root node is the box representing the whole space. I mutate the box tree by removing both children of a non-leaf box. I then insert as many leafs as was lost in the remove operation. So, the number of leafs are kept the same. This gives a very quickly changing box landscape:

Random mutations on the box space. All sub-boxes in a random box are deleted. Then, new leafs are inserted, keeping the number of boxes the same.

Now we need a measure for the complexity of the box tree. I use area-wise entropy:

$S = -\sum_{i \in boxes} \log_2(\frac{A_i}{A})\frac{A_i}{A}$

Where A_i is the area of the box and A is the total area. This is a measure of how much we can compress a point's location, if we know that it is in one of the boxes with equal probability. If we do a greedy search to maximize the entropy, we get:

Greedy search to maximize box entropy. Optimum is that all boxes have same area. There are 64 boxes, and the total area is 1, so ideally they should all have the area 2^(-6).

Greedy search to minimize entropy. Optimum is that one box covers the whole area and the others have 0 area. This cannot be represented by the box tree, so it produces a geometric series of boxes.

We need also a score for the resulting time series. Ideally, the series should cover a lot of the box space. I use the count-wise entropy:

$S_{series} = -\sum_{i \in boxes} \log_2(\frac{C_i}{C})\frac{C_i}{C}$

Where C_i is the count of the series for box i, and C is the total number of elements in the series. With this, we can do a greedy search to maximize S_series - S_boxes:

Greedy search to maximize S_series - S_boxes. Estimating S_series by sampling is not very reliable.

The problem is that we cannot reliably find out the entropy of the series, because it relies on the stationary distribution. At this point, I realize that what we are doing actually defines a Markov chain. And for Markov chains, we can calculate the stationary distribution. So, that is a good next step in order to get a better measure of process complexity.

Garage of Code

codecogs equations

tisdag 26 februari 2019

Sampling: mutating box trees

Inga kommentarer:

Skicka en kommentar