Greedy Finestructure optimisation

Finestructure Icon

Greedy finestructure optimisation

We have created a simple bash script that uses the pre-existing finestructure commands to compute the MAP (maximum aposteriori) state estimation using greedy optimisation. This can be many times faster than performing full MCMC, and is suitable for very large datasets (it is probably your only option for 10000+ samples). ChromoPainter will have become a very significant cost by this point (although note that ChromoPainter is a parallelizable step).

You only need: Running can be as simple as: datafile.chunkcounts.out outputfile.xml This uses the "tree building" step of finestructure by repeatedly:
  • Attempting MCMC moves, accepting only if they increase the posterior probability
  • Checking after a certain amount of iterations whether any progress has been made
There is therefore a danger of getting stuck in a local optima, and of failing to find a possible move that would increase the Posterior. However, empirically the approach does perform well enough for exploritory data analysis. Convergence is assessed simply by counting the number of populations (as it unlikely that adding then removing populations is possible).

Here are the usage details: Usage: ./ [-r] [-R] [-d] [-m value] [-x value] [-t value] [-a value] [-f value] datafile outputfile
Essentials: datafile and outputfile
Important flags are -m and -x
-m value: sets the number of repeated FineSTRUCTURE runs to perform before giving in. (default: 20)
-x value: sets the number of FineSTRUCTURE iterations to perform per step (finestructure -x flag). (default: 20000)
-t value: (finestructure -t flag). (default: t=100000000, i.e. effectively infinite. careful, this may be slow)
-a value: finestructure flags to be passed to all runs, e.g. "-X -Y". Quotes essential! Usually not needed. (default: "")
-f value: set the location of the finestructure executable (default: finestructure)
-r: when set, temporary files are replaced. without this you can run more iterations by changing -m and -x
-R: when set, the final tree file is deleted if present. Default is to not run.
-d: perform a dry run but don't actually do anything. Useful to see the fineSTRUCTURE arguments sued in each step.
EXAMPLE: ./ -a "-X -Y\ -c 0.2" -m 4 -t 1000 -x 50000 test.chunkcounts.out testgreedy.xml
.. continued with: ./ -a "-X -Y\ -c 0.2" -m 10 -R -t 1000 -x 50000 test.chunkcounts.out testgreedy.xml

FineSTRUCTURE is in theory run until convergence; i.e. until successive greedy tree runs have the same tree.
You must therefore set "-x" large enough to find differences at each step.
With "-x" too small, early stopping is likely and a lower K will be found.
The tree is computed only once, at the end; intermediate trees are present but highly stochastic.
Set "-t" to some smaller value if you are worried you may find very many populations; you can always rerun the final step.
If you have problems, please let me know at: and I'll do my best to help.