PaintMyChromosomes.com
fineSTRUCTURE v2 & GLOBETROTTER

Finestructure Icon
© 2012 Daniel Lawson.
Website template by Arcsin

9 Greedy finestructure

We have created a simple bash script that uses the pre-existing finestructure commands to compute the MAP (maximum aposteriori) state estimation using greedy optimisation. This can be many times faster than performing full MCMC, and is suitable for very large datasets (it is probably your only option for 10000+ samples). ChromoPainter will have become a very significant cost by this point.

To use greedy optimization, you should:

  1. Run ChromoPainter to obtain the combined coancestry matrix using fs <filename>.cp <options> -combines2.
  2. Run finestructuregreedy.sh <filename>.chunkcounts.out outputfile.xml This uses the ”tree building” step of finestructure by repeatedly:
    • Attempting MCMC moves, accepting only if they increase the posterior probability.
    • Checking after a certain amount of iterations whether any progress has been made.

IMPORTANT NOTE: The -x option controls how many steps are taken between each step. The default of 50000 may be moderately slow, but it does try hard to find a better state. If this is too low, the algorithm will terminate prematurely.

There is a danger of getting stuck in a local optima, and of failing to find a possible move that would increase the Posterior. However, empirically the approach does perform well enough for exploratory data analysis. Convergence is assessed simply by counting the number of populations (as it unlikely that adding then removing populations is possible).


Listing 14: finestructuregreedy
> finestructuregreedy.pl
 
ERROR: Require .xml file name ending for outfile ("" is invalid) 
Usage: ../scripts/finestructuregreedy.sh: [-r] [-R] [-d] [-m value] [-x value] [-t value] [-a value] [-f value] 
datafile outputfile 
Essentials: datafile and outputfile 
Important flags are -m and -x 
  -m value: sets the number of repeated FineSTRUCTURE runs to perform before giving in. (default: 20) 
  -x value: sets the number of FineSTRUCTURE iterations to perform per step (finestructure -x flag). (default: 50000) 
  -t value: (finestructure -t flag). (default: t=100000000, i.e. effectively infinite. careful, this may be slow) 
  -a value: finestructure flags to be passed to all runs, e.g. "-X -Y". Quotes essential! Usually not needed. (default: 
"") 
  -f value: set the location of the finestructure executable (default: finestructure) 
  -r: when set, temporary files are replaced. without this you can run more iterations by changing -m and -x 
  -R: when set, the final tree file is deleted if present. Default is to not run. 
  -d: perform a dry run but don't actually do anything. Useful to see the fineSTRUCTURE arguments sued in each step. 
EXAMPLE: ../scripts/finestructuregreedy.sh -a "-X -Y\ -c 0.2" -m 4 -t 1000 -x 50000 test.chunkcounts.out testgreedy.xml 
.. continued with: ../scripts/finestructuregreedy.sh -a "-X -Y\ -c 0.2" -m 10 -R -t 1000 -x 50000 test.chunkcounts.out 
testgreedy.xml 
 
FineSTRUCTURE is in theory run until convergence; i.e. until successive greedy tree runs have the same tree. 
You must therefore set "-x" large enough to find differences at each step. 
With "-x" too small, early stopping is likely and a lower K will be found. 
The tree is computed only once, at the end; intermediate trees are present but highly stochastic. 
Set "-t" to some smaller value if you are worried you may find very many populations; you can always rerun the final 
step. 
 
You may ignore the two warnings: 
WARNING!  NOT TESTING ALL <c> COMBINATIONS! (max 1) 
WARNING! Cannot confirm data file is the same as the MCMC was run on! 
The first is generated by each iteration, the second by all but the initial run.