PaintMyChromosomes.com
fineSTRUCTURE v2 & GLOBETROTTER

Finestructure Icon
© 2012 Daniel Lawson.
Website template by Arcsin

8 Computational considerations

The ChromoPainter step has computational cost proportional to N2L where N is the number of individuals and L is the number of SNPs. This is parallelized automatically, so if you have a large enough compute cluster, the cost is NL. For guidance, L = 88K and N = 100 takes 3130 seconds (50 minutes) on a 2010 laptop using a single CPU. L = 88K and N = 500 takes 264000 seconds (3 days). L = 800K and N = 1000 (HGDP scale dataset) required a week on a moderate scale cluster. L = 10M (sequence data) for N = 500 requires a similar amount of compute. N = 1500 on sequence data is about as high as is reasonably manageable; up to N = 3000 is manageable for SNP chip data.

The big barrier to computation for sequence data is memory. The cost per parallel run is proportional to NL, which can run to several Gigabytes, preventing easy parallelization. We are addressing this, but for the meantime you may need to customize the provided qsub script to request an appropriate amount of memory per chromosome.

If you are attempting to work with a dataset at or above this scale, we do have methodology in development for this. PBWT painting (an approximate algorithm) is orders of magnitude faster, and we are developing low-memory, efficient versions of the ChromoPainter algorithm too. Contact us if you might be interested in joining the development of these algorithms.

Running FineSTRUCTURE is also a problem at this scale. It has run time independent of L, and has been run successfully (taking approx two weeks) for N = 2000. For larger runs we provide an optimization script (See scripts/finestructuregreedy.sh) which greedily searches for the maximum a-posteriori state. This typically gets stuck in a local mode but multiple independent runs find similar enough best states to be useful. Expect serious problems above N = 10000.