Complex example on simulated data
This section is under preparation: the commands are not yet guaranteed to represent the best possible solution to the problem. In particular the recombination section is due to be streamlined.Here is a walkthrough for the simulated data example found in the fineSTRUCTURE paper. There is NOT any help here about phasing, but multiple datafiles are dealt with, where both individuals and genetic regions are separated for processing on a parallel machine.
By following the example, you will establish a pipeline that can easily be modified for real datasets of arbitrary complexity.
Note that in practice (for large datasets) you will want to run ChromoPainter on a cluster, probably running Linux. This is simply because of the large volume of data to be processed. These instructions assume that you have access to such a computer and know how to run jobs in parallel. Without this, you can still make progress but it will be very slow, and also irritating to have to run the process 100 times using a GUI on Windows.
Note that this script splits up the data processing more than strictly necessary, painting each individual for each genetic block separately. You can trivially run all individuals simultaneously (the "-a 0 0" option of chromopainter) and with a little more effort you can run all chromosomes together (by constructing a genetic map, separating chromosomes by negative genetic distances). However, the opportunities for parallel processing of very large datasets are clearer as constructed.
Also note: we use two notions of a `region': a single linked genetic region (like a chromosome, each is a phase input file here) and a bootstrap region, which is k haplotype chunks (or SNPs in the unlinked model). These should be identifable from context but contact me if you find it unclear.
Useful tools
If possible, you should use a recombination map, which can be converted into the correct format with convertrecfile.pl. However, recombination maps are not available in many species currently. If no map is available, a uniform map should be used, which you can generate with a perl script: makeuniformrecfile.pl. Additionally, the neaverage.pl script computes the average value of the effective population size for this case. These are demonstrated in the example below. The use of a real recombination map will be almost identical to the example below. Finally, chromopainter ignores individual names, but the finestructure pipeline can use them. You can put them back in using chromopainterindivrename.pl.Overview of the pipeline
Steps 1 and 2b will be streamlined in the future. Note that the provided scripts assume equal length of genetic per alignment block, and that no genetic map is available.- Create a recombination map file for the selected SNPs, either as a uniform rate or by converting a real one if it exists for your species.
- (a) Estimate the true recombination rate (actually, effective population size) in a sample of the genomic blocks using ChromoPainter's built in E-M procedure.
- (b) Combine these into a single population size by averaging.
- Run ChromoPainter with this recombination rate.
- Run ChromoCombine
- Run fineSTRUCTURE
Simulated data
Download 100 5Mb genetic regions of data here. These files have had 20 individuals per population extracted, and been converted into the PHASE format, with an additional (first) line in the header containing "0 0". This tells ChromoPainter that all individuals are being painted with all other individuals.unzip myres100regions.phase.zip #will extract 100 files to ./simdataOBnosingletons/
You can also get a list of individual names (which are here very boring) at indivnamelist.txt, which will be useful for when you want to interpret your own data.
(You can also get some example recombination map files from here, but these are not compatible with the example below as they correspond to a very slightly different set of SNPs. We will provide these, and detailed help on generating them, at a later date).
Unlinked model
We will start with the unlinked model, since we can skip steps 1 and 2 in this case. It is good practice to compare the unlinked and linked models for your dataset as a check that nothing went wrong in the recombination map estimation.ChromoPainter
ChromoPainter can be run with the following loop:
#!/bin/sh
numrepeats=10
numrecfiles=10
numinds=100
dir="unlinkedresparts"
mkdir -p $dir
for repeat in `seq 1 $numrepeats` ; do
for recfile in `seq 1 $numrecfiles` ;do
for ind in `seq 1 $numinds` ;do
# the following can be parallelized however you please
chromopainter -a $ind $ind -k 10000 \
-g simdatanosingletons/myres-$repeat-$recfile.phase \
-u -o $dir/unlinkedmyres-$repeat-$recfile-ind$ind
done
done
done
There is one parameter that needs your input here: "-k kval". This is the number of SNPs (for the unlinked model, chunks for the linked) for which the variance is uncorrelated enough to give approximately a random sample for the true variance. In practice, "k" should be several times the average LD distance, e.g. a factor of 100 to give approximately 100 random samples to estimate the local variance. It is important to select a value that is significantly larger than the minimum value that LD is zero, due to the way that we calculate the effective number of chunks "c". If you use a sufficiently large k, then all choices will give the same expected value of c (with variance that increases with k), so if you are unsure use one guess, increase it by 50% and try again. (you only need to test this on a small segment of your data, e.g. one phase file in the simulated data). Here, each phase file does not contain enough data to give a good estimate, and therefore we will instead use whole genomic regions of 5Mb as bootstrap regions in chromopainter, with the "-C" flag. Provided that your datafiles have the same number of SNPs in them, this approach will always give a "safe" estimate of c, but you need many data regions.The above procedure generates 10000 fileroots, each consisting of 9 files! Each of these fileroots contains the .chunkcounts.out file that will eventually be needed by fineSTRUCTURE, as well as the regionchunkcounts.out and regionsquaredchunkcounts.out files required to calculate the effective number of chunks (the scaling constant "c"). It also gives other interesting files such as the chunklengths and the mutationprobs. Each fileroot is for a single individual, for a single 5Mb genetic region, and will take under a minute to generate. You can therefore run this whole process on a single PC in a couple of days, but using say 100 machines will make it significantly faster and will help you set things up to cope with your own data. You could instead have skipped the loop over individuals, and used the "-a 0 0" command to run all individuals against all others. You would then obtain 100 files (each taking 100 times longer to generate). The parallelization over individuals is likely to be the most convenient for real dataset (which typically have ~20 chromosomes, which can optionally be merged into a single file - see the faq "Can I combine data from different chromosomes into the same run?").
ChromoCombine
ChromoCombine can now be run. We'll also want chromopainterindivrename.pl, which allows us to rename the individuals in the chromopainter output to interpretable names. (Get the name list if you don't already have it):wget http://www.maths.bris.ac.uk/~madjl/finestructure-old/indivnamelist.txt
The chromocombine command is:
chromocombine -C -o unlinkedmyres.100regions.unnamed -d unlinkedresparts
chromopainterindivrename.pl indivnamelist.txt \
unlinkedmyres.100regions.unnamed unlinkedmyres.100regions
This should not take very long, and could instead be performed on your desktop PC using the ChromoCombine GUI. It creates the file "unlinkedmyres.100regions.chunkcounts.out", which is the coancestry matrix that fineSTRUCTURE requires as input. It also contains the correct value of "c". The other files are also correctly updated (although because of the -C option the original regionchunkcounts and regionsquaredchunkcounts files have been discarded and replaced by correct ones assuming each original file was a bootstrap region, so you could now run "chromocombine -o test unlinkedmyres.100regions" and you would get the same value of c this time without needing the "-C" flag). We have (optionally) renamed the chromopainter individuals from "IND1-N" to the names generated in "indivnamelist.txt" (which are here the same, but won't be for your data).
fineSTRUCTURE
fineSTRUCTURE now has everything it needs to run. The following would be a sensible way of generating an MCMC sample and population tree:
finestructure -x 100000 -y 100000 -z 100 \
unlinkedmyres.100regions.chunkcounts.out \
unlinkedmyres.100regions.mcmc.xml
finestructure -x 10000 -m T \
unlinkedmyres.100regions.chunkcounts.out \
unlinkedmyres.100regions.mcmc.xml \
unlinkedmyres.100regions.mcmc.tree.xml
This can of course be done in the fineSTRUCTURE GUI or on the command line. Whichever you chose to do, you should now be able to follow the tutorial/example in the fineSTRUCTURE manual with these data, and obtain very similar results to that obtained with the unlinked data provided with the example (which is for the same data but a slightly different subset).
Linkage model
Recombination map
The procedure is similar to the no-linkage model above, but first we must generate a recombination map file, and additionally estimate the effective population size. You will need the makeuniformrecfile.pl script, which generates a uniform recombination map for a phase file.
numrepeats=10
numrecfiles=10
for repeat in `seq 1 $numrepeats` ; do
for recfile in `seq 1 $numrecfiles` ;do
phasefile="simdatanosingletons/myres-$repeat-$recfile.phase"
recombfile="simdatanosingletons/myres-$repeat-$recfile.recombfile"
./makeuniformrecfile.pl $phasefile $recombfile #this is fairly fast
done
done
Effective population size
We now have recombination maps for every phase file. Unfortunately we don't yet know the recombination rate parameter, or alternatively the effective population size. To estimate this we will perform EM inference on a subset of the data. We will perform 10 EM steps, which therefore takes roughly 10 times longer than a normal chromopainter run.
dir="linkedrespartsEM"
repeat=1
numrecfiles=10
for recfile in `seq 1 $numrecfiles` ;do
phasefile="simdatanosingletons/myres-$repeat-$recfile.phase"
recombfile="simdatanosingletons/myres-$repeat-$recfile.recombfile"
for ind in `seq 10 10 100` ;do #subset of the data
echo "IND $ind of $numinds, RECFILE $recfile of $numrecfiles"
chromopainter -a $ind $ind -i 10 -in -g $phasefile \
-r $recombfile -o $dir/linkedmyres-$repeat-$recfile-ind$ind
done
done
We can now estimate the effective population size using neaverage.pl. The script weighs samples by their recombination distances, for which we need to inform it about the recombination files used to generate the file.
nefile="linkedrespartsEM/myneest.txt"
nelistfile="linkedrespartsEM/myneest-list.txt"
rm -f $nelistfile #CAREFUL: we remove the file if it already exists
for recfile in `seq 1 $numrecfiles` ;do
recombfile="simdatanosingletons/myres-$repeat-$recfile.recombfile"
for ind in `seq 10 10 100` ;do #same subset of the data as above
emfile="$dir/linkedmyres-$repeat-$recfile-ind$ind.EMprobs.out"
echo "$recombfile $emfile" >> $nelistfile
done
done
./neaverage.pl -o $nefile -l $nelistfile
globalparams=`cat $nefile` # $nefile contains the commands to tell ChromoPainter
# about both Ne and global mutation rate, e.g. "-n 10000 -M 0.01".
Now "globalparams" contains the effective population size, measured in units that correspond to the recombination maps we generated. (Note that a very accurate estimate of the effective population size is not needed; within 10% or worse will be fine.)ChromoPainter
We are now back where we started with the unlinked model. The difference is that instead of specifying that the files are unlinked you should instead tell chromopainter about a recombination map file. Additionally, the "-k kval" parameter should be correct at it's default value of 100 (since we are now considering a region to be k chunks, i.e. nearly independent haplotype blocks, rather than SNPs). Finally, you need to tell it about the global parameters (Ne and mutation rate) that we calculated.
dir="linkedrespartsNeEst"
mkdir -p $dir
numrepeats=10
numrecfiles=10
for repeat in `seq 1 $numrepeats` ; do
for recfile in `seq 1 $numrecfiles` ;do
phasefile="simdatanosingletons/myres-$repeat-$recfile.phase"
recombfile="simdatanosingletons/myres-$repeat-$recfile.recombfile"
for ind in `seq 1 100` ;do
echo "IND $ind of $numinds, RECFILE $recfile, REPEAT $repeat"
outputfile="$dir/linkedmyres-$repeat-$recfile-ind$ind"
chromopainter -a $ind $ind $globalparams -g $phasefile \
-r $recombfile -o $outputfile
done
done
done
ChromoCombine
Then you can run chromocombine similarly to before. It is often the case that the linked model doesn't need the "-C" flag when the unlinked model does; this is because the linked model knows how many effectively independent chunks it has (approximately). As above, we'll rename the individuals.chromocombine -o linkedmyres.100regions.unnamed -d linkedrespartsNeEst
chromopainterindivrename.pl indivnamelist.txt\
linkedmyres.100regions.unnamed linkedmyres.100regions
fineSTRUCTURE
Thats the hard part done; we have a valid fineSTRUCTURE input file. fineSTRUCTURE now runs as before on the new output files:
finestructure -x 100000 -y 100000 -z 100 \
linkedmyres.100regions.chunkcounts.out \
linkedmyres.100regions.mcmc.xml
finestructure -x 10000 -m T \
linkedmyres.100regions.chunkcounts.out \
linkedmyres.100regions.mcmc.xml \
linkedmyres.100regions.mcmc.tree.xml
Further analysis
Here we will look a bit further into how the fineSTRUCTURE linkage and no-linkage models differ by running the algorithm with differing quantities of data. First we will run the linkage model with a varying number of chunks. (These runs only take a few minutes).
numrepeats=10
numrecfiles=10
filelist=""
for repeat in `seq 1 $numrepeats` ; do
nr=`echo "$numrecfiles * $repeat" | bc`
echo "Regions = $nr"
filelist="$filelist linkedrespartsNeEst/linkedmyres-$repeat-*.chunkcounts.out"
chromocombine -o linkedrespartsNeEst.${nr}regions $filelist
finestructure -x 50000 -y 50000 -z 100 linkedrespartsNeEst.${nr}regions.chunkcounts.out\
linkedrespartsNeEst.${nr}regions.mcmc.xml
finestructure -x 10000 -m T linkedrespartsNeEst.${nr}regions.chunkcounts.out\
linkedrespartsNeEst.${nr}regions.mcmc.xml\
linkedrespartsNeEst.${nr}regions.mcmc.tree.xml
done
We can perform the same analysis for the linkage model:
numrepeats=10
numrecfiles=10
filelist=""
for repeat in `seq 1 $numrepeats` ; do
nr=`echo "$numrecfiles * $repeat" | bc`
echo "Regions = $nr"
filelist="$filelist unlinkedresparts/unlinkedmyres-$repeat-*.chunkcounts.out"
chromocombine -o unlinkedresparts.${nr}regions $filelist
finestructure -x 50000 -y 50000 -z 100 unlinkedresparts.${nr}regions.chunkcounts.out\
unlinkedresparts.${nr}regions.mcmc.xml
finestructure -x 10000 -m T unlinkedresparts.${nr}regions.chunkcounts.out\
unlinkedresparts.${nr}regions.mcmc.xml\
unlinkedresparts.${nr}regions.mcmc.tree.xml
done
Now you can load these runs into the finestructure GUI more directly, for example:
nr="10"
finegui -c unlinkedresparts.${nr}regions.chunkcounts.out \
-m unlinkedresparts.${nr}regions.mcmc.xml \
-t unlinkedresparts.${nr}regions.mcmc.tree.xml
which will set the filenames for you, so you can quickly load all three files and view the MCMC clusterings. You should find that the unlinked model cannot get good splits until provided with around 50 regions, but that the linked model does well with even just 10. Given 80 regions, the linked model gets the 5 populations broadly correct, and at 100 regions it is perfect. The unlinked model with 100 regions cannot see the split in the 5th population (nearly 200 are needed to identify this split). Note that with the correctly inferred recombination maps, the linkage model can do even better than in this example using a uniform recombination map.
Stochastic Optimization
For this dataset, optimization isn't needed - the MCMC is pretty quick. But here is how you'd do it in case you want to adapt it to your own data. (The value comes when there are many individuals, e.g. over 1000).
finestructure -I 1 -x 0 -y 0 \ #Start from a single pop, do no MCMC
linkedmyres.100regions.chunkcounts.out \
linkedmyres.100regions.OPT0.xml
finestructure -x 2000 -m T -t 1000 \ #2000 optimization steps
linkedmyres.100regions.chunkcounts.out \
linkedmyres.100regions.OPT0.xml \
linkedmyres.100regions.tree.xml
finestructure -I 1 -x 0 -y 0 \
linkedmyres.100regions.chunkcounts.out \
linkedmyres.100regions.tree.xml \
linkedmyres.100regions.mapstate.xml
You would then load this into the GUI using the mapstate file instead of the usual MCMC file:
finegui -c linkedmyres.100regions.chunkcounts.out \
-m linkedmyres.100regions.mapstate.xml \
-t linkedmyres.100regions.tree.xml
Note that I used the "-t" option here: this is because, for large datasets, you might want to create a cheap tree first to see if you think the MAP state is OK. In that case use e.g. "-t 100" the first time around, as you can continue the MAP estimation, or the tree estimation, from this file. Always use as large a "-t" as you can for the final tree!There is currently no advice for how to set "-x". I would suggest increasing it by say 10% from your first value and seeing in the MAP value changes.