PaintMyChromosomes.com
|
fineSTRUCTUREDataMethodologyOtherAuthors
©
2012 Daniel Lawson.
Website template by Arcsin |
4 Detailed help
4.1 Information on Input formatsSee also the conversion scripts in Section 11. ########################## HELP ON INPUT FORMATS. This help, combined with looking at the example and the use of the provided scripts to convert your data, should be enough for most users to get started. NOTE: You can specify multiple phase and recombination files, one for each chromosome (at least, they are assumed unlinked.) Specify via: -phasefiles <list>.phase <of>.phase <files>.phase with corresponding: -recombfiles <list>.rec <of>.rec <files>.rec ########################## IDFILE FORMAT: This specifies the names of the individuals in the data, as well as (optionally) which population they are from and whether they are included. Format: N lines, one per individual, containing the following columns: <NAME> <POPULATION> <INCLUSION> <ignored extra info> Where <NAME> and <POPULATION> are strings and <INCLUSION> is 1 to include an individual and 0 to exclude them. The second and third columns can be omitted (but the second must be present if the third is). Currently <POPULATION> is not used by this version of fs. EXAMPLE IDFILE: Ind1 Pop1 1 Ind2 Pop1 1 Ind3 Pop2 0 Ind4 Pop2 1 Ind5 Pop2 1 ########################## CHROMOPAINTER'S v2 'PHASE' FORMAT: This is heavily based on 'FastPhase' output. ⋆ The first line contains the number of ⋆haplotypes⋆ (i.e. for diploids, 2⋆ the number of individuals). ⋆ The second line contains the number of SNPs. ⋆ The third line contains the letter P, followed by the basepair location of each SNP (space separated). These must match the recombination file. Within each chromosome, basepairs must be in order. ⋆ Each additional line contains a haplotype, in the order specified in the IDFILE. Diploids have two contiguous rows. Each character (allowing no spaces!) represents a ⋆biallelic⋆ SNP. Accepted characters are 0,1,A,C,G,T, with NO missing values! EXAMPLE PHASEFILE: 10 6 P 100 200 300 400 500 600 010101 011101 111101 001101 011000 001100 001001 001011 001001 001111 ########################## CHROMOPAINTERS RECOMBINATION FILE FORMAT: Required only if running in unlinked mode. This specifies the distance between SNPs in 'recombination rate' units. There should be a header line followed by one line for each SNP in haplotype infile. Each line should contain two columns, with the first column denoting the basepair position values given in haplotype infile, in the same order. The second column should give the genetic distance per basepair between the SNP at the position in the first column of the same row and the SNP at the position in the first column of the subsequent row. The last row should have a '0' in the second column (though this is not required this value is simply ignored by the program). Genetic distance should be given in Morgans, or at least the relevant output files assume this value is in Morgans. If you are including genetic information from multiple chromosomes, put a '-9' (or any value < 0) next to the last basepair position of the preceeding chromosome. EXAMPLE RECOMBFILE: start.pos recom.rate.perbp 100 0.01 200 0.02 300 -9 400 0.02 500 0.05 600 0 ########################## See the dedicated ChromoPainter v1 manual for more details.
4.2 Help on how the computation is performed⋆⋆⋆⋆⋆ Help on computational stages ⋆⋆⋆⋆⋆ The computation for finestructure is split into 4 main stages. These are breakpoints at which we can export computation to a HPC machine. Before and after each, automatic mode will do the work necessary to construct the next stage. This includes the construction of the command lines to be executed; the command lines themselves are all that is run externally. pre-stage<x>: performed by fs. A -reset <x> command will result in this being redone. post-stage: performed by fs A -reset <x> command will use the output of the post-stage<x-1> processing. This means, for example, that we can avoid needing to duplicate the chromopainter (stage2) runs in order create a duplicated finestructure (stage3) run. stage: Either '-dos<x>' meaning that the previously generated commands are run internally (in parallel) or '-writes'<x>' meaning that they are written to file to be performed externally in HPC mode. DETAILS: ==== pre-stage0 ==== #### stage0 #### Data conversion. Currently not implemented! ==== post-stage0 ==== Action -countdata : Ends stage0. Performs checks on the data and confirms that we have valid data. ==== pre-stage1 ==== Important note: stage1 is skipped when running in unlinked mode (no recombination file provided) Action -makes1 : Make the stage1 commands. #### stage1 #### Chromopainter parameter inference Action -dos1 : Do the stage1 commands. This we should only be doing in single machine mode; we use -writes1 in HPC mode. Action -writes1 <optional filename> : Write the stage1 commands to file, which we only need in HPC mode. In single machine mode we can instead use -dos1. ==== post-stage1 ==== Action -combines1 : Ends stage1 by combining the output of the stage1 commands. This means estimating the parameters mu and Ne from the output of stage1. ==== pre-stage2 ==== Action -makes2 : Make the stage2 commands. #### stage2 #### Chromopainter painting Action -dos2 : Do the stage2 commands. This we should only be doing in single machine mode; we use -writes2 in HPC mode. Action -writes2 <optional filename> : Write the stage2 commands to file, which we only need in HPC mode. In single machine mode we can instead use -dos2. ==== post-stage2 ==== Action -combines2 : Ends stage2 by combining the output of the stage2 commands. This means estimating 'c' and creating the genome-wide chromopainter output for all individuals. ==== pre-stage3 ==== Action -makes3 : Make the stage3 commands. #### stage3 #### FineSTRUCTURE MCMC inference Action -dos3 : Do the stage3 commands. This we should only be doing in single machine mode; we use -writes3 in HPC mode. Action -writes3 <optional filename> : Write the stage3 commands to file, which we only need in HPC mode. In single machine mode we can instead use -dos3. ==== post-stage3 ==== Action -combines3 : Ends stage3 by checking the output of the stage3 commands. ==== pre-stage4 ==== Action -makes4 : Make the stage4 commands. #### stage4 #### FineSTRUCTURE tree inference Action -dos4 : Do the stage4 commands. This we should only be doing in single machine mode; we use -writes4 in HPC mode. Action -writes4 <optional filename> : Write the stage4 commands to file, which we only need in HPC mode. In single machine mode we can instead use -dos4. ==== post-stage4 ==== Action -combines4 : Ends stage4 by checking the output of the stage4 commands. Not a command, but if -go gets here, we will provide the GUI command line for visualising and exploring the results.
4.3 Help on the output files createdFILES CREATED, in order of importance. IMPORTANT FILES: <projectname>.cp: The finestructure parameter file, containing the state of the pipeline. <projectname>_<linked>.chunkcounts.out: Created by stage 2 combine: The final chromopainter painting matrix, giving the number of chunks donated to individuals in rows from individuals in columns. The first line containt the estimate of "c". <projectname>_<linked>.chunklengths.out: Created by stage 2 combine: The final chromopainter painting matrix, giving the total recombination map distance donated to individuals in rows from individuals in columns. <projectname>_<linked>.mcmc.xml: Created by stage 3 combine: The main MCMC file of the clustering performed by fineSTRUCTURE. <projectname>_<linked>.tree.xml: Created by stage 4 combine: The main "tree" created from the best MCMC state by fineSTRUCTURE. <projectname>: A folder containing all pipeline files. <projectname>/commandfiles/commandfile<X>.txt: The commands to be run to complete stage X. (-hpc 1 mode only) USEFUL FILES: <projectname>/stage<X>: folders containing all pipeline files for a stage X. <projectname>/cpbackup/<projectname>.cp<X>.bak: Backups of the parameter file, created after every action. <projectname>/stage1/⋆_EM_linked_file<f>_ind<i>.EMprobs.out: Created by stage 1: The chromopainter parameter estimate files (indexed f=1..<num_phase_files>, in the order given) for the individuals in the order encountered in the idfile (omitting individuals specified as such). <projectname>/stage2/⋆_mainrun_file<f>_ind<i>.⋆: Created by stage 2: All chromopainter files created with the same parameters for all individuals (indexed f=1..<num_phase_files>, in the order given) for the individuals in the order encountered in the idfile (omitting individuals specified as such). See the "fs cp" help for details. <projectname>/stage3/⋆_linked_mcmc_run<r>.xml: Created by stage 3: all further MCMC runs beyond the first (r=1..nummcmcruns-1). <projectname>/stage3/⋆_mcmc.mcmctraces.tab: Created by stage 3 combine: The mcmc samples from all runs in a single file. <projectname>/stage4/⋆_linked_mcmc_run<r>.xml: Created by stage 4: all further trees beyond the first (r=1..nummcmcruns-1). OTHER FILES: <projectname>_<linked>.mutationprobs.out: Created by stage 2 combine: The final chromopainter painting matrix, giving the ⋆expected number of SNPs donated with error⋆ to individuals in rows from individuals in columns. <projectname>_<linked>.regionchunkcounts.out: Created by stage 2 combine: an intermediate file for calculating "c". See fs cp help for details. <projectname>_<linked>.chunklengths.out: Created by stage 2 combine: an intermediate file for calculating "c". See fs cp help for details. <projectname>/stage3/⋆_linked_mcmc_run<r>_x<x>_y<y>_z<z>.xml: Created by stage 3 when MCMC fails convergence tests. This is a backup of where each MCMC run reached, and is used as a starting point for the next run. <projectname>/stage<X>/⋆.log: Log files created by each stage, 1,2,2a (combining stage2 output across chromosomes),3 and 4.
4.4 Help on specific parametersHelp on specific commands or parameters is obtained by invoking help with the name as the argument. See Section 4.6 for obtaining a list of all parameters.
4.5 Accessing FineSTRUCTURE, ChromoCombine and ChromoPainter directly⋆⋆⋆⋆⋆ Help for tool mode ⋆⋆⋆⋆⋆ USAGE: fs [tool] [OPTIONS] Using this interface you can access any of the advanced functionality of chromopainter and finestructure. [tool] can be any of: <projectname>.cp: automatic mode - creates and runs the commands below for you, organising what to do in a 'project'. This should be the default unless you know what you are doing. cp: chromopainter mode (commands exactly as chromopainter.) This can be used to perform more sophisticated analyses, such as admixture modelling via GLOBETROTTER. combine: chromocombine mode (commands exactly as chromocombine) fs: finestucture mode (commands exactly as finestructure) Run "fs [tool] -h" to obtain more detailed help on cp, combine, or fs tools. Run "fs -h" to get the automatic mode help.
4.6 List of all parametersHelp for Parameter validatedoutput : Derived. Whether we have validated output from each stage of the analysis (0-4) Help for Parameter exec : Finestructure command line. Set this to be able to use a specific version of this software. (default: fs) Help for Parameter hpc : THIS IS IMPORTANT FOR BIG DATASETS! Set hpc mode. 0: Commands are run 'inline' (see 'numthreads' to control how many CPU's to use). 1: Stop computation for an external batch process, creating a file containing commands to generate the results of each stage. 2: Run commands inline, but create the commands for reference. (default: 0.) Help for Parameter numthreads : Maximum parallel threads in 'hpc=0' mode. Default: 0, meaning all available CPUs. Help for Parameter ploidy : Haplotypes per individual. =1 if haploid, 2 if diploid. (default: 2) Help for Parameter linkagemode : unlinked/linked. Whether we use the linked model. default: unlinked / linked if recombination files provided. Help for Parameter indsperproc : Desired number of individuals per process (default: 0, meaning autocalculate: use 1 In HPC mode, ceiling(N/numthreads) otherwise. Try to choose it such that you get a sensible number of commands compared to the number of cores you have available. Help for Parameter outputlogfiles : 1=Commands are written to file with redirection to log files. 0: no redirection. (default:1) Help for Parameter allowdep : Whether dependency resolution is allowed. 0=no, 1=yes. Main use is for pipelining. (default:1). Help for Parameter s12inputtype : What type of data input (currently only "phase" supported) Help for Parameter idfile : IDfile location, containing the labels of each individual. REQUIRED, no default (unless -createids is used). Help for Parameter s12args : arguments to be passed to Chromopainter (default: empty) Help for Parameter ninds : Derived. number of individuals observed in the idfile Help for Parameter nindsUsed : Derived. number of individuals retained for processing from the idfile Help for Parameter nsnps : Derived. number of SNPs in total, over all files Help for Parameter s1args : Arguments passed to stage1 (default:-in -iM --emfilesonly) Help for Parameter s1emits : Number of EM iterations (chromopainter -i <n>, default: 10) Help for Parameter s1minsnps : Minimum number of SNPs for EM estimation (for chromopainter -e, default: 10000) Help for Parameter s1snpfrac : fraction of genome to use for EM estimation. (default: 0.1) Help for Parameter s1indfrac : fraction of individuals to use for EM estimation. (default: 1.0) Help for Parameter s1outputroot : output file for stage 1 (default is autoconstructed from filename) Help for Parameter Neinf : Derived. Inferred ‘Effective population size Ne' (chromopainter -n). Help for Parameter muinf : Derived. Inferred Mutation rate mu (chromopainter -M) Help for Parameter s2chunksperregion : number of chunks in a "region" (-ve: use default of 100 for linked, nsnps/100 for unlinked) Help for Parameter s2samples : number of samples of the painting to obtain per recipient haplotype, for examining the details of the painting. (Populates <root>.samples.out; default 0. Warning: these file can get large) Help for Parameter s2args : Additional arguments for stage 2 (default: none, "") Help for Parameter s2outputroot : Output file name for stage 2 (default: autoconstructed). Help for Parameter s2combineargs : Additional arguments for stage 2 combine (fs combine; default: none, "") Help for Parameter cval : Derived. 'c' as inferred using chromopainter. This is only used for sanity checking. See s34 args for setting it manually. Help for Parameter cproot : The name of the final chromopainter output. (Default: <filename>, the project file name) Help for Parameter cpchunkcounts : the finestructure input file, derived name of the chunkcounts file from cproot. Help for Parameter fsroot : The name of the finestructure output (Default: <filename>, the project file name). Help for Parameter s34args : Additional arguments to both finestructure mcmc and tree steps. Add "-c <val>" to manually override 'c'. Help for Parameter s3iters : Number of TOTAL iterations to use for MCMC. By default we assign half to burnin and half to sampling. (default: 100000) Help for Parameter s3iterssample : Number of iterations to use for MCMC (default: -ve, meaning derive from s3iters) Help for Parameter s3itersburnin : Number of iterations to use for MCMC burnin (default: -ve, meaning derive from s3iters) Help for Parameter numskip : Number of mcmc iterations per retained sample; (default: -ve, meaning derive from maxretained) Help for Parameter maxretained : Maximum number of samples to retain when numskip -ve. (default: 500) Help for Parameter nummcmcruns : Number of ⋆independent⋆ mcmc runs. (default: 2) Help for Parameter fsmcmcoutput : Filename to use for mcmc output (default: autogenerated) Help for Parameter mcmcGR : Derived. Gelman-Rubin diagnostics obtained from combining MCMC runs, for log-posterior, K,log-beta,delta,f respectively Help for Parameter threshGR : Threshold for the Gelman-Rubin statistic to allow moving on to the tree building stage. We always move on if thresGR<0. (Default: 1.3) Help for Parameter s4args : Extra arguments to the tree building step. (default: none, "") Help for Parameter s4iters : Number of maximization steps when finding the best state from which the tree is built. (default: 100000) Help for Parameter fstreeoutput : Filename to use for finestructure tree output. (default: autogenerated) Help for Parameter phasefiles : Comma or space separated list of all 'phase' files containing the (phased) SNP details for each haplotype. Required. Must be sorted alphanumerically to ensure chromosomes are correctly ordered. So don't use ⋆.phase, use file{1..22}.phase. Override this with upper case -PHASEFILES. Help for Parameter recombfiles : Comma or space separated list of all recombination map files containing the recombination distance between SNPs. If provided, a linked analysis is performed. Otherwise an 'unlinked' analysis is performed. Note that linkage is very important for dense markers! Help for Parameter nsnpsvec : Derived. Comma separated list of the number of SNPs in each phase file. Help for Parameter s1outputrootvec : Derived. Comma separated list of the stage 1 output files names. Help for Parameter s2outputrootvec : Derived. Comma separated list of the stage 2 output files names. Help for Parameter fsmcmcoutputvec : Derived. Comma separated list of the stage 3 output files names. Help for Parameter old_fsmcmcoutputvec : Derived. Comma separated list of the stage 3 output files names, if we need to continue a too-short MCMC run. Help for Parameter fstreeoutputvec : Derived. Comma separated list of the stage 4 output files names. Help for Parameter stage : Derived. Don't mess with this! The internal measure of which stage of processing we've reached. Change it via -reset or -duplicate. |