Phasing your data
Summary of this document
- It is absolutely essential that you simultaneously phase all samples that you will be analysing.
- Unsystematic phasing errors are probably not a problem.
- You are free to use your choice of phasing algorithm.
Why is phasing necessary?
Chromopainter takes as input haplotype data. The majority of current genotyping technologies (e.g. SNP arrays and Illumina sequencing) give information on which alleles are found at a particular locus. However, they do not give information on e.g. whether the allele comes from the maternally and paternally inherited copies of each chromosome in human autosomal data. While technologies are becoming available that allow phase to be determined experimentally, they are currently expensive and not feasible for most population genetic applications. Phasing is not a problem for haploid organisms or for the X/Y chromosomes in males.What do phasing algorithms do?
There are multiple statistical algorithms to infer phase for diploid data based on identifying sets of alleles that are found together in multiple individuals. The accuracy of these algorithms is greatly improved by the availability of related individuals, which can provide considerable information on which alleles are transmitted together as a single unit. For example, phasing switch errors are an order of magnitude less frequent for individuals in parent offspring trios than for unrelated individuals. The presence of 'reference panels' of individuals phased using trio information or other methods can additionally improve the accuracy of phasing of unrelated members of the sample. For a given dataset, the different algorithms give approximately similar performance but with some tradeoff between speed and computational accuracy, so that for example fastPHASE is faster but generally less accurate than PHASE.How do phasing errors affect the performance of Chromopainter and FineSTRUCTURE?
Phasing errors break up haplotypes that were in fact inherited as a single unit. This can result in CHROMOPAINTER inferring more than one donor for stretches that would otherwise have a single nearest neighbor. It also potentially makes haplotypes less likely to be used as donors when painting other individuals. A small number of randomly distributed errors is unlikely to be problematic and we show for example in Lawson et al. 2012 that we achieve good performance with HGDP data despite inevitable phasing errors. However phasing errors can be more problematic if they are systematic and make haplotypes from particular individuals look more like each other than they do those of other members of the sample. This can potentially lead to fineSTRUCTURE grouping together individuals who are phased similarly although they are not especially related.How do we recommend phasing should be performed?
In order to both avoid bias and increase power when applying our methods, it is absolutely essential that you simultaneously phase all samples that you will be analysing with ChromoPainter and fineSTRUCTURE. If you wish to use previously phased samples as "donors" or "populations" in our models, in addition to new samples, you do not necessarily have to re-phase all such data (which can be computationally time consuming, especially if these previously phased samples vastly outnumber the new samples you will be phasing). If you do not wish to re-phase such samples, we strongly recommend that you phase your new samples while fixing the previously phased samples as a "reference panel" in the phasing algorithm. If you will be using multiple such previously phased datasets that have been generated by different research groups or using different phasing strategies, we recommend you fix at most one of the previously phased datasets as a "reference" and re-phase all remaining datasets including any new samples you will be jointly analyzing. Our methods rely heavily on haplotype patterns shared among individuals, so differential phasing among different populations/datasets will very likely introduce biases and/or reduce power, the latter of which we have observed in practice.Some available phasing algorithms
There are a variety of freely-available programs for phasing genotype data, such as BEAGLE, fastPHASE, IMPUTEv2, MACH, and ShapeIT. For smaller datasets, there is also PHASE. Each has its own various attributes that have been summarized elsewhere (see papers referenced by above programs). Our limited understanding is that PHASE is perhaps the most accurate for small datasets, especially when you have no "reference panel" information to assist with phasing. However, PHASE can only handle datasets of limited size and cannot currently cope with, e.g., full-genome human data. For such large-scale datasets, our understanding is that the phasing accuracy among all of the other above programs is vaguely similar and in practice appears to be "good enough" for the purposes of interrogating population structure and colonization history using our methods. We therefore recommend using any of the above programs to phase your data and following the protocols that the authors of these programs recommend to maximize phasing accuracy, for example including so-called "reference panels" with known phase such as from 1000Genomes or HapMap.IMPUTEv2 is a convenient choice of phasing software since it is the one we used. We provide conversion scripts for BEAGLE, SHAPEIT/IMPUTE2 and fastPHASE.