2019-06-11_GBS_EE

Resource for Genotype-by-Sequencing in Ecology and Evolution workshop

View the Project on GitHub

Filtering your SNPs

Developed by: Alana Alexander

Background

Congratulations! Over the last day you’ve learned how to use the command line and STACKS to generate a vcf file (and other file formats) containing variable SNPs derived from GBS/RADseq data. By tweaking some of the parameters of the STACKS pipeline (particularly m - the minimum read depth; M - the number of mismatches between alleles; and n - the number of mismatches between loci in the catalog), we can make it less likely that we’ve included problematic loci (e.g. paralogous loci smushed into one “super stack”). Other RADseq assembly pipelines like ipyrad and dDocent also have similar parameters that can be tweaked.

We probably need to do a little more filtering

But…despite this, it is likely that a few problematic loci are going to sneak through into our dataset. These might include some paralogous loci, which will be more heterozygous than expected because of fixed differences between the different ‘true’ loci that have been merged into one ‘erroneous’ locus. It might also include loci that have null alleles, that is, where the restriction enzyme site for one allele has been eroded by a mutation. This can make individuals look homozoygous at that site (and that locus to look more homozygous than you might expect in general), even though at least some of the individuals are truly heterozygous. The impact of these problematic loci will depend on what you are using your data for, but null alleles can be particularly problematic for parentage analysis, because if an offspring inherits one of the null alleles there will be no data at that locus to suggest that it and the parent it inherited the null allele from are related. It can also be quite problematic for estimating diversity.

Filtering out loci that are out of HWE

Fortunately, we can screen for loci that are out of Hardy-Weinberg Equilibrium (HWE) i.e. loci that appear to be too heterozygous/homozygous based on allele frequencies in the population. One problem we can face, however, is that a lot of the programs (e.g. KGD, VCFtools) that filter out loci not in HWE assume we’ve only got a single population in our dataset. Why is this a problem? Well, null alleles and paralogy are not the only reasons you can be out of HWE. One big reason is population structure! That’s right! If we have multiple populations present in our data set, and we filter on HWE, then we might actually be tipping out the SNPs that are the most informative about population structure!

Solution? Only filter out loci that are out of HWE across multiple populations

The solution is splitting out our vcf file into populations and only getting rid of loci that are out of HWE within multiple populations. That is because if some kind of artifact like null alleles or paralogous loci leads to a locus being out of HWE, then it should be out of HWE within multiple populations. By looking at a per-population level, we won’t be throwing out loci that are out of HWE across the entire data set due to population structure. To do this, we are going to use a set of bash and R scripts (GBS_SNP_filter) to split out the populations and then filter SNPs on HWE and the few other things mentioned below.

Other things we can filter on using GBS_SNP_filter

Enough background already, can we just get to it?

Let’s head over here to start playing around with GBS_SNP_filter


Jump back to Stacks exercise II
Jump back to main workshop schedule