Filtering your SNPs

Developed by: Alana Alexander

Other filtering parameters

There are a boat load of additional filters/options you could apply to your data set, and also a boatload of differing opinions on why and whether you should further filter your SNP set! Some filters you might consider (credit to the creators of ipyrad for references for these):

overall sequencing depth (i.e. chucking out loci covered by too few reads). This might reduce the likelihood of calling the wrong genotype at a locus because of low depth (e.g. hard to call a heterozygote when you only have one read at a locus!), but also see ‘Do I even need to filter at all?’, below.
to reference genome or not to reference genome. For some folks, this will be an easy one to answer. If you don’t have a reference genome for your organism, you won’t be able to make use of it! However, evidence suggests if you do have a reference genome available, that will assist with your analyses (see: Bioinformatic processing of RAD‐seq data dramatically impacts downstream population genetic inference)
a MAF (minor allele frequency) filter. Recent research suggests for Structure/PCA analyses, that singleton SNPs being filtered out increases the ability to detect population structure, but that more stringent MAFs might actually reduce the ability to detect population structure (Minor allele frequency thresholds strongly affect population structure inference with genomic data sets). For coalescent analyses (e.g. CubSFS, dadi, fastsimcoal2) avoid filtering on MAF - the low frequency variants in a population are hugely informative about demographic processes impacting the population!

If you are interested in any of these further filtering steps, I’d highly encourage you to check out this tutorial from the dDocent folks for using VCFtools, which walks through several of these filtering steps (any filtering steps that aren’t covered in this tutorial, will be covered in the extensive VCFtools documentation)

Do I even need to filter at all?

Instead of filtering at the genotype level, other alternatives include using pipelines that can handle and calculate population-level statistics for low depth data at a population level rather than individual level (e.g. ANGSD, Genotype‐free estimation of allele frequencies reduces bias and improves demographic inference from RADSeq data), or incorporate the biases from low-depth in individual level calculations e.g. KGD.

Jump back to using GBS filter
Jump back to filtering SNPs intro
Jump back to main workshop schedule