Resource for Genotype-by-Sequencing in Ecology and Evolution workshop
Developed by: Alana Alexander
GBS_SNP_filter is a github repository. To get it on to the computer we will use git clone
.
First, make sure you are in your working directory: /nesi/nobackup/nesi02659/users/<yourusername>
. To clone the folder as it is on github, use:
git clone https://github.com/laninsky/GBS_SNP_filter.git
List the contents of your directory. You should now see a folder called GBS_SNP_filter
- this folder has all the scripts from the GBS_SNP_filter
github repository.
GBS_SNP_filter requires some dependencies (other programs it needs to run), namely VCFtools, PLINK, and R. Luckily, a working version of VCFtools and R are installed on Mahuika already! We load them using the module load
command:
module load VCFtools
module load R
Remember, when you are working on other projects on Mahuika/NeSI, you can see what other modules are available to support your analyses by module avail
, but VCFtools, PLINK and R are all we need for GBS_SNP_filter.
The PLINK version on Mahuika has some conflicts with the other programs we are using, so I’ve downloaded a ‘clean’ version. When we recently ran another program that also doesn’t have a module available, structure
, we provided an absolute path so Mahuika knew where it was located. Here, we are going to show you a slightly different way to get Mahuika to find a locally installed program. To do this, we need to run a little bit of code to add the location where PLINK is to our $PATH. Your $PATH is a list of all the places the computer looks for programs to run. You can have a look at your path by running:
echo $PATH
When you run any piece of software, the computer looks into all the places mentioned in your $PATH. Even the ls
command that you just ran is somewhere on your computer. You can find it by typing:
which ls
The plink
executable is located in the source_data
folder, so we are adding this location on to the beginning of our $PATH:
export PATH=/nesi/nobackup/nesi02659/source_data:$PATH
You will now be able to access plink
Note: If you log out of ga-vl01
and log back again, you’ll need to re-load these programs following the instructions above.
vcf file: First up, we need the vcf file we want to filter! For this, we will use the populations.snps.vcf
file we created for the stickleback data in our 2nd STACKS exercise.
Let’s copy that into the GBS_SNP_filter folder (e.g. if you are located in the working
folder, ‘above’ GBS_SNP_filter:
cp ../working/denovo/stacks/populations.snps.vcf GBS_SNP_filter
Have a look at the vcf
file using less
and try to understand its structure. You can use this to help. VCF is by far the most common way to store variant calls in population genetics, so take a bit of time to familiarise yourself with it.
popmap.txt file: The next thing we need is a popmap.txt
file with one individual per line, with the sample name separated from the population code by whitespace (i.e. either a space or a tab):
sample_name population_code
Luckily we already created a popmap file recently! Copy it into the GBS_SNP_filter folder and name it popmap.txt (you can use mv
or cp
to rename it).
No matter what directory you are currently located in, it is now time to get into the GBS_SNP_filter
folder.
cd GBS_SNP_filter
GBS_SNP_filter.txt: We have now put our vcf
and our popmap
file into the GBS_SNP_filter folder. The final file we need is GBS_SNP_filter.txt
. This eight line long file contains the options that we want to use to filter our SNPs. The GBS_SNP_filter repository has a lot of background on some of the options, but briefly the lines should be as follows:
Let’s start off with the following info in our GBS_SNP_filter.txt file. After our first run we can play around with this to see how it affects our results (and see whether we can break GBS_SNP_filter by choosing some crazy values while we are at it!):
populations.snps.vcf
0.85
0.9
0.05
0.5
3
#CHROM
We can create this file by nano GBS_SNP_filter.txt
(don’t forget the blank line on line 8!).
Alright, we are pretty much ready to run GBS_SNP_filter.txt! Our final checklist:
git clone
to clone the GBS_SNP_filter github repositoryAfter doing all that, we are ready to run the filter by:
bash GBS_SNP_filter.sh
The script is going to go through the following steps:
If you want more detail that’s going on in each step, please check out the detailed workflow.
Note: If you get the following message, it is safe to ignore:
Warning message:
Missing column names filled in: 'X8' [8]
Although GBS_SNP_filter will print out some messages to screen, the real detail of how it is going is written out to a log file in the GBS_SNP_filter folder. To monitor the log while GBS_SNP_filter is running, we need to log in to Mahuika using a different terminal window, and then get to the bit of Mahuika we are running the analyses on by:
ssh -Y ga-vl01
After heading to our GBS_SNP_filter folder:
cd /nesi/nobackup/nesi02659/users/<yourusername>/GBS_SNP_filter
We can peek into our log file by:
less populations.snps.log
If we type Shift+F
after opening the file using less
, the log file will update when GBS_SNP_filter adds more info.
Alternatively, we can also monitor our job by using the top
command. First, we have to get out of less
. To do this we need to do Ctrl+c
to interrupt less
waiting for data, and then q
to exit less
. To use top, we then simply type:
top
Can you see your user name and a process using a good whack of %CPU? If so, cool!
Once GBS_SNP_filter completes, a large number of files will be present. The final output vcf file will have “.ld.vcf” as a suffix, and the rest of the files are described here. The log file will detail how many SNPs were filtered out at each step.
After GBS_SNP_filter completes, have a go tweaking the parameters and rerun it to see how it affects the number of SNPs retained. Keep line 1 (populations.snps.vcf), line 7 (#CHROM), and line 8 (blank line) the same, but play around with the other filters. How does this affect the number of SNPs left at the end? Are there any combinations that cause GBS_SNP_filter to fail? Why do you think that might be? The log can get a bit unwieldy after multiple runs, so it might be a good idea to note down the number of SNPs for each parameter combination on your desktop.
Some known ways to make GBS_SNP_filter fail:
The README for GBS_SNP_filter provides additional detail on this set of scripts, and the repository also has further utility scripts for calculating multi-locus heterozygosity for your individuals, and observed/expected heterozygosity for your populations.
In addition, there are yet more potential filters you can use on your data, and other useful programs for further filtering of data that you should check out here.
Jump back to filtering SNPs intro
Jump back to main workshop schedule