Shell Scripting Basics

Once you move beyond basic shell commands, and the process becomes increasingly complicated, you are going to want the commands into a shell script. There are many good tutorials online to get you started. The Software Carpentry website includes some good tips on their Unix Shell lesson (chapters 5 and 6). As well, Data Carpentry has a good introduction on their lesson: Introduction to the Command Line for Genomics. For the lesson here, we will take you through some basics to get you started.

Simple script

In a text editor, add the following:

#!/usr/bin/env bash

# a first script

echo 'Hello World'

The first line is called the Shebang, which tells the script where the BASH interpreter is. It is possible to run without this line, but your script may not work in all environments (see Here and Here for more details than you would ever want to know). The next line that starts with the # is a comment, and is ignored by BASH, but can help tell the reader what is going on in the script. The last line just has the command to output “Hello World” to the screen.

In order to run any new shell script, you have to make it executable:

chmod a+x

and now it should run. You just have to do this the first time you run it.

A little less simple: running other programs

The following script fastq2fasta.sh will convert one of our example fastq files to a fasta format

#!/usr/bin/env bash

# running another program with the shell: seqtk 

# This command will convert the fastq file to a fasta file

seqtk seq -A example_data/e4485061.924a.4f09.9d7e.bfefc780d1a8_11_L001_R1_001.fastq > example_data/sample1.fasta

Repeating the same command: Looping

The above script works fine if we want to just do one file (but then you would hardly need a script for one command for one file), but if we want to do the same command on multiple files, we can set up a for loop:

Shell script: fastq2fasta_loop.sh

#!/usr/bin/env bash

# running another program with the shell: seqtk 

# This command will convert multiple fastq files to the fasta format

cd example_data

for fq in *.fastq
do
	seqtk seq -A $fq > $fq\.fasta

done

The \ in the command is called an escape. That is so the program doesn’t confuse what variable you are calling.

Playing with variables

Okay, now we can loop through multiple files and run the same command. The only thing with the above script is that now all of the fasta files end in “.fastq.fasta”, which doesn’t look so nice. We can adjust the variable that the for loop is iterating through to adjust the output name:

fastq2fasta_loop_rename.sh

#!/usr/bin/env bash

# running another program with the shell: seqtk 

# This command will convert multiple fastq files to the fasta format

cd example_data

for fq in *.fastq
do
	sampleName=$(echo $fq | cut -f 1 -d '.')
	echo $sampleName
	seqtk seq -A $fq > $sampleName\.fasta

done

Looping from a file

If we don’t want to run all the files in the folder, we can make a list and run the loop from there. One way to do this is a While loop:

fastq2fasta_while_loop_rename.sh

#!/usr/bin/env bash

# running another program with the shell: seqtk 

# This command will convert multiple fastq files to the fasta format

cd example_data

cat fqlist | while read line
do
	sampleName=$(echo $line | cut -f 1 -d '.')
	echo $sampleName
	seqtk seq -A $line > $sampleName\.fasta

done

We can also take more complicated things from a file, for example a table with old and new names, and use these with our variable slicing to give more sensible names to our files:

fastq2fasta_loop_newName.sh

#!/usr/bin/env bash

# running another program with the shell: seqtk 

# This command will convert multiple fastq files to the fasta format

cd example_data

cat fqTable | while read line
do
	currentName=$(echo "$line" | cut -f 1)
	newName=$(echo "$line" | cut -f 2)
	echo $currentName
	echo $newName
	seqtk seq -A $currentName > $newName\.fasta

	# can also use to just change names:
	#mv $currentName $newName

done

Are you jazzed about for loops and want even more speed?

We’ve got a small guide on how to parallelize these loops

hackyhour

Source for information on fortnightly bioinformatics help session