View on GitHub

hackyhour

Source for information on fortnightly bioinformatics help session

Visualising Alignments and Assemblies

–from SAM files to SeqMonk

The SAM/BAM file format

The Sequence Alignment Map (SAM) file format is the standard format for storing output from mapping next-generation sequence reads to a genome, transcriptome, or other reference. They are most often in the compressed version, called BAM (binary alignment map), because their size can be very large. These files store a lot of information about each read, as it relates to the reference.

This link has a good basic description

Details of all standard and optional fields are found in this PDF

There are two main components of a SAM file: the first part is the header section, which can include information about the file parameters (‘@HD’, e.g. format, and sorting order of reads), the reference sequences (‘@SQ’, sequence name and length) and the sample, or Read Group (‘@RQ’, e.g. ID, barcode, platform). SAM files can vary a lot, and not all of them will have every field. There are only a few fields required for each. See the PDF link for minimal requirements.

The most useful information in a SAM file is in the alignment section, which has 11 mandatory fields and many many optional fields (you can even add your own fields). This section is essentially a giant table (tab-delimited), with each field as column and each read is a row. As an example, here are the first six columns:

QNAME	FLAG	RNAME	POS	MAPQ	CIGAR
query name	code	reference	ref start	mapping qual	indel info

Manipulating SAM/BAM files

Most of your interaction with SAM/BAM files will be to use them for sorting, filtering or extracting information from them based on one or several of these fields. There are several tools available for these operations. Here are links and usage examples for two of the most popular:

samtools

samtools usage examples

picard tools

picard tools usage examples

SeqMonk - Visualising Alignments

SeqMonk is a tool for visualising and analysing high throughput sequence data. It can import genome files, mapping files (BAM), annotation files (gff, gtf), as well as other types of data formats. In this session we provide just a brief overview of its capabilities.

alt text