One of the most important tools in bioinformatics is BLAST, or Basic Local Alignment Search Tool. It is used to find and compare nucleotide or amino acid sequences, and assess their similarity. Sequences that are very similar to each other are assumed to be homologous. We will talk about the concept of homology later, but in short, homology means similarity due to common evolutionary descent. In terms of genes, common descent can apply to two genes in two different organisms, in which case both genes have evolved from a single gene present in the last common ancestor of the species (orthology). Alternatively, it can apply to two genes in the same organism, which arose from a duplication of a single ancestral gene (paralogy).
When annotating genes, it is essential to be able to identify homologous genes. For example, after identifying an unknown gene, comparing it to homologous genes from a well-studied model organism might give you a first clue about its function. On the flip side, the first step in annotating a gene is often finding it first with the help of a homologous gene from another organism. Similarly, when studying a gene family, a group of homologous (paralogous) and often functionally similar genes, finding all members of that group is key. In all of these cases, BLAST is the right tool for the job.
BLAST works by aligning segments of a query sequence (the sequence to search for) to one or many sequences to search against, identifying regions of high sequence similarity using a scoring matrix. The details of the algorithm can be found here. BLAST is implemented in many, if not all, genome project websites, including FlyBase, WormBase and the Ant Portal of the Hymenoptera Genome Database. As previously, we will use the latter to demonstrate how to locate the Cytoplasmic Ribosomal Protein RpL30 gene in the Red harvester ant:
(1) The two most important choices regarding BLAST concern the program and the database used. “Program” specifies one of five basic types of BLAST, depending on the nature of the query and target sequences:
blastn — use a nucleotide query to search a nucleotide database
blastp — use a protein query to search a protein database
blastx — use a translated nucleotide query to search a protein database
tblastn — use a protein query to search a translated nucleotide database
tblastx — use a translated nucleotide query to search a translated nucleotide databaseTranslated means that sequences are translated into all six reading frames before being compared to each other. In this case, we have chosen the protein (amino acid) sequence of our target gene RpL30 from the fruit fly Drosophila melanogaster as the query sequence. We want to see whether this gene has already been annotated in the Red harvester ant Pogonomyrmex barbatus and is therefore found among the Official Gene Set (OGS), a collection of automatically annotated and manually curated protein sequences representing all protein-coding genes. With both our query and target sequences in protein format, we need to select “blastp” as program from the pull-down menu.
The “Database” menu offers both genome assemblies and gene sets from several ant species. “PbarOGSv1.2proteins” represents the most current OGS version of Pogonomyrmex barbatus.
(2). Paste the query sequence in Fasta format into this field. The complete set of Drosophila melanogaster Cytoplasmic Ribosomal Protein (CRP) genes is available here.
(3). By default, BLAST does not consider highly repetitive (low complexity) regions. However, some some coding sequences may contain repetitive sequence (e.g., see Dmel_RpLP1), in which case it might be beneficial to deactivate the low complexity filter.
(4). Click “Search” to run BLAST.
(1). The first section of the results page shows a graphical representation of the length and similarity of the sequences found by BLAST, the so-called hits. In our case, we have found seven hits, but only one that covers the full length of the query (the thick black bar). It is highlighted in purple, indicating high overall similarity to the query as measured by the alignment score. All other hits are fragmentary and of low similarity, most likely representing spurious results.
(2). The second section lists all hits with their sequence identifiers, scores and e-values. The e-value is the probability that the similarity between query and target sequence is due to chance rather than homology, taking the size of the database into account (a large database is more likely to produce a chance hit). Values well below zero are usually interpreted as indicating true homology. Here, this applies only for the first hit with an e-value of 10-49.
(3). The following section gives more detailed information about individual hits, in descending order of their score. Details include the proportions of identical (“Identities”) and chemically similar (“Positives”) positions between query and target, as well as an alignment between the two sequences. The high number of matching positions indicates that they must have evolved from a common ancestor and are therefore homologous. In contrast, the weak similarity between the query and the next best hit (4), as shown by the positive e-value and incomplete alignment, is probably a result of chance.
The result of this BLAST analysis is therefore that the Official Gene Set of P. barbatus contains one gene, PB2287, which is homologous to RpL30 in D. melanogaster. To acquire the full protein sequence of this gene, click on any of the highlighted gene identifiers.
Tips and further resources
— Regions of low similarity can break up BLAST hits, which are then displayed separately as sub-hits even though they are part of the same contiguous sequence. This is especially true when blasting a protein-coding gene against genomic sequence. Interrupted by introns, each exon will typically be represented by its own sub-hit. For example, try to blast our target gene above against the P. barbatus genome assembly (“Pbar1.0scaffolds.fa”) using tblastn. When comparing the sequence coordinates in the alignment of the best hit, you will notice that there is a gap of over 600 nucleotides in the genomic target sequence separating the two exons (see below).
— Some websites link their BLAST results to a genome browser. For example, you can examine the genomic context of our target gene by clicking on “See complete hit in GBrowse” (see below). Alternatively, you could search for the gene identifier PB22887 in GBrowse or use the scaffold identifier and coordinates as shown by BLAST. Mind that some genes are found on the antisense strand (watch out for a negative “Frame”), so you have to read the coordinates backwards.
— Protein or translated nucleotide sequences usually give more robust results and should be used when possible. With only four character states compared to 20, nucleotide sequences have a higher chance to be similar due to chance and provide less resolution.
— The NCBI BLAST website allows blasting against most of NCBI’s databases, including GenBank, a comprehensive collection of sequence data from thousands of organisms. This can be used to identify unknown sequences, sometimes even to the source organism.
— BLAST is also available as a command-line operated application (BLAST+), and plug-in for BioPerl and Biopython. As such, it can be run and parsed as part of a scripted pipeline to handle larger data volumes. The official BLAST+ user manual is available here, and a quick guide to the most common commands here.
Next chapter: Aligning sequences