Introduction to genetics

Genes and genomes

All living organisms depend on genetic information to exist. It serves as a blue print and instruction manual for the molecular building blocks that form an organism and keep it functional. Passed on from generation to generation, it contains the program required to build an organism during its development, and it constitutes the substrate on which evolution acts, giving rise to new lineages of organisms over time.

The entirety of an organism’s genetic information is called its genome. The question what constitutes a gene is more difficult to answer. Generally speaking, a gene is a region in the genome that gives rise to a functional product. Typically, it is associated with regulatory and transcribed sequence, two topics we will cover in more detail below. Many, though not all genes, contain information on how to build specific proteins, large biomolecules that govern nearly every process in an organism. Unless noted otherwise, this tutorial refers to protein-coding genes.

Further reading: What is a gene?


DNA and gene transcription

Genetic information is encoded in desoxyribonucleic acid or DNA. DNA molecules are made up of long chains of simpler units called nucleotides, which consist of a sugar-phosphate backbone and one of the four nucleobases adenine (A), cytosine (C), guanine (G) or thymine (T). The order or sequence of these bases along the backbone is what stores genetic information.

In multicellular organisms, DNA is arranged in a double-stranded, antiparallel helix (see figure below). Long stretches of DNA containing up to millions of nucleotides are organized in tightly packed structures called chromosomes, which reside within the nucleus of each cell. Bases on one strand are always paired with a specific base on the other: adenine with thymine, and cytosine with guanine. This provides a built-in mechanism for copying and replicating DNA. For instance, during cell division, the strands unzip and are then filled up again by free nucleotides complementary to each base. Each daughter cell is thus provided with a full set of double-stranded DNA in the form of chromosomes.

In order to use the genetic information stored in a gene, a working copy of its sequence is being made first. This process is called DNA transcription. To initiate this process, a complex of proteins consisting of DNA polymerase and various transcription factors binds to the DNA. Relying on the principle of complementarity mentioned above, the polymerase then assembles a copy of the coding or sense strand using the transcribed or antisense strand as a template. The copy is made up of ribonucleic acid or RNA, a type of molecule similar to but less stable than DNA, and typically short and single-stranded in comparison. In the case of protein-coding genes, the copy is called messenger RNA, or short mRNA, because it carries the information from the site of its storage, the nucleus, to the cytoplasm where it is being used to generate a gene product.


Gene translation and the genetic code

After arriving in the cytoplasm, the messenger RNA attaches itself to a ribosome, a large molecular machine governing DNA translation. During this process, the genetic information stored in a gene and carried by the mRNA molecule, is translated into a protein. As mentioned above, the information on how to make a protein is encoded in the order of the four bases. The code, depicted in the table below, is nearly universal across all forms of life on earth: Each triplet of bases or codon stands for one of the 20 amino acids that form the building blocks of all proteins. For each codon a ribosome reads in, it adds a corresponding amino acid to a growing chain (also called a polypeptide). Since there are 64 codons (43) but only 20 amino acids, the code is highly redundant, meaning that most amino acids are encoded by multiple codons. In addition, four codon have a special function: ATG indicates the start of the coding sequence (as well as encoding the amino acid methionine, which is therefore always the first in a protein), while three so-called stop codons stand for its end, and lead to termination of the amino acid chain building process. The sequence and chemical nature of the amino acids determines how the protein folds into a three-dimensional structure, which in turn determines its function in the cell. The whole process of generating a gene product including transcription and translation is called gene expression. It is subject to subtle regulation at many levels, adjusting the amount of gene product needed in the cell at any given time.


DNA structure


Codon table (standard code)

TTT F – Phenylalanine TCT S – Serine TAT Y – Tyrosine TGT C – Cysteine
TTC F – Phenylalanine TCC S – Serine TAC Y – Tyrosine TGC C – Cysteine
TTA L – Leucine TCA S – Serine TAA STOP TGA STOP
TTG L – Leucine TCG S – Serine TAG STOP TGG W – Tryptophan
CTT L – Leucine CCT P – Proline CAT H – Histidine CGT R – Arginine
CTC L – Leucine CCC P – Proline CAC H – Histidine CGC R – Arginine
CTA L – Leucine CCA P – Proline CAA Q – Glutamine CGA R – Arginine
CTG L – Leucine CCG P – Proline CAG Q – Glutamine CGG R – Arginine
ATT I – Isoleucine ACT T – Threonine AAT A – Asparagine AGT S – Serine
ATC I – Isoleucine ACC T – Threonine AAC A – Asparagine AGC S – Serine
ATA I – Isoleucine ACA T – Threonine AAA L – Lysine AGA R – Arginine
ATG M – Methionine* ACG T – Threonine AAG L – Lysine AGG R – Arginine
GTT V – Valine GCT A – Alanine GAT D – Aspartic acid GGT G – Glycine
GTC V – Valine GCC A – Alanine GAC D – Aspartic acid GGC G – Glycine
GTA V – Valine GCA A – Alanine GAA E – Glutamic acid GGA G – Glycine
GTG V – Valine GCG A – Alanine GAG E – Glutamic acid GGG G – Glycine


Gene structure

Proteins are encoded by the coding sequence (CDS) or body of a gene. However, a functional gene contains additional sequence that is transcribed but not translated, or even left untranscribed. Parts of a gene that are included in its mRNA but do not code for protein are called untranslated regions, or short UTRs. They are located on either side of the coding sequence – the 5′ UTR in front of the start codon, and the 3′ UTR after the stop codon (green and red in the figure below, respectively). In almost all organism except bacteria, the coding sequence is further split into shorter fragments called exons which are separated by non-coding sequence called introns. Exons allow organisms to be more flexible in producing proteins, for instance by omitting certain exons during transcription. However, they require the removal of the introns, which are initially transcribed, but later removed from the mature mRNA in a process called splicing (see also below, under Alternative transcripts). Finally, the initiation of transcription and the regulation of gene activity requires promotors and other regulatory elements. Although these elements are part of the gene, they are often found far from the main part of the gene, and also not found in the mRNA since they are not being transcribed themselves.




Alternative transcripts

Genes can give rise to more than one type of transcript, which is important to remember when annotating genes on the basis of experimentally obtained transcript data. One of the most important mechanisms by which a gene can produce multiple transcipts is alternative splicing. In this process, particular exons are included in or excluded from the mature mRNA. Because alternative splicing produces different proteins from the same gene, it is an important means of increasing protein diversity and regulating gene expression.

Variants of proteins derived from the same gene are called isoforms. Other sources of isoforms include recently duplicated and thus highly similar genes (a process we will talk about in more detail here), and alleles. Alleles represent variants of the same gene that differ slightly in sequence and may result in different observable phenotypic traits, like eye color. There are typically several alleles of each gene present in a population, and most more complex organisms possess two (which may be identical), one from each of their two sets of chromosomes.


Next chapter: Genome sequencing