6.6: Expressed Sequence Tags - Biology

6.6: Expressed Sequence Tags - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Only a very small percentage (1.2% in humans) of the DNA in vertebrate genomes encodes proteins (the "proteome") because the exons of most genes are separated by much-longer introns between our genes lie vast amounts of DNA much of which appears to regulate the expression of our genes but is not transcribed and translated into a protein product. So even when the complete sequence of a genome is known, it is often difficult to spot particular genes (open reading frames or ORFs).

One approach to solving the problem is to examine a transcriptome of the organism. Most commonly this is defined as: All the messenger RNA (mRNA) molecules transcribed from the genome. It is "a" transcriptome, not "the" transcriptome, because what genes are transcribed in a cell depends on the kind of cell (e.g., liver cell vs. lymphocyte) and what the cell is doing at that time, e.g.,

  • getting ready to divide by mitosis;
  • responding to the arrival of a hormone or cytokine;
  • getting ready to secrete a protein product.

Expressed Sequence Tags (ESTs)

ESTs are short (200–500 nucleotides) DNA sequences that can be used to identify a gene that is being expressed in a cell at a particular time.

The Procedure:

  • Isolate the messenger RNA (mRNA) from a particular tissue (e.g., liver)
  • Treat it with reverse transcriptase. Reverse transcriptase is a DNA polymerase that uses RNA as its template. Thus it is able to make genetic information flow in the reverse (RNA ->DNA) of its normal direction (DNA -> RNA).
  • This produces complementary DNA (cDNA). Note that cDNA differs from the normal gene in lacking the intron sequences.
  • Sequence 200–500 nucleotides at both the 5′ and 3′ ends of each cDNA.
  • Examine the database of the organism's genome to find a matching sequence.

Genome-Wide Analysis of Gene Expression

Karine G. Le Roch , Elizabeth A. Winzeler , in Encyclopedia of Biological Chemistry , 2004

Serial Methods for Global Analysis of Gene Expression

Serial analysis of gene expression (SAGE) is a technique for collecting genome-wide expression data in which short sequence tags (7–9 bases in length) are isolated from near the 3′ end of transcripts using biochemical methods ( Figure 1 ) . The short tags (usually 10,000–20,000 per condition) are concatenated and then sequenced. Finally each SAGE tag is mapped back to the gene from which it originated by using full-genome sequence data. By counting the number of SAGE tags for a gene for several different conditions, one can obtain an estimate of how the gene's expression is changing. This technique which does not depend on constructing a microarray, can be initiated with full genome sequence information, and is very quantitative, especially for highly expressed genes. The technique is relatively expensive and is much more time consuming than microarray analysis. Other serial methods include sequencing libraries of expressed sequence tags (ESTs).

Figure 1 . Serial analysis of gene expression. RNA from the condition of interest is first converted to double-stranded cDNA using reverse transcriptase and DNA polymerase enzymes (A). The double-stranded cDNA is then cleaved with a restriction enzyme that cuts every few hundred bases in the genome of interest (usually NlaIII, whose site is here represented by solid circles). A short piece of DNA is then ligated at the cut site of the first restriction enzyme (B), generating a recognition site for a type II restriction enzyme (usually BsmFI, whose recognition site is shown as a solid rectangle) which cuts a defined distance from its recognition site, creating a 15 bp SAGE tag. After restriction with the type Ii enzyme (C), the tags are then biochemically linked to one another and finally sequenced (D). The number of SAGE tags mapping to a gene is then tabulated to determine the gene's expression level.

  1. Restriction Fragment Length Polymorphism (RFLP)
  2. Amplified Fragment Length Polymorphism (AFLP)
  3. Random Amplified Polymorphic DNA (RAPD)
  4. Cleaved Amplified Polymorphic Sequences (CAPS)
  5. Simple Sequence Repeat (SSR) Length Polymorphism
  6. Single Strand Conformational Polymorphism (SSCP)
  7. Heteroduplex Analysis (HA)
  8. Single Nucleotide Polymorphism (SNP)
  9. Expressed Sequence Tags (EST)
  10. Sequence Tagged Sites (STS)

Restriction Fragment Length Polymorphisms (RFLPs):

RFLPs refer to variations found within a species in the length of DNA fragments generated by specific endonuclease. RFLPs are first type of DNA markers developed to distinguish individuals at the DNA level. RFLP technique was developed before the discovery of Polymerase Chain Reaction (PCR).

The advantages, disadvantages and uses of this technique are presented below:

RFLP technique has several advantages. It is a cheaper and simple technique of DNA sequencing. It does not require special instrumentation. The majority of RFLP markers are co-dominant and highly locus specific. These are powerful tools for comparative and synteny mapping.

It is useful in developing other markers such as CAPS and INDEL. Several samples can be screened simultaneously by this technique using different probes. RFLP genotypes for single copy or low copy number genes can be easily scored and interpreted.

Developing sets of RFLP probes and markers is labour intensive. This technique requires large amount of high quality DNA. The multiplex ratio is low, typically one per gel. The genotyping throughput is low. It involves use of radioactive chemicals. RFLP finger prints for multi-gene families are often complex and difficult to score. RFLP probes cannot be shared between laboratories.

They can be used in determining paternity cases. In criminal cases, they can be used in determining source of DNA sample. They can be used to determine the disease status of an individual. They are useful in gene mapping, germplasm characterization and marker assisted selection. They are useful in detection of pathogen in plants even if it is in latent stage.

Amplified Fragment Length Polymorphism (AFLP):

AFLPs are differences in restriction fragment lengths caused by SNPs or INDELs that create or abolish restriction endonuclease recognition sites. AFLP assays are performed by selectively amplifying a pool of restriction fragments using PCR. RFLP technique was originally known as selective restriction fragment amplification.

It provides very high multiplex ratio and genotyping throughput. These are highly reproducible across laboratories. No marker development work is needed however, AFLP primer screening is often necessary to identify optimal primer specificities and combinations.

No special instrumentation is needed for performing AFLP assays however, special instrumentation is needed for co-dominant scoring.

Start-up costs are moderately low. AFLP assays can be performed using very small DNA samples (typically 0.2 to 2.5 pg per individual). The technology can be applied to virtually any organism with minimal initial development.

The maximum polymorphic information content for any bi-allelic marker is 0.5. High quality DNA is needed to ensure complete restriction enzyme digestion. DNA quality may or may not be a weakness depending on the species. Rapid methods for isolating DNA may not produce sufficiently clean template DNA for AFLP analysis.

Proprietary technology is needed to score heterozygotes and ++ homozygotes. Otherwise, AFLPs must be dominantly scored. Dominance may or may not be a weakness depending on the application.

The homology of a restriction fragment cannot be unequivocally ascertained across genotypes or mapping populations. Developing locus specific markers from individual fragments can be difficult and does not seem to be widely done.

The switch to non-radioactive assays has not been rapid. Chemiluminescent AFLP fingerprinting methods have been developed and seem to work well.

The fingerprints produced by fluorescent AFLP assay methods are often difficult to interpret and score and thus do not seem to be widely used. AFLP markers often densely cluster in centromeric regions in species with large genomes, e.g., barley (Hordeum vulgare L.) and sunflower (Helianthus annuus L.).

This technique has been widely used in the construction of genetic maps containing high densities of DNA marker. In plant breeding and genetics, AFLP markers are used in varietal identification, germplasm characterization, gene tagging and marker assisted selection.

Random Amplified Polymorphic DNA (RAPDs):

RAPD refers to polymorphism found within a species in the randomly amplified DNA generated by restriction endonuclease enzyme. RAPDs are PCR based DNA markers. RAPD marker assays are performed using single DNA primer of arbitrary sequence.

RAPD primers are readily available being universal. They provide moderately high genotyping throughput. This technique is simple PCR assay (no blotting and no radioactivity). It does not require special equipment. Only PCR is needed. The start-up cost is low.

RAPD marker assays can be performed using very small DNA samples (5 to 25 ng per sample). RAPD primers are universal and can be commercially purchased. RAPD markers can be easily shared between laboratories. Locus-specific, co-dominant PCR-based markers can be developed from RAPD markers. It provides more polymorphism than RFLPs.

The detection of polymorphism is limited. The maximum polymorphic information content for any bi-allelic marker is 0.5. This technique only detects dominant markers. The reproducibility of RAPD assays across laboratories is often low. The homology of fragments across genotypes cannot be ascertained without mapping. It is not applicable in marker assisted breeding programme.

This technique can be used in various ways such as for varietal identification, DNA fingerprinting, gene tagging and construction of linkage maps. It can also be used to study phylogenetic relationship among species and sub-species and assessment of variability in breeding populations.

Cleaved Amplified Polymorphic Sequences (CAPS):

CAPS polymorphisms are differences in restriction fragment lengths caused by SNPs or INDELs that create or abolish restriction endonuclease recognition sites in PCR amplicons produced by locus-specific oligonucleotide primers.

CAPS assays are performed by digesting locus-specific PCR amplicons with one or more restriction enzymes and separating the digested DNA on agarose or polyacrylamide gels.

CAPS analysis is versatile and can be combined with single strand conformational polymorphim (SSCP), sequence-characterized amplified region (SCAR), or random amplified polymorphic DNA (RAPD) analysis to increase the chance of finding a DNA polymorphism.

Michaels and Amasino (1998) proposed a variant of the CAPS method called dCAPS based on SNPs.

The genotyping throughput is moderately high. It is a simple PCR assay. Markers are developed from the DNA sequences of previously mapped RFLP markers. Most CAPS markers are co- dominant and locus specific. No special equipment is needed to perform manual CAPS marker assays.

CAPS marker assays can be performed using semi-automated methods, e.g., fluorescent assays on a DNA sequencer (e.g., ABI377). Start-up costs are low for manual assay methods. CAPS assays can be performed using very small DNA samples (typically 50 to 100 ng per individual). Most CAPS genotypes are easily scored and interpreted. CAPS markers are easily shared between laboratories.

Typically, a battery of restriction enzymes must be tested to find polymorphisms. Although CAPS markers still nave great utility and should not be over looked, other methods have emerged as tools for screening locus-specific DNA fragments for polymorphisms, e.g., SNP assays. The development of easily scored and interpreted assays may be difficult for some genes, especially those belonging to multi-gene families.

This is straightforward way to develop PCR-based markers from the DNA sequences of previously mapped RFLP markers. It is a simple method that builds on the investment of an RFLP map and eliminates the need for DNA blotting.

Simple Sequence Repeats (SSRs):

Simple sequence repeats (SSRs) or microsatellites are tandemly repeated mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs. SSR length polymorphisms are caused by differences in the number of repeats. SSR loci are individually amplified by PCR using pairs of oligonucleotide primers specific to unique DNA sequences flanking the SSR sequence.

Jeffreys (1985) showed that some restriction fragment length polymorphisms are caused by VNTRs. The name “mini satellite” was coined because of the similarity of VNTRs to larger satellite DNA repeats.

SSR markers tend to be highly polymorphic. The genotyping throughput is high. This is a simple PCR assay. Many SSR markers are multi-allelic and highly polymorphic. SSR markers can be multiplexed, either functionally by pooling independent PCR products or by true multiplex- PCR. Semi-automated SSR genotyping methods have been developed. Most SSRs are co-dominant and locus specific.

No special equipment is needed for performing SSRs assays however, special equipment is needed for some assay methods, e.g., semi-automated fluorescent assays performed on a DNA sequences. Start-up costs are low for manual assay methods (once the markers are developed). SSR assays can be performed using very small DNA samples (

100 ng per individual). SSR markers are easily shared between laboratories.

The development of SSRs is labor intensive. SSR marker development costs are very high. SSR markers are taxa specific. Start-up costs are high for automated SSR assay methods. Developing PCR multiplexes is difficult and expensive. Some markers may not multiplex.

SSR markers are used for mapping of genes in eukaryotes.

Single Strand Conformational Polymorphisms (SSCPs):

SSCPs refer to DNA polymorphisms produced by differential folding of single-stranded DNA harboring mutations. The conformation of the folded DNA molecule is produced by intra-molecular interactions and is thus a function of the DNA sequence.

SSCP marker assays are performed using heat-denatured DNA on non-denaturing DNA sequencing gels. Special gels (e.g., mutation detection enhancement gels) have been developed to enhance the discovery of single-strand conformational polymorphisms caused by INDELs, SNPs, or SSRs.

It is a simple PCR assay. Many SSCP markers are multi-allelic and highly polymorphic. Most SSCPs are co-dominant and locus specific. No special equipment is needed. Start-up costs are low. SSCP marker assays can be performed using very small DNA samples (typically 10 to 50 ng per individual).

SSCP markers are easily shared between laboratories. SSCP gels can be silver stained (no radioactivity). The complexity of PCR products can be assessed and individual fragments can be isolated and sequenced.

The development of SSCP markers is labor intensive. SSCP marker development costs can be high. SSCP marker analysis cannot be automated.

SSCPs have been widely used in human genetics to screen disease genes for DNA polymorphisms. Although SSCP analysis does not uncover every DNA sequence polymorphism, the methodology is straight forward and a significant number of polymorphisms can be discovered. SSCP analysis can be a powerful tool for assessing the complexity of PCR products.

Heteroduplex Analysis (HA):

It refers to DNA polymorphisms produced by separating homo-duplex from heteroduplex DNA using non-denaturing gel electrophoresis or partially denaturing high performance liquid chromatography.

Single-base mismatches between genotypes produce hetero-duplexes thus, the presence of hetero-duplexes signals the presence of DNA polymorphisms. Heteroduplex analyses can be rapidly and efficiently performed on numerous genotypes before specific alleles are sequenced, thereby greatly reducing sequencing costs in SNP discovery and SNP marker development.

It is a powerful method for SNP discovery. Automated HA can be performed using HPLC. Most heteroduplex markers are co-dominant and locus specific. HA can be performed using very small DNA samples (typically 10 to 50 ng per individual). HA markers are easily shared between laboratories.

Requires special equipment. One protocol may not be sufficient for heteroduplex analyses of different targets via HPLC.

Heteroduplex analysis has been mostly used in human genetics to screen disease genes for DNA polymorphism. In plant breeding, it is used for detection of pathogens which are in latent stage and thus useful in selection of disease free plants. It is also useful in the discovery of single nucleotide polymorphism.

Single Nucleotide Polymorphism (SNP):

The variations which are found at a single nucleotide position are known as single nucleotide polymorphisms or SNP. Such variation results due to substitution, deletion or insertion. This type of polymorphisms has two alleles and also called bialleleic loci. This is the most common class of DNA polymorphism. It is found both in natural lines and after induced mutagenesis. Main features of SNP markers are given below.

1. SNP markers are highly polymorphic and mostly bialleleic.

2. The genotyping throughput is very high.

3. SNP markers are locus specific.

4. Such variation results due to substitution, deletion or insertion.

5. SNP markers are excellent long term investment.

6. SNP markers can be used to pinpoint functional polymorphism.

7. This technique requires small amount of DNA.

SNP markers are useful in gene mapping. SNPs help in detection of mutations at molecular level. SNP markers are useful in positional cloning of a mutant locus. SNP markers are useful in detection of disease causing genes.

Most of the SNPs are bialleleic and less informative than SSRs. Multiplexing is not possible for all loci. Some SNP assay techniques are costly. Development of SNP markers is labour oriented. More (three times) SNPs are required in preparing genetic maps than SSR markers.

SNPs are useful in preparing genetic maps. They have been used in preparing human genetic maps. In plant breeding, SNPs have been used to lesser extent.

Expressed Sequence Tags (EST):

Expressed Sequence Tags (ESTs) are small pieces of DNA and their location and sequence on the chromosome are known. The variations which are found at a single nucleotide position are known. The term Expressed Sequence Tags (ESTs) was first used by Venter and his colleagues in 1991. Main features of EST markers are given below.

1. ESTs are short DNA sequences (200-500 nucleotide long).

2. They are a type of sequence tagged sites (STS).

3. ESTs consist of exons only.

It is a rapid and inexpensive technique of locating a gene. ESTs are useful in discovering new genes related to genetic diseases. They can be used for tissue specific gene expression.

ESTs have lack of prime specificity. It is a time consuming and labour oriented technique. The precision is lesser than other techniques. It is difficult to obtain large (> 6kb) transcripts. Multiplexing is not possible for all loci.

ESTs are commonly used to map genes of known function. They are also used for phylogenetic studies and generating DNA arrays.

Sequence Tagged Sites (STS):

In genomics, a sequence tagged site (STS) is a short DNA sequence that has a single copy in a genome and whose location and base sequence are known. Main features of STS markers are given below.

1. STSs are short DNA sequences (200-500 nucleotide long).

2. STSs occur only once in the genome.

3. STS are detected by PCR in the presence of all other genomic sequences.

4. STSs are derived from cDNAs.

STSs are useful in physical mapping of genes. This technique permits sharing of data across the laboratories. It is a rapid and most specific technique than DNA hybridization techniques. It has high degree of accuracy. It can be automated.

Development of STS is a difficult task. It is time consuming and labour oriented technique. It require high technical skill.

STS is the most powerful physical mapping technique. It can be used to identify any locus on the chromosome. STSs are used as standard markers to find out gene in any region of the genome. It is used for constructing detailed maps of large genomes.


Past studies have pointed to a role for HPS in the control of seed lustre in soybean cultivars [2, 5]. Now, we have conducted an extensive study of Hps copy number polymorphisms in a range of soybean lines and related legume species. The structure of the Hps gene was investigated by isolating and characterizing clones from the genomic region. The results have led us to propose a model to account for variation of seed lustre controlled by Hps.

From the analysis of DNA blot hybridizations of various soybean cultivars, lines, and related species, we can conclude that Hps copy number polymorphisms are common in soybean. The Hps locus appears to have evolved and diversified in soybean (Glycine max) in comparison to its wild ancestor (Glycine soja). Hybridization patterns show that the Hps sequence itself is also specific to these two species, a result that is supported by searches of DNA and protein sequences in GenBank (not shown). HPS shows similarities to so-called bi-modular proteins containing plant lipid transfer protein (LTP) domains [5]. The plant LTPs constitute a large group of related proteins derived from the prolamin super-family. Our results show that HPS has diverged substantially from other LTPs and that there are no close counterparts in other species.

All Glycine max lines that were tested contained multiple copies of the Hps gene, but there were large differences in the number of copies of Hps depending on the cultivar examined. We observed a good correlation between the apparent Hps copy number, as judged by hybridization intensity on DNA blots, and seed lustre. This is especially true for dull- and shiny-seeded phenotypes and for intermediates between these types. This relationship was not apparent for bloom phenotypes, an exception that has been noted in past studies that correlated the occurrence of HPS protein to seed lustre [2, 5]. Two bloom phenotypes analyzed, Clark B1 and Sooty, produced contrasting patterns of Hps hybridization. This can be accounted for by tracing the pedigree of Clark B1. The cv Clark is a dull phenotype with a high-copy Hps RFLP pattern, whereas Sooty is a bloom phenotype with a low-copy RFLP pattern. Clark B1 is an isoline derived from a cross between Clark and Sooty, with Clark as the recurrent parent. This indicates that the bloom phenotype (B1) is controlled by genes that are independent of B and Hps.

Multiple copies of Hps could be detected in a number of different soybean cultivars and lines by conventional DNA blot hybridizations. Multiple genomic copies of Hps were also detected by real-time PCR analysis (not shown). Copy number estimates from real-time PCR analysis were more variable and always exceeded estimates determined by conventional hybridizations. Each type of analysis, real-time PCR and conventional hybridization, were performed many times and, overall, we have greater confidence in the results from conventional hybridizations. By this method, the soybean cv Harosoy 63 was estimated to posses 27 ± 5 copies of Hps per haploid genome.

Variation in Hps copy number among different soybean lines could also be detected by comparative genomic hybridization (CGH) to cDNA microarrays. Although substantial differences in Hps copy number were detected by CGH, quantifying the number of Hps copies in a particular genome was not possible since hybridization intensities were not calibrated. Nonetheless, we have shown that CGH may be used to search for copy number polymorphisms in plant genomes. It is a potentially powerful application of microarrays that may be under-appreciated. For example, genomic DNA from plant lines that differ in a particular trait of interest could be screened using microarrays to identify genes that show differences in copy number. These genes could be tested as candidates for the trait of interest.

From our analysis of Hps gene structure, at least three pieces of evidence suggest that most of the Hps copies share a high degree of sequence identity. First, the hybridization patterns produced upon digestion of genomic DNA with a variety of enzymes indicate that restriction enzyme sites have been conserved in most of the gene copies. Secondly, analysis of Hps genomic clones indicates that independent clones with nearly identical sequences correspond to separate copies of Hps genes. Finally, expressed sequence tags encoding Hps transcripts do not show a high degree of sequence polymorphism [14, 15]. Thus, it appears that most copies of the Hps gene have not diverged in sequence. This indicates that duplication and expansion of this gene cluster has been a recent event, or that sequence identity is maintained by frequent recombination events occurring within the cluster. Naturally, it would be desirable to clone a contiguous region of genomic DNA encompassing the entire tandem array of Hps genes. We attempted to do this by screening bacterial artificial chromosome (BAC) libraries but were unsuccessful. It is known that tandem arrays may be intractable to cloning and propagation [16], perhaps explaining this result.

In the cross between soybean lines OX281 and Mukden, Hps copy number polymorphisms cosegregated with seed lustre phenotype B and associated genetic markers. This result was expected because past studies have shown that B cosegregates with the presence of HPS protein on the seed surface, and with a DNA marker derived from the Hps cDNA sequence [2]. The multiple copies of Hps that are present in OX281 segregate in a Mendelian fashion, indicating that they occur at a single genetic locus and are not distributed throughout the genome. The analysis and assembly of Hps genomic clones substantiates the inheritance results, since the clones could be aligned to produce a reiterated array of Hps genes. All of the evidence therefore points to a tandem array of Hps genes occurring in a structural configuration arising from gene amplification.

Gene amplification occurs when multiple identical copies of a DNA sequence are duplicated within the genome. It may be an adaptive mechanism that results from selective pressure on the genome, as illustrated by drug, insecticide, or herbicide resistance observed in cell lines or in populations [17–19]. Amplification typically leads to a tandem array of reiterated units, such as that observed for genes encoding rRNA, snRNA, and histones [17]. Unlike genes that undergo duplication and divergence [20], individual units within a tandem array are under constraint and maintain a high-degree of sequence identity. Structural genes occurring in tandem arrays that are stable over generations, such as rDNAs, are considered a mechanism to accommodate cellular demand for large amounts of identical gene product. Genetic components embedded within tandem arrays that act to stabilize or promote gene amplification have been proposed, such as AT-rich tracts, autonomously replicating sequence (ARS) elements, and matrix attachment regions (MAR) [17]. These cis-acting elements have even been used to modulate gene copy number and expression levels of heterologous genes in transformed cells [21].

Thus, the features of the Hps locus appear to be consistent with characteristics associated with other amplified genes, from plants and animals. Plant genomes are known to have many large gene families and duplicated genes occurring in tandem arrays are also fairly common [22]. One of the largest tandem arrays characterized in plants corresponds to a gene cluster of 22 copies encoding alpha zeins in Zea mays [23], but there are few other examples of extensive arrays of nearly identical structural genes at one locus. The Hps locus is also exceptional because of the allelic variation in copy number of this gene cluster among different soybean cultivars and lines. Although it is not clear whether all copies of Hps are functionally expressed and transcribed, previous work has shown that transcripts encoding Hps are far more abundant in the endocarp of soybean lines with many genomic copies of Hps than in lines with few copies [5].

The results from this study together with past work [2, 5] can be integrated into a model, whereby Hps genomic copy number operates as a genetic rheostat to control transcriptional and translational flux and the resulting quantity of HPS protein synthesized by the endocarp. Variation in HPS protein levels expressed in the endocarp could then account for the variable pattern of attachment of this tissue to the seed surface, and the resulting seed lustre phenotypes. Alternative explanations cannot be excluded, but the evidence so far tends to favour this gene amplification-based hypothesis. What kind of selection pressure could cause this to occur? The size, shape, colour, and general appearance of the seed are traits that are under intense selective pressure for crop plants, especially so for legumes. Even today certain markets may favour dull- or shiny-seeded soybeans for particular uses, so it is not unreasonable to suppose that selection for various lustre phenotypes has accompanied the development and expansion of this crop since its domestication some 3,000 years ago [24].

We thank L. Hood and D. Smoller for helpful discussions A.J. Ross, S. O'Brien, and A. Cziko for bee collections S. O'Brien for bee brain dissections D. Toma for RNA extraction M. Rebeiz for assistance with PERL programming A. Cziko for assistance in microarray fabrication and R. Hoskins, S. Clough, and members of the Robinson lab for reviewing the manuscript. Special thanks to H.A. Lewin, Director of the Keck Center, for excellent advice throughout the project and his tireless and creative efforts to facilitate genomics research on this campus. This research was supported by an NSF Postdoctoral Fellowship in Bioinformatics (C.W.W.) and grants from the University of Illinois Critical Research Initiatives Program and the Burroughs Wellcome Trust (G.E.R.).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked �vertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


Analogous to the proteins that bind endogenous fluorophores, proteins that bind exogenous fluorophores have also been developed. The best characterized of these are fluorogen-activating proteins, single-chain antibodies that bind a nonfluorescent molecule and stabilize it in a fluorescent state (Szent-Gyorgyi et al., 2008 Bruchez, 2015). These are commercially available with both cell-permeant and cell-impermeant ligands, enabling discrimination of intracellular fusions from extracellular fusions. A related labeling scheme is that of FlAsH/ReAsh, in which a six–amino acid tetracysteine motif recognizes arsenic-containing dyes (Griffin et al., 1998 Gaietta et al., 2002). By themselves, the dyes are not fluorescent, but they become fluorescent when bound to the tetracysteine tag. Nonspecific binding to cysteines can lead to background fluorescence and is suppressed by washing with sulfhydryl-containing compounds. Although not widely used, this labeling scheme is noteworthy because the tag is very small. Spectral properties for these tags are given in Table 1.

An alternative way to fluorescently label a protein of interest is by covalently coupling a dye molecule to it. Although this has long been done in vitro using amine- or sulfhydryl-reactive dyes, more recently, self-labeling tag sequences have been used for this. These tags covalently react with a small-molecule substrate containing a fluorophore (Table 2). The most widely used tags are the SNAP(f), CLIP(f), and Halo tags (Keppler et al., 2003 Gautier et al., 2008 Los et al., 2008 Sun et al., 2011). The SNAP and CLIP tags are variants of O 6 -alkylguanine-DNA alkyltransferase that react with benzylguanine and benzylcytosine derivatives, respectively (Figure 1). The Halo tag is derived from haloalkane dehalogenase and reacts with alkylhalides. A similar but less widely used tag is the TMP tag, which uses an engineered Escherichia coli dihydrofolate reductase to react with trimethoprim-fused fluorophores (Miller et al., 2005 Chen et al., 2012). In these systems, the reactive group that covalently binds to the tag is independent of the attached fluorophore, allowing a wide variety of fluorophores (and other molecules, such as affinity tags) to be attached. This chemical versatility allows changing the label on the protein by simply changing the substrate and enables experiments that would be difficult to carry out with other tags, such as two-color pulse-chase labeling by first incubating with one substrate and then by a second, or distinguishing intracellular from extracellular protein by labeling with cell-permeant and cell-impermeant substrates. The major drawback to these proteins is the added complexity of using an external substrate that is itself fluorescent and may require washing to reduce background, although there is a version of the TMP tag that reacts with nonfluorescent substrates to produce fluorescent adducts (Jing and Cornish, 2013). In addition, newly synthesized protein is fluorescently labeled only if substrate is available, which makes these methods less useful for long time-lapse imaging experiments.

TABLE 2: Other genetically encoded tagging strategies.

Modular tags for protein and RNA sequences that are discussed in the text are listed here. For more information, see the text.


The prediction that selection affects the genome in a locus-specific way also affecting flanking neutral variation, known as genetic hitchhiking, enables the use of polymorphic markers in noncoding regions to detect the footprints of selection. However, as the strength of the selective footprint on a locus depends on the distance from the selected site and will decay with time due to recombination, the utilization of polymorphic markers closely linked to coding regions of the genome should increase the probability of detecting the footprints of selection as more gene-containing regions are covered. The occurrence of highly polymorphic microsatellites in the untranslated regions of expressed sequence tags (ESTs) is a potentially useful source of gene-associated polymorphisms which has thus far not been utilized for genome screens in natural populations. In this study, we searched for the genetic signatures of divergent selection by screening 95 genomic and EST-derived mini- and microsatellites in eight natural Atlantic salmon, Salmo salar L., populations from different spatial scales inhabiting contrasting natural environments (salt-, brackish, and freshwater habitat). Altogether, we identified nine EST-associated microsatellites, which exhibited highly significant deviations from the neutral expectations using different statistical methods at various spatial scales and showed similar trends in separate population samples from different environments (salt-, brackish, and freshwater habitats) and sea areas (Barents vs. White Sea). We consider these ESTs as the best candidate loci affected by divergent selection, and hence, they serve as promising genes associated with adaptive divergence in Atlantic salmon. Our results demonstrate that EST-linked microsatellite genome scans provide an efficient strategy for discovering functional polymorphisms, especially in nonmodel organisms.

Present address: Program in Genetics and Genomic Biology, Hospital for Sick Children, University Avenue, Toronto, Ontario, M5G 1X8, Canada

Present address: Facultad de Química, Cátedra de Inmunología, Universita de la Republica, Montevideo, 11300, Uruguay


Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh, EH9 3JT, UK

Yvonne M Harcus, John Parkinson, Cecilia Fernández, Jennifer Daub, Mark L Blaxter & Rick M Maizels

Department of Biological Sciences, Imperial College London, London, SW7 2AZ, UK