32: Personal Genomes, Synthetic Genomes, Computing in C vs. Si - Biology

32: Personal Genomes, Synthetic Genomes, Computing in C vs. Si - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

32: Personal Genomes, Synthetic Genomes, Computing in C vs. Si

Synthetic biology, metaphors and responsibility

Metaphors are not just decorative rhetorical devices that make speech pretty. They are fundamental tools for thinking about the world and acting on the world. The language we use to make a better world matters words matter metaphors matter. Words have consequences - ethical, social and legal ones, as well as political and economic ones. They need to be used ‘responsibly’. They also need to be studied carefully – this is what we want to do through this editorial and the related thematic collection. In the context of synthetic biology, natural and social scientists have become increasingly interested in metaphors, a wave of interest that we want to exploit and amplify. We want to build on emerging articles and books on synthetic biology, metaphors of life and the ethical and moral implications of such metaphors. This editorial provides a brief introduction to synthetic biology and responsible innovation, as well as a comprehensive review of literature on the social, cultural and ethical impacts of metaphor use in genomics and synthetic biology. Our aim is to stimulate an interdisciplinary and international discussion on the impact that metaphors can have on science, policy and publics in the context of synthetic biology.


Polyploidy provides new genetic raw material for evolutionary diversification, as gene duplication can lead to the evolution of new gene functions and regulatory networks 1 . Nevertheless, whole-genome duplication (WGD) is a relatively rare occurrence in animals when compared to fungi and plants 2 . Two rounds of ancient WGD occurred in the last common ancestor of the vertebrates, with additional rounds in some teleost fish lineages 2,3,4 . Fixation of these WGD events (i.e., ‘polyploidization’) is considered a major force in shaping the evolutionarily success of vertebrate lineages, by facilitating fundamental changes in physiology and morphology, leading to the origin of new adaptations 5,6 . Among the invertebrates, horseshoe crabs 7,8,9 , spiders, and scorpions 10 represent the only sexually reproducing lineages that are known to have undergone WGD (Fig. 1a).

a Schematic diagram illustrating the current knowledge of whole-genome duplication (WGD) in animals. ‘?R’ denotes unknown rounds of whole-genome duplication b pictures of horseshoe crabs C. roundicultata and T. tridentatus c Repeat content for the two horseshoe crab genomes, C. rotundicauda and T. tridentatus: Pie charts illustrating repeat content as a proportion of total genomic content Repeat content present in genic verses intergenic regions and Repeat landscape plots illustrating transposable element activity in each horseshoe crab genome. Source data reveals these figures can be found in Supplementary Data 8.

Horseshoe crabs are considered to be ‘living fossils’. The oldest actual fossils of horseshoe crabs date to the Ordovician period

450 million years ago (Mya) 11 , and remarkably, extant species remain relatively unchanged morpologically since this extremely ancient date. However, despite their long history, there are only four extant species of horseshoe crabs worldwide: the Atlantic horseshoe crab (Limulus polyphemus) from the Atlantic East Coast of North America, and the mangrove horseshoe crab (Carcinoscorpius rotundicauda), the Indo-Pacific horseshoe crab (Tachypleus gigas), and the tri-spine horseshoe crab (Tachypleus tridentatus), from South and East Asia 12 . All extant horseshoe crabs are estimated to have diverged from a common ancestor that existed

135 Mya 13 , and they share an ancestral WGD 9 . A high-quality genome assembly was recently announced as a genomic resource for T. tridentatus 14,15 , leaving an exciting research opportunity to analyse the genomes of other horseshoe crab species to understand how WGD events reshape the genome and rewire genetic regulatory networks in invertebrates.

In the present study, we provide the first high quality genome of the mangrove horseshoe crab (C. rotundicauda), and a resequenced genome of the tri-spine horseshoe crab (T. tridentatus). Importantly, we present evidence for the number of rounds of WGD that have occurred in these genomes, and investigate if these represent a shared event with spiders. We also examine the evolutionary fate of genes and microRNAs at both the individual and population level in these genomes. Collectively, this study highlights the evolutionary consequences of a unique invertebrate WGD, while at the same time providing detailed genetic insights of utility for diverse genomic, biomedical, and conservation applications.


The rise of genomics and its impact on human health

Established in 1990, the Human Genome Project was one of the most expensive and collaborative ventures ever undertaken in science. Ten years since its completion, it has continued to provide a wealth of novel information, the implications of which are not yet fully understood [8]. The open-access nature of the project has stimulated scientists, as well as scientific companies, to develop better sequencing tools and accompanying analytical software. The ensuing innovations have helped to mark down the price of whole genome sequencing over the years, from nearly $3 billion at its inception to under $3,000, making it accessible to researchers from different biomedical disciplines [14].

Sequencing tools will play an important role in the development of personalized medicine. Some sequencing technologies are already used in clinics to test genetic conditions, diagnose complex diseases, or screen patient samples for rare variants. These tests allow health professionals to accurately diagnose a disease and prescribe appropriate medication specific to the patient [15, 16]. With the recent support of NIH grants in the US, neonatal sequencing is being explored to probe rare and complex disorders of newborn babies [17, 18]. There are technologies in development that allow non-invasive ways of sequencing a genome of an unborn child [19]. Personalized genome sequencing will transform the future of the healthcare landscape. However, the rise in the number of sequenced genomes is creating new problems. In particular, the way the genome analysis software works is through comparison of the obtained sequences with a reference. Because the human genome is different between different individuals, what is the reference sequence? What is the threshold to distinguish common from rare DNA variants?

Amid all these interesting implications of genome sequencing, the debate concerning the correct use of scientific terminology remains. Specifically, the nomenclature “mutation” and “polymorphism”, and also “point mutation” versus “SNP”, can be independently used to describe the same event, namely a difference in the sequence as compared with a reference. From a strictly grammatical and etymological point of view, a mutation is an event (of mutating) and a polymorphism is a condition or quality (of being polymorphic) but these terms by extension quickly came to mean the resulting event or condition itself. In principle, a point DNA variant can be labeled as a mutation or SNP. Since no clear rules are available, currently used software tools used for genome sequencing make no assignment and label the difference simply as DNA variant, blurring the distinction between the two categories.

“Mutation” and “polymorphism”: earlier definitions

The uniform and unequivocal description of sequence variants in human DNA and protein sequences (mutations, polymorphisms) were initiated by two papers published in 1993 [20, 21]. In this context, any rare change in the nucleotide sequence, usually but not always with a disease causing attribute, is termed a “mutation” [22]. This change in the nucleotide sequence may or may not cause phenotypic changes. Mutations can be inherited from parents (germline mutations) or acquired over the life of an individual (somatic mutations), the latter being the principal driver of human diseases like cancer. Germline mutations occur in the gametes. Since the offspring is initially derived from the fusion of an egg and a sperm, germline mutations of parents may also be found in each nucleated cell of their progeny. Mutations usually arise from unrepaired DNA damage, replication errors, or mobile genetic elements. There are several major classes of DNA mutations. A point mutation occurs when a single nucleotide is added, deleted or substituted. Along with point mutations, the whole structure of a chromosome can be altered, with chromosomal regions being flipped, deleted, duplicated, or translocated [23]. Another kind of DNA mutation is defined as 𠇌opy number variation”. In this case, the expression of a gene is amplified (or reduced) through increased (decreased) copy number of a locus allele [24, 25].

A variation in the DNA sequence that occurs in a population with a frequency of 1 % or higher is termed a polymorphism [26]. The higher incidence in the population suggests that a polymorphism is naturally occurring, with either a neutral or beneficial effect. Polymorphisms can also be of one or more nucleotide changes, just like mutations. The SNP exemplifies the commonest polymorphism, thought to arise every 1,000 base pairs in the human genome, and is usually found in areas flanking protein-coding genes [27] – regions now recognized as critical for microRNA binding and regulation of gene/protein expression [28]. However, SNPs can also occur in coding sequences, introns, or in intergenic regions [27]. SNPs are used as genetic signatures in populations to study the predisposition to certain traits, including diseases [29].

The anatomy of the problem

In the era of advanced DNA sequencing tools and personal genomics, these earlier definitions of mutation and polymorphism are antiquated. Before multiple parallel sequencing was developed, it was impossible to sequence multiple times the genome of the same patient. For these reasons at that time it was required to use a reference sequence coming from the assembly of multiple genomes. In the preparation of the consensus sequence, an arbitrary threshold of 1 % was established to distinguish common (polymorphism) from rare (mutation) variants [26].

The 1 % or higher frequency associated with a polymorphism is an arbitrary number [30] recommended by scientists prior to the era of Next Gen Sequencing. The threshold being arbitrary, redefining the population itself may affect the classification, with rare variants becoming polymorphisms or polymorphisms becoming rare variants according to the population analyzed. For decades, the use of this frequency to develop population models was preferred to the use of sequencing tools, which at that time were error-prone and labor-intensive. With the advent of new sequencing technologies and the subsequent sequencing of individuals, a very different picture of population dynamics has begun to emerge. Mutations that were thought to be rare in a population have been found to exceed the frequency threshold set at 1 % [31]. Even more surprising, there is a lack of association of some of these rare mutations with human diseases. When comparing populations separated by geographic and physical barriers, a disease-causing mutation in one population is found to be harmless in another, and vice versa [32].

For instance, sickle-cell anemia is caused by a nucleotide change (SNP rs334) in a gene coding for the beta chain of the hemoglobin protein [33]. In fact, rs334 is classified as a SNP, since its minor allele frequency in the population is ϡ %. The disease manifests in people who have two copies of the mutated gene (rs334(TT) genotype). Sickle cell anemia is usually rare (ρ %) in the populations of developed nations [34]. However, the heterozygous form of the gene (rs334(AT) genotype) is persistent in populations of Africa, India, and other developing nations, where malaria is endemic [33]. In these geographic locations, heterozygote carriers of rs334 have a survival advantage against the malaria pathogen, and therefore this beneficial mutation is passed through the offspring to succeeding generations [35�]. Here, a rare variant, which in one population (developed nations) causes a severe disease in homozygosis, can persist in another population to confer a survival advantage as a polymorphism in heterozygosis [38]. Such exceptions are increasing and show the need to redefine the terms mutation and polymorphism. The distinction between mutation and polymorphism on the basis of their disease-causing capacity is further complicated. Although thought to be naturally occurring, recent research into SNPs has shown that they can be associated with diseases like diabetes and cancers. At least 40 SNPs have been shown to associate with type-2 diabetes alone [39]. In short, it is not possible to classify the functional role of variations according to frequency in the population or their capability to cause a disease.

Context of personal genomics

This debate on “mutation” and “polymorphism” needs urgent evaluation in the era of Next Gen Sequencing and precision medicine. Multiple international collaborative projects like ENCODE (Encyclopedia of DNA elements) and HapMap (Haplotype Map) have ensued to map all the genes, genetic variation, and regulatory elements of the genome, to find associations with human biology, personal traits, and diseases [40].

In this climate, commercial companies like Illumina and Roche are developing advanced and robust platforms that tailor to the need of both small and large research facilities. The increasing competition among these companies has resulted in many different technologies, which are now available to facilitate new insights into genomics [11]. Similarly, advanced genomic tools and analytical software have been developed that can function independently of the particular platform. Researchers using tools like CLC genomics, Next Gene and Geno Matrix, can access and download sequencing datasets for their own streamlined research. The primary goal of such research is to look for subtle, complex, and dynamic sequence variations. The lack of consistent definitions and a uniform scientific language can hamper this upcoming field, where genomic platforms may formulate incorrect hypotheses and researchers may misinterpret data based on earlier definitions.

The problem is particularly important in the case of precision medicine and personalized treatments. For example, one of the main reasons to sequence the genome of a cancer consists in the identification of unique genetic features of cancer cells which may then be targeted with a personalized treatment [41]. Accordingly, it is required to classify the somatic mutations of the cancer cells and use such knowledge to exploit therapeutically all the differences between cancer and noncancerous cells. Therefore, in order to be treated with a targeted agent a cancer patient needs to express the target originated by the specific mutation occurring in cancer cells. However, should a difference be misclassified, it becomes possible for a polymorphism (present in all the cells of the patient) to be taken as a somatic mutation. The result could be a toxic effect, since the targeted treatment will impact both cancer and noncancerous cells carrying the same genetic variant. This problem is prevented if both germline and somatic cancer genomes would be sequenced in the same patient.

Another important reason underlying the need of such distinction is that a disease may originate with two subsequent mutations according to the two-hit hypothesis [42]. Within a population, a germline mutation (first hit) may predispose a subset of patients to a second, somatic, mutation whose effects will create the diseased phenotype [43]. In this context, in order to identify populations at risk it would be extremely helpful to distinguish between somatic and germline mutations. For example, multiple meningiomas occur in 㰐 % of meningioma patients. A first germline mutation in the SMARCB1 gene will predispose to meningioma, but this will occur only when a somatic mutation in the NF2 gene intervenes [44]. In the absence of a clear distinction between somatic and germline variants this kind of pathogenic discovery may be impossible.

This approach is now supported by a recent study. Jones et al. evaluated 815 tumor-normal paired samples coming from 15 different tumor types [45] using Next Gene Sequencing. Library preparation was performed with two methods, whole exome preparation and targeted amplification, for 111 genes. Analyses were then conducted either as if only the cancer tissue was sequenced (reference human genome assembly GRch37-lite) or taking as reference the germline DNA of the same patient. With the first analysis, the authors reported a very high rate of false-positive variants (31 % and 65 % in exome and targeted libraries, respectively). Furthermore, they identified germline mutations in 3 % of the cancers, even if they came from a cohort without family history (sporadic cancer). Now that the new sequencing technologies have dramatically reduced the cost of sequencing, precision medicine and personal genomics require that the reference of the DNA sequencing project should be obtained from the germline DNA of the same patient.

Ongoing debate and HGVS (Human Genome Variation Society) recommendations

The ongoing debate among scientists to resolve the nomenclature mutation and polymorphism is a step in the right direction. The HGVS, an alliance of 600 members from 34 countries, incorporates discussion and recommendations to establish consensus definitions and descriptions of generic terms that are accepted worldwide. Since the early 1990s, the HGVS has been instrumental in its push to standardize the mutation nomenclature. The recommendations of the HGVS have been based on extensive discussions among scientists over the years.

The papers published on this topic for the last 20 years show that HGVS was visionary to recommend new changes and extensions based on discoveries of relatively complex variants. In 2002, several researchers tried to address this nomenclature problem and the challenges to make more inclusive definitions.A special article by Condit et al. found that mutation had become increasingly negative in connotation since its use in the biological sciences, but particularly over the course of the 20th century [22]. This negativity of the term became entrenched with radiation experiments and the use of atomic weapons during the II nd world war, and later with science fiction books and movies. The paper suggested that a better term like “variation” and 𠇊lteration” might be useful, but its inconsistent usage in the scientific world makes it problematic.

More recently, additional papers have highlighted the urgency of a 𠇌onsensus” guiding the selection of the sequencing methods (data collection) and reporting. These studies point out that the accurate classification of pathogenic variants requires a standardized approach and the building of data repositories including all these data [46]. In this context, Richards et al. on the behalf of the American College of Medical Genetics and Genomics (ACMG) have noted that the terms “mutation” and “polymorphism” often lead to confusion because of incorrect assumptions of pathogenic and benign effects, respectively. Thus, they recommended that both terms be replaced by the term “variant” with the following modifiers: (i) pathogenic, (ii) likely pathogenic, (iii) uncertain significance, (iv) likely benign, or (v) benign [47].

32: Personal Genomes, Synthetic Genomes, Computing in C vs. Si - Biology

Personal genomics is critical to advancing our ability to treat and preemptively diagnose genetic diseases. However, despite the possibilities of personalizing medicine, it remains tethered, in large part, to the weight of some significant computational-side problems. This includes everything from storage to compute to code, all of which were issues on the table at the National Center for Supercomputing Applications&rsquo (NCSA) Private Sector Program Annual Meeting .

During the event, Dr. Victor Jongeneel, Senior Research Scientist at NCSA and the Institute for Genomic Biology at the University of Illinois detailed some of the bottlenecks and potential solutions that keep expectations for personal genomics grounded.

In the case of personal genomics, the problem is not the scientific understanding of the genome itself, it&rsquos how to reconstruct, compare and make sense of the massive data from sequencers. He claims that the disruptive part of this technology as a whole is rooted in our ability to actually acquire the data. According to Jongeneel, the amount of DNA sequence data generated last year was more than what had been generated over the entire history of sequencing before that.

Personal genomics is anything but a reality right now Jongeneel says. He notes that the range of new services that offer to sequence your genome for a few hundred dollars are far from complete service. These simply take DNA from a saliva kit, probe for a certain number of positions in genomes that are known to be variable and then try to deduce personal characteristics from that information. He claims that this is not personal genomics because in such a case, all you&rsquore examining are known differences between individuals in the population&mdashnot your own genome. Besides, to do what is required for a genuine look at one&rsquos personal genomics is far more computationally-intensive and would entail far more than a measly few hundred dollars.

To realize true personal genomics, all differences between individuals need to be analyzed. Jongeneel explained that we are moving toward this more comprehensive genomic sampling via well-funded projects like the 1000 Genomes Initiative, which aims to allow the generation of all necessary data for $1000. He says this soon will be possible but again the computational bottlenecks are the main limitation.

Jongeneel cites three of the main technology vendors that are providing next-generation sequencing and says that while their approaches differ, on average, for a sequenced genome they&rsquore running for 8 days for 200 gigabases worth of information. This translates into well over one terabyte per human genome.

When it&rsquos human genomes sequences are the result of several hundred million (or even a billion) reads&mdasha number that depends on the technology vendor. From there, researchers need to determine where they come from in the genome relative to common reference genomes. This &ldquosimple&rdquo alignment process whereby the individual genome is compared via alignment with the reference genome is incredibly demanding computationally&mdashas is the next step where one must interpret this alignment to document individual differences and to make sure there is consistency.

Jongeneel says that this alignment step typically takes several days just for the processing of a single sample as it is aligned to the reference genome. To further complicate the process, we all have pieces of DNA that aren&rsquot necessarily found in the DNA of others. While these are small differences he says these can make a very big difference. Analysis of these unique pieces require a complete piecing together of individual reads to allow researchers to see what the larger structure of the genome might look like. And it gets even more demanding.

Rebuilding genomes requires the construction of highly complex graphs, which itself is a strain on computational resources. This is even more demanding when one must disambiguate the graph to make sense of it in terms of an actual genome sequence. After all, there are pieces of sequence rolling off the machines that are on the order of between 75-100 nucleotides long&mdashand you&rsquore trying to reconstitute genomes that are in the millions or billions of nucleotides long. This is the scientific equivalent of fitting a cell-sized piece into a massive tabletop puzzle.

More concretely than the puzzle image, consider this: Jongeneel says that if you wanted to reconstruct an entire genome from this kind of information you&rsquore talking about the construction of a graph would likely have over 3 billon nodes with in excess of 10 billion edges to it. This is, of course, assuming there are no errors in your data which, he apologizes, there probably are. The raw time taken for an algorithm on a medium-sized cluster the assembly properly takes several weeks for each genome.

Jongeneel says that this is the kind of bottleneck that prevents some interesting genomic projects from taking off. For instance, there is currently an effort to sequence the entire range of DNA for several hundred common vertebrates. However, storing that information and spending several weeks for each individual species makes that out of reach&mdashfor now, at least. He says that there is hope on the horizon, but it is going to take a rethinking of code and computing.

He says that the problem lies, in large part, in the software itself. His team ran a test on the widely-used genome assembler ABySS, which has broad appeal since it uses MPI and can leverage a much-needed cluster environment. They undertook assembly for a modest-sized genome of a yeast and noted that it was clear, based on wall clock and memory requirements, that this was not a scalable code.

He says this hints at a much deeper problem&mdashmany of those developing genomics software aren&rsquot professional developers. Even though they integrate some complex algorithmic ideas, the code they write &ldquoisn&rsquot up to the standards of the HPC community.&rdquo

He commented on this further, saying that what is needed most is a highly parallel genome assembler. He pointed to some progress in the arena from a group at Iowa State but says that unfortunately, &ldquotheir software is not in the public domain so it isn&rsquot available, we can&rsquot test it and it&rsquos not in the community.&rdquo

A representative from Microsoft in the audience asked Jongeneel about what the solution might be to this problem, inquiring if it was a simple need for more parallel programmers, better tools or languages for developing these, or some other new type of scalable solution. Jongeneel responded that since most of the code being produced is research grade and the technology moves so quickly that it renders &ldquonew&rdquo code obsolete in very little time. He says that commercial attempts have failed for the same reason&mdashas soon as they&rsquove produced a viable, scalable solution they&rsquove been left behind by the swift movement toward new solutions.

Jongeneel said that if you think about personal genomics, if we even wanted to move toward the goal of one million people, we&rsquore going to hit the exabyte range in no time. He feels that in addition these datasets need to be analyzed using workflows with multiple complex steps, thus we require a fundamental rethinking of compute architectures that can enable this kind of research.

That aside, he claims that one side question is what we should do with the massive amount of raw data that is valuable for future research (and sometimes legally sticky to dispose of now anyway). With this raw data in vast volume he says that extraction of &lsquorelevant&rsquo information is the problem. Jongeneel notes, Data analytics and pattern discovery on large numbers of genomes will be required to produce meaningful results.


Here we present what is currently the oldest near-complete HIV genome, from 1966 in Kinshasa, DRC. This DRC66 sample is 10 y older than the previously earliest characterized full genome, an 01A1G strain that was isolated from blood in 1976, also in DRC, but which underwent cell culture passages before sequencing (38). There are only nine other HIV-1 genomes available from the prediscovery phase of AIDS (1978 to 1982), all subtype B from the United States (25). The oldest HIV-1 genomic fragments are derived from plasma and FFPE samples from 1959 and 1960, again both from Kinshasa, DRC (11, 12). While these provided undisputable evidence of the presence and major diversification of HIV-1 group M two decades before its discovery, the short sequences that were recovered do not allow complete characterization of the HIV-1 strain involved and contain only a fraction of the phylogenetic information that is present in complete genomes.

To achieve sequence coverage across the DRC66 archival genome, labor-intensive amplification of overlapping short fragments between 54 nt and 106 nt in a highly sensitive jackhammer PCR procedure proved necessary. In comparison, none of the >65 million reads of an Illumina MiSeq run without prior amplification on the same sample contained HIV-1 sequence data. The latter approach had provided a full genome at 3,000× coverage of an influenza A H1N1 strain in an FFPE sample from 1918, however (24). Perhaps the difference in success resulted from different storage conditions in a humid tropical versus a temperate region, as evidenced by the majority of our reads being derived from environmental organisms that could have invaded the sample during preparation or storage, or, more likely, from a comparatively low viral titer in the FFPE lymph node specimen.

Globally, more HIV-1 group M cases are caused by strains that belong to the subtype C clade than any other clade, largely because southern Africa holds the highest HIV-1 burden and subtype C predominates there (39). Estimated to have originated in southeastern DRC, phylodynamic analyses indicated subtype C strains have spread from there to southern Africa via connections between mining cities (13). At the LANL HIV sequence database, currently about 19% of HIV-1 sequences from DRC are classified as subtype C (mostly documented from partial gene sequences). The DRC66 sequence represents a sister lineage to the subtype C clade, and quite divergent: we estimate it shared a common ancestor with subtype C some 20 y before the time of the common ancestor of conventional subtype C. Parts of gag and pol from three recently described intersubtype recombinant genomes from Kinshasa and Mbuji-Mayi sampled in 2008 (17), and part of a partial pol sequence sampled in Sweden in 2000 (40), appear to be the only reported contemporary sequences that also belong to this lineage in part of their genomes, although we cannot be certain we did not miss any short sequence stretches of, e.g., complex recombinant forms that would also cluster with this clade. Villabona-Arenas et al. (17) and Rodgers et al. (19) describe additional so-called divergent C lineages sampled between 1997 and 2012 in DRC that are monophyletic with conventional C with respect to the DRC66 lineage, yet form distinct sister lineages to subtype C. Similarly, for most other HIV-1 subtypes, more divergent lineages can be found in DRC (in particular Kinshasa) and other central African countries than in other regions where the more restricted within-subtype diversity arose in a relatively short time after founder events. The DRC66 genome provides a unique insight into the subtype C-like diversity that would have been present in DRC in the 1960s. The fact that particular residues of the translated integrase protein of DRC66 are known to induce resistance to integrase inhibitor drugs, which were obviously developed long after DRC66 was sampled, highlights that the natural 1960s diversity already harbored some genetic basis for anti-HIV therapy failure.

We further investigated whether the phylogenetic information in the suite of HIV-1 genomes sampled across the past decades, almost all after the discovery of HIV-1, reliably captures HIV-1’s evolutionary rates over the longer time frame that includes HIV-1’s long prediscovery phase in humans. Few calibration points from direct biological observations are typically available to test such conclusions for real-world analyses, especially for such a medically important pathogen. Crucially, such ancient DNA calibration points can lead to dramatic changes in evolutionary histories once thought to be definitively established. For example, recently reported hepatitis B virus sequences from the Bronze age and Neolithic suggested a 100-fold slower evolutionary rate for this double-stranded DNA virus than previously thought (41 ⇓ –43), and such data are prompting updates to evolutionary clock models to better accommodate time-dependent rate variation (10). Because it is impossible to completely rule out such biases without complete genomic information from an early evolutionary time point, we believe it is important to attempt to recover such information from surviving HIV-1 specimens.

Reassuringly, in the context of HIV-1 group M, we do not observe that an “ancient” HIV-1 genome significantly changes evolutionary inferences based on phylogenies built from more-recent genomes. Indeed, there is remarkably little difference in key estimates—including the overall age of the pandemic lineage of HIV—when this sequence is included in phylogenomic analyses. Given that it is more than 50 y older than currently circulating HIV-1 strains, this sequence provides direct evidence for the reliability of dating estimates over the last half-century of HIV-1 circulation. This stands in contrast to the disconnection between short-term rates observed in SIVs and the rates at which SIV strains evolve when averaged across centuries or millennia of evolution in natural populations of different primate species, where molecular clock dating theory has difficulties accommodating the rate differences (6).

Interestingly, our analysis highlights an often-overlooked source of uncertainty in evolutionary divergence dating based on any sample of genomes. The suite of HIV-1 genomes sampled from patients and available in public databases is inevitably a very limited subsample of the true diversity of HIV-1 group M. To investigate the degree of variation such an unavoidable sampling process induces, we subsampled the available GenBank sample of nonintersubtype recombinant HIV-1 group M genomes from Africa, only retaining a small set of genome samples before 1990 in each sample. While credible intervals of all dating and rate estimates overlapped substantially, the overall variation between subsamples was larger than that induced in each subsample when DRC66 was either included or excluded. Besides variation in the underlying evolutionary models used in different studies, usage of different HIV-1 genome dataset samples could also explain why our HIV-1 group M TMRCA estimates are somewhat older here than previously reported: 1920 (95% HPD 1909 to 1930) (13), 1930 (1911 to 1945) (44), 1932 (1905 to 1954) (15), 1920 (1902 to 1939) (14), and 1908 (1884 to 1924) (11). Across our five investigated subsamples, HIV-1 group M TMRCA confidence intervals ranged from 1881 to 1918. We did not further explore the sensitivity of TMRCA estimates to various evolutionary model specifications, though it has been shown for example that the choice of coalescent tree prior may influence TMRCA estimates of HIV-1 for Bayesian inferences (11, 45). While a skygrid coalescent model should be appropriate (46), a recent study that was also based on complete HIV-1 genomes but that used a combination of an exponential and logistic growth model as tree prior (47) estimated 1915 to 1925 as the HIV-1 group M TMRCA. Taken together, while most estimates of the origin of the pandemic lineage of HIV-1 indeed converge to around the turn of the 20th century, phylogenetic uncertainty, evolutionary model specifications, and natural variation among samples of HIV-1’s genomic diversity prevent narrowing down the age estimate to less than a few decades.

In conclusion, using a highly sensitive amplification protocol for degraded archival samples, we here present the oldest HIV-1 near-complete genome available to date. While we are careful not to extrapolate to other pathogen–host systems and much deeper time scales evident in SIV, our study indicates that evolutionary rates calibrated from HIV-1 group M sequences sampled across the decades after its discovery can be used reliably to infer the timing of events that occurred during the prediscovery era. We note that in addition to evolutionary model specifications, the inherent stochasticity associated with a sample of the true viral diversity in nature inevitably introduces uncertainty to phylogenetic dating estimates, which is addressable by purposely subsampling datasets.


We thank R. Schlapbach and L. Poveda from Zürich Functional Genomics Center (ZFGC) for sequencing support B. Maier and members from ScopeM for electron microscopy support S. Nath from the Joint Genome Institute (JGI) for DNA synthesis and sequencing support F. Rudolf for assistance with yeast marker design H. Christen for conception of computational algorithms and Samuel I. Miller, Markus Aebi, and Uwe Sauer for critical comments. This work received institutional support from Community Science Program (CSP) DNA Synthesis Award Grants JGI CSP-1593 (to M.C. and B.C.) and CSP-2840 (to M.C. and B.C.) from the US Department of Energy Joint Genome Institute, Swiss Federal Institute of Technology (ETH) Zürich ETH Research Grant ETH-08 16-1 (to B.C.), and Swiss National Science Foundation Grant 31003A_166476 (to B.C.). The work conducted by the US Department of Energy Joint Genome Institute, a Department of Energy Office of Science User Facility, is supported by Office of Science of the US Department of Energy Contract DE-AC02-05CH11231.

Ethics declarations

Competing interests

Gad Getz receives research funds from IBM and Pharmacyclics and is an inventor on patent applications related to MuTect, ABSOLUTE, MutSig, MSMuTect, MSMutSig and POLYSOLVER. Hikmat Al-Ahmadie is consultant for AstraZeneca and Bristol-Myers Squibb. Samuel Aparicio is a founder and shareholder of Contextual Genomics. Pratiti Bandopadhayay receives grant funding from Novartis for an unrelated project. Rameen Beroukhim owns equity in Ampressa Therapeutics. Andrew Biankin receives grant funding from Celgene, AstraZeneca and is a consultant for or on advisory boards of AstraZeneca, Celgene, Elstar Therapeutics, Clovis Oncology and Roche. Ewan Birney is a consultant for Oxford Nanopore, Dovetail and GSK. Marcus Bosenberg is a consultant for Eli Lilly. Atul Butte is a cofounder of and consultant for Personalis, NuMedii, a consultant for Samsung, Geisinger Health, Mango Tree Corporation, Regenstrief Institute and in the recent past a consultant for 10x Genomics and Helix, a shareholder in Personalis, a minor shareholder in Apple, Twitter, Facebook, Google, Microsoft, Sarepta, 10x Genomics, Amazon, Biogen, CVS, Illumina, Snap and Sutro and has received honoraria and travel reimbursement for invited talks from Genentech, Roche, Pfizer, Optum, AbbVie and many academic institutions and health systems. Carlos Caldas has served on the Scientific Advisory Board of Illumina. Lorraine Chantrill acted on an advisory board for AMGEN Australia in the past 2 years. Andrew D. Cherniack receives research funding from Bayer. Helen Davies is an inventor on a number of patent applications that encompass the use of mutational signatures. Francisco De La Vega was employed at Annai Systems during part of the project. Ronny Drapkin serves on the scientific advisory board of Repare Therapeutics and Siamab Therapeutics. Rosalind Eeles has received an honorarium for the GU-ASCO meeting in San Francisco in January 2016 as a speaker, a honorarium and support from Janssen for the RMH FR meeting in November 2017 as a speaker (title: genetics and prostate cancer), a honorarium for an University of Chicago invited talk in May 2018 as speaker and an educational honorarium paid by Bayer & Ipsen to attend GU Connect ‘Treatment sequencing for mCRPC patients within the changing landscape of mHSPC’ at a venue at ESMO, Barcelona, on 28 September 2019. Paul Flicek is a member of the scientific advisory boards of Fabric Genomics and Eagle Genomics. Ronald Ghossein is a consultant for Veracyte. Dominik Glodzik is an inventor on a number of patent applications that encompass the use of mutational signatures. Eoghan Harrington is a full-time employee of Oxford Nanopore Technologies and is a stock holder. Yann Joly is responsible for the Data Access Compliance Office (DACO) of ICGC 2009-2018. Sissel Juul is a full-time employee of Oxford Nanopore Technologies and is a stock holder. Vincent Khoo has received personal fees and non-financial support from Accuray, Astellas, Bayer, Boston Scientific and Janssen. Stian Knappskog is a coprincipal investigator on a clinical trial that receives research funding from AstraZeneca and Pfizer. Ignaty Leshchiner is a consultant for PACT Pharma. Carlos López-Otín has ownership interest (including stock and patents) in DREAMgenics. Matthew Meyerson is a scientific advisory board chair of, and consultant for, OrigiMed, has obtained research funding from Bayer and Ono Pharma and receives patent royalties from LabCorp. Serena Nik-Zainal is an inventor on a number of patent applications that encompass the use of mutational signatures. Nathan Pennell has done consulting work with Merck, Astrazeneca, Eli Lilly and Bristol-Myers Squibb. Xose S. Puente has ownership interest (including stock and patents in DREAMgenics. Benjamin J. Raphael is a consultant for and has ownership interest (including stock and patents) in Medley Genomics. Jorge Reis-Filho is a consultant for Goldman Sachs and REPARE Therapeutics, member of the scientific advisory board of Volition RX and Paige.AI and an ad hoc member of the scientific advisory board of Ventana Medical Systems, Roche Tissue Diagnostics, InVicro, Roche, Genentech and Novartis. Lewis R. Roberts has received grant support from ARIAD Pharmaceuticals, Bayer, BTG International, Exact Sciences, Gilead Sciences, Glycotest, RedHill Biopharma, Target PharmaSolutions and Wako Diagnostics and has provided advisory services to Bayer, Exact Sciences, Gilead Sciences, GRAIL, QED Therapeutics and TAVEC Pharmaceuticals. Richard A. Scolyer has received fees for professional services from Merck Sharp & Dohme, GlaxoSmithKline Australia, Bristol-Myers Squibb, Dermpedia, Novartis Pharmaceuticals Australia, Myriad, NeraCare GmbH and Amgen. Tal Shmaya is employed at Annai Systems. Reiner Siebert has received speaker honoraria from Roche and AstraZeneca. Sabina Signoretti is a consultant for Bristol-Myers Squibb, AstraZeneca, Merck, AACR and NCI and has received funding from Bristol-Myers Squibb, AstraZeneca, Exelixis and royalties from Biogenex. Jared Simpson has received research funding and travel support from Oxford Nanopore Technologies. Anil K. Sood is a consultant for Merck and Kiyatec, has received research funding from M-Trap and is a shareholder in BioPath. Simon Tavaré is on the scientific advisory board of Ipsen and a consultant for Kallyope. John F. Thompson has received honoraria and travel support for attending advisory board meetings of GlaxoSmithKline and Provectus and has received honoraria for participation in advisory boards for MSD Australia and BMS Australia. Daniel Turner is a full-time employee of Oxford Nanopore Technologies and is a stock holder. Naveen Vasudev has received speaker honoraria and/or consultancy fees from Bristol-Myers Squibb, Pfizer, EUSA pharma, MSD and Novartis. Jeremiah A. Wala is a consultant for Nference. Daniel J. Weisenberger is a consultant for Zymo Research. Dai-Ying Wu is employed at Annai Systems. Cheng-Zhong Zhang is a cofounder and equity holder of Pillar Biosciences, a for-profit company that specializes in the development of targeted sequencing assays. The other authors declare no competing interests.

Watch the video: NMgr Matematická biologie a biomedicína - Jakub Jamárik (October 2022).