Page 55 - Hematology_ Basic Principles and Practice ( PDFDrive )
P. 55
Chapter 3 Genomic Approaches to Hematology 27
analyzed in an experiment (e.g., the expression of each of 22,000 same time, the error rates for any given sequencing read can be as
mRNA transcripts) exceeds the number of samples (e.g., 50 patients high as 1%, depending on the sequencing platform. How can these
with a particular type of lymphoma), there is potential for finding two statements both be correct? Although a 1% error rate (99%
patterns in the data simply by chance. The more features analyzed accurate) may seem low, when taken in the context of sequencing all
and the fewer the number of samples, the more likely such a phe- 3 billion bases of the human genome, that would in principle result
nomenon is to be encountered. For this reason, the use of nominal in 30 million errors! Thankfully, this is not the case, because most
p-values to estimate statistical significance of an observed observation sequencing errors are idiosyncratic—that is, they are not a function
is generally discouraged. Rather, some approach to correcting for of a particular DNA sequence. The consequence of this is that by
multiple hypothesis testing is in order. (In the present example, simply resequencing the same region multiple times and taking the
22,000 hypotheses are effectively being tested.) In the absence of such consensus read, such idiosyncratic errors are lost; it is highly unlikely
penalization, the significance of observations is likely to be grossly for them to occur over and over again at the same spot.
overestimated. Indeed, such misinterpretations of data were at the For normal, diploid genomes, sequencing is typically done at least
root of many of the early uses of gene expression profiling data in 30-fold over (referred to as 30× coverage). The consensus obtained by
biomedical research. observing a given nucleotide 30 times is generally sufficient for ren-
Related to the challenges with high-dimensional data described, dering the correct read of that nucleotide. However, things get more
special considerations of pattern-matching algorithms must be made. complicated when dealing with (1) tumors containing gene copy
With the availability of high-dimensional gene expression profiling number alterations (e.g., aneuploidy or regions or gene deletion or
data in the late 1990s came a flood of computational innovation from amplification) or (2) admixture of normal cells within the tumor
computer scientists looking to find biologically meaningful patterns sample. To compensate for copy number variation and normal cell
amid biologic data. With that early wave of computational analysis contamination seen in most samples, typical cancer genome sequenc-
came the realization that with often limited numbers of samples ing projects aim for a depth of coverage of at least 100×. Sequencing
(compared with the number of features analyzed) comes the possibil- for diagnostic purposes may require even greater depth of coverage.
ity of “overfitting” a computational model to a particular dataset—that In addition, the analysis of samples containing only rare tumor cells
is, defining a pattern (e.g., a spectrum of genes that are differentially (e.g., 10%) would require ultradeep sequencing; otherwise, any
expressed) that is correlated with a phenotype of interest (e.g., sur- tumor-specific mutations would likely become false negatives. Impor-
vival) in an initial dataset but then does not predict accurately when tantly, the frequency of cancer-associated mutations in studies per-
applied to an independent dataset. This failure to reproduce initial formed using traditional Sanger sequencing methods may have been
findings was variously attributed to technical defects in the genomic underestimated because of the lack of power to detect mutations in
data itself, insufficiently complex algorithms, and the possibility that tumors with significant normal cell contamination. Whereas Sanger
perhaps the most important features were not being analyzed in the sequencing delivers the average allele observed in a sample, next-
first place (e.g., noncoding RNAs). In fact, however, nearly all of generation sequencing methods deliver a distribution of observed
the early failures of pattern recognition algorithms to validate when alleles, allowing for mutant alleles to be identified even if they repre-
applied to new datasets were attributable to overfitting of the models sent a minority population.
to an initial, small dataset. The solution to this problem is to ensure
that discovery datasets are sufficiently large to avoid overfitting and
to insist that, before any clinical or biologic claims are made, the Future of Sequencing Technologies
model be tested on completely independent samples.
No one could have predicted the dramatic advances that have come
to DNA sequencing technologies over the past decade. Costs have
NEXT-GENERATION SEQUENCING TECHNOLOGY dropped dramatically, and it is predicted that costs will continue to
drop, although less precipitously. The cost of whole-genome sequenc-
Beginning around 2006, a number of new approaches to DNA ing was estimated to be $1000 in 2016; however, this cost can be
sequencing burst onto the scene. These technical advances have achieved only when sequencing many samples from large cohorts of
transformed the field of genomics and will likely equally transform patients, not on an individual sample-by-sample basis. Also, in addi-
the diagnostics field in the years to come. A number of novel sequenc- tion to sequence generation, there is a high cost associated with
ing approaches have been commercialized, and their details are sequence analysis. The cost of storage of genome sequence and analysis
beyond the scope of this chapter. However, they differ fundamentally may exceed the cost of generating the data in the first place, and a
from traditional Sanger sequencing, which has been the mainstay for detailed analysis is far from straightforward. Nevertheless, it is likely
the past several decades. First, and most well recognized, is the dra- that, over the decade ahead, genome sequencing will become a
matically lower cost of current sequencing methods compared with routine component of both clinical research and routine clinical care.
Sanger sequencing. Costs have dropped by nearly 1 million–fold
compared with the sequencing of the first human genome. This drop
in cost has transformed genome sequencing from the work of an DNA-LEVEL CHARACTERIZATION
entire community over the course of more than a decade (the initial
sequencing of the human genome took 15 years and approximately Somatic Versus Germline Events
$3 billion) to a routine experiment that can be done for hundreds of
samples by major sequencing centers in the course of a week in 2016. It is important to recognize the fundamental difference between
These exponential cost reductions have come about not through germline variants and somatic variants in genome sequences. Germline
dramatic drops in reagent costs but rather through dramatic increases variants are present in all cells of the body (with the exception of rare
in data output. A single lane on a modern sequencer generates vastly mosaicism), and these variants can contribute to the risk of future
more data than a lane of conventional sequencing. This is relevant disease. Germline variants can be common (i.e., seen in ≥5% of the
because to realize the lower costs of contemporary sequencing, large- human population), or they can be rare (in principle, unique to a
scale projects must be undertaken. That is, devoting a single lane of single individual). Each individual also carries de novo variants that
sequencing to the sequencing of a plasmid, for example, is more are present in neither of the individual’s parents’ genomes. It has been
expensive with current technologies than with traditional Sanger demonstrated recently by genetic analyses of large populations that
sequencing; the cost savings are realized only when large data outputs aging individuals without evidence of hematologic disease acquire
are required (e.g., the sequencing of entire genomes or of isolated mutations over time in genes that are associated with leukemia. This
genes across large numbers of patients). observation has been named clonal hematopoiesis of indeterminate
When executed and analyzed properly, next-generation sequenc- potential. This indicates that expansion of particular clones occurs
ing technologies can yield nearly perfect fidelity of sequence. At the that is associated with an increased risk of myeloid and lymphoid

