Page 55 - Hematology_ Basic Principles and Practice ( PDFDrive )
P. 55

Chapter 3  Genomic Approaches to Hematology  27


            analyzed  in  an  experiment  (e.g.,  the  expression  of  each  of  22,000   same time, the error rates for any given sequencing read can be as
            mRNA transcripts) exceeds the number of samples (e.g., 50 patients   high as 1%, depending on the sequencing platform. How can these
            with a particular type of lymphoma), there is potential for finding   two  statements  both  be  correct?  Although  a  1%  error  rate  (99%
            patterns in the data simply by chance. The more features analyzed   accurate) may seem low, when taken in the context of sequencing all
            and the fewer the number of samples, the more likely such a phe-  3 billion bases of the human genome, that would in principle result
            nomenon is to be encountered. For this reason, the use of nominal   in 30 million errors! Thankfully, this is not the case, because most
            p-values to estimate statistical significance of an observed observation   sequencing errors are idiosyncratic—that is, they are not a function
            is  generally  discouraged.  Rather,  some  approach  to  correcting  for   of a particular DNA sequence. The consequence of this is that by
            multiple  hypothesis  testing  is  in  order.  (In  the  present  example,   simply resequencing the same region multiple times and taking the
            22,000 hypotheses are effectively being tested.) In the absence of such   consensus read, such idiosyncratic errors are lost; it is highly unlikely
            penalization, the significance of observations is likely to be grossly   for them to occur over and over again at the same spot.
            overestimated.  Indeed,  such misinterpretations of data  were at  the   For normal, diploid genomes, sequencing is typically done at least
            root of many of the early uses of gene expression profiling data in   30-fold over (referred to as 30× coverage). The consensus obtained by
            biomedical research.                                  observing a given nucleotide 30 times is generally sufficient for ren-
              Related to the challenges with high-dimensional data described,   dering the correct read of that nucleotide. However, things get more
            special considerations of pattern-matching algorithms must be made.   complicated  when  dealing  with  (1)  tumors  containing  gene  copy
            With the availability of high-dimensional gene expression profiling   number alterations (e.g., aneuploidy or regions or gene deletion or
            data in the late 1990s came a flood of computational innovation from   amplification)  or  (2)  admixture  of  normal  cells  within  the  tumor
            computer scientists looking to find biologically meaningful patterns   sample. To compensate for copy number variation and normal cell
            amid biologic data. With that early wave of computational analysis   contamination seen in most samples, typical cancer genome sequenc-
            came  the  realization  that  with  often  limited  numbers  of  samples   ing projects aim for a depth of coverage of at least 100×. Sequencing
            (compared with the number of features analyzed) comes the possibil-  for diagnostic purposes may require even greater depth of coverage.
            ity of “overfitting” a computational model to a particular dataset—that   In addition, the analysis of samples containing only rare tumor cells
            is, defining a pattern (e.g., a spectrum of genes that are differentially   (e.g.,  10%)  would  require  ultradeep  sequencing;  otherwise,  any
            expressed) that is correlated with a phenotype of interest (e.g., sur-  tumor-specific mutations would likely become false negatives. Impor-
            vival) in an initial dataset but then does not predict accurately when   tantly, the frequency of cancer-associated mutations in studies per-
            applied to an independent dataset. This failure to reproduce initial   formed using traditional Sanger sequencing methods may have been
            findings was variously attributed to technical defects in the genomic   underestimated because of the lack of power to detect mutations in
            data itself, insufficiently complex algorithms, and the possibility that   tumors with significant normal cell contamination. Whereas Sanger
            perhaps the most important features were not being analyzed in the   sequencing  delivers  the  average  allele  observed  in  a  sample,  next-
            first  place  (e.g.,  noncoding  RNAs).  In  fact,  however,  nearly  all  of    generation  sequencing  methods  deliver  a  distribution  of  observed
            the early failures of pattern recognition algorithms to validate when   alleles, allowing for mutant alleles to be identified even if they repre-
            applied to new datasets were attributable to overfitting of the models   sent a minority population.
            to an initial, small dataset. The solution to this problem is to ensure
            that discovery datasets are sufficiently large to avoid overfitting and
            to  insist  that,  before  any  clinical  or  biologic  claims  are  made,  the   Future of Sequencing Technologies
            model be tested on completely independent samples.
                                                                  No one could have predicted the dramatic advances that have come
                                                                  to DNA sequencing technologies over the past decade. Costs have
            NEXT-GENERATION SEQUENCING TECHNOLOGY                 dropped dramatically, and it is predicted that costs will continue to
                                                                  drop, although less precipitously. The cost of whole-genome sequenc-
            Beginning  around  2006,  a  number  of  new  approaches  to  DNA   ing was estimated to be $1000 in 2016; however, this cost can be
            sequencing  burst  onto  the  scene.  These  technical  advances  have   achieved only when sequencing many samples from large cohorts of
            transformed the field of genomics and will likely equally transform   patients, not on an individual sample-by-sample basis. Also, in addi-
            the diagnostics field in the years to come. A number of novel sequenc-  tion  to  sequence  generation,  there  is  a  high  cost  associated  with
            ing  approaches  have  been  commercialized,  and  their  details  are   sequence analysis. The cost of storage of genome sequence and analysis
            beyond the scope of this chapter. However, they differ fundamentally   may exceed the cost of generating the data in the first place, and a
            from traditional Sanger sequencing, which has been the mainstay for   detailed analysis is far from straightforward. Nevertheless, it is likely
            the past several decades. First, and most well recognized, is the dra-  that,  over  the  decade  ahead,  genome  sequencing  will  become  a
            matically lower cost of current sequencing methods compared with   routine component of both clinical research and routine clinical care.
            Sanger  sequencing.  Costs  have  dropped  by  nearly  1  million–fold
            compared with the sequencing of the first human genome. This drop
            in  cost  has  transformed  genome  sequencing  from  the  work  of  an   DNA-LEVEL CHARACTERIZATION
            entire community over the course of more than a decade (the initial
            sequencing of the human genome took 15 years and approximately   Somatic Versus Germline Events
            $3 billion) to a routine experiment that can be done for hundreds of
            samples by major sequencing centers in the course of a week in 2016.   It  is  important  to  recognize  the  fundamental  difference  between
            These  exponential  cost  reductions  have  come  about  not  through   germline variants and somatic variants in genome sequences. Germline
            dramatic drops in reagent costs but rather through dramatic increases   variants are present in all cells of the body (with the exception of rare
            in data output. A single lane on a modern sequencer generates vastly   mosaicism), and these variants can contribute to the risk of future
            more data than a lane of conventional sequencing. This is relevant   disease. Germline variants can be common (i.e., seen in ≥5% of the
            because to realize the lower costs of contemporary sequencing, large-  human population), or they can be rare (in principle, unique to a
            scale projects must be undertaken. That is, devoting a single lane of   single individual). Each individual also carries de novo variants that
            sequencing  to  the  sequencing  of  a  plasmid,  for  example,  is  more   are present in neither of the individual’s parents’ genomes. It has been
            expensive  with  current  technologies  than  with  traditional  Sanger   demonstrated recently by genetic analyses of large populations that
            sequencing; the cost savings are realized only when large data outputs   aging  individuals  without  evidence  of  hematologic  disease  acquire
            are required (e.g., the sequencing of entire genomes or of isolated   mutations over time in genes that are associated with leukemia. This
            genes across large numbers of patients).              observation  has  been  named  clonal  hematopoiesis  of  indeterminate
              When executed and analyzed properly, next-generation sequenc-  potential. This  indicates  that  expansion  of  particular  clones  occurs
            ing technologies can yield nearly perfect fidelity of sequence. At the   that is associated with an increased risk of myeloid and lymphoid
   50   51   52   53   54   55   56   57   58   59   60