Page 1345 - Clinical Immunology_ Principles and Practice ( PDFDrive )
P. 1345
1306 ParT ElEvEN Diagnostic Immunology
an informatics pipeline for each clinical application. New assay
development requires validation conforming to Clinical Labora- Secondary Data Analysis—Demultiplexing,
tory Improvement Amendments of 1988 (CLIA 1988) regulations Alignment, and Variant Calling
and laboratories must have an ongoing process to monitor data The next steps in next-generation sequence data analysis involve
quality and ensure result accuracy. The US Food and Drug aligning reads to a reference sequence and generating variant
Administration (FDA) has released a draft guidance on standards calls. In many high-throughput applications, patient DNA samples
for clinical NGS-based diagnostic tests (http://www.fda.gov/ucm/ are tagged with index sequences during preparation for sequencing
groups/fdagov-public/@fdagov-meddev-gen/documents/ (library construction). Molecular indexing or multiplexing allows
document/ucm509838.pdf). pooling of the samples on the instrument and then sorting them
out after sequencing. Demultiplexing of sequence reads is another
Sample and Laboratory Process Management step that is subject to quality monitoring.
Each diagnostic laboratory must deal with the generic operational After demultiplexing, the reads are mapped and aligned to
problems of sample accession, tracking, and reporting. Clinical- the reference genome. Alignment of short-read sequences to the
grade laboratory information management systems (LIMS) are reference genome involves systematically matching read fragments
required to handle all these processes with associated regulatory to their correct location in the genome. The most widely used
compliance. DNA diagnostic laboratories have several unique tools exploit the Burrow-Wheeler algorithm to carry out this
problems and requirements that deserve comment. Automated process efficiently and precisely (bio-bwa.sourceforge.net/).
data acquisition is an important component of DNA sequencing Typically, only uniquely mapping reads are passed to the later
and genotyping requiring personnel specialized in information steps of sequence analysis. This makes it difficult to analyze some
science and systems administration. Advanced statistical models segments of the genome that are important in health and disease.
are employed at many steps in the processes of base calling Some elements in the genome are composed of nearly identical
(primary analysis), alignment to the reference genome, and sequences most often arranged in tandem on adjacent segments
identification of positions that are different from the reference of chromosomes. Human leukocyte antigen (HLA) presents
(together called secondary analysis). Once the raw data from arrays particular challenges: (i) certain HLA alleles may not be repre-
and NGS are produced, bioinformaticians develop, manage, and sented in the reference genomes; (ii) reads may align to more
operate analysis pipelines that synthesize the results into forms than one location in HLA leading to discard of the read or
comprehensible to the laboratory staff tasked with reporting misalignment and false-positive variation; and (iii) identical reads
the results. Bioinformaticians maintain or develop analysis may have origins in distinct haplotypes that cannot be easily
information management systems (AIMS), which are also used recognized with short-read sequences.
to collect and monitor performance metrics and quality control. Another important issue is that the reliability of variant calling
Specialized software is used to perform these functions and to is different with different classes of variation. Small insertion and
report the metrics needed for quality control. The number of deletion variants (indels) are clinically important because they often
patient-specific data records and the complexity of relationships lead to frameshift and premature termination of proteins; but calling
in family-based testing make it essentially impossible for manual indels and automated application of consistent indel nomenclature
processes to achieve the required reliability. Because of the broad are more difficult than single nucleotide variants (SNVs). The
intended use of genomic testing, there is an increasing importance Genome Analysis Toolkit (http://www.broadinstitute.org/gatk/) is
to collection of patient phenotype data, which is needed for the most widely used software for variant calling.
variant filtering and prioritization (see Tertiary analysis below). Targeted resequencing and whole exome sequencing (WES)
focus on protein-coding elements in the genome. Because of the
Primary Data Analysis—Genotyping and Base Calling complex and highly variable exon–intron structure of genes, there
Genotyping in the case of microarray and base calling in the is considerable technical difficulty in using exon sequence data
case of sequencers are platform specific, and the required software to call structural variants and CNVs. WGS, in contrast, surveys
is supplied by instrument vendors. Microarray data, whether all the intron and intergenic sequence. New methods of PCR-free
array comparative genomic hybridization or SNP chip platforms, library construction enable the read count depth to be used as
32
use signal hybridization intensity to estimate DNA copy number. an accurate surrogate for copy number. In addition, gaps in
Copy number calls are based on multiple adjacent assay positions aligned reads (called “split reads”) can be used to recognize deletions
with respect to the genome map (i.e., the identification of clinically and other structural variants, including duplications and inversions.
important CNVs is always supported by many independent data Although challenges still remain, it is possible that WGS data
points and on-chip assay replicates). The resolution and reliability combined with standardized algorithms could allow a single test
of the CNV calls depends on total number of positions on the to be used for almost all classes of pathogenic alleles.
array and their “responsiveness” to differences in DNA copy Variants are saved in a specified format called a genomic variant
number. Laboratories using these methods must assess data call format (gVCF) file. This format contains information not
quality with robust statistical procedures specifying in advance only about the positions that contain a nonreference genotype
the minimum size and composition of called CNVs. call but also about the quality of each site that is called with
Sequencing data, especially when considering exome and whole the reference homozygous base. This is important because it
genomes, presents much more challenging problems in bioinfor- allows multiple samples to be aggregated (e.g., to analyze mother,
matics. Base calling from the instrument raw data (primary analysis) father, and their offspring jointly). Format standardization allows
typically takes place in local computers dedicated to the sequencer, exchange and aggregation of data among laboratories around
but cloud-based methods can be used. Base calling generates the world. Data aggregation is now widely recognized as a key
sequence “reads” with their base quality scores. Some of the step in the development of molecular diagnostics, not only to
important measures of data quality at the primary analysis step reduce errors in variant calling but also to enable sophisticated
are base quality score, number of reads per sample, length of approaches to the problem of genotype-phenotype relationships
reads, and fall-off of base quality with read position. in rare genetic diseases.

