Course objective

** basic principles of NGS**
Basic biological applicaitons
Basics in data processing
Statistical and informatics theories in data analysis
Advantages and limitaitons
Assumptions of different methodologies
Biological interpretation of the results

Course overview (http://compbio.iupui.edu/group/6/pages/ngs course G788 I590)


What is NGS technology?
Platform overview
SOLid(Life Technology)
Pacific Biosciences
Ion Torrent(Life Technology)
Biological applications
Basic concepts and challenges

What is NGS technology?

One can sequence handreds of millions of short sequences(35-100bp) in a single run
Illumina/Solexa GAII/HiSeq 2000
Life Technologies/Applied Biosystems
Ion Torrent
Roche/454 FLX, Titanium
Key words for NGS
Short reads
35,50,75,and 100bp(Solexa and SOLiD)
Ultra-high throughput
1 to 1.5 billion reads(Solexa and SOLiD)
2-4 million reads(454)
Platform overview
1 “flow cell” = 8 “lanes”
1 lane = ~10-30 million “reads”——5-20 million “mappable reads”
Single-end(SE) or Paired-ends(PE)

	1 lane: $800 - $2000

	==Cluster generation by bridge amplification==
Illumina HiSeq
	1 billion clusters
	30x coverage of two human genomes in a single run

SOLiD: Sequence-by-ligation
Amplification: emulsion PCR
Base detection:
mixture of labeled oligonucleotides and queries
the input strand with ligase
Color space vs. base space
Each base is interrogated twice
454 Technology - pyrosequencing
Pacific Biosciences
Single Molecular Real Time(SMRT) technology
Long reads, short run times, high quality
1000-1200bp reads (5% 3-5k) fast and low cost per run
True single molecule sequencing
No amplification
Ion Torrent Personal Genome Machine
Principle: when a nucleotide is incorporated into a strand of DNA by polymerase, a hydrogen ion is released.
If a match, a hydrogen ion is released and the change in the pH of the solution is detected.

Biological applications

What can we do with NGS data?

Sequence DNA
De novo sequencing
Reference-based re-sequencing
SNP, CNV, Indels
Identify “who is there?”

Sequence RNA
RNA-seq(transcriptome-wide sequencing)
novel ncRNAs

Study protein-DNA/RNA interaction
ChIP-seq (for TF, Pol II binding)
CLIP-seq (for RNA binding proteins)

DNA methylation
Histone modification(ChIP-seq)
Nucleosome positioning
Chromosome looping

DNA sequencing

Whole genome sequencing
disease genomes
Cancer: high rate of abnormalities, sometimes 10K mutations per cancer
Few are “driver” muatations
The rest are “passenger” mutations
Question: Identify changes “drivers” in the tumor genomes that drive cancer progression?
Therapy selection based on genomics?
sequence mutations
structure variations

Region of chromosome
Exome (Agilent: 50Mb)
Selected genes/gene families
Whole genes
Capture array
Capture in solution
Pooled samples
Goal: identify genetic variations within a cohorts of samples

Copy number variaton
Array CGH(Comparative Genomic Hybridization)

Structure variations
The genome contains many structural variations
Insertions, deletions, inversions, tandem duplications, translocations, and more complex rearrangements
Can be detected with FISH
Most are difficult to detect with arrays


Patient-specific biomarkers?
Digital readout (counts)
Higher dynamic range
Can find novel transcripts
Splice variants
Different 5’, 3’ ends
Mutations: recombinations
Can find genetic variations on transcripts
Allele specific expression if coverage is very deep
Goal: Identify genome-wide binding patterns of a POI (Protein of interest)
NGS enables breakthrough in genetic study

1. 1000 Genome Project
Three pilot projects
low-coverage whole-genome sequencing of 179 individuals from four populations
high-coverage sequencing of two mother-father-child
Exon-targeted sequencing of 697 individuals from seven populations
Learn from 1000 Genome project
Locations, allele frequency, and local haplotypes for
15 million SNPs
1 million short insertion/deletions
20 thousand structural variations
Each person carries 250-300 loss-of-function variants in annotated genes, and 50-100 variants implicated in inherited disorders
International Cancer Genome Consortium

Basic concepts

Paired ends/Mate pairs

Pool samples into a single lane of a flow cell
Add a short “index” to tag library
6-base oligos
Currently 12 unique tags to generate 96 samples/run
Add in the P2 adaptor
Up to 16 barcodes (will upgrade to 96)


Bioinformatics is the Bottleneck

Cloud computing

Clouds are a larege pool of easily usable and accessible virtualized resources
Best Practices
Analyze the details of the project to take full advantage of the cloud
Create a backup strategy before you launch and instance
Only launch an instance when you are ready to start working and shut down the instance when work is complete
Access an instance using a secure connection

