本文图片来自于学习视频——新一代测序技术数据分析第一讲(综述)
Course objective
** basic principles of NGS**
Basic biological applicaitons
Basics in data processing
Statistical and informatics theories in data analysis
Advantages and limitaitons
Assumptions of different methodologies
Biological interpretation of the results
Course overview (http://compbio.iupui.edu/group/6/pages/ngs course G788 I590)
Outline
What is NGS technology?
Platform overview
Illumina
SOLid(Life Technology)
454(Roche)
Helicos
Pacific Biosciences
Ion Torrent(Life Technology)
Biological applications
Basic concepts and challenges
What is NGS technology?
One can sequence handreds of millions of short sequences(35-100bp) in a single run
Illumina/Solexa GAII/HiSeq 2000
Life Technologies/Applied Biosystems
SOLiD
Ion Torrent
Roche/454 FLX, Titanium
Key words for NGS
Sequencing
Short reads
35,50,75,and 100bp(Solexa and SOLiD)
400bp(454)
Ultra-high throughput
1 to 1.5 billion reads(Solexa and SOLiD)
2-4 million reads(454)
Platform overview
Illumina
1 “flow cell” = 8 “lanes”
1 lane = ~10-30 million “reads”——5-20 million “mappable reads”
36,50,75,100bp
Single-end(SE) or Paired-ends(PE)
1 lane: $800 - $2000
multiplexing
==Sequence-by-synthesis==
==Cluster generation by bridge amplification==
Illumina HiSeq
1 billion clusters
30x coverage of two human genomes in a single run
SOLiD: Sequence-by-ligation
Amplification: emulsion PCR
Base detection:
mixture of labeled oligonucleotides and queries
the input strand with ligase
Color space vs. base space
Each base is interrogated twice
454 Technology - pyrosequencing
Pacific Biosciences
Single Molecular Real Time(SMRT) technology
Long reads, short run times, high quality
1000-1200bp reads (5% 3-5k) fast and low cost per run
Helicos
True single molecule sequencing
No amplification
Ion Torrent Personal Genome Machine
Principle: when a nucleotide is incorporated into a strand of DNA by polymerase, a hydrogen ion is released.
If a match, a hydrogen ion is released and the change in the pH of the solution is detected.
Biological applications
What can we do with NGS data?
Sequence DNA
De novo sequencing
Reference-based re-sequencing
SNP, CNV, Indels
Metagenomics
Identify “who is there?”
Sequence RNA
RNA-seq(transcriptome-wide sequencing)
miRNA-seq
novel ncRNAs
Study protein-DNA/RNA interaction
ChIP-seq (for TF, Pol II binding)
CLIP-seq (for RNA binding proteins)
Epigenetics
DNA methylation
Histone modification(ChIP-seq)
Nucleosome positioning
Chromosome looping
DNA sequencing
Whole genome sequencing
disease genomes
Cancer: high rate of abnormalities, sometimes 10K mutations per cancer
Few are “driver” muatations
The rest are “passenger” mutations
Question: Identify changes “drivers” in the tumor genomes that drive cancer progression?
Therapy selection based on genomics?
sequence mutations
structure variations
Targeted
Region of chromosome
Exome (Agilent: 50Mb)
Selected genes/gene families
Exons
Whole genes
Technology:
Capture array
Capture in solution
PCR
Pooled samples
Goal: identify genetic variations within a cohorts of samples
Copy number variaton
Technology
Microarrays
Array CGH(Comparative Genomic Hybridization)
Structure variations
The genome contains many structural variations
Insertions, deletions, inversions, tandem duplications, translocations, and more complex rearrangements
Can be detected with FISH
Most are difficult to detect with arrays
Examples
Patient-specific biomarkers?
Leary et al. Science Translational Medicine. 2010
RNA-seq
Wang et al. Nat Rev Gen. 2009
Advantages
Digital readout (counts)
Higher dynamic range
Can find novel transcripts
Splice variants
Different 5’, 3’ ends
Mutations: recombinations
Can find genetic variations on transcripts
Allele specific expression if coverage is very deep
ChIP-seq
Goal: Identify genome-wide binding patterns of a POI (Protein of interest)
Park. Nat. Rev. Genetics 2009
Genome-wide mapping of in vivo protein-DNA interactions. Science 2007
High-resolution profiling of histone methylations in human genome. Cell, 2007
Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods, 2007
NGS enables breakthrough in genetic study
1. 1000 Genome Project
Three pilot projects
low-coverage whole-genome sequencing of 179 individuals from four populations
high-coverage sequencing of two mother-father-child
Exon-targeted sequencing of 697 individuals from seven populations
Learn from 1000 Genome project
Locations, allele frequency, and local haplotypes for
15 million SNPs
1 million short insertion/deletions
20 thousand structural variations
Each person carries 250-300 loss-of-function variants in annotated genes, and 50-100 variants implicated in inherited disorders
International Cancer Genome Consortium
modENCODE
Basic concepts
Paired ends/Mate pairs
Multiplexing/Barcoding
Pool samples into a single lane of a flow cell
Add a short “index” to tag library
Illumina
6-base oligos
Currently 12 unique tags to generate 96 samples/run
SOLiD
Add in the P2 adaptor
Up to 16 barcodes (will upgrade to 96)
Challenges
Bioinformatics is the Bottleneck
Cloud computing
Clouds are a larege pool of easily usable and accessible virtualized resources
Best Practices
Analyze the details of the project to take full advantage of the cloud
Create a backup strategy before you launch and instance
Only launch an instance when you are ready to start working and shut down the instance when work is complete
Access an instance using a secure connection
转载:https://blog.csdn.net/weixin_42953727/article/details/102488365