16th December 2015
Genomics 3.0: Big Data in
Precision Medicine
Asoke K Talukder, Ph.D
InterpretOmics, Bangalore, India
BDA 2015, 16th December 2015, Hyderabad, India
17th December 2009
16th December 2015
Part III – Big Data Genomics
16th December 2015
Multi Scale Big Data
3
16th December 2015
Multi Omics Big Data
4
16th December 2015
Big ‘OMICS’
(High-throughput) Data Domains
DNA-Seq
ChIP-Seq
RNA-Seq
Systems
Biology
Meta
Analysis
Population
GeneticsGWAS
Microarray
Exome-Seq
Repli-Seq
Small
RNA-Seq
Biological
Networks
Proteomics
Metagenomics
5
16th December 2015
Model
• Create a virtual (or physical) entity that has same physical
appearance of the original entity in a reduced scale
(space)
• Use Physical Science to create sensors that can sense
and quantify the input to the system causing Perturbation
• Use Physical Science to measure the output of the
Perturbed model entity
• Use Mathematical (or Statistical) science that can simulate
the function and behaviour of the original entity in reduced
space and reduced time with perturbation
6
16th December 2015
Dimensions of Big DataThe 7 Vs of Genomic Big Data
• Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10
9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond.
• Velocity is about the data-retrieval time or the time taken to service a request. Velocity is
also measured through the rate of change of the data volume.
• Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio
etcetera.
• Veracity is another dimension to measure data reliability - the ability of an organization to trust the data
and be able to confidently use it to make crucial decisions.
• Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that
data processing time is close to linear and the algorithm does not have any bias; irrespective of the
volume of the data, the algorithm is able to process the data in reasonable time.
• Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at
picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers.
• Value is the final actionable insight or the functional knowledge. The same
mutation in a gene may have a different effect depending on the population or the
environmental factors.
16th December 2015
Types of Genomic Big Data
1. Patient Data (n = 1)
2. Perishable (n = 1)
3. Persistent (n = N)
4. Phenotypic (n = N)
5. Clinical (n = N)
6. Biological/Molecular (n = N)
16th December 2015
Applications of Next-Generation Sequencing
9
16th December 2015
Asoke Talukder
Frederick Sanger
• Frederick Sanger was born in Rendcomb, a small
village in Gloucestershire on August 13, 1918. He
completed his Ph.D. in 1943 on lysine metabolism and a
more practical problem concerning the nitrogen of
potatoes. Sanger's first triumph was to determine the
complete amino acid sequence of the two polypeptide
chains of insulin in 1955. It was this achievement that
earned him his first Nobel prize in Chemistry in 1958. By
1967 he had determined the nucleotide sequence of the
5S ribosomal RNA from Escherichia coli, a small RNA
about 115 nucleotides long. He then turned to DNA and,
by 1975, had developed the “dideoxy” method for
sequencing DNA molecules, also known as the Sanger
method. This has been of key importance in such
projects as the Human Genome Project and earned him
his second Nobel prize in Chemistry in 1980.
10
16th December 2015
Asoke Talukder11
16th December 2015
Sample generation and cluster generation
200,000 clusters per tile
62.5 million reads per lane
100 bp reads -> 12.5 Gb per lane
Prepare DNA or
cDNA fragmentsLigate adapters
100μm Random
array of clusters
Attach single molecules to
surface Amplify to form cl
12
16th December 2015
Base Calling
Consecuitive cycles
The identity of each base of a cluster is read from stacked sequence image
Sequence
13
16th December 2015
Asoke Talukder14
Dideoxynucleotide Sequencing
14
16th December 2015
Decoding the Book of Life– milestone for Quantitative Biology
A Milestone for Humanity – the Human genome
Human Genome Completed, June 26th, 2000
15
Francis CollinsBill ClintonJ Craig Ventor
15
16th December 2015
3 billion base pair => 6 G letters &
1 letter => 1 byte
The whole genome can be recorded in just 10 CD-ROMs!
In 2003, Human genome
sequence was deciphered!
• Genome is the complete set of genes of a living thing.
• In 2003, the human genome sequencing was completed.
• The human genome contains about 3 billion base pairs.
• The number of genes is estimated to be between 20,000 to
25,000.
• The difference between the genome of human and that of
chimpanzee is only 1.23%!
16
16th December 2015
Asoke Talukder
Illumina Genome Analyzer (GA)
• The Genome Analyzer sequences clustered template DNA using a robust four-color DNA Sequencing-By-Synthesis (SBS) technology that employs reversible terminators with removable fluorescence. This approach provides a high degree of sequencing accuracy even through homopolymeric regions.
17
16th December 2015
Asoke Talukder
NGS (Next Generation Sequencing)
Technology
18
16th December 2015
Asoke Talukder
How is Microarray Manufactured?• Affymetrix GeneChip
• silicon chip
• oligonucleiotide probes lithographically synthesized on the
array
• cRNA is used instead of cDNA
19
16th December 2015
How Does Microarray Work?
20
16th December 2015
Part IV – Biological Databases
Molecular Biology Databases …
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb,BBDB, BCGD,Beanref,Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
16 December, 201522
NCBI (National Center for Biotechnology
Information)
• over 30 databases including
GenBank, PubMed, OMIM, and
GEO
• Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
16 December, 201523
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
16 December, 201536
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
16 December, 201537
Protein Data Bank (PDB)
16 December, 201538
16 December, 201539
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard LinkNeighbors
Related SequencesNeighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
16 December, 201540
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]
UMLS Knowledge Source Server (UMLSKS) Home Page
Unified Medical Language System
From top links or buttons
Search 3 Knowledge Sources
From sidebar
Downloads
Documentation
Resources
16 December, 201542
“Biologic Function” hierarchy
Biologic Function
360
Pathologic Function
9983
Physiologic Function
691
Disease or
Syndrome
67716
Cell or
Molecular
Dysfunction
1276
Experimental
Model of
Disease
72
Organism
Function
1528
Organ
or Tissue
Function
2912
Cell
Function
4417
Molecular
Function
13442
Mental or
Behavioral
Dysfunction
5691
Neoplastic
Process
19436
Mental
Process
1224
Genetic
Function
1340
16 December, 201543
16th December 2015
Part V – Algorithms
Algorithms
• An algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem
• First you must identify exactly what the problem is!
• A problem describes a class of computational tasks. A problem instance is one particular input from that task
• In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible)
• Unlike commercial software that is data intensive, algorithms as science and mathematics intensive
16 December, 201545
Schematic representation of our implementation of the de Bruijn graph
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Example of Tour Bus error correction
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Breadcrumb algorithm
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
16 December, 201549
16th December 2015
• Overview of Human Disease
– classifications, Inheritance, mechanisms (cause)
• Databases
– OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim)
– Gene Clinics (http://www.geneclinics.org/)
– Mutation database (http://mutdb.org/)
– Ocomine (http://www.oncomine.org/)
– Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/)
• Analysis of genes for molecular functions,
biological processes and pathwaysThe PANTHER (Protein ANalysis THrough Evolutionary Relationships
(http://www.pantherdb.org/)
Protein Interaction network (http://string.embl.de/)50
16th December 2015
• Are results statistically significant?
• Many random process are involved in Biological processes
• Many processes appear to be random but in reality are non-random
• Many chances and uncertainties are involved in biology data collection
• Statistical modeling of biological phenomenon can help to understand patterns in life
Why Statistics?
51
16th December 2015
Deductive and Inductive Science
Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003
Law of Gravitation,
Newton's Law of Motion
E = mC2
Biological Phenomenon
Simulation
Clinical Trial
52
16th December 2015
Why Statistics?
Purpose of statistics is to draw inferences from samples of data to
the population from which these samples came
Or
Abstract an entity with average behavior where the behavior of the
constituent parts cannot be measured
53
16th December 2015
Challenges in Computing
• Nature is a Tweaker
• Computers are efficient in discovering identity but
not similarity
• Biology needs similarity & not identity
• All Biology problems are different & unique
• Huge data generated by Next Generation
Sequencers with many errors
• Eliminate Noise from Information
• Minimize False Positive and False Negative
54
16th December 2015
Most Biology Solutions are
NP-Hard• If the data volume increases by x, complexity of solution is
much higher than x (non deterministic polynomial time)
• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time
• You may not know when you have an optimal solution, if you use a heuristic
• Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution
• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need?
55
16th December 2015
NGS: Experiment with an Open Mind
• The process (Wet Lab)
• Take DNA/RNA/cDNA/miRNA etc
• Break into tiny pieces
• Amplify them
• Read them as sequence of bases
• The process (Dry Lab)
• Analyze the data
• Extract information from data
• NGS Experiments are unbiased
• NGS can help discover many unknown patterns in the genome/gene or cell
56
16th December 2015
Next Generation Sequence Data
• FASTQ (Illumina)
• Sff (454)
• CCS (PacBio)
• ...
• Microarray
Single End
Sequences
Insert Size
Library Size
Sequence Seque
ncePaired End or
Mate-paired
DNA/RNA/miRNA
OverlappedOverlapped reads
Random Order & Orientation
Long reads
Short reads
Fixed length reads
Variable length reads
cDNA/mRNA
Hundreds to Billions Bases
Circular Consensus reads
Billions to Hundreds Bases
57
16th December 2015
Paired-end/Mate-pair Data
Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing,
Nature Methods Supplement| Vol.6 No.11s | November 2009
58
16th December 2015
Roche 454 NGS Data (.sff)
FNA File content>contig00001 length=439 numreads=17
CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC
GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA
GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC
GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC
QUAL File content>contig00001 length=439 numreads=17
64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64
64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Phred quality scoreQ is defined as a property which is logarithmically related to the base-calling error probabilities P
Q = -10 * log10P
or
P = 10-Q/10
• If Phred assigns a quality score of 30 to a base, the chances that this base is called
incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality
score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to
assess the quality of sequences. Because
• In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64
based on the vendor
59
16th December 2015
@HWI-EAS107_1_4_1_113_501CATTATAAATTGAAGCTTATACAAAAAACTCGA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_213_501ATTATAAATTGAAGCTTATACAAAAAACTCGAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_313_501CATTATAAATTGAAGCTTATACAAAAAACTCGA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_413_501TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_513_501TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>HWI-EAS107_1_4_1_113_501CATTATAAATTGAAGCTTATACAAAAAACTCGA>HWI-EAS107_1_4_1_213_501ATTATAAATTGAAGCTTATACAAAAAACTCGAA>HWI-EAS107_1_4_1_313_501CATTATAAATTGAAGCTTATACAAAAAACTCGA>HWI-EAS107_1_4_1_413_501TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA>HWI-EAS107_1_4_1_513_501TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA
Data in FASTQ/FASTA Format
• For Paired-end sequences you have two files with name
• _1 & _2 to indicate End_1 & End_2
• Within files you have matching record id@HWI-EAS107_1_4_1_113_501/1• To indicate the sequence of End_1
• And@HWI-EAS107_1_4_1_113_501/2• To indicate the sequence of End_2
• Paired-end read is INWARD
•
• Mate-pair read is OUTWARD
•
• FASTA
• FASTQ
60
16th December 2015
Error Due to Physics
Beginning
(bad quality data)
Middle
(good quality data)
End
(bad quality data)
Source: Wikipedia
61
16th December 2015
Base-calling Error
(Errors occur at rates 1 to 5 errors every 100 nucleotide)
ACCGT
CGTGC
TTAC
TACCGT
ACCGT
CGTGC
TTAC
TGCCGT
ACCGT
CAGTGC
TTAC
TACCGT
ACCGT
CGTGC
TTAC
TACGT
Substitution Insertion Deletion
Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology
--ACCGT--
----CGTGC
TTAC-----
-TACCGT—
TTACCGTC (Consensus)
62
16th December 2015
Adaptors & Contamination
• Illumina Adaptors:1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT
3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT
6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
• In a Paired Read, contamination in one end will result into filtering of both ends
63
16th December 2015
Genome/DNA DataRun1:
Lane No of Reads Size (bytes)
1 41,179,668 10,285,393,108
2 43,252,726 10,455,103,434
3 42,951,004 10,381,539,992
4 43,580,180 10,534,360,126
6 42,071,130 10,171,701,008
7 43,084,416 10,414,795,392
8 42,891,196 10,369,596,648
Run2:
Lane No of Reads Size(bytes)
1 42,773,842 10,703,924,228
2 44,809,016 10,772,709,314
3 44,898,528 10,790,934,680
4 44,099,962 10,598,532,600
6 44,731,270 10,746,564,462
7 44,162,428 10,607,946,662
8 43,689,238 10,492,962,600
Lane Size (bytes)
6 6,396,631,302
7 6,392,634,380
8 6,240,332,704
Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides
Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides
Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides
Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT)
Lane Size (bytes)
1 6,535,068,410
2 6,512,213,186
3 6,497,931,646
4 6,417,130,928
64
16th December 2015
RNA-Seq Data for a Marine Animal
Tissue Name # Reads # Bases Size (bytes)
Brain 73,224,886 4,393,493,160 14,378,439,860
Heart 71,954,940 4,317,296,400 14,129,178,812
Liver 68,992,472 4,139,548,320 13,547,005,500
65
16th December 2015
miRNA Data
Sample No of Bases No. of Bases No. of Size of
Name Received Processed Sequences Data
========================================================
S1 27,951,043 27,951,043 1,114,585 70.5 MB
S2 24,768,291 24,768,291 1,043,462 64.5 MB
S3 41,569,143 41,569,143 1,685,096 106.5 MB
S4 34,037,239 34,037,239 1,433,791 89.2 MB
S5 24,963,089 24,963,089 1,033,362 61.6 MB
S6 34,846,223 34,846,223 1,439,337 96.5 MB
S7 74,262,271 74,262,271 2,309,712 164.6 MB
Read Size varying from 18 to 36 in FASTA format
66
16th December 2015
Typical Biological Data Volume
(Illumina sequencing platform based)
67
16th December 2015
Complexities in NGS Data
• Large files – Microsoft Windows often fails to even open the file
• Variable Length Reads – allocating memory is always a computational
challenges
• Computers are good at Identity discovery but Biology needs Similarity
discovery
• Categorical data – cannot take differences between two objects
• Data are error prone – Quality of data is always a challenge
• Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base)
• Needs Super Computing power with Terabytes of Memory, and Petabytes of
Storage
• Most Biology problems are NP-Hard – algorithms fail to scale with large data
volume
• Many Open Source tools for NGS data and poorly documented and not
maintained, supported, or easy to change
68
16th December 2015
NGS Data Challenges
TACCGT
TGCCGT
TCCGT
TCCCGT
ACCCGT
ACCGT
Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology
No Coverage
Fragments
No Coverage
DeletionInsertionSubstitution
Read Errors
XTarget XA XB C
XA XD XCAssembled
D
B
Repeats
69
16th December 2015
Unknown Orientation & Order
CACGT
ACGT
TGCA
ACTACG
GTACT
ACTGA
CTGA
CACGT--------
-ACGT--------
-ACGT--------
--CGTAGT-----
-----AGTAC---
--------ACTGA
---------CTGA
CACGTAGTACTGA
70
16th December 2015
Discovering Biomedical Knowledge
Data
Information
Knowledge
Literature/
Molecular Data
Clinical/Bedside Data Medical
Knowledge
Target Data
Preprocessed
Data
Transformed
Data
Patterns
iOmicsClinical/Drug Data
71
16th December 2015
Data Information Knowledge
Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002
Wet Lab experiment &
High-throughput data
Open-domain widely
used Algorithms &
Tools
Custom Tools and
Open-domain
Databases
Problem Specific
Algorithms, Analysis,
and Databases
Data
Information
Knowledge
Related
Information
72
16th December 2015
Systems Biology –Hypothesis Agnostic System/Genome Wide Study
ETLExperiment/Sample Big Data
Data ScienceMolecular Biology /
Genetics
Hypothesis
Computer Science/
Algorithms
Bioinformatics Statistics Meta Analysis /
Network BiologyPublish /
Translational Biomedicine
Scientist / BiologistNGS / Sequencer
Biomedical
Databases
Literature73
16th December 2015
Data Sciences• Data Science is about learning from data, in order to gain useful
predictions and insights
• Separating signal from noise presents many computational and
inferential challenges, which we approached from a perspective at the
interface of computer science and statistics
• Data munging/scraping/sampling/cleaning in order to get an
informative, manageable data set
• Data storage and management in order to be able to access data -
especially big data - quickly and reliably during subsequent analysis
• Exploratory data analysis to generate hypotheses and intuition about
the data
• Prediction based on statistical tools such as regression, classification,
and clustering
• Communication of results through visualization, stories, and
interpretable summaries.
74
16th December 2015
Data Simulator (Synthetic Data)
• Take a Reference genome (e.g., hg19 or mm10 or some other genome)
• Create a VCF (Variation Call Format) file with synthetic mutations
• Or, take known mutations in VCF format from COSMIC or 1000Genome
• Apply (inject) the mutations from VCF file into the reference genome
• This will create a genome (single strand) with known mutations
• Inject random errors (sequencer errors)
• Define the depth or coverage
• Create fixed length single-end or paired-end reads
• A FASTQ file will be generated with known coverage and known
mutations
• Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data
75
16th December 2015
Data Scientists' Skills
Ref: Wikipedia
76
16th December 2015
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
77
16th December 2015
• Real Human miRNA Data
• Nucleotide Patterns– Mono, Di, Poly statistics
– Motif Statistics
• Quality of Nucleotides
Truth is in the Data
78
16th December 2015
Random genomes
fragmentation
Genomes assembly
using overlaps
Metagenomics/
Multiple genomes
The Sequencing & Assembly Process
Target Microbial
Genomes
16th December 2015
The Jigsaw Puzzle
Source: Unknown
80
16th December 2015
Phases in Assembly
• Understand the data– Data inventory
– Single End, Paired End, Mate Paired etc
– Sequence structure (Read size, Format)
– Quality of the data
– Patterns within the data
• Clean up the data– Remove (Filter/Trim) vector/adaptor contaminated data
– Remove data of bad quality
– Remove data that might cause chimeric error
• Genome or Trancriptome in Ref-Assembly
• Contigs in Denovo Assembly
81
16th December 2015
Genome Reference Assembly
• Seed Based Algorithm
– Indexes either the genome or the reads in a data structure
– All k-long words (k-mers) of one sequence are indexed in a table with an entry for every possible k-mer
– Seeds (exact or nearly exact substring matches between the read and the genome) are used to rapidly isolate the potential locations where the read could match, and then a sensitive, full alignment phase, often with the Smith–Waterman
Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058
82
16th December 2015
• MAQ (Mapping and Assembly with Qualities) is a Reference Assembly that supports 63 bases of short fixed-length Reads
• MAQ was designed for Illumina 1G Genetic Analyzer data, with functions to handle ABI SOLiD data.
• MAQ aligns reads to reference sequences and then calls the consensus. For single-end reads, MAQ is able to find all hits with up to 2 or 3 mismatches.
• For paired-end reads, MAQ finds all paired hits with one of the two reads containing up to 1 mismatch.
• At the assembling stage, MAQ calls the consensus based on a statistical model. It calls the base which maximizes the posterior probability and calculates a phread quality at each position along the consensus. Heterozygotes are also called in this process.
MAQ
Ref: http://maq.sourceforge.net/
83
16th December 2015
• BWT (Burrows–Wheeler Transform)
• In the BWT index, only a fraction of the
pointers must be precomputed and saved,
while the rest are reconstructed on demand
• Bowtie and BWA utilize heuristic algorithms to
search for non-exact matches in the BWT-
based index, if exact matches cannot be
located
Faster Genome Ref-assembly Algorithm
Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058
84
16th December 2015
Alignment – Bowtie
(SAM – Sequence Assembly Map)HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774
1374500255 100M * 0 0TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA
CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA%%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/
>8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100NM:i:0
HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_41607741374501255 100M * 0 0CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC
CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT444454313355544455544433244445661493/3;;565=;491=;5;54==3=
;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100NM:i:0
HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_41607741374502255 100M * 0 0TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC
TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATCBAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555
5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100NM:i:0
85
16th December 2015
Alignment in Genome Viewer
86
16th December 2015
• Greatest computational challenge for Variation Analysis (SNP/InDel) task lies in judging the likelihood that a position is a heterozygous or homozygous variant given the error rates of the various platforms
• The probability of bad mappings, and the amount of support or coverage
• Therefore, most of the tools include a detailed data preparation step in which they filter, realign and often re-score reads, followed by a nucleotide or heterozygosity calling step done under a Bayesian framework
SNP, Micro-InDel, & Point Mutation
87
16th December 2015
Lack of Coverage
• Coverage at a position i of a target is defined as
the number of fragments that cover this position. If
coverage is zero or low, there is not enough
information in the fragment set to reconstruct the
target completely
No
Coverage
Target
Fragments
No Coverage
88
16th December 2015
End of Part III, IV & V
InterpretOmicsOffice: Shezan Lavelle, 5th Floor,
#15 Walton Road, Bengaluru 560001
Lab: #329, 7th Main, HAL 2nd Stage,
Indiranagar, Bengaluru 560008
Phone: +91(80)46623800
89
Top Related