Download - Bda2015 tutorial-part2-data&databases

16th December 2015

Genomics 3.0: Big Data in

Precision Medicine

Asoke K Talukder, Ph.D

InterpretOmics, Bangalore, India

BDA 2015, 16th December 2015, Hyderabad, India

17th December 2009

16th December 2015

Part III – Big Data Genomics

16th December 2015

Multi Scale Big Data

3

16th December 2015

Multi Omics Big Data

4

16th December 2015

Big ‘OMICS’

(High-throughput) Data Domains

DNA-Seq

ChIP-Seq

RNA-Seq

Systems

Biology

Meta

Analysis

Population

GeneticsGWAS

Microarray

Exome-Seq

Repli-Seq

Small

RNA-Seq

Biological

Networks

Proteomics

Metagenomics

5

16th December 2015

Model

• Create a virtual (or physical) entity that has same physical

appearance of the original entity in a reduced scale

(space)

• Use Physical Science to create sensors that can sense

and quantify the input to the system causing Perturbation

• Use Physical Science to measure the output of the

Perturbed model entity

• Use Mathematical (or Statistical) science that can simulate

the function and behaviour of the original entity in reduced

space and reduced time with perturbation

6

16th December 2015

Dimensions of Big DataThe 7 Vs of Genomic Big Data

• Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10

9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond.

• Velocity is about the data-retrieval time or the time taken to service a request. Velocity is

also measured through the rate of change of the data volume.

• Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio

etcetera.

• Veracity is another dimension to measure data reliability - the ability of an organization to trust the data

and be able to confidently use it to make crucial decisions.

• Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that

data processing time is close to linear and the algorithm does not have any bias; irrespective of the

volume of the data, the algorithm is able to process the data in reasonable time.

• Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at

picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers.

• Value is the final actionable insight or the functional knowledge. The same

mutation in a gene may have a different effect depending on the population or the

environmental factors.

16th December 2015

Types of Genomic Big Data

1. Patient Data (n = 1)

2. Perishable (n = 1)

3. Persistent (n = N)

4. Phenotypic (n = N)

5. Clinical (n = N)

6. Biological/Molecular (n = N)

16th December 2015

Applications of Next-Generation Sequencing

9

16th December 2015

Asoke Talukder

Frederick Sanger

• Frederick Sanger was born in Rendcomb, a small

village in Gloucestershire on August 13, 1918. He

completed his Ph.D. in 1943 on lysine metabolism and a

more practical problem concerning the nitrogen of

potatoes. Sanger's first triumph was to determine the

complete amino acid sequence of the two polypeptide

chains of insulin in 1955. It was this achievement that

earned him his first Nobel prize in Chemistry in 1958. By

1967 he had determined the nucleotide sequence of the

5S ribosomal RNA from Escherichia coli, a small RNA

about 115 nucleotides long. He then turned to DNA and,

by 1975, had developed the “dideoxy” method for

sequencing DNA molecules, also known as the Sanger

method. This has been of key importance in such

projects as the Human Genome Project and earned him

his second Nobel prize in Chemistry in 1980.

10

16th December 2015

Asoke Talukder11

16th December 2015

Sample generation and cluster generation

200,000 clusters per tile

62.5 million reads per lane

100 bp reads -> 12.5 Gb per lane

Prepare DNA or

cDNA fragmentsLigate adapters

100μm Random

array of clusters

Attach single molecules to

surface Amplify to form cl

12

16th December 2015

Base Calling

Consecuitive cycles

The identity of each base of a cluster is read from stacked sequence image

Sequence

13

16th December 2015

Asoke Talukder14

Dideoxynucleotide Sequencing

14

16th December 2015

Decoding the Book of Life– milestone for Quantitative Biology

A Milestone for Humanity – the Human genome

Human Genome Completed, June 26th, 2000

15

Francis CollinsBill ClintonJ Craig Ventor

15

16th December 2015

3 billion base pair => 6 G letters &

1 letter => 1 byte

The whole genome can be recorded in just 10 CD-ROMs!

In 2003, Human genome

sequence was deciphered!

• Genome is the complete set of genes of a living thing.

• In 2003, the human genome sequencing was completed.

• The human genome contains about 3 billion base pairs.

• The number of genes is estimated to be between 20,000 to

25,000.

• The difference between the genome of human and that of

chimpanzee is only 1.23%!

16

16th December 2015

Asoke Talukder

Illumina Genome Analyzer (GA)

• The Genome Analyzer sequences clustered template DNA using a robust four-color DNA Sequencing-By-Synthesis (SBS) technology that employs reversible terminators with removable fluorescence. This approach provides a high degree of sequencing accuracy even through homopolymeric regions.

17

16th December 2015

Asoke Talukder

NGS (Next Generation Sequencing)

Technology

18

16th December 2015

Asoke Talukder

How is Microarray Manufactured?• Affymetrix GeneChip

• silicon chip

• oligonucleiotide probes lithographically synthesized on the

array

• cRNA is used instead of cDNA

19

16th December 2015

How Does Microarray Work?

20

16th December 2015

Part IV – Biological Databases

Molecular Biology Databases …

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,

ARR, AsDb,BBDB, BCGD,Beanref,Biolmage,

BioMagResBank, BIOMDB, BLOCKS, BovGBASE,

BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,

CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,

ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,

CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,

Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,

ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,

ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,

GCRDB, GDB, GENATLAS, Genbank, GeneCards,

Genline, GenLink, GENOTK, GenProtEC, GIFTS,

GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,

HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,

HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,

HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,

KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,

Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,

MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,

OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,

PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,

PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,

PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,

SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,

SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-

MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,

TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,

VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,

YPM, etc .................. !!!!

16 December, 201522

NCBI (National Center for Biotechnology

Information)

• over 30 databases including

GenBank, PubMed, OMIM, and

GEO

• Access all NCBI resources via

Entrez

(www.ncbi.nlm.nih.gov/Entrez/)

16 December, 201523

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

16 December, 201536

Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)

16 December, 201537

Protein Data Bank (PDB)

16 December, 201538

16 December, 201539

ENTREZ: A DISCOVERY SYSTEM

Gene

Taxonomy

PubMed

abstracts

Nucleotide

sequences

Protein

sequences

3-D

Structure

3 -D

Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Hard LinkNeighbors

Related SequencesNeighbors

Related Sequences

BLink

Domains

Neighbors

Related Structures

Pre-computed and pre-compiled data.

•A potential “gold mine” of undiscovered

relationships.

•Used less than expected.

16 December, 201540

PRECISE RESULTS

MLH1[Gene Name] AND Human[Organism]

UMLS Knowledge Source Server (UMLSKS) Home Page

Unified Medical Language System

From top links or buttons

Search 3 Knowledge Sources

From sidebar

Downloads

Documentation

Resources

16 December, 201542

“Biologic Function” hierarchy

Biologic Function

360

Pathologic Function

9983

Physiologic Function

691

Disease or

Syndrome

67716

Cell or

Molecular

Dysfunction

1276

Experimental

Model of

Disease

72

Organism

Function

1528

Organ

or Tissue

Function

2912

Cell

Function

4417

Molecular

Function

13442

Mental or

Behavioral

Dysfunction

5691

Neoplastic

Process

19436

Mental

Process

1224

Genetic

Function

1340

16 December, 201543

16th December 2015

Part V – Algorithms

Algorithms

• An algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem

• First you must identify exactly what the problem is!

• A problem describes a class of computational tasks. A problem instance is one particular input from that task

• In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible)

• Unlike commercial software that is data intensive, algorithms as science and mathematics intensive

16 December, 201545

Schematic representation of our implementation of the de Bruijn graph

Zerbino D. R., Birney E. Genome Res.;2008;18:821-829

©2008 by Cold Spring Harbor Laboratory Press

Example of Tour Bus error correction



Breadcrumb algorithm



16 December, 201549

http://genome.cshlp.org/content/18/5/821/T1.large.jpg

http://genome.cshlp.org/content/18/5/821/T1.large.jpg

16th December 2015

• Overview of Human Disease

– classifications, Inheritance, mechanisms (cause)

• Databases

– OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim)

– Gene Clinics (http://www.geneclinics.org/)

– Mutation database (http://mutdb.org/)

– Ocomine (http://www.oncomine.org/)

– Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/)

• Analysis of genes for molecular functions,

biological processes and pathwaysThe PANTHER (Protein ANalysis THrough Evolutionary Relationships

(http://www.pantherdb.org/)

Protein Interaction network (http://string.embl.de/)50

16th December 2015

• Are results statistically significant?

• Many random process are involved in Biological processes

• Many processes appear to be random but in reality are non-random

• Many chances and uncertainties are involved in biology data collection

• Statistical modeling of biological phenomenon can help to understand patterns in life

Why Statistics?

51

16th December 2015

Deductive and Inductive Science

Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003

Law of Gravitation,

Newton's Law of Motion

E = mC2

Biological Phenomenon

Simulation

Clinical Trial

52

16th December 2015

Why Statistics?

Purpose of statistics is to draw inferences from samples of data to

the population from which these samples came

Or

Abstract an entity with average behavior where the behavior of the

constituent parts cannot be measured

53

16th December 2015

Challenges in Computing

• Nature is a Tweaker

• Computers are efficient in discovering identity but

not similarity

• Biology needs similarity & not identity

• All Biology problems are different & unique

• Huge data generated by Next Generation

Sequencers with many errors

• Eliminate Noise from Information

• Minimize False Positive and False Negative

54

16th December 2015

Most Biology Solutions are

NP-Hard• If the data volume increases by x, complexity of solution is

much higher than x (non deterministic polynomial time)

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time

• You may not know when you have an optimal solution, if you use a heuristic

• Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need?

55

16th December 2015

NGS: Experiment with an Open Mind

• The process (Wet Lab)

• Take DNA/RNA/cDNA/miRNA etc

• Break into tiny pieces

• Amplify them

• Read them as sequence of bases

• The process (Dry Lab)

• Analyze the data

• Extract information from data

• NGS Experiments are unbiased

• NGS can help discover many unknown patterns in the genome/gene or cell

56

16th December 2015

Next Generation Sequence Data

• FASTQ (Illumina)

• Sff (454)

• CCS (PacBio)

• ...

• Microarray

Single End

Sequences

Insert Size

Library Size

Sequence Seque

ncePaired End or

Mate-paired

DNA/RNA/miRNA

OverlappedOverlapped reads

Random Order & Orientation

Long reads

Short reads

Fixed length reads

Variable length reads

cDNA/mRNA

Hundreds to Billions Bases

Circular Consensus reads

Billions to Hundreds Bases

57

16th December 2015

Paired-end/Mate-pair Data

Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing,

Nature Methods Supplement| Vol.6 No.11s | November 2009

58

16th December 2015

Roche 454 NGS Data (.sff)

FNA File content>contig00001 length=439 numreads=17

CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC

GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA

GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC

GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC

QUAL File content>contig00001 length=439 numreads=17

64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64

64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

Phred quality scoreQ is defined as a property which is logarithmically related to the base-calling error probabilities P

Q = -10 * log10P

or

P = 10-Q/10

• If Phred assigns a quality score of 30 to a base, the chances that this base is called

incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality

score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to

assess the quality of sequences. Because

• In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64

based on the vendor

59

16th December 2015

@HWI-EAS107_1_4_1_113_501CATTATAAATTGAAGCTTATACAAAAAACTCGA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_213_501ATTATAAATTGAAGCTTATACAAAAAACTCGAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_313_501CATTATAAATTGAAGCTTATACAAAAAACTCGA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_413_501TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@HWI-EAS107_1_4_1_513_501TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA+IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

>HWI-EAS107_1_4_1_113_501CATTATAAATTGAAGCTTATACAAAAAACTCGA>HWI-EAS107_1_4_1_213_501ATTATAAATTGAAGCTTATACAAAAAACTCGAA>HWI-EAS107_1_4_1_313_501CATTATAAATTGAAGCTTATACAAAAAACTCGA>HWI-EAS107_1_4_1_413_501TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA>HWI-EAS107_1_4_1_513_501TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA

Data in FASTQ/FASTA Format

• For Paired-end sequences you have two files with name

• _1 & _2 to indicate End_1 & End_2

• Within files you have matching record id@HWI-EAS107_1_4_1_113_501/1• To indicate the sequence of End_1

• And@HWI-EAS107_1_4_1_113_501/2• To indicate the sequence of End_2

• Paired-end read is INWARD

•

• Mate-pair read is OUTWARD

•

• FASTA

• FASTQ

60

16th December 2015

Error Due to Physics

Beginning

(bad quality data)

Middle

(good quality data)

End

(bad quality data)

Source: Wikipedia

61

16th December 2015

Base-calling Error

(Errors occur at rates 1 to 5 errors every 100 nucleotide)

ACCGT

CGTGC

TTAC

TACCGT

ACCGT

CGTGC

TTAC

TGCCGT

ACCGT

CAGTGC

TTAC

TACCGT

ACCGT

CGTGC

TTAC

TACGT

Substitution Insertion Deletion

Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology

--ACCGT--

----CGTGC

TTAC-----

-TACCGT—

TTACCGTC (Consensus)

62

16th December 2015

Adaptors & Contamination

• Illumina Adaptors:1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG

2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT

3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT

5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT

6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT

• In a Paired Read, contamination in one end will result into filtering of both ends

63

16th December 2015

Genome/DNA DataRun1:

Lane No of Reads Size (bytes)

1 41,179,668 10,285,393,108

2 43,252,726 10,455,103,434

3 42,951,004 10,381,539,992

4 43,580,180 10,534,360,126

6 42,071,130 10,171,701,008

7 43,084,416 10,414,795,392

8 42,891,196 10,369,596,648

Run2:

Lane No of Reads Size(bytes)

1 42,773,842 10,703,924,228

2 44,809,016 10,772,709,314

3 44,898,528 10,790,934,680

4 44,099,962 10,598,532,600

6 44,731,270 10,746,564,462

7 44,162,428 10,607,946,662

8 43,689,238 10,492,962,600

Lane Size (bytes)

6 6,396,631,302

7 6,392,634,380

8 6,240,332,704

Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides

Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides

Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides

Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT)

Lane Size (bytes)

1 6,535,068,410

2 6,512,213,186

3 6,497,931,646

4 6,417,130,928

64

16th December 2015

RNA-Seq Data for a Marine Animal

Tissue Name # Reads # Bases Size (bytes)

Brain 73,224,886 4,393,493,160 14,378,439,860

Heart 71,954,940 4,317,296,400 14,129,178,812

Liver 68,992,472 4,139,548,320 13,547,005,500

65

16th December 2015

miRNA Data

Sample No of Bases No. of Bases No. of Size of

Name Received Processed Sequences Data

========================================================

S1 27,951,043 27,951,043 1,114,585 70.5 MB

S2 24,768,291 24,768,291 1,043,462 64.5 MB

S3 41,569,143 41,569,143 1,685,096 106.5 MB

S4 34,037,239 34,037,239 1,433,791 89.2 MB

S5 24,963,089 24,963,089 1,033,362 61.6 MB

S6 34,846,223 34,846,223 1,439,337 96.5 MB

S7 74,262,271 74,262,271 2,309,712 164.6 MB

Read Size varying from 18 to 36 in FASTA format

66

16th December 2015

Typical Biological Data Volume

(Illumina sequencing platform based)

67

16th December 2015

Complexities in NGS Data

• Large files – Microsoft Windows often fails to even open the file

• Variable Length Reads – allocating memory is always a computational

challenges

• Computers are good at Identity discovery but Biology needs Similarity

discovery

• Categorical data – cannot take differences between two objects

• Data are error prone – Quality of data is always a challenge

• Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base)

• Needs Super Computing power with Terabytes of Memory, and Petabytes of

Storage

• Most Biology problems are NP-Hard – algorithms fail to scale with large data

volume

• Many Open Source tools for NGS data and poorly documented and not

maintained, supported, or easy to change

68

16th December 2015

NGS Data Challenges

TACCGT

TGCCGT

TCCGT

TCCCGT

ACCCGT

ACCGT

Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology

No Coverage

Fragments

No Coverage

DeletionInsertionSubstitution

Read Errors

XTarget XA XB C

XA XD XCAssembled

D

B

Repeats

69

16th December 2015

Unknown Orientation & Order

CACGT

ACGT

TGCA

ACTACG

GTACT

ACTGA

CTGA

CACGT--------

-ACGT--------

-ACGT--------

--CGTAGT-----

-----AGTAC---

--------ACTGA

---------CTGA

CACGTAGTACTGA

70

16th December 2015

Discovering Biomedical Knowledge

Data

Information

Knowledge

Literature/

Molecular Data

Clinical/Bedside Data Medical

Knowledge

Target Data

Preprocessed

Data

Transformed

Data

Patterns

iOmicsClinical/Drug Data

71

16th December 2015

Data Information Knowledge

Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002

Wet Lab experiment &

High-throughput data

Open-domain widely

used Algorithms &

Tools

Custom Tools and

Open-domain

Databases

Problem Specific

Algorithms, Analysis,

and Databases

Data

Information

Knowledge

Related

Information

72

16th December 2015

Systems Biology –Hypothesis Agnostic System/Genome Wide Study

ETLExperiment/Sample Big Data

Data ScienceMolecular Biology /

Genetics

Hypothesis

Computer Science/

Algorithms

Bioinformatics Statistics Meta Analysis /

Network BiologyPublish /

Translational Biomedicine

Scientist / BiologistNGS / Sequencer

Biomedical

Databases

Literature73

16th December 2015

Data Sciences• Data Science is about learning from data, in order to gain useful

predictions and insights

• Separating signal from noise presents many computational and

inferential challenges, which we approached from a perspective at the

interface of computer science and statistics

• Data munging/scraping/sampling/cleaning in order to get an

informative, manageable data set

• Data storage and management in order to be able to access data -

especially big data - quickly and reliably during subsequent analysis

• Exploratory data analysis to generate hypotheses and intuition about

the data

• Prediction based on statistical tools such as regression, classification,

and clustering

• Communication of results through visualization, stories, and

interpretable summaries.

74

16th December 2015

Data Simulator (Synthetic Data)

• Take a Reference genome (e.g., hg19 or mm10 or some other genome)

• Create a VCF (Variation Call Format) file with synthetic mutations

• Or, take known mutations in VCF format from COSMIC or 1000Genome

• Apply (inject) the mutations from VCF file into the reference genome

• This will create a genome (single strand) with known mutations

• Inject random errors (sequencer errors)

• Define the depth or coverage

• Create fixed length single-end or paired-end reads

• A FASTQ file will be generated with known coverage and known

mutations

• Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data

75

16th December 2015

Data Scientists' Skills

Ref: Wikipedia

76

16th December 2015

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an

approach/philosophy for data analysis that employs a

variety of techniques (mostly graphical) to

1. Maximize insight into a data set;

2. Uncover underlying structure;

3. Extract important variables;

4. Detect outliers and anomalies;

5. Test underlying assumptions;

6. Develop parsimonious models; and

7. Determine optimal factor settings.

77

16th December 2015

• Real Human miRNA Data

• Nucleotide Patterns– Mono, Di, Poly statistics

– Motif Statistics

• Quality of Nucleotides

Truth is in the Data

78

16th December 2015

Random genomes

fragmentation

Genomes assembly

using overlaps

Metagenomics/

Multiple genomes

The Sequencing & Assembly Process

Target Microbial

Genomes

16th December 2015

The Jigsaw Puzzle

Source: Unknown

80

16th December 2015

Phases in Assembly

• Understand the data– Data inventory

– Single End, Paired End, Mate Paired etc

– Sequence structure (Read size, Format)

– Quality of the data

– Patterns within the data

• Clean up the data– Remove (Filter/Trim) vector/adaptor contaminated data

– Remove data of bad quality

– Remove data that might cause chimeric error

• Genome or Trancriptome in Ref-Assembly

• Contigs in Denovo Assembly

81

16th December 2015

Genome Reference Assembly

• Seed Based Algorithm

– Indexes either the genome or the reads in a data structure

– All k-long words (k-mers) of one sequence are indexed in a table with an entry for every possible k-mer

– Seeds (exact or nearly exact substring matches between the read and the genome) are used to rapidly isolate the potential locations where the read could match, and then a sensitive, full alignment phase, often with the Smith–Waterman

Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058

82

16th December 2015

• MAQ (Mapping and Assembly with Qualities) is a Reference Assembly that supports 63 bases of short fixed-length Reads

• MAQ was designed for Illumina 1G Genetic Analyzer data, with functions to handle ABI SOLiD data.

• MAQ aligns reads to reference sequences and then calls the consensus. For single-end reads, MAQ is able to find all hits with up to 2 or 3 mismatches.

• For paired-end reads, MAQ finds all paired hits with one of the two reads containing up to 1 mismatch.

• At the assembling stage, MAQ calls the consensus based on a statistical model. It calls the base which maximizes the posterior probability and calculates a phread quality at each position along the consensus. Heterozygotes are also called in this process.

MAQ

Ref: http://maq.sourceforge.net/

83

16th December 2015

• BWT (Burrows–Wheeler Transform)

• In the BWT index, only a fraction of the

pointers must be precomputed and saved,

while the rest are reconstructed on demand

• Bowtie and BWA utilize heuristic algorithms to

search for non-exact matches in the BWT-

based index, if exact matches cannot be

located

Faster Genome Ref-assembly Algorithm

Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058

84

16th December 2015

Alignment – Bowtie

(SAM – Sequence Assembly Map)HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774

1374500255 100M * 0 0TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA

CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA%%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/

>8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100NM:i:0

HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_41607741374501255 100M * 0 0CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC

CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT444454313355544455544433244445661493/3;;565=;491=;5;54==3=

;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100NM:i:0

HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_41607741374502255 100M * 0 0TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC

TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATCBAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555

5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100NM:i:0

85

16th December 2015

Alignment in Genome Viewer

86

16th December 2015

• Greatest computational challenge for Variation Analysis (SNP/InDel) task lies in judging the likelihood that a position is a heterozygous or homozygous variant given the error rates of the various platforms

• The probability of bad mappings, and the amount of support or coverage

• Therefore, most of the tools include a detailed data preparation step in which they filter, realign and often re-score reads, followed by a nucleotide or heterozygosity calling step done under a Bayesian framework

SNP, Micro-InDel, & Point Mutation

87

16th December 2015

Lack of Coverage

• Coverage at a position i of a target is defined as

the number of fragments that cover this position. If

coverage is zero or low, there is not enough

information in the fragment set to reconstruct the

target completely

No

Coverage

Target

Fragments

No Coverage

88

16th December 2015

End of Part III, IV & V

InterpretOmicsOffice: Shezan Lavelle, 5th Floor,

#15 Walton Road, Bengaluru 560001

Lab: #329, 7th Main, HAL 2nd Stage,

Indiranagar, Bengaluru 560008

Phone: +91(80)46623800

89