Computational Systems Biology Deep Learning in the Life ... · • 7.09 Quantitative and...

Post on 28-Jun-2020

1 views 0 download

Transcript of Computational Systems Biology Deep Learning in the Life ... · • 7.09 Quantitative and...

ComputationalSystemsBiologyDeepLearningintheLifeSciences

�1

6.80220.39020.490HST.5066.874AreaIITQE(AI)

DavidGiffordLecture1

February4,2019

http://mit6874.github.io

Yourguides

SaberLiugeliu@mit.edu

http://mit6874.github.io

SidJainsj1@mit.edu

KonstantinKrismerkrismer@mit.edu

mit6874.github.io6.874staff@mit.edu

YoushouldhavereceivedtheGoogleCloudcouponURLinyouremail

Recitations(thisweek)Thursday4-5pm36-155Friday4-5pm36-155

Officehoursareafterrecitationat5pminsameroom

(PS1helpandadvice)

Approximately8%ofdeeplearningpublicationsareinbioinformatics

Welcometoanewapproachtolifesciencesresearch

• Enabledbytheconvergenceofthreethings• Inexpensive,high-quality,collectionoflargedatasets(sequencing,imaging,etc.)

• Newmachinelearningmethods(includingensemblemethods)

• High-performanceGraphicsProcessingUnit(GPU)machinelearningimplementations

• Resultiscompletelytransformative

Yourbackground• Calculus,LinearAlgebra• Probability,Programming• IntroductoryBiology

AlternativeMITsubjects

• 6.047/6.878ComputationalBiology:Genomes,Networks,Evolution

• 6.S897/HST.956:MachineLearningforHealthcare(2:30pm4-270)• 8.592StatisticalPhysicsinBiology• 7.09QuantitativeandComputationalBiology• 7.32SystemsBiology• 7.33EvolutionaryBiology:Concepts,ModelsandComputation• 7.57QuantitativeBiologyforGraduateStudents• 18.417IntroductiontoComputationalMolecularBiology• 20.482FoundationsofAlgorithmsandComputationalTechniquesinSystemsBiology

MachineLearningistheabilitytoimproveonataskwithmoretrainingdata

• TaskTtobeperformed• Classification,Regression,Transcription,Translation,StructuredOutput,AnomalyDetection,Synthesis,Imputation,Denoising

• MeasuredbyPerformanceMeasureP• TrainedonExperienceE(TrainingData)

https://arxiv.org/abs/1710.10196Trainedon30,000imagesfromCelebA-HQ

SyntheticCelebrities

Thissubjectistheredpill

Welcome

L1 Feb.5 MachinelearninginthecomputadonallifesciencesL2 Feb.7 NeuralnetworksandTensorFlowR1 Feb7MachineLearningOverviewandPS1L3 Feb12 Convoludonalandrecurrentneuralnetworks

ProblemSet:SoemaxMNIST(PS1)

PS1:TensorFlowWarmUp

RegulatoryElements/MLmodelsandinterpretadon

L4 Feb14 Protein-DNAinteracdonsR2 Feb.14 NeuralNetworksandTensorFlow

Feb.19(Holiday-President’sDay)L5 Feb.21 ModelsofProtein-DNAInteracdonR3 Feb.21 ModfsandmodelsL6 Feb.26 Modelinterpretadon(Gradientmethods,blackbox)

ProblemSet:RegulatoryGrammar

PS2:Genomicregulatorycodes

TheExpressedGenome/Dimensionalityreducdon

L7 Feb.28 TheexpressedgenomeandRNAsplicingR4 Feb28ModelinterpretadonL8 Mar5 PCA,dimensionalityreducdon(t-SNE),autoencodersL9 Mar7 scRNAseqandcelllabelingR5 Mar7 Compressedstaterepresentadons

ProblemSet:scRNA-seqtSNE

PS3:ParametrictSNE

GeneReguladon/Modelselecdonanduncertainty

L10 Mar12 ModelinggeneexpressionandreguladonL11 Mar14 Modeluncertainty,significance,hypothesistesdngR6 Mar14ModelselecdonandL1/L2regularizadonL12 Mar19 ChromadnaccessibilityandmarksL13 Mar21 PredicdngchromadnaccessibilityR7 Mar21 Chromadnaccessibility

ProblemSet:CTCFBindingfromDNase-seq

PS4:ChromatinAccessibility

Genotype->Phenotype,Therapeudcs

L14 Apr2 DiscoveringandpredicdnggenomeinteracdonsL15 Apr4 eQTLpredicdonandvariantprioridzadonR8 Apr4LeadSNPstocausalSNPs;haplotypestructureL16 Apr9 ImagingandgenotypetophenotypeL17 Apr11 Generadvemodels:opdmizadon,VAEs,GANsR9 Apr11GeneradvemodelsL18 Apr18 DeepLearningforeQTLsL19 Apr23 TherapeudcDesign

L20 Apr25 ExamReviewL21 Apr30 Exam

ProblemSet:Generadvemodelsformedicalrecords

PS5:GenerativeModelsSample1:dischargeinstructions:pleasecontactyourprimarycarephysicianorreturntotheemergencyroomif[*omitted*]developanyconstipation.[*omitted*]shouldbehadstoptransferredto[*omitted*]withdr.[*omitted*]orstartedonalimityourmedications.*[*omitted*]seefultdr.[*omitted*]officeandstopina1mgtablettotrofevergreattoyourpaininpostions,storale.[*omitted*]willbetakingacardiaccatheterizationandtakeanyanti-inflammatorymedicinesdiagnessoranyotherconcerningsymptoms.

Yourprogrammingenvironment

Yourcomputingresource

Yourgradeisbasedon5problemsets,anexam,andafinalproject

• FiveProblemSets(40%)• Individualcontribution• DoneusingGoogleCloud,JupyterNotebook

• Inclassexam(1.5hours),onesheetofnotes(30%)• FinalProject(30%)• Doneindividuallyorinteams(6.874bypermission)

• Substantialquestion

Amgencouldnotreproducethefindingsof47/53(89%)landmarkpreclinicalcancerpapers

http://www.nature.com/nature/journal/v483/n7391/pdf/483531a.pdf

Directandconceptualreplicationisimportant

• Directreplicationisdefinedasattemptingtoreproduceapreviouslyobservedresultwithaprocedurethatprovidesnoapriorireasontoexpectadifferentoutcome

• Conceptualreplicationusesadifferentmethodology(suchasadifferentexperimentaltechniqueoradifferentmodelofadisease)totestthesamehypothesis;triestoavoidconfounders

https://elifesciences.org/content/6/e23383

Reproducibility Project: Cancer Biology Registered Report/Replication Study Structure

• ARegisteredReportdetailstheexperimentaldesignsandprotocolsthatwillbeusedforthereplications,andexperimentscannotbeginuntilthisreporthasbeenpeerreviewedandacceptedforpublication.

• TheresultsoftheexperimentsarethenpublishedasaReplicationStudy,irrespectiveofoutcomebutsubjecttopeerreviewtocheckthattheexperimentaldesignsandprotocolswerefollowed.

https://elifesciences.org/content/6/e23383

Claimprecisioniskeytoscience

• “Wehavediscoveredtheregulatoryelements”• “Wehavepredictedtheregulatoryelements”

• “Thevariantcausesadifferenceingeneexpression”

• “Thevariantisassociatedwithadifferenceingeneexpression”

Interventionsenablecausalstatements

• Observationonlydatacanbeinfluencedbyconfounders

• Aconfounderisanunobservedvariablethatexplainsanobservedeffect

• Interventionsonavariableallowforthedetectionofitsdirectandindirecteffects

MLresolvesProtein-DNAbindingevents

• Who-whatprotein(s)arebinding?• Where-wherearetheybinding?• Why-whatchromatinstateandsequencemotifcausestheirbinding?

• When-whatdifferentialbindingisobservedindifferentcellstatesorgenotypes?

• How-areaccessoryfactorsormodificationsofthefactorinvolved?

Howcanweestablishgroundtruth?

• Replicateexperimentsshouldhaveconsistentobservations

• Independenttestsforsamehypothesis(differentantibody,differentassay)

• Statisticaltestagainstanullhypothesis-whatistheprobablyofseeingthereadsatrandom?Weneedanullmodelforthistest.

x W by

tf.matmul

+

tf.nn.softmax

lossfunction

tf.placeholder tf.placeholder tf.variable tf.variable

optimizerProblemSet1Structure

[None,10] [None,784] [784,10] [10]

Programming model

Big idea: Express a numeric computation as a graph.

Graph nodes are operations which have any number of inputs and outputs

Graph edges are tensors which flow between nodes

Programming model: NN feedforward

Programming model: NN feedforward

Variables are 0-ary stateful nodes which

output their current value.(State is retained across multiple

executions of a graph.)

(parameters, gradient stores, eligibility traces, …)

Programming model: NN feedforward

Placeholders are 0-ary nodes whose value is fed

in at execution time.

(inputs, variable learning rates, …)

Programming model: NN feedforward

Mathematical operations:MatMul: Multiply two matrix values.Add: Add elementwise (with broadcasting).ReLU: Activate with elementwise rectified linear function.

In code, please!1. Create model weights,

including initialization

a. W ~ Uniform(-1, 1); b = 0

2. Create input placeholder x

a. m * 784 input matrix

3. Create computation graph

importtensorflowastfb=tf.Variable(tf.zeros((100,)))W=tf.Variable(tf.random_uniform((784,100),-1,1))x=tf.placeholder(tf.float32,(None,784))h_i=tf.nn.relu(tf.matmul(x,W)+b)

1

2

3

How do we run it?

So far we have defined a graph.

We can deploy this graph with a session: a binding to a particular execution context (e.g. CPU, GPU)

Getting outputsess.run(fetches,feeds)

Fetches: List of graph nodes. Return the outputs of these nodes.

Feeds: Dictionary mapping from graph nodes to concrete values. Specifies the value of each graph node given in the dictionary.

importnumpyasnpimporttensorflowastfb=tf.Variable(tf.zeros((100,)))W=tf.Variable(tf.random_uniform((784,100)-1,1))x=tf.placeholder(tf.float32,(None,784))h_i=tf.nn.relu(tf.matmul(x,W)+b)

1

23

sess=tf.Session()sess.run(tf.global_variables_initalizer())sess.run(h_i,{x:np.random.random(64,784)})

Basic flow

1.Build a grapha.Graph contains parameter specifications, model architecture, optimization process, …

b.Somewhere between 5 and 5000 lines

2.Initialize a session

3.Fetch and feed data with Session.run

a.Compilation, optimization, etc. happens at this step — you probably won’t notice

Thissubjectistheredpill