What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in...
Transcript of What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in...
WhatarethechallengesforDataScience?
MagnusRattrayDirector,UniversityofManchesterDataScienceInstitute
ProfessorofComputational&SystemsBiologyFacultyofBiology,Medicine&Health
UniversityofManchester
www.datascience.manchester.ac.uk
TheLargeSynopticSurveyTelescope:• 3.2Gpixelcamera• 2000exposurespernight• 20TBpernight• 10yearsurvey100PBdata
Initsfirstmonthofoperation,LSSTwillsurveymoreoftheUniversethanallprevioustelescopes
Astronomy
Particlephysics
LargeHadronCollider(Atlasexperiment)• 1billionproton-protoncollisionseverysecond• Nominaloutputrateofdetector:68TB/s• Actualoutputratetodisk:1.5GB/s(reducedviafastidentificationof“interesting”events)
• Datarateofupto100TBperday,forupto6monthsperyear,for10-15years200PB
Commute-flowisabrandnewgeodemographic classification ofcommutingflowsforEnglandandWalesbasedonorigin-destinationdatafromthe2011Censusthathasbeenusedtoanalysethespatialdynamicsofcommuting.Aninteractivetoolkitis@www.commute-flow.net26milliontraveltoworkflowsrecordedin2011censusforEnglandandWales
Hincks,S.,Kingston,R.,Webb,B.andWong,C.(inpress)ANewGeodemographicClassificationofCommutingFlowsforEnglandandWales.InternationalJournalofGeographicInformationScience.
A new two-tiergeodemographictypologyofcommutingpatternswith9super-groupsandatotalof40groups.Eachincludesapenportraitwithaninteractiveflowmapandradialchart.
Geography
Mental health
Sport
Swimmingpool
Volleyball
1.RawGPSdata
2.Detectionofgeolocationvisited
3.Geolocationsvisited
4.Identificationofplacesvisited
5.Placesvisited
6.Typeofplacesandactivitiesrecognition
7.Out-of-homeactivities
Difrancesco et al. Out-of-home activity recognition from GPS data in schizophrenic patients. IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016).
Respiratoryhealth
Researchisincreasinglydata-drivenBottom-upmodelling:• Definemodelofsystemfromassumedmicroscopicprinciples• Developatractableapproximationto“solve”themodel• Exploresystempropertiesforvariousparametersettings(e.g.growthrates,stationaryproperties,phasetransitions)• Test/refine/revisethemodelgivenexperimentaldata
Data-drivenmodelling:• Identifysystemvariablesthatcanbemeasured:thedata• Fitagenerativeorpredictivestatisticalmodeltothedata• Makeinferences,learnhiddenvariables,scoremodels
Increasinglyweareconnectingtheseapproaches– allowingforstrong“mechanistic”priorknowledgewithindata-drivenmodels
ChallengesforDataScience
• Bigdata– scalability• Complexdata– modelling &inference• Messydata– probability& statistics• Humandata– privacy,ethics,interaction• Accessibledata– openness,reproducibility
“Datahandlingisnowthebottleneck.Itcostsmoretoanalyze agenomethantosequenceagenome.”DavidHaussler
High-throughputDNAsequencing
Example:Genomics
Genomics:[email protected]_11067_FC7070M:4:1:2299:1109length=50TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT+SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^[email protected]_11067_FC7070M:4:1:2374:1108length=50GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC+SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`[email protected]_11067_FC7070M:4:1:2438:1109length=50TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT+SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe
200GBdatafor60xcoverageoverhumangenome20PBfor100Kgenomes
Royetal.Science2010
RNA-SeqTranscriptomics
Bis-Seq,ChIP-SeqEpigenomics
DNA-SeqGenomics
HiC,ChIA-PETInteractomics
Genomics:complexdata• DNAsequencingisanincrediblydisruptivetechnology• Genomicsisnotjustaboutgenomes!Many‘omics layers
Lister,Pelizzola etal.Nature2009
Genomics:messydata
• 111reference“epigenomes”• 2804high-throughputsequencingdatasets• 1.5x1011mappedsequencereads• >1013sequencedDNAbases(>1000genomes)
Everynew‘omic layerisasbigasagenome
Genomic&Precisionmedicine
Precisiondiagnosis&precisiontreatment
Prognostics&Theranostics
Informingprevention
Newmodelsofcareatdisease
boundariesDrivingrapidinnovation&adoption
Roleofmulti-omics
Linking‘big’data
Re-aligningincentivesforcommiss’ng –drivenbyscience,research
Genomics– humandata
“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland
• Life-coursecomplexityindicatesmultiple(sub-)diseases– Usuallystartsyoung– Mayprogress,remit orrelapse overlife
• Inconsistentgene-environmentinteractionsindicatesmultiple(sub-)diseases– Variableeffectsofgeneticpolymorphisms,e.g.CD14– Variabletreatment-setting interactions
Example:Asthmas StretchGenomics
Calleleassociated
Talleleassociated
Noassociation
CD14EndotoxinReceptor
SimpsonAetal.Endotoxinexposure,CD14,andallergicdisease:aninteractionbetweengenesandtheenvironment.AmJRespir Crit CareMed.2006;174(4):386-92.
50-60%heritabilityintwinstudiesbut<2%phenotype
explainedbycurrentgenomics
SlidesfromIainBuchan
• ProgressionofallergyEczema →Asthma→Rhinitis
• Inferredfrompopulationsummary→
• Assumedcausal linkbetweeneczema– asthma&rhinitis
• Clinicalresponse:target childrenwitheczematoreduceprogressiontoasthma
ReceivedWisdom:AtopicMarch
Spergel &Paller,2003
WorldAllergyOrganization,2014
EcologicFallacyRevealed
Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.
MRCSTELARconsortiumworkingatscaleacrossMAASandALSPACScohorts
Model-basedmachinelearning
allowingfortransitionsbetweenskin,lungandnasalallergiesovertime
BetterTargetsfor‘Omics
Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.
Disambiguatediseaseprofilestomovetowardcausalmodellingandefficientidentificationof
mechanisms
Data TypeLarge-scale Structural Changes
Balanced Translocations
Distant Consanguinity
Uniparental Disomy
Novel / Known Coding Variants
Novel / Known Non-coding
VariantsTargetedgenesequencing û û û û ü ûSNP+arrays ûü û ü ü û ûArrayCGH* ûü û û û û ûExome ûü û ûü ûü ü ûWholeGenome ûü ü ü ü ü ü
+SingleNucleotidePolymorphism*ComparativeGenomicHybridisation
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
0 0.5 1 1.5 2 2.5
GenotypingWholegenome3.3bnbasesBothexonsandintronsExome
10mbasesExonsonly
Panels<10mbases
Subsetofexons
“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland
Towardsgenomicmedicine
Genomics– accessibledata?
• Sequencing100,000genomesfrompatientswithcancerandrarediseases• £24mdatainfrastructureawardfromMRC• GenomicsEnglandClinicalInterpretationPartnerships(GeCIPs)toenhancevalueofdata
• SequencingfacilityattheSangerCentre• 30PBdatainadatacentreonamilitarybase• Researchers(GeCIP members)willnotbeallowedtodownloadrawdatafiles
• Restrictedaccesstodataandcomputethroughsecurevirtualdesktop(Inuvika)
• Analysishastomovetothedata
Buthowdowemovethistoaglobalscale?Howdoweanalyseacrossmanydatasets?
100KGenomesProject
NextGenomicRevolution:Scalingdowntosinglecells
Microfluidicssequencing/cytometry
DNA/RNA
ProteinFuidigm C1
Single-celldata
• Existinggenomicmethodsaverageoveracellpopulationof̴107cells
• Single-cellmethodsuncoverhiddenstructure:– Diversesub-populationsofimmunecells– Clonalstructurewithintumours– Rarecirculatingtumourcellsfromblood– Asynchronouscellulardynamics– Eachcellisnowahigh-dimensionaldatapoint
Clusteringsinglecellproteindata
Amiretal. NatureBiotech.2013
Uncoveringclonalevolutionintumours
Time
Normal cells
t0 t1 t2 t3 tsample
Tissue volumeat time of sampling
A
ABD
ABC
Genotypes
20%
15%
25%
40%
Clones
Life history of the tumor Poly-clonal tumor at sampling
0
Clonal evolution tree
15
20
0
A
AB
40
ABD
25
ABC
FlorianMarkowetz,CRUKCambridge– fromhisblog“ScientificB-sides”
Approach
Targeted:• BasicCNAtoverifyCTCstatus• Target1-20genes• UseWBCsas–ve controls
GenomeWide:• Copynumberalteration(CNA)• WES- comprehensiveanalysis• UseWBCsas–ve controls
6SCLCpatientschosenwith=>4singleisolatedCTCsandCTCpoolsCNAdatafrom6,682cancer-relatedprotein-codinggenes
TP53
* Poolof10CTCs
** * * * * * *
Circulatingtumourcells(CTC)profiling
Expandedstudyongoing,2000CTCsfrom30patients
CTCenrichmentviaCellSearchCTCisolationviaDepArray
CarolineDiveandGed Brady,CRUKManchesterInstitute
Modellingchallenge:confoundingvariation
Stegle etal.NatureReviewsGenetics2014
SinglecelldataLastyear
Single-cellRNA-Seq103 cellsperexperiment107 sequencereadspercell104featuresextractedpercell
CyTOF proteinquantification103cellspersecond106 perexperiment30-50featurespercell
ThisyearSingle-cellRNA-Seq106 cellsperexperiment108 readspercell>105featurespercell
Singlecellmulti-omics
?
Whatarethepinchpoints?
• Datavolume:costandtransferspeed• Dataanalysis:scalablealgorithms• Dataquality:batcheffects,missingdata,missingmetadata,conceptdrift
• Dataintegration:multi-modalmodelling• Reproducibleandrobustresearch
Datavolume
• Movealgorithmstothedata– Putcomputeclosetolocaldata– Commercialcloud(e.g.BaseSpace,Cytobank)– Bespokesecurecloud(e.g.100Kgenomesproject)
• Issuestoconsider– Willyouralgorithmsgivesameresults?– Willtheanalysisbereproducibleinthefuture?– Howtointegrateacrossresources?
Dataanalysis
• Scalingupalgorithms,e.g.DeeplearninglibrariesintegratingCPU/GPUarchitectures
• Fastapproximatemethods• Online/streamingdataprocessing• Avoidsolvingcompute-intensiveintermediatetasks:e.g.avoidgenomicalignmentpriortocountingsub-sequencematches(k-mers)
• Mixedprecisionnumerics
MethodsforMachineLearningnolongersimplyassessedonpredictiveaccuracy
Dataanalysis
Dataquality
Bigcollecteddataaretypicallynotdesignedforasingleresearchquestion(oranyresearchquestion)
Weneedmethodstodealwith:
Confounders,batcheffects,missingdata,missingmetadata,conceptdrift,outliers….
(whileremainingscalable)
RobustandreproducibleresearchPublishdata,code,workflows,versionnumbers,containers…
Resultsshouldnotdependstronglyonarbitrarymodellingchoices“shakethemodel”(ChrisHolmes)
“Hypothesisselection”leadstoupwardsignificancebias• Trytobreakyourmodels• Userobustmodels• Usebootstrapping
Keeptrackofallhypothesesyouhaveconsidered• Storeyourworkinghistory– notebookscience• Publishnegativeresults
Robustandreproducibleresearch• Buildreproducibilityintoyourroutine– don’twaituntilafter
yourpaperisaccepted• Don’tfeaturehere:
Conclusion
• Researchisincreasinglydata-drivenacrossallfields– DataScienceisnowubiquitous
• Newchallengescomefromthescale,complexityandnatureofdata:Bigdata– scalablealgorithmsandarchitecturesComplexdata– bettermodels:bottomupandtopdownMessydata– statisticalthinkingisessentialHumandata– ethicaldimensionsareofkeyimportanceAccessibledata– avaluablecommonresource