What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in...

37
What are the challenges for Data Science? Magnus Rattray Director, University of Manchester Data Science Institute Professor of Computational & Systems Biology Faculty of Biology, Medicine & Health University of Manchester www.datascience.manchester.ac.uk

Transcript of What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in...

Page 1: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

WhatarethechallengesforDataScience?

MagnusRattrayDirector,UniversityofManchesterDataScienceInstitute

ProfessorofComputational&SystemsBiologyFacultyofBiology,Medicine&Health

UniversityofManchester

www.datascience.manchester.ac.uk

Page 2: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

TheLargeSynopticSurveyTelescope:• 3.2Gpixelcamera• 2000exposurespernight• 20TBpernight• 10yearsurvey100PBdata

Initsfirstmonthofoperation,LSSTwillsurveymoreoftheUniversethanallprevioustelescopes

Astronomy

Page 3: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Particlephysics

LargeHadronCollider(Atlasexperiment)• 1billionproton-protoncollisionseverysecond• Nominaloutputrateofdetector:68TB/s• Actualoutputratetodisk:1.5GB/s(reducedviafastidentificationof“interesting”events)

• Datarateofupto100TBperday,forupto6monthsperyear,for10-15years200PB

Page 4: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Commute-flowisabrandnewgeodemographic classification ofcommutingflowsforEnglandandWalesbasedonorigin-destinationdatafromthe2011Censusthathasbeenusedtoanalysethespatialdynamicsofcommuting.Aninteractivetoolkitis@www.commute-flow.net26milliontraveltoworkflowsrecordedin2011censusforEnglandandWales

Hincks,S.,Kingston,R.,Webb,B.andWong,C.(inpress)ANewGeodemographicClassificationofCommutingFlowsforEnglandandWales.InternationalJournalofGeographicInformationScience.

A new two-tiergeodemographictypologyofcommutingpatternswith9super-groupsandatotalof40groups.Eachincludesapenportraitwithaninteractiveflowmapandradialchart.

Geography

Page 5: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Mental health

Sport

Swimmingpool

Volleyball

1.RawGPSdata

2.Detectionofgeolocationvisited

3.Geolocationsvisited

4.Identificationofplacesvisited

5.Placesvisited

6.Typeofplacesandactivitiesrecognition

7.Out-of-homeactivities

Difrancesco et al. Out-of-home activity recognition from GPS data in schizophrenic patients. IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016).

Page 6: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Respiratoryhealth

Page 7: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Researchisincreasinglydata-drivenBottom-upmodelling:• Definemodelofsystemfromassumedmicroscopicprinciples• Developatractableapproximationto“solve”themodel• Exploresystempropertiesforvariousparametersettings(e.g.growthrates,stationaryproperties,phasetransitions)• Test/refine/revisethemodelgivenexperimentaldata

Data-drivenmodelling:• Identifysystemvariablesthatcanbemeasured:thedata• Fitagenerativeorpredictivestatisticalmodeltothedata• Makeinferences,learnhiddenvariables,scoremodels

Increasinglyweareconnectingtheseapproaches– allowingforstrong“mechanistic”priorknowledgewithindata-drivenmodels

Page 8: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

ChallengesforDataScience

• Bigdata– scalability• Complexdata– modelling &inference• Messydata– probability& statistics• Humandata– privacy,ethics,interaction• Accessibledata– openness,reproducibility

Page 9: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

“Datahandlingisnowthebottleneck.Itcostsmoretoanalyze agenomethantosequenceagenome.”DavidHaussler

High-throughputDNAsequencing

Example:Genomics

Page 10: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Genomics:[email protected]_11067_FC7070M:4:1:2299:1109length=50TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT+SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^[email protected]_11067_FC7070M:4:1:2374:1108length=50GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC+SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`[email protected]_11067_FC7070M:4:1:2438:1109length=50TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT+SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe

200GBdatafor60xcoverageoverhumangenome20PBfor100Kgenomes

Page 11: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Royetal.Science2010

RNA-SeqTranscriptomics

Bis-Seq,ChIP-SeqEpigenomics

DNA-SeqGenomics

HiC,ChIA-PETInteractomics

Genomics:complexdata• DNAsequencingisanincrediblydisruptivetechnology• Genomicsisnotjustaboutgenomes!Many‘omics layers

Page 12: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Lister,Pelizzola etal.Nature2009

Genomics:messydata

Page 13: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

• 111reference“epigenomes”• 2804high-throughputsequencingdatasets• 1.5x1011mappedsequencereads• >1013sequencedDNAbases(>1000genomes)

Everynew‘omic layerisasbigasagenome

Page 14: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Genomic&Precisionmedicine

Precisiondiagnosis&precisiontreatment

Prognostics&Theranostics

Informingprevention

Newmodelsofcareatdisease

boundariesDrivingrapidinnovation&adoption

Roleofmulti-omics

Linking‘big’data

Re-aligningincentivesforcommiss’ng –drivenbyscience,research

Genomics– humandata

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

Page 15: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

• Life-coursecomplexityindicatesmultiple(sub-)diseases– Usuallystartsyoung– Mayprogress,remit orrelapse overlife

• Inconsistentgene-environmentinteractionsindicatesmultiple(sub-)diseases– Variableeffectsofgeneticpolymorphisms,e.g.CD14– Variabletreatment-setting interactions

Example:Asthmas StretchGenomics

Calleleassociated

Talleleassociated

Noassociation

CD14EndotoxinReceptor

SimpsonAetal.Endotoxinexposure,CD14,andallergicdisease:aninteractionbetweengenesandtheenvironment.AmJRespir Crit CareMed.2006;174(4):386-92.

50-60%heritabilityintwinstudiesbut<2%phenotype

explainedbycurrentgenomics

SlidesfromIainBuchan

Page 16: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

• ProgressionofallergyEczema →Asthma→Rhinitis

• Inferredfrompopulationsummary→

• Assumedcausal linkbetweeneczema– asthma&rhinitis

• Clinicalresponse:target childrenwitheczematoreduceprogressiontoasthma

ReceivedWisdom:AtopicMarch

Spergel &Paller,2003

WorldAllergyOrganization,2014

Page 17: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

EcologicFallacyRevealed

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

MRCSTELARconsortiumworkingatscaleacrossMAASandALSPACScohorts

Model-basedmachinelearning

allowingfortransitionsbetweenskin,lungandnasalallergiesovertime

Page 18: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

BetterTargetsfor‘Omics

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

Disambiguatediseaseprofilestomovetowardcausalmodellingandefficientidentificationof

mechanisms

Page 19: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Data TypeLarge-scale Structural Changes

Balanced Translocations

Distant Consanguinity

Uniparental Disomy

Novel / Known Coding Variants

Novel / Known Non-coding

VariantsTargetedgenesequencing û û û û ü ûSNP+arrays ûü û ü ü û ûArrayCGH* ûü û û û û ûExome ûü û ûü ûü ü ûWholeGenome ûü ü ü ü ü ü

+SingleNucleotidePolymorphism*ComparativeGenomicHybridisation

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

0 0.5 1 1.5 2 2.5

GenotypingWholegenome3.3bnbasesBothexonsandintronsExome

10mbasesExonsonly

Panels<10mbases

Subsetofexons

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

Towardsgenomicmedicine

Page 20: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Genomics– accessibledata?

• Sequencing100,000genomesfrompatientswithcancerandrarediseases• £24mdatainfrastructureawardfromMRC• GenomicsEnglandClinicalInterpretationPartnerships(GeCIPs)toenhancevalueofdata

Page 21: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

• SequencingfacilityattheSangerCentre• 30PBdatainadatacentreonamilitarybase• Researchers(GeCIP members)willnotbeallowedtodownloadrawdatafiles

• Restrictedaccesstodataandcomputethroughsecurevirtualdesktop(Inuvika)

• Analysishastomovetothedata

Buthowdowemovethistoaglobalscale?Howdoweanalyseacrossmanydatasets?

100KGenomesProject

Page 22: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

NextGenomicRevolution:Scalingdowntosinglecells

Microfluidicssequencing/cytometry

DNA/RNA

ProteinFuidigm C1

Page 23: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Single-celldata

• Existinggenomicmethodsaverageoveracellpopulationof̴107cells

• Single-cellmethodsuncoverhiddenstructure:– Diversesub-populationsofimmunecells– Clonalstructurewithintumours– Rarecirculatingtumourcellsfromblood– Asynchronouscellulardynamics– Eachcellisnowahigh-dimensionaldatapoint

Page 24: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Clusteringsinglecellproteindata

Amiretal. NatureBiotech.2013

Page 25: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Uncoveringclonalevolutionintumours

Time

Normal cells

t0 t1 t2 t3 tsample

Tissue volumeat time of sampling

A

ABD

ABC

Genotypes

20%

15%

25%

40%

Clones

Life history of the tumor Poly-clonal tumor at sampling

0

Clonal evolution tree

15

20

0

A

AB

40

ABD

25

ABC

FlorianMarkowetz,CRUKCambridge– fromhisblog“ScientificB-sides”

Page 26: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Approach

Targeted:• BasicCNAtoverifyCTCstatus• Target1-20genes• UseWBCsas–ve controls

GenomeWide:• Copynumberalteration(CNA)• WES- comprehensiveanalysis• UseWBCsas–ve controls

6SCLCpatientschosenwith=>4singleisolatedCTCsandCTCpoolsCNAdatafrom6,682cancer-relatedprotein-codinggenes

TP53

* Poolof10CTCs

** * * * * * *

Circulatingtumourcells(CTC)profiling

Expandedstudyongoing,2000CTCsfrom30patients

CTCenrichmentviaCellSearchCTCisolationviaDepArray

CarolineDiveandGed Brady,CRUKManchesterInstitute

Page 27: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Modellingchallenge:confoundingvariation

Stegle etal.NatureReviewsGenetics2014

Page 28: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

SinglecelldataLastyear

Single-cellRNA-Seq103 cellsperexperiment107 sequencereadspercell104featuresextractedpercell

CyTOF proteinquantification103cellspersecond106 perexperiment30-50featurespercell

ThisyearSingle-cellRNA-Seq106 cellsperexperiment108 readspercell>105featurespercell

Singlecellmulti-omics

?

Page 29: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Whatarethepinchpoints?

• Datavolume:costandtransferspeed• Dataanalysis:scalablealgorithms• Dataquality:batcheffects,missingdata,missingmetadata,conceptdrift

• Dataintegration:multi-modalmodelling• Reproducibleandrobustresearch

Page 30: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Datavolume

• Movealgorithmstothedata– Putcomputeclosetolocaldata– Commercialcloud(e.g.BaseSpace,Cytobank)– Bespokesecurecloud(e.g.100Kgenomesproject)

• Issuestoconsider– Willyouralgorithmsgivesameresults?– Willtheanalysisbereproducibleinthefuture?– Howtointegrateacrossresources?

Page 31: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Dataanalysis

• Scalingupalgorithms,e.g.DeeplearninglibrariesintegratingCPU/GPUarchitectures

• Fastapproximatemethods• Online/streamingdataprocessing• Avoidsolvingcompute-intensiveintermediatetasks:e.g.avoidgenomicalignmentpriortocountingsub-sequencematches(k-mers)

• Mixedprecisionnumerics

Page 32: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

MethodsforMachineLearningnolongersimplyassessedonpredictiveaccuracy

Dataanalysis

Page 33: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Dataquality

Bigcollecteddataaretypicallynotdesignedforasingleresearchquestion(oranyresearchquestion)

Weneedmethodstodealwith:

Confounders,batcheffects,missingdata,missingmetadata,conceptdrift,outliers….

(whileremainingscalable)

Page 34: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.
Page 35: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

RobustandreproducibleresearchPublishdata,code,workflows,versionnumbers,containers…

Resultsshouldnotdependstronglyonarbitrarymodellingchoices“shakethemodel”(ChrisHolmes)

“Hypothesisselection”leadstoupwardsignificancebias• Trytobreakyourmodels• Userobustmodels• Usebootstrapping

Keeptrackofallhypothesesyouhaveconsidered• Storeyourworkinghistory– notebookscience• Publishnegativeresults

Page 36: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Robustandreproducibleresearch• Buildreproducibilityintoyourroutine– don’twaituntilafter

yourpaperisaccepted• Don’tfeaturehere:

Page 37: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales.

Conclusion

• Researchisincreasinglydata-drivenacrossallfields– DataScienceisnowubiquitous

• Newchallengescomefromthescale,complexityandnatureofdata:Bigdata– scalablealgorithmsandarchitecturesComplexdata– bettermodels:bottomupandtopdownMessydata– statisticalthinkingisessentialHumandata– ethicaldimensionsareofkeyimportanceAccessibledata– avaluablecommonresource