Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending...

Post on 08-Jun-2020

5 views 0 download

Transcript of Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending...

�1

TowardEnablingReproducibilityforData-IntensiveResearchusingtheWholeTalePlatform

VictoriaStoddenAssociateProfessor,SchoolofInformationSciences

UniversityofIllinoisatUrbana-Champaign

ParCoSymposiumReproducibilityinData-IntensiveComputing

Prague,CZSeptember10,2019

Agenda

1. InfrastructureContributions:TheWholeTaleProject

2.ExtendingWholeTaletoEnable“TalesatScale”

3. InfrastructureChallenges

�2

ParsingReproducibility

● EmpiricalReproducibility:○ traditionalempiricalexperiments,e.g.atthebench/lab

● StatisticalReproducibility:○ statisticalmethodologyusedpermitsgeneralizabilityofdatainferences

● ComputationalReproducibility:○ transparencyofcomputationalstepsthatproducescientificfindings

V.Stodden.(2013).ResolvingIrreproducibilityinEmpiricalandComputationalResearch.IMSBulletin

WholeTale:MergingScience&CyberinfrastructurePathways

�4

WholeTaleCollaboration(PITeam)● UIllinois(NCSA)BertramLudäscher,VictoriaStodden,MattTurk

○ overalllead(co-operativeagreement)○ reproducibility;provenance;opensourcesoftwaredevelopment;outreach

● UChicago(Globus)KyleChard○ datatransfer&storage;compute;infrastructure

● UCSantaBarbara(NCEAS)MattJones○ (meta-)datapublishing;provenance;repositories

● UTexas,Austin(TACC)NiallGaffney○ compute;HTC;“bigtale”;ScienceGateways

● UNotreDame(CRC)JarekNabrzyski○ UXdesign;UIdesign

�5

SimplifyingComputationalReproducibilityinWholeTale● Researcherscaneasilypackageandsharetales:

○ Data,Code,andComputeEnvironment■ includingnarrativeandworkflowinformationincludinginputs,outputs,andintermediates

○ tore-createthecomputationalresultsfromascientificstudy○ achievingcomputationalreproducibility○ thus“settingthedefaulttoreproducible.”

● Alsoempowersuserstoverifyandextendresultswithdifferentdata,methods,andenvironments.

�6

V.Stodden,D.H.Bailey,J.Borwein,R.J.LeVeque,W.Rider,andW.Stein.SettingtheDefaulttoReproducible:ReproducibilityinComputationalandExperimentalMathematics,ICERMWorkshop2013.

WholeTale:What’sinaname…ADoubleEntendre:

○ Wholetale:capturestheend-to-endscientificdiscoverystory,includingcomputationalaspects

○ Longtail:includesallcomputationalresearch,e.g.bespokeorsmallscaleresearch

AddressesProblemsscientistsface:○ Reproducibility(andreuse)challengesincomputational&data-enabled

research(e.g.data+codeaccess,dependencyhell,…)WholeTaleApproach:

○ directlyrespondtocommunityneedsandrequirements

�7

TheNeedforaPlatformforReproducibleResearch● Enableresearchersto(easily)managethecompleteconductofa

computationalexperimentandpermititsexposureasapublishable“Tale”

● Addressthetwotrendssimultaneously:○ improvedtransparencysoresearcherscanrunmuchmoreambitious

computationalexperiments.○ andbettercomputationalexperimentinfrastructurewillallowresearchersto

bemoretransparent.

D.DonohoandV.Stodden.(2015).ReproducibleResearchintheMathematicalSciences.ThePrincetonCompaniontoAppliedMathematics,Ed.N.J.Higham.

SowhatisWholeTale?● Aweb-based,opensourceplatformforreproducibleresearchforthe

creation,publication,andexecutionoftales:executableresearchobjectsthatcapturedata,code,anddetailsofthecomputingenvironmentusedtoproduceresearchfindings

● DrivenbyCommunityEngagement:○ Workinggroups,internships,collaborations,etc.

● EnhancesEducation&Training:○ Trainingforreproducibility;useofWholeTaleintheclassroom

WTSoftwareDevelopment● Open-SourceDevelopmentModel

○ across5collaborativesites● Allsourceisopen:

○ https://github.com/whole-tale/● Developersareexpectedtofollowthe:

○ Developer'sguide● Opencommunicationvia:

○ weeklycallswithpublicmeetingnotes● Softwarereleasesfollowa:

○ Developmentplan

�10

Development

Workshops & Working Groups

Whatexactlyis(in)aTale?● Tale=executableresearchobject,i.e.

○ data(references)○ +code(computationalmethods)○ +narrative(traditionalsciencestory)○ +computeenvironment(e.g.RStudio,Jupyter)

● Capturedinastandards-basedtaleformatcompletewithmetadata

�11

Code/Narrative

Computeenvironment

Data

�12

BrowseExistingTales…

�13

…ComposeNewTales…

�14

…Run&InteractwithTales

�15

…UseTaleMetadata

�16

…IntegrateDataReposwithWholeTale!

● Enablesturnkeyexploratorydataanalysisonexistingpublisheddatasets

● DataONEandDataversenetworkscover>90majorresearchrepositories!

InputData

ResearchQuestion Analysis Output

Data Narrative PublishedArticle

Verify/Reproduce/Re-use

Accelerate

AcceleratingReproducibleOpenScience

�18

WholeTalePlatformOverview

Research&QuantitativeComputationalEnvironments

ExternalDataSources

Code+Narrative

● Authenticateusingyourinstitutionalidentity● Accesscommonly-usedcomputationalenvironments● Easilycustomizeyourenvironment● Referenceandaccessexternallyregistereddata

● Createoruploadyourdataandcode● Addmetadata(includingprovenanceinformation)● Submitcode,data,andenvironmenttoarchivalrepository● Getapersistentidentifier● Shareforverificationandre-use

PublishTale

CreatetaleAnalyzedata

Whoseproblemsareweaddressing?● Researchers,scientists,othersmaybe

○ creatorsoftalese.g.shareyourfindingsinatale

○ reviewersofarticlescanreviewtalese.g.reproducenewscientificclaims

○ (re-)usersoftalese.g.builduponprogressofothers

�19

ExtendingWTtoData-IntensiveResearch● Motivatingscenario:TheRenaissanceSimulationsLaboratoryprovidesaccesstoover70

TBofrawdataandderiveddataproducts.RSLexposesdataavailableonsystemsattheSanDiegoSupercomputingCenterviaJupyterweb-basedinteractiveenvironments.

● Relevantfeatures:

1.theRSdataislarge,impracticaltotransfer,requireslarge-scaleresourcestoanalyze.2.theresearchcommunityleveragesJupyterinteractiveenvironmentsforboth

exploratoryandprimaryanalyticalworkwithsomeanalysisrequiringbatchcomputeresources.

3.thecommunityisinterestedinsharingresultingresearchartifacts(e.g.,code,deriveddata)forbothre-executionandre-use.

�20

ExtendingWTtoData-IntensiveResearch

�21

Tale frontend and HPC workloads on WT deployment cluster: Users can launch local HPC jobs using standard system calls

Tale Frontend on single HPC Compute Node: running the Tale frontend (Jupyter/R-studio notebooks) on compute nodes in an HPC cluster, which launch independent HPC jobs using standard system calls.

ExtendingWTtoData-IntensiveResearch

�22

Tale frontend on HPC compute node with local LRM (cluster queuing system) access: Allows submission of HPC jobs to the queuing system of the cluster.

Tale frontend on HPC compute nodes with MPI: launch the Tale frontend as an MPI job. The cluster LRM (queuing system) allocates the number of nodes requested at the submission of the Tale frontend job and sets the appropriate MPI environment. The Tale frontend would run on the lead node allocated to the MPI job by the LRM and would launch MPI subjobs on the nodes allocated to the MPI job.

ExtendingWTtoData-IntensiveResearch

�23

Tale frontend on WT cluster with remote LRM access: Tale frontends run alongside WT services, but HPC jobs can be submitted to remote clusters via the middleware.

Decoupled Tale frontend with LRM Remote Access: Tale frontends run on various resources and HPC jobs can run on any resources supported by the middleware. Users could bypass the limitations present in the default resources provided by the WT infrastructure e.g. a user with cloud access could request that a Tale be run on cloud resources under the user’s account.

ChallengestoExtendingWT● TheneedtomaintainresponsivenessofTalefrontends

● DependenceonMiddleware:ScalabilityandLongevity

● ManagingHPCnetworkrestrictions

● Talefrontendshoweverrequireincomingnetworkconnectionsinordertoexposetheiruserinterface.Consequently,ageneralsolutioninvolvingTalefrontendsoncomputenodesrequiressomeformofproxyingofconnectionsfromtheWholeTaleclustertoHPCclustercomputenodes.Restrictionsonincomingnetworkconnectionsmaylikelybearesultoflocalsecuritypoliciesandthereforeproxying,evenifauthenticated,maybeseenasanunwelcomecircumventionofsuchpolicies.

● ContainerizationandHPCworkloadse.g.adependenceonspecifichardwarewhichcanaffecttheabilityforthecodetobere-runifthespecifichardwarebecomesunavailable

● Dataaccessandquasi-locality:IfTalefrontendsand/orHPCworkloadsrunonHPCresourcesonwhichcopiesofdataarealreadyavailable,theWTimplementationisbeinefficientsinceeachfilewouldbetransferredoncetoWTresourcesandonceforeachTalefrontendinstancethataccessesthefile

�24

Conclusion WholeTaleofferspotentialforenablingreproducibilityforData-Intensive

computing,butisnotwithoutchallengesrequiringinnovationinthesoftwarearchitectureandinfrastructureimplementation.

However,reproducibilityisnowrecognizedasapressingissueofwhichcomputationalinfrastructureisonekeypart.

Infrastructuresupportingtransparencyandreproducibilitywillbeusednotoutofhygieneorasabestpractice,butbecauseitenablesincreasinglyambitiouscomputationalresearch.

�25