�1
TowardEnablingReproducibilityforData-IntensiveResearchusingtheWholeTalePlatform
VictoriaStoddenAssociateProfessor,SchoolofInformationSciences
UniversityofIllinoisatUrbana-Champaign
ParCoSymposiumReproducibilityinData-IntensiveComputing
Prague,CZSeptember10,2019
Agenda
1. InfrastructureContributions:TheWholeTaleProject
2.ExtendingWholeTaletoEnable“TalesatScale”
3. InfrastructureChallenges
�2
ParsingReproducibility
● EmpiricalReproducibility:○ traditionalempiricalexperiments,e.g.atthebench/lab
● StatisticalReproducibility:○ statisticalmethodologyusedpermitsgeneralizabilityofdatainferences
● ComputationalReproducibility:○ transparencyofcomputationalstepsthatproducescientificfindings
V.Stodden.(2013).ResolvingIrreproducibilityinEmpiricalandComputationalResearch.IMSBulletin
WholeTale:MergingScience&CyberinfrastructurePathways
�4
WholeTaleCollaboration(PITeam)● UIllinois(NCSA)BertramLudäscher,VictoriaStodden,MattTurk
○ overalllead(co-operativeagreement)○ reproducibility;provenance;opensourcesoftwaredevelopment;outreach
● UChicago(Globus)KyleChard○ datatransfer&storage;compute;infrastructure
● UCSantaBarbara(NCEAS)MattJones○ (meta-)datapublishing;provenance;repositories
● UTexas,Austin(TACC)NiallGaffney○ compute;HTC;“bigtale”;ScienceGateways
● UNotreDame(CRC)JarekNabrzyski○ UXdesign;UIdesign
�5
SimplifyingComputationalReproducibilityinWholeTale● Researcherscaneasilypackageandsharetales:
○ Data,Code,andComputeEnvironment■ includingnarrativeandworkflowinformationincludinginputs,outputs,andintermediates
○ tore-createthecomputationalresultsfromascientificstudy○ achievingcomputationalreproducibility○ thus“settingthedefaulttoreproducible.”
● Alsoempowersuserstoverifyandextendresultswithdifferentdata,methods,andenvironments.
�6
V.Stodden,D.H.Bailey,J.Borwein,R.J.LeVeque,W.Rider,andW.Stein.SettingtheDefaulttoReproducible:ReproducibilityinComputationalandExperimentalMathematics,ICERMWorkshop2013.
WholeTale:What’sinaname…ADoubleEntendre:
○ Wholetale:capturestheend-to-endscientificdiscoverystory,includingcomputationalaspects
○ Longtail:includesallcomputationalresearch,e.g.bespokeorsmallscaleresearch
AddressesProblemsscientistsface:○ Reproducibility(andreuse)challengesincomputational&data-enabled
research(e.g.data+codeaccess,dependencyhell,…)WholeTaleApproach:
○ directlyrespondtocommunityneedsandrequirements
�7
TheNeedforaPlatformforReproducibleResearch● Enableresearchersto(easily)managethecompleteconductofa
computationalexperimentandpermititsexposureasapublishable“Tale”
● Addressthetwotrendssimultaneously:○ improvedtransparencysoresearcherscanrunmuchmoreambitious
computationalexperiments.○ andbettercomputationalexperimentinfrastructurewillallowresearchersto
bemoretransparent.
D.DonohoandV.Stodden.(2015).ReproducibleResearchintheMathematicalSciences.ThePrincetonCompaniontoAppliedMathematics,Ed.N.J.Higham.
SowhatisWholeTale?● Aweb-based,opensourceplatformforreproducibleresearchforthe
creation,publication,andexecutionoftales:executableresearchobjectsthatcapturedata,code,anddetailsofthecomputingenvironmentusedtoproduceresearchfindings
● DrivenbyCommunityEngagement:○ Workinggroups,internships,collaborations,etc.
● EnhancesEducation&Training:○ Trainingforreproducibility;useofWholeTaleintheclassroom
WTSoftwareDevelopment● Open-SourceDevelopmentModel
○ across5collaborativesites● Allsourceisopen:
○ https://github.com/whole-tale/● Developersareexpectedtofollowthe:
○ Developer'sguide● Opencommunicationvia:
○ weeklycallswithpublicmeetingnotes● Softwarereleasesfollowa:
○ Developmentplan
�10
Development
Workshops & Working Groups
Whatexactlyis(in)aTale?● Tale=executableresearchobject,i.e.
○ data(references)○ +code(computationalmethods)○ +narrative(traditionalsciencestory)○ +computeenvironment(e.g.RStudio,Jupyter)
● Capturedinastandards-basedtaleformatcompletewithmetadata
�11
Code/Narrative
Computeenvironment
Data
�12
BrowseExistingTales…
�13
…ComposeNewTales…
�14
…Run&InteractwithTales
…
�15
…UseTaleMetadata
…
�16
…IntegrateDataReposwithWholeTale!
● Enablesturnkeyexploratorydataanalysisonexistingpublisheddatasets
● DataONEandDataversenetworkscover>90majorresearchrepositories!
InputData
ResearchQuestion Analysis Output
Data Narrative PublishedArticle
Verify/Reproduce/Re-use
Accelerate
AcceleratingReproducibleOpenScience
�18
WholeTalePlatformOverview
Research&QuantitativeComputationalEnvironments
ExternalDataSources
Code+Narrative
● Authenticateusingyourinstitutionalidentity● Accesscommonly-usedcomputationalenvironments● Easilycustomizeyourenvironment● Referenceandaccessexternallyregistereddata
● Createoruploadyourdataandcode● Addmetadata(includingprovenanceinformation)● Submitcode,data,andenvironmenttoarchivalrepository● Getapersistentidentifier● Shareforverificationandre-use
PublishTale
CreatetaleAnalyzedata
Whoseproblemsareweaddressing?● Researchers,scientists,othersmaybe
○ creatorsoftalese.g.shareyourfindingsinatale
○ reviewersofarticlescanreviewtalese.g.reproducenewscientificclaims
○ (re-)usersoftalese.g.builduponprogressofothers
�19
ExtendingWTtoData-IntensiveResearch● Motivatingscenario:TheRenaissanceSimulationsLaboratoryprovidesaccesstoover70
TBofrawdataandderiveddataproducts.RSLexposesdataavailableonsystemsattheSanDiegoSupercomputingCenterviaJupyterweb-basedinteractiveenvironments.
● Relevantfeatures:
1.theRSdataislarge,impracticaltotransfer,requireslarge-scaleresourcestoanalyze.2.theresearchcommunityleveragesJupyterinteractiveenvironmentsforboth
exploratoryandprimaryanalyticalworkwithsomeanalysisrequiringbatchcomputeresources.
3.thecommunityisinterestedinsharingresultingresearchartifacts(e.g.,code,deriveddata)forbothre-executionandre-use.
�20
ExtendingWTtoData-IntensiveResearch
�21
Tale frontend and HPC workloads on WT deployment cluster: Users can launch local HPC jobs using standard system calls
Tale Frontend on single HPC Compute Node: running the Tale frontend (Jupyter/R-studio notebooks) on compute nodes in an HPC cluster, which launch independent HPC jobs using standard system calls.
ExtendingWTtoData-IntensiveResearch
�22
Tale frontend on HPC compute node with local LRM (cluster queuing system) access: Allows submission of HPC jobs to the queuing system of the cluster.
Tale frontend on HPC compute nodes with MPI: launch the Tale frontend as an MPI job. The cluster LRM (queuing system) allocates the number of nodes requested at the submission of the Tale frontend job and sets the appropriate MPI environment. The Tale frontend would run on the lead node allocated to the MPI job by the LRM and would launch MPI subjobs on the nodes allocated to the MPI job.
ExtendingWTtoData-IntensiveResearch
�23
Tale frontend on WT cluster with remote LRM access: Tale frontends run alongside WT services, but HPC jobs can be submitted to remote clusters via the middleware.
Decoupled Tale frontend with LRM Remote Access: Tale frontends run on various resources and HPC jobs can run on any resources supported by the middleware. Users could bypass the limitations present in the default resources provided by the WT infrastructure e.g. a user with cloud access could request that a Tale be run on cloud resources under the user’s account.
ChallengestoExtendingWT● TheneedtomaintainresponsivenessofTalefrontends
● DependenceonMiddleware:ScalabilityandLongevity
● ManagingHPCnetworkrestrictions
● Talefrontendshoweverrequireincomingnetworkconnectionsinordertoexposetheiruserinterface.Consequently,ageneralsolutioninvolvingTalefrontendsoncomputenodesrequiressomeformofproxyingofconnectionsfromtheWholeTaleclustertoHPCclustercomputenodes.Restrictionsonincomingnetworkconnectionsmaylikelybearesultoflocalsecuritypoliciesandthereforeproxying,evenifauthenticated,maybeseenasanunwelcomecircumventionofsuchpolicies.
● ContainerizationandHPCworkloadse.g.adependenceonspecifichardwarewhichcanaffecttheabilityforthecodetobere-runifthespecifichardwarebecomesunavailable
● Dataaccessandquasi-locality:IfTalefrontendsand/orHPCworkloadsrunonHPCresourcesonwhichcopiesofdataarealreadyavailable,theWTimplementationisbeinefficientsinceeachfilewouldbetransferredoncetoWTresourcesandonceforeachTalefrontendinstancethataccessesthefile
�24
Conclusion WholeTaleofferspotentialforenablingreproducibilityforData-Intensive
computing,butisnotwithoutchallengesrequiringinnovationinthesoftwarearchitectureandinfrastructureimplementation.
However,reproducibilityisnowrecognizedasapressingissueofwhichcomputationalinfrastructureisonekeypart.
Infrastructuresupportingtransparencyandreproducibilitywillbeusednotoutofhygieneorasabestpractice,butbecauseitenablesincreasinglyambitiouscomputationalresearch.
�25
Top Related