Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science
-
Upload
john-fonner -
Category
Science
-
view
140 -
download
0
Transcript of Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science
![Page 1: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/1.jpg)
05/02/2023 1
JUPYTER ASCENDING:A PRACTICAL HAND GUIDE TO
GALACTIC SCALE, REPRODUCIBLE DATA SCIENCE
John Fonner, PhDUniversity of Texas at Austin
April 5th, 2016
![Page 2: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/2.jpg)
05/02/2023 2
Photos, Tweets, and hate mail all welcome! Email: [email protected] Twitter: @johnfonner
![Page 3: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/3.jpg)
05/02/2023 3
SCIENCE AS A SECOND THOUGHT
1. Formulate a theory2. Gather data3. Learn about data
storage4. Learn about data
movement protocols5. Lose data6. Check out of rehab7. Learn about backup and
replication8. Gather data9. Learn about versioning10. Start preliminary
analysis11. Buy a newer laptop12. Buy more memory13. Buy a desktop with
more memory
14. Buy a bigger monitor & GPUs “for work”
15. Google “250GB Excel Spreadsheet”
16. Learn about batch processing
17. Learn about batch schedulers
18. Learn about patience.19. Learn more about data
storage20. Learn about distributed
systems.21. Go back through notes
to remember the science question.
22. Learn R & Python23. Learn linux admin24. Finish preliminary
analysis.25. Grow a ponytail26. Write a paper.27. Learn about data
publishing28. Learn about
reproducibility29. Plot the death of your
advisor/dept. head30. Apply for grants &
research allocations on public systems
31. Wait to apply next time32. Finish analyzing data33. Reformulate your
theory34. Goto 1
![Page 4: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/4.jpg)
05/02/2023 4
SCIENTIFIC REPRODUCIBILITY
+ +
![Page 5: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/5.jpg)
05/02/2023 5
SOME ASSEMBLY REQUIRED…
?
? ?
?
![Page 6: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/6.jpg)
05/02/2023 6
![Page 7: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/7.jpg)
05/02/2023 7
SCIENTISTS, WITH FEW EXCEPTIONS, ARE NOT TRAINED PROGRAMMERS
Research is hard Coding is hard Research code is
well designed, documented, leverages design patterns, highly reusable, portable, and usually open source.
![Page 8: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/8.jpg)
05/02/2023 8
ACCESSIBILITY >= CAPABILITY
For scientific reproducibility, the impact of your work will be more about accessibility than capability Domain grad students, not sys admins, are the
early adopters Where can we focus effort to create community
around capability?
![Page 9: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/9.jpg)
05/02/2023 9
What has changed the least about the computation you do over the last 10 years?
What do we ask domain researchers to learn to use our tools and data?
Memory/CPU/DiskOperating System
ApplicationsInterface
![Page 10: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/10.jpg)
05/02/2023 10
![Page 11: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/11.jpg)
05/02/2023 11
![Page 12: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/12.jpg)
05/02/2023 12
![Page 13: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/13.jpg)
05/02/2023 13
Decoupling the technology “stack”“Reproducers”• Web Browser• GUIs• Windows / Mac OS
Support• Sample Data and
Sample Workflows
“Producers”• Linux CLI• Hadoop / GPFS /
Lustre• Clusters / Clouds /
Containers• Dockerfile /
Makefile / Ansible
![Page 14: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/14.jpg)
05/02/2023 14
BACKEND INFRASTRUCTURE: SYSTEMS
Categorize systems as either Storage or Execution Describe and support relevant protocols,
directories, schedulers, and quotas Each system includes the credentials to log into
the system (SSH Keys, X509, username/password) Register everything with a JSON document
http://agaveapi.co/documentation/tutorials/system-management-tutorial/
![Page 15: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/15.jpg)
05/02/2023 15
BACKEND INFRASTRUCTURE: APPS
An “App” is a versioned instance of a software package on a specific Execution System
App assets are bundled into a directory and stored on a Storage System
Apps can be private, shared with individual users, or made public
Public apps are compressed, assigned a checksum, and stored in a protected space
http://agaveapi.co/documentation/tutorials/app-management-tutorial/
![Page 16: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/16.jpg)
05/02/2023 16
BACKEND INFRASTRUCTURE: JOBS
A “Job” is an execution of an App with a specific set of input files and parameters
All jobs are given an ID, all inputs and parameters are preserved, output is also tracked
Jobs can be shared with others
http://agaveapi.co/documentation/tutorials/job-management-tutorial/
![Page 17: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/17.jpg)
05/02/2023 17
![Page 18: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/18.jpg)
05/02/2023 18
![Page 19: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/19.jpg)
05/02/2023 19
![Page 20: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/20.jpg)
05/02/2023 20
![Page 21: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/21.jpg)
05/02/2023 21
![Page 22: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/22.jpg)
05/02/2023 22
DEVELOPER COMMAND-LINE TOOLS
https://bitbucket.org/agaveapi/cli Requires bash and python’s json.tool Uses caching for authentication Parses JSON responses to condense output
As a Linux user, this is home-sweet-home
![Page 23: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/23.jpg)
05/02/2023 23
WHAT ABOUT JUPYTER?
Bleeding edge research will never be on a webpage
Data exploration “outside the app” also needs to be captured
An infrastructure for responsible computing at scale inevitably must support responsible data exploration
Jupyter has broad OS support, domain adoption, domain libraries, and a more interactive UI
![Page 24: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/24.jpg)
05/02/2023 24
AGAVEPY
github.com/TACC/agavepy Pythonic wrapper for all Agave endpoints pip install agavepy Developers actively “dogfooding” the module (Obviously) usable within Jupyter Has had greater uptake by users (not just
developers)
![Page 25: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/25.jpg)
05/02/2023 25
AGAVE-AWARE JUPYTERHUB
Going one step further – give users a notebook jupyter.public.tenants.prod.agaveapi.co/ (Free) account creation here:
public.tenants.prod.agaveapi.co/create_account Beta implementation at the moment
data purges during updates Limited capacity on the current VM All notebooks run inside Docker containers
![Page 26: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/26.jpg)
05/02/2023 26
WHAT’S NEXT?
Full-featured developer portal Open-source reference implementation of an
Angular Javascript portal built on Agave Additional Jupyter notebook examples Production-grade support for a hosted
JupyterHub
![Page 27: Jupyter Ascending: a practical hand guide to galactic scale, reproducible data science](https://reader036.fdocuments.in/reader036/viewer/2022062306/58e8d2551a28abb3398b5883/html5/thumbnails/27.jpg)
05/02/2023 27
THANKS!QUESTIONS?
Email: [email protected]: @johnfonnerTACC: www.tacc.utexas.eduAgave: www.agaveapi.coAgavePy: github.com/TACC/agavepy