eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on...
Transcript of eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on...
![Page 1: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/1.jpg)
eResearch NZ 2020
12-14 February 2020 | Dunedin Centre
Abstract booklet
Programme...........................2 - 4Abstracts..................................5 - 136
*Click on the session title to read the abstract*
1
![Page 2: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/2.jpg)
9:00
10:00 - 12:30
10:00
10:30 Keynote 1 - Rosie Hicks
11:30
12:30 - 13:30
Breakout Session 1
Session 1 A Session 1 B Session 1 C Session 1 D Session 1 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
13:30 Megan Guidry - Training: It's better
together
David Fellinger - Building a Federated
Research Collaborative
Miles Benton - Assessing the potential of
autonomous AI devices for portable real-time DNA
sequencers and deployable sensors
13:50 Ngoni Faya - Genomics Aotearoa Training Callum Walley - Engineering HPC: What’s
going on?
Ann Mc Cartney - Utilising Oxford Nanopore Data
for the Genome Assembly of Endemic New
Zealand Species
14:10 Murray Cadzow - Carpentries at Otago Marko Laban - Cloud-native technologies
in eResearch - benefits and challenges
Elizabeth Permina - Mice, organoids and single
cells: computational methods for cancer treatment
14:30 Christina Hall - Hybrid Training: a scalable
model for delivering hands-on training to
dispersed learners
Ryan Chard - Automating the Research
Data Lifecycle with Globus Automate
Eliatan Niktab - Network-based Nonparametric
Tests to Identify Genetic Modifiers of Rare
Diseases
14:50 Lightning talks: Riku Takei -
Internationalisation of The Carpentries –
Lessons learnt on the way / Matt Bixley -
Reproducible Posters: an Otago Theme
Lightning talks: Wallace Chase - Why so
slow? Molasses biased data transfers… /
Jun Huh - Learning How To Learn
Lightning talks: Alessandra Santana - Per-
sample pathway analysis tool for DNA methylation
data / Joseph Guhlin - I’m a Big Metal Fan: Big
Data at the Lowest Level
15:10 - 15:30
Birds-of-a-Feather (BoF) Sessions
Session 2 A Session 2 B Session 2 C Session 2 D Session 2 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
15:30 Megan Guidry - Building and Supporting a
New Zealand Digital Literacy Training
Community / Sara King - A Common
Thread: Creating community, working
together and enriching research
Laura Armstrong - Identifying, connecting
and citing research with persistent
identifiers.
Blair Bethwaite - Research Cloud NZ Workshop: Chris Scott - First steps in machine learning with NeSI (part 2)
Workshop: Gabriel Noaje - NVIDIA Accelerated
Computing Workshop
17:30 - 18:30
Conference Welcome Address
Lunch
Afternoon Tea
Registration Open
Session: Opening Ceremony and Keynotes
Workshop: Chris Scott - First steps in machine learning with NeSI (part 1)
Keynote 2 - Micaela Parker
Workshop: Gabriel Noaje - NVIDIA Accelerated
Computing Workshop
Wednesday 12 February
13:30 - 15:10
15:30 - 17:30
Welcome Function
Dunedin Centre
End of Day One
*Click on the session title to read the abstract*
![Page 3: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/3.jpg)
Thursday 13th February
9:00
9:30 - 10:30
9:30
10:30-11:00
Breakout Session 2
Session 3 A Session 3 B Session 3 C Session 3 D Session 3 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
11:00 Alexander Ritchie - Humanities Data
Untied - An Untapped Resource or just an
Untidy Office?
Wolfgang Hayek - Singularity containers
on HPC
11:20 Alan McCulloch - Data Pipelines and
Prisms
Lahiru Ariyasinghe - Challenges and
opportunities in timely and efficient delivery
of IT for eResearch projects.
Workshop: Daniel O'Byrne - The Basics of
Cloud Computing
11:40 Shona Mackie - Climate Data and
Computing Needs are Hotting Up!
Paula Andrea Martinez - Towards FAIR
principles for research software
12:00 Shiobhan Smith - Uniting equipment and
research publications: bigger than Ben
Hur?
Chris Hines - Strudel2: Increasing
accessibility of HPC Infrastructure
Rudiger Brauning- GBSathon: Benchmarking
reproducibility of Genotyping-By-Sequencing
analysis workflows through comparison with
SNP chip and pedigree data
Thomas Nicholson - Using comparative RNASeq
to identify small non-coding RNAs in bacterial
clades
Alana Alexander - Akoranga from research
consultation with Māori on sequencing the
genome of a taonga species
Matt Bixley - Naive Prediction of Cancer
Outcomes using Machine Learning
12:20 - 13:30
Breakout Session 3
Session 4 A Session 4 B Session 4 C Session 4 D Session 4 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
13:30 Lisa Thomasen - Influencing Data Culture
to Optimise Data Utilisation
Jun Huh - User journey-driven product
management
13:50 Brian Flaherty - Where Data Lives: NeSI,
taonga and growing repository services.
Vladimir Mencl - Enabling authentication at global
scale: an update on REANNZ services
14:10 Andrea Goethals - Digital Preservation
New Zealand
Dinindu Senanayake - HPC for life sciences:
handling the challenges posed by a domain that
relies on big-data
14:30 Sander Zwanenburg - Otago’s Network for
Engagement And Research: Mapping
Academic Expertise and Connections
Chris Hines - The Undies-Mate Un-Debate
14:50 Lightning talks: Lana Alsabbagh - Use of
the National Library’s Web and Twitter
Collections for Research / Jess Howie -
Support for Research Data Management in
university libraries – How far have we
come?
Nick Jones - Advancing New Zealand’s
computational research capabilities and
skills
Jeff Zais - Worldwide Trends in Computer
Architectures for Data Science
Andrew Lonie - Progress in the
Australian BioCommons
April Neoh - Beyond super
Lightning talks: Wallace Chase - Frozen data /
Adam Bartonicek - Why overfitting is bad for
science: Lessons from psychology
15:10 - 15:30
15:30 - 17:30 Birds-of-a-Feather (BoF) Sessions
15:30 Jonny Flutey - Micro-credentials and
Research Skills Development
Jana Makar - Growing the eResearch workforce in an inclusive way
Anton Angelo - All Research Questions Are
Ethical Questions
Workshop: Blair Bethwaite - Containers in HPC Tutorial (part 2)
19:00 - 22:00
11:00-12:20
Morning Tea
Tuesday 19 February
Registration Open
Session: Plenary/Keynote
Lunch
Birds of a Feather: Brian Flaherty - Building a
national/regional data transfer platform: Globus BoF
Workshop: Blair Bethwaite - Containers in HPC Tutorial (part 1)
Keynote 3 - Richard Dean
Afternoon Tea
End of day 2
Conference Dinner
Larnach Castle
This event is included in full and student registrations, however tickets are limited so your attendance must be confirmed prior to the conference commencement.
13:30 - 15:10
*Click on the session title to read the abstract*
![Page 4: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/4.jpg)
Friday 14 February
9:00
9:30 - 10:30
9:30
10:30 - 11:00
Breakout Session 4
Session 5 A Session 5 B Session 5 C Session 5 D Session 5 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
11:00 Richard Sinnott - Applied Deep Learning
for Diverse Research Communities
Stephanie Guichard - Data-intensive
approaches to finding and predicting
research outcomes for New Zealand health
research
Wallace Chase - How does REANNZ work?
11:20 Justin Baker - eResearch Collaboration
Projects – supporting CSIRO’s digital
science and research
Jonny Williams - Earth system modelling in
New Zealand – turning big data in big
science
Alexander Pletzer - Enhancing eResearch
productivity with NeSI's consultancy service
11:40 Jo Lane - Scientific supercomputing:
Teaching practical skills for credit
Nancy Lin - Data Analytic Transformation
Journey with Jupyter
Cheng-Hao Cai - Building Machine Learning
Systems on microsoft Azure Cloud Machines
12:00 Matt Plummer - Running Rāpoi: Rebooting
Research Computing & Support at VUW
Carina Kemp - Building an International
FAIR Infrastructure for ‘Uniting’ Research
Data
Dan Sun - Big Internet Pipe and Cloud Saved My
Storage in Crisis
12:20
12:30 - 13:30
Birds-of-a-Feather (BoF) Sessions
Session 6 A Session 6 B Session 6 C Session 6 D Session 6 E
Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)
13:30 Nooriyah Lohani - Research Software
Engineering Community update and next
steps in New Zealand
Joep De Ligt - Scalable Workflows and
Reproducible Data Analysis (for Genomics)
Carina Kemp- Data movement challenges to
research productivity - examples and responses
End of day 3
Wednesday 20 February
Registration Open
Session: Plenary/Keynote
Morning Tea
Conference wrap-up
Workshop: Shiobhan Smith - United in data
management: Is it time for a national research data
management framework?
Alexis Tindall - Humanities, Arts and Social
Sciences: What have we learned, where are we
going?
13:30 - 15:30
Lunch
11:00 - 12:20
Keynote 4 - Amber Budden
*Click on the session title to read the abstract*
*Programme subject to change*
![Page 5: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/5.jpg)
!ƪƻNJŀƴƎŀ ŦNJƻƳ NJŜǎŜŀNJŎƘ Ŏƻƴǎdzƭǘŀǘƛƻƴ ǿƛǘƘ aņƻNJƛ ƻƴ ǎŜljdzŜƴŎƛƴƎ ǘƘŜ ƎŜƴƻƳŜ ƻŦ ŀ ǘŀƻƴƎŀ ǎLJŜŎƛŜǎ
Alana Alexander1 and Benjamin Iwikau Te Aika2
1. Department of Anatomy, University of Otago; 2. Genomics Aotearoa
[email protected], [email protected]
As uri (descendants) of Tangaroa (or Tāne-Mahuta in the pūrākau of some hapū), Hector’s
and Māui dolphins (Cephalorhynchus hectori) are taonga (treasured). However,
anthropogenic activities, particularly fishing (through fisheries bycatch), have led to
restricted/fragmented distributions and significant reductions in genetic diversity in both
subspecies. A worrying additional trend are deaths due to the parasitic disease
toxoplasmosis, potentially exacerbated by decreased genetic diversity. Hologenomics – a
new paradigm where genomes of a host and its co-existing microbes (microbiome) are
simultaneously investigated for novel insights into host health, population sizes, and
connectivity – could therefore be an important tool to address susceptibility to
toxoplasmosis and other diseases, as well as population sizes through time, potential
divergence, and past patterns of interchange between the Hector’s and Māui dolphin.
However, in order to be effective partners to Te Tiriti o Waitangi – particularly maintaining
Māori rangatiratanga over resources and taonga, it is important that research consultation
with mana whenua from the areas where Hector’s and Māui samples originate is
undertaken. This is particularly important given the taonga status of Hector’s and Māui
dolphins, as well as potential concerns about the rendering of ‘biological whakapapa’ into
digital form during this project. Here, we outline our consultation procedures, the general
feedback based on this consultation, our lessons learned from the process, and what we
would do better/differently next time. We hope that presenting our experiences –
particularly where there was room for improvement by us – will help other researchers to
communicate more effectively with mana whenua in order to benefit Māori, the
researchers, and their rangahau (research).
ABOUT THE AUTHOR(S)
Alana Alexander: Alana’s research utilises the ‘time-traveling’ ability of population genomics
and phylogenomics by combining genomics, advanced computational tools, and
behavioural, ecological, and biogeographic data to make inferences about the processes
leading to patterns of genetic diversity within and among populations. These inferences
5
![Page 6: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/6.jpg)
range from global spatial and deep temporal scales (e.g. the worldwide impact of climate
fluctuations on global sperm whale populations over the last 125,000 years), to regional
spatial scales across time scales relevant to local adaptation (e.g. the evolution of MHC
immune genes in Hector’s and Māui dolphin populations), to finer spatial and temporal
scales (e.g. the movement of a chickadee hybrid zone in Missouri by just a few kilometres
over three decades). Overall, she considers herself a molecular ecologist/evolutionary
biologist who focuses on the interplay between pattern and process in genomic data. As a
Māori scientist (Ngāpuhi, Te Hikutu) she also maintains a strong interest in ensuring that her
research can be used to support kaitiakitanga and rangatiratanga of resources within the
rohe of iwi, hapū and papatipu rūnaka.
Benjamin Iwikau Te Aika: Ngati Mutunga, Te Ati Awa, Kati Wairaki, Kati Mamoe, Waitaha.
Ben is a specialist in multiple areas, including Māori economic development in
environmental advocacy, toi Māori (Māori art), whakairo (carving), and tā moko. Currently,
he is the Vision Mātauranga Coordinator at Genomics Aoteraoa where he coordinates Māori
consultation and outreach, identifies potential research collaborations with Māori
communities, and supports Genomics Aotearoa’s projects and researchers. Ben aims to
facilitate engagement to identify levels of acknowledgement and degree of control and
provide proper recognition to the interests of Māori. Ben works with researchers and with
Māori at multiple levels in the community to improve confidence, capacity and capability for
engagement. Ben draws on knowledge in Mātauranga Māori, and also the research
guidelines Te Ara Tika, Te Mata Ira and He Tangata Kei Tua. He also works on projects to
improve genomics research relevance to Māori. One initiative has enhanced kaitiaki
practices for a Māori landowner group in their management of native species - a great
example of commerce, science and kaitiakitanga in the hands of flax roots Māori. Ben is
passionate about his tamariki, hunting, whakapapa and whenua.
6
![Page 7: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/7.jpg)
LƎƴƛǘŜ ς [ƛƎƘǘŜƴƛƴƎ ¢ŀƭƪΥ ¦ǎŜ ƻŦ ǘƘŜ bŀǘƛƻƴŀƭ [ƛōNJŀNJȅΩǎ ²Ŝō ŀƴŘ ¢ǿƛǘǘŜNJ /ƻƭƭŜŎǘƛƻƴǎ ŦƻNJ wŜǎŜŀNJŎƘ
Lana Alsabbagh
National Library of New Zealand
The National Library of New Zealand has performed a “whole-of-domain” harvest since
2008, acquiring publicly available web content from the New Zealand .nz, .net, .org and
.com domains. The National Library’s Web Archiving team has also undertaken a number of
web harvests related to significant events in recent history such as the 2017 General
Election, including tweets, related data, and images. The Whole-of-Domain collection and
the Twitter harvests are both presently inaccessible to researchers. The Library’s goal is to
improve usage of this data by providing researchers with tools and services that would
enable computational access to this data.
In partnership with Library staff, the Digital Research Coordinator planned and carried out
interviews and a survey with a select group of scholars involved in digital humanities to help
the Library understand the tools and services researchers need to make full use of these
digital collections. This lightening talk will discuss the findings and ideas for further research.
ABOUT THE AUTHOR(S)
- Lana Alsabbagh
- Lana is the Digital Research Coordinator at the National Library of New Zealand. She
is currently researching ways to facilitate stakeholder engagement with the Library
web archival collection.
7
![Page 8: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/8.jpg)
All Research Data Questions are Ethical Questions
Anton Angelo
University of Canterbury
The data lying behind research is becoming steadily more transparent, and research is
becoming more about using huge existing datasets than just generating data to answer a
specific question.
This will be a facilitated discussion amongst all the attendees about current concerns
- Can data ever not be biased on race or gender?
- Is anonymity a lost cause?
- How much data do I have to give away? (And whose was it, anyway?)
- Data datasheets – is more bureaucracy the answer?
- Is the researcher to blame for bias, or the training set?
ABOUT THE AUTHOR(S)
- Anton Angelo
- Anton Angelo is Research Data Coordinator at the University of Canterbury Library.
He has an active interest in all aspects of research data: storage, publishing, ethics,
licensing and review. He is a certified Data and Library Carpentry instructor.
8
![Page 9: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/9.jpg)
9ƴƎŀƎƛƴƎ NJŜǎŜŀNJŎƘŜNJǎ ǿƛǘƘ ŜwŜǎŜŀNJŎƘΥ /ƻw9 ōƛŘ ǎdzLJLJƻNJǘ
Laura Armstrong
Centre for eResearch, University of Auckland
‘Build it and they will come’ doesn’t seem to work when it comes to researchers engaging with eresearch. Institutions invest in infrastructure and platforms but for a portion of our communities this doesn’t deliver better, faster research or more connected researchers because they are not engaging with eresearch – aren’t aware of it, can’t access it, don’t know how it applies to them, struggle to use it or don’t feel it meets their needs.
We assume those involved with engaging researchers in eresearch grapple, as we do, with what is engagement in this context – beyond marketing and promotion - and how to identify and address barriers. How can we connect researchers with eresearch so they truly engage with it – access it, use it, shape it, innovate with it and achieve amazing things with it?
Many universities services and programmes to engage researchers in eresearch at various scales, career stages and across diverse communities. This presentation offers our model and experience of engaging researchers in eresearch to support the 2019 TEC call for Centre for Research Excellence bids. Part of a strategic and coordinated approach led by our Office of Research Strategy and Integrity (ORSI), this engagement has deepened our relationship with many senior PIs and research administrators, led to uptake of eresarch services across several research groups, and created connections that have resulted in non-CoRE funded research that relies on eresearch services and expertice.
ABOUT THE AUTHOR(S)
Laura Armstrong
Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,
University of Auckland working to engage researchers in eresearch, and deliver research
data management services and researcher enablement projects.
9
![Page 10: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/10.jpg)
LŘŜƴǘƛŦȅƛƴƎΣ ŎƻƴƴŜŎǘƛƴƎ ŀƴŘ ŎƛǘƛƴƎ NJŜǎŜŀNJŎƘ ǿƛǘƘ LJŜNJǎƛǎǘŜƴǘ ƛŘŜƴǘƛŦƛŜNJǎΦ BoF Session
Natasha Simons, Anton Angelo, Shiobhan Smith and Laura Armstrong
Australian Research Data Commons, Brisbane, Australia, [email protected]
University of Canterbury, [email protected]
University of Otago, [email protected],
Centre for eResearch, University of Auckland [email protected]
Increasingly, the research community, including funders and publishers, is recognising the
power of ‘connected up’ research to facilitate reuse, reproducibility and transparency of
research. Persistent identifiers (PIDs) are critical enablers for identifying and linking related
research objects including datasets, people, grants, concepts, places, projects and
publications. PID systems:
● Provide social and technical infrastructure to identify and cite a research output over time
● Enable machine readability and exchange ● Collect and make available metadata that can provide further context and
connections ● Facilitate the linkage and discovery of research outputs, objects, related people and
things ● Provide key tools for tracking the impact of research and researchers
Join this BoF to learn about recent developments in PID services and infrastructure with a
particular focus on DOI (research data), ORCID (people and organisations), RAID (research
activities and projects), IGSN (physical samples and specimens) and ROR (research
organisations).
Find out how to maximise the return on your investment in PIDS through participation in
national and global initiatives such as the NZ DOI consortium, Scholix and the Project FREYA
PID Graph which uses PIDS to offer researchers, and research institutions a richer, more
connected experience.
AUDIENCE
This BoF will be of interest to those designing, implementing, maintaining and supporting
PID services including eresearch professionals, repository managers, developers and
librarians. Participants should come along prepared to exchange knowledge, share
experiences and contribute to discussions about optimising the ‘power of PIDs’.
10
![Page 11: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/11.jpg)
The session will kick off with brief lightning talks presented by those working at the cutting
edge of global developments in PID services and infrastructure. Following facilitated Q&A,
participants will be encouraged to contribute to an open discussion to share experiences,
explore ideas and ask questions.
OUTCOMES
Participants will leave the BoF with a fresh perspective on the opportunities PIDs can offer
researchers and research organisations. We envisage that many participants will be
prompted to explore in greater depth, ideas raised during the session as they might apply to
their organisation. The BoF will also offer participants the opportunity to establish or
strengthen connections with the broader PID community in New Zealand, Australia and
internationally.
ABOUT THE AUTHOR(S)
Natasha Simons
Natasha Simons is Associate Director, Skilled Workforce, for the Australian ResearchData
Commons (formerly ANDS, RDS and Nectar). With a background in libraries, IT and
eResearch, Natasha has a history of developing policy, technical infrastructure (with a focus
on persistent identifiers) and skills to support research. She works with a variety of people
and groups to improve data management skills, platforms, policies and practices. Based at
The University of Queensland, Brisbane, Australia, Natasha is the co-chair of the Research
Data Alliance Interest Group on Data Policy Standardisation and Implementation, Deputy
Chair of the Australian ORCID Advisory Group and co-chair of the DataCite community
Engagement Steering Group.
https://orcid.org/0000-0003-0635-1998
Anton Angelo
Anton Angelo is a data librarian working at the university of Canterbury. He managed
Canterbury’s effort to be among the first NZ Universities in the NZ DOI consortium, and
adopting the NZ Orcid Hub, verifying over 80% of Canterbury’s scholars’ affiliations. He also
manages the UC Research Repository, the Canterbury Institutional Repository, and has been
very active in supporting Open Access. He has two cats and three chickens.
https://orcid.org/0000-0002-2265-1299
Shiobhan Smith
Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being
appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan
was Subject Librarian to a number of Humanities departments including Sociology,
11
![Page 12: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/12.jpg)
Anthropology, Geography, and Theology. As Subject Librarian to the Centre for
Sustainability, Shiobhan was involved in the development of the Otago Data Management
Planning tool and has an interest in Research Data Management. Shiobhan also has
knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.
https://orcid.org/0000-0003-1738-9836
Laura Armstrong
Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,
University of Auckland working to engage researchers in eresearch, and deliver research
data management services and researcher enablement projects.
12
![Page 13: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/13.jpg)
D.{ŀǘƘƻƴΥ .ŜƴŎƘƳŀNJƪƛƴƎ NJŜLJNJƻŘdzŎƛōƛƭƛǘȅ ƻŦ DŜƴƻǘȅLJƛƴƎπ.ȅπ{ŜljdzŜƴŎƛƴƎ ŀƴŀƭȅǎƛǎ ǿƻNJƪŦƭƻǿǎ ǘƘNJƻdzƎƘ ŎƻƳLJŀNJƛǎƻƴ ǿƛǘƘ {bt ŎƘƛLJ ŀƴŘ LJŜŘƛƎNJŜŜ Řŀǘŀ
Rachael Louise Ashby (1), Rudiger Brauning (1), Hayley Baird (1), Ruy Jauregui (2), Monica
Vallender (1), Aurelie Laugraud (3), Charles Hefer (3), Abdul Baten (2), Paul Maclean (2),
Rayna Anderson (1), Roger Moraga (2), Siva Ganesh (2), Tracey van Stijn (1), Jeanne Jacobs
(3), Ken Dodds (1), John McEwan (1), Shannon Clarke (1) and Andrew Griffiths (2)
(1) AgResearch, Invermay Agricultural Centre, Private Bag 50034, Mosgiel 9053, New
Zealand
(2) AgResearch, Grasslands Research Centre, Private Bag 11008, Palmerston North 4442,
New Zealand
(3) AgResearch, Lincoln Research Centre, Private Bag 4749, Christchurch 8140, New Zealand
The advent of reduced representation genotyping-by-sequencing (GBS) provides a cost-
effective high-throughput genotyping platform to many ‘orphan’ species. This enables
downstream analyses including genomic selection, parentage assignment, conservation
genetics, population genetics and genome wide association studies. There are many
different workflows available for deriving SNPs from GBS data. Key aspects of any
bioinformatic workflow include accuracy, reproducibility and reliability. Few independent
studies benchmark multiple workflows to biological ‘gold standards’, such as pedigree or
SNP chip data, to assess these key aspects. Here, we benchmark open source SNP-calling
workflows for GBS data to assess their accuracy and reproducibility. To do this, we
generated GBS data for a cohort of 333 sheep. These have also been genotyped using a 50k
or 600k SNP chip. Furthermore, the cohort comprised 125 parent-offspring trios and all
individuals had multigenerational pedigree data. The SNPs called from the GBS workflows
were compared back to the gold standards to assess the accuracy, reproducibility and
reliability of SNP callers. Focusing on the bigger picture, we derived genomic relationship
matrices (GRMs) from all methods to compare the accuracy of the SNPs called for
downstream biological applications including relationship estimates among parents and
progeny.
13
![Page 14: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/14.jpg)
ABOUT THE AUTHOR(S)
Rachael Louise Ashby
Rachael Ashby is a postdoctoral researcher with the Bioinformatics team at AgResearch and Genomics Aoteroa. Her research focusses on the use of next generation sequencing for applications including genome assembly and genotyping-by-sequencing for genomic management of highly diverse species.
14
![Page 15: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/15.jpg)
ŜwŜǎŜŀNJŎƘ /ƻƭƭŀōƻNJŀǘƛƻƴ tNJƻƧŜŎǘǎ ς ǎdzLJLJƻNJǘƛƴƎ /{LwhΩǎ ŘƛƎƛǘŀƭ ǎŎƛŜƴŎŜ ŀƴŘ NJŜǎŜŀNJŎƘΦ
John Zic1, Justin Baker2
1 CSIRO, Sydney, Australia, [email protected]
2 CSIRO, Clayton, Australia, [email protected]
Background
CSIRO is Australia’s largest research agency and is a recognised leader in a diverse set of
science domains: Agricultural Sciences, Environment/Ecology, Plant and Animal Sciences,
Geosciences, Chemistry and Materials Science. CSIRO also manages research infrastructure
like the Australia Telescope National Facility (ATNF), the Marine Research Vessel RV
Investigator and the Pawsey Supercomputing Centre.
For many years in Australia, and also worldwide [2], research and science have undergone
transformational changes with the introduction of new instruments and advanced facilities
with matching increases in storage and computing capabilities. Individual researchers were
taking a bespoke approach to matching these technologies and capabilities to the way that
research and science were carried out. Wider adoption of new practices required social
change (in the practice of science and research) and these changes remained fragmented
and tailored to specific sciences or even projects. Organisations, by and large, varied
enormously in their support of these new practices.
As far back as 2007 [1], CSIRO eResearch practitioners advocated that science and research
practices within CSIRO adapt to deal with these challenges. Much like the rest of the world,
practices matured over the years: in CSIRO’s health and biosecurity, oceanographic and
atmospheric research, radio astronomy, agriculture and food as well as geological and
other earth sciences.
However, a significant shift occured in 2018, with a formal recognition by the CSIRO Board
of the need to support the new “digital” science and research at an organisational level.
CSIRO developed strategic digital transformation initiatives, including CSIRO’s Managed
Data Ecosystem (MDE), Missions and the Digital Academy [4].
The aim of the MDE is to connect current and new platforms in a seamless way and improve
interoperability between datasets so users will be able to easily find and work on multiple
datasets. It will provide a set of tools and approaches enabling CSIRO and partners to
improve our collaboration, mining and analysis of data.
15
![Page 16: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/16.jpg)
CSIRO Missions are major scientific and collaborative research programs aimed at making
significant breakthroughs in one of six major challenges facing Australia. They include the
resilient and valuable environments, food security and quality, health and well-being, future
industries, sustainable energy and resources, and regional security.
CSIRO's Digital Academy is focused on investing in the digital capability of our staff and
involves a rethink in planning for a digitally driven research environment. It provides a
learning opportunity for our staff, helping define the digital talent, skills and new ways of
working. The Academy will help attract and retain new digital talent within the Australian
innovation system, develop new digital skills and mindsets in Australian’s scientists and
facilitate digital talent accessibility and collaboration across Australia’s innovation system.
Existing Support for “Digital” Science through “eResearch” initiatives
CSIRO Scientific Computing Services group has been providing a dedicated eResearch service
since 2011 [3] This service is delivered through "eResearch Collaboration Projects” (eRCPs)
which now delivers specialist capabilities that includes Machine Learning, Data Analytics,
Scientific Visualisation, Workflow Management and Science Data Handling into research and
science projects.
The eRCP process is run as a competitive grant process and continues to be very successful.
In the latest cycle, forty Scientific Computing Services specialists successfully completed and
delivered over sixty eRCPs outcomes from a total of eighty submissions. The underlying
capabilities are delivered by members from each of teams in the Scientific Computing
Services group: Technical Solutions; Data Analytics and Visualisation; Research Software
Engineering; and Modelling and Dataflow. The eRCP process also provides a mechanism to
promote and introduce new tools and frameworks for consumption to CSIRO’s research
community eg Jupyter and R/Shiny.
The submission and approval process has also been significantly streamlined since its
introduction in 2011. The new eRCP portal includes semi-automated mechanisms for
seeking endorsement from research unit directors, along with automated directory name
lookup and links to related research projects. A post-project survey is used to elicit feedback
from individual researchers at the end of each cycle.
Specialists from the Scientific Computing program are then assigned to work on one or more
approved eRCPs. Over the six-month cycle, the resource allocation is around 0.2 FTE, with
each staff member allocated 3 eRCP projects per cycle. Importantly, eRCPs are provided to
CSIRO researchers and scientists at no additional charge.
16
![Page 17: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/17.jpg)
The eRCP has been enormously successful over the years, with demand outstripping
capability to allocate staff to the projects. The program has demonstrated a range of useful
outcomes including – including for example - an augmented reality tool for analysing
bushfire plumes over Tasmania; a dashboard to interrogate cotton crop physiological
measurements and an online platform to monitor algal blooms for multiple water bodies.
Scientific Computing specialists also provide dedicated support to CSIRO researchers, based
around the same set of core capabilities, via an entirely separate funding models known as
“pan deployments” as well as secondments. In both cases, CSIRO projects fund the specialists’
time at larger allocations, often extending over 12 months or more. In a sense, this acts like a
contractor service for Business Units, providing them with highly specialised skills but without
the need to recruit new staff of their own.
Future Plans
CSIRO Scientific Computing will respond to the major initiatives – MDE, Digital Academy and
Missions as follows:
• MDE
o Redirect Scientific Computing expertise currently working on eRCPs and pan
deployments to MDE related activities. In the first instance, these specialists
will apply their skills and domain knowledge to one of several nominated
pilots, helping design and build foundational components of the MDE.
o Over time, it is anticipated that those same specialists will contribute to the
ongoing development and enhancement of additional MDE components in
line with its progressive organisational rollout.
• Digital Academy
o Develop/adapt training content as appropriate for the Digital Academy. For
example, making use of existing Software Carpentry material for HPC usage,
but customising appropriate aspects for our own computing environment.
o Delivering training content to CSIRO staff. This has already proven very
successful in the machine learning area – with hundreds of staff attending
sessions - and will no doubt continue to grow over time.
• Missions
o Scientific Computing will continue to provide CSIRO researchers with the
eResearch support they need in response to the significant scientific
challenges tackling Missions.
REFERENCES
1. J. A. Taylor, J. Zic, and J. Morrissey, “Building CSIRO e-Research Capabilities,” in
17
![Page 18: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/18.jpg)
eResearch Australasia 2008.
2. T. Hey, S. Tansley, and K. Tolle, “The Fourth Paradigm: Data-Intensive Scientific
Discovery,” Data-Intensive Sci. Discov. Microsoft Res., 2009.
3. S. Moskwa, “The Accelerated Computing Initiative,” in eResearch Australasia, 2012.
4. CSIRO Chief Executive's Report 2018-19: https://www.csiro.au/en/About/Our-
impact/Reporting-our-impact/Annual-reports/18-19-annual-report/part-1/chief-
executive-report
ABOUT THE AUTHOR(S)
- Dr John Zic is the Executive Manager of CSIRO’s Science Computing Services
- Mr Justin Baker is Leader of the Scientific Computing Data Analytics and Visualisation
Team.
18
![Page 19: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/19.jpg)
²Ƙȅ ƻǾŜNJŦƛǘǘƛƴƎ ƛǎ ōŀŘ ŦƻNJ ǎŎƛŜƴŎŜΥ [Ŝǎǎƻƴǎ ŦNJƻƳ LJǎȅŎƘƻƭƻƎȅ
Adam Bartonicek, Dr. Narun Pornpattananangkul, Associate Professor Tamlin Conner University of Otago, Department of Psychology
Correspondence to: [email protected]
Many published research findings in psychology cannot be replicated. Even formerly “well-established” effects such as power-posing and implicit priming have failed to replicate. The crisis is not limited to psychology – replication issues abound across numerous fields, including neuroscience and biomedical sciences (e.g. Button et al., 2013; Ioannidis, 2005). The main causes of the replication crisis are thought to be inadequate statistical literacy and questionable research practices, such as p-hacking and “HARKing” (Hypothesizing After Results are Known). However, there may also be a less well-appreciated contributor to replication crisis – overfitting (Yarkoni & Westfall, 2017). Overfitting occurs when an overly complex model provides a good fit to the data it was trained on, but fails to accurately predict new samples. The goal of the classical statistical frameworks used in psychology, such as OLS and maximum likelihood methods, is to provide inference by finding the best fit to the data at hand. As such, these methods are liable to overfitting, especially when used alongside automatic variable selection methods such as forward, backward, and stepwise regression. Conversely, the goal of more recent statistical and machine learning methods is to maximize prediction accuracy in new samples and guard against overfitting directly. As such, psychologists and other scientific researchers may benefit from incorporating newer statistical and machine learning methods into their research in order to improve its replicability. To this end, more user-friendly open-source machine learning software packages are now being developed, such as the recent R package PredPsych and machine learning module for JASP. The proliferation of convenient digital tools for machine learning may lead to more replicable and reliable research, in psychology and in experimental science in general.
ABOUT THE AUTHOR(S)
Adam Bartonicek is a PhD student at the Department of Psychology, University of Otago. His main interests are well-being and using new statistical learning methods for high-dimensional inference.Dr. Narun Pornpattananangkul is a lecturer at the Department of Psychology, University of Otago. His main research interests include using big data in fMRI to study changes in reward-processing in mood disorders.Associate Professor Tamlin Conner is a lecturer at the Department of Psychology, University of Otago. Her main research interests include the impact of health behaviours on well-being and using mobile technology for daily experience sampling.
19
![Page 20: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/20.jpg)
Associate Professor Tamlin Conner is a lecturer at the Department of Psychology, University
of Otago. Her main research interests include the impact of health behaviours on well-being
and using mobile technology for daily experience sampling.
REFERENCES
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 0696–0701. https://doi.org/10.1371/journal.pmed.0020124Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393
20
![Page 21: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/21.jpg)
!ǎǎŜǎǎƛƴƎ ǘƘŜ LJƻǘŜƴǘƛŀƭ ƻŦ ŀdzǘƻƴƻƳƻdzǎ !L ŘŜǾƛŎŜǎ ŦƻNJ LJƻNJǘŀōƭŜ NJŜŀƭπǘƛƳŜ 5b! ǎŜljdzŜƴŎŜNJǎ ŀƴŘ ŘŜLJƭƻȅŀōƭŜ ǎŜƴǎƻNJǎ
Authors name(s): Miles Benton, Joep de Ligt, Donia Macartney-Coxson, Richard Dean
Organisation: Institute of Environmental Science and Research (ESR), Wellington, NZ
Authors Email(s): [email protected], [email protected], donia.macartney-
[email protected], [email protected]
The current ‘climate’ is full of buzz words, ranging from AI (artificial intelligence) and deep
learning, through to cloud computing and the ‘Internet of Things’. As consumers, and even
research specialists, this can all be a little overwhelming. At ESR we are endeavouring to
provide our staff, clients, and hopefully the wider community, with some insight into the
technologies behind this jargon. One such project involves evaluating the deployment of
low-cost portable devices into the field to collect real-time data.
This talk will highlight our experiences with the Nvidia Jetson family of small embedded
computing platforms. The Jetson ecosystem includes small form-factor modules with GPU-
accelerated parallel processing, making them ideal low-power, high-performance portable
devices which have the capability to perform advanced operations in remote locations.
Our aim is to create a cost effective and truly portable real-time DNA sequencing device
which can be easily taken into the ‘field’ with results reported in real-time as the sequencer
runs. This will incorporate the Nanopore minION DNA sequencer alongside a cheap single
board computer (Nvidia Jetson based) powered by off-the-shelf rechargeable batteries. The
Nvidia powered technology will allow real-time base calling of DNA, thus making direct
detection/identification in the field a real possibility.
Additionally, we envisage a totally modular device not just limited to DNA sequencing.
Backwards compatibility with such ecosystems as Raspberry Pi and Arduino means that a
wide range of sensors can be attached (i.e. temperature, humidity, water flow, camera’s)
which can report back in real-time. This, alongside the ability to run off portable (even solar
powered) batteries, makes for an extremely versatile base unit.
Ultimately the whole package is extremely cost effective, with potential use cases across a
multitude of research, primary sector and industry fields. Additionally, the affordable, easy
to source components provide exciting opportunities for such endeavours as community
outreach and education.
21
![Page 22: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/22.jpg)
ABOUT THE AUTHOR
- Name: Dr Miles Benton
- Bio: Dr Benton is a Senior Bioinformatics Scientist within the Human Genomics group
at ESR with extensive experience in computational genomics and bioinformatics. He
recently completed a post doc at Queensland University of Technology (Brisbane,
Australia) working on the development of methods to deal with ever expanding
genomic data sets and their access and interpretation back to the people that matter
(i.e. clients, clinicians, researchers, public, etc). Part of his role at ESR has been
implementing bioinformatics workflows in both research and clinical settings. He is
also developing machine learning/AI technology on portable Nvidia modules for field
deployment in various areas. Dr Benton was recently appointed to the Genomics
Aoteoroa Bioinformatics Leadership Team, where he is responsible for overseeing
bioinformatics support for human health projects. He is also heavily involved in the
Data Carpentries as an instructor and facilitator, as well as a mentor on ESR's data
science accelerator programme. He is deeply committed to making data science and
it's tools accessible, with the belief that everyone should be able to 'play' with and
interpret their data.
22
![Page 23: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/23.jpg)
wŜǎŜŀNJŎƘ /ƭƻdzŘ b½
Blair Bethwaite
NeSI, University of Auckland
Research Clouds are community or private Infrastructure-as-a-Service computing
capabilities tailored for research users, services, and workloads. As IaaS’s these capabilities
can cater to a massive range of use-cases providing researcher-defined infrastructure with
close integration to other institutional IT services. As the international Open Infrastructure
(nee OpenStack) community has matured and stabilised in recent years we are seeing more
and more scientific cloud deployments popping up – over 50 research organisations were
represented at the last Scientific SIG BoF during the Berlin OpenStack Summit and close to
100 people attended the Cloud Infrastructure in HPC BoF at SC19.
In Australia the Nectar cloud programme, which built one of the first national Research
Clouds over 7 years ago, is continuing to be supported through the merged and rebranded
ARDC (Australian Research Data Commons). New Zealand already operates one private
Node of the Nectar cloud thanks to University of Auckland. Could there be more, and is
there a case for testing broader sector access?
This BoF aims to bring together research infrastructure specialists from across the country
to gather interest and workshop models and technical architectures for a national research
cloud capability.
ABOUT THE AUTHOR(S)
Blair Bethwaite
Blair has worked in distributed computing for over a decade; both in research and for
research; for institutional and national projects; from applications, through grid & cloud
middleware, to full HPC & cloud systems design, implementation, and operations. Previously
over the ditch at Monash University, Blair most recently led Monash’s use of OpenStack to
underpin research computing. Originally from Christchurch, in mid-2018 Blair returned to
take up the opportunity of becoming NeSI's Solutions Manager, focusing back up the
technology stack closer to the user. Blair's role within NeSI covers Application Support and
Collaboration & Integration.
23
![Page 24: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/24.jpg)
/ƻƴǘŀƛƴŜNJǎ ƛƴ It/ ¢dzǘƻNJƛŀƭ
Blair Bethwaite, Mark Gray NeSI, Pawsey [email protected], [email protected]
NB: please refer to https://nesi.github.io/ernz20-containers/ for setup instructions/options prior to attending this tutorial.
½ Day Hands On Event Hosted on NeSI and run in collaboration with Pawsey Supercomputing Centre. This event will build on material and experience from the same tutorial recently run at the Supercomputing’19.
No longer an experimental topic, containers are here to stay in HPC. They offer software portability, improved collaboration, and data reproducibility. A variety of tools (e.g. Docker, Shifter, Singularity, Podman) exist for users who want to incorporate containers into their workflows, but oftentimes they may not know where to start.
This tutorial will cover the basics of creating and using containers in an HPC environment. We will make use of hands-on demonstrations from a range of disciplines to highlight how containers can be used in scientific workflows. These examples will draw from Bioinformatics, Machine Learning, Computational Fluid Dynamics and other areas.
Through this discussion, attendees will learn how to run GPU- and MPI-enabled applications with containers. We will also show how containers can be used to improve performance in Python workflows and I/O-intensive jobs.
Lastly, we will discuss best practices for container management and administration. These practices include how to incorporate good software engineering principles, such as the use of revision control and continuous integration tools.
24
![Page 25: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/25.jpg)
bŀƛǾŜ tNJŜŘƛŎǘƛƻƴ ƻŦ /ŀƴŎŜNJ hdzǘŎƻƳŜǎ dzǎƛƴƎ aŀŎƘƛƴŜ [ŜŀNJƴƛƴƎ
Matt Bixley1, Mik Black1 1Department of Biochemistry, University of Otago, New Zealand
Prediction of 5 year cancer outcomes from histology images has been undertaken using
machine learning (ML), artificial intelligence (AI) and deep learning techniques, by multiple
international research groups, with success for a number of different cancers (e.g., breast
and colorectal). A key outcome in this approach is the easy translation of technology to
allow pathologists to access the applications in their workflow. An extension to the idea of
outcome prediction is to use histology image data to estimate genomic characteristics of a
tumour, such as those often derived from gene expression data – examples include
molecular subtype, proliferation rate, oncogenic pathway activation, and genomic
instability.
Typically the training process involves the hand delineation of 100s if not 1000s of slides to
identify regions of interest and remove aberrations to improve accuracy. While some
automation has been attempted, here we present a naive approach to estimate the
accuracy with minimal human intervention. Currently the work has been applied to stomach
cancer slides from The Cancer Genome Atlas (TCGA), using both patient outcome data, and
genomic data on the molecular characteristics of the tumour.
ABOUT THE AUTHOR(S)
Matt is a Carpentries Instructor and Teaching/Research Fellow at the University of Otago.
His research background extends from laboratory and field work through quantitative
genetics and bioinformatics. Matt’s current research is on the use of Machine Learning tools
to predict cancer outcomes.
Mik received a BSc(Hons) in statistics from the University of Canterbury, and a MSc
(mathematical statistics) and PhD (statistics) from Purdue University. After completing his
PhD in 2002, Mik returned to New Zealand to work as a lecturer in the Department of
Statistics at the University of Auckland. An ongoing involvement in a number of Dunedin-
25
![Page 26: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/26.jpg)
based collaborative genomics projects resulted in a move to the University of Otago in 2006,
where he now leads a research group focused on the development and application of
statistical methods for the analysis of data from genomics experiments, with a particular
emphasis on human disease. Mik has also been heavily involved in major initiatives
designed to put in place sustainable national research infrastructure for NZ: Genomics
Aotearoa and NZ Genomics Limited for genomics, digital literacy training via The
Carpentries, and NeSI (New Zealand eScience Infrastructure) for high performance
computing and eResearch.
26
![Page 27: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/27.jpg)
wŜLJNJƻŘdzŎƛōƭŜ tƻǎǘŜNJǎ ŀƴ hǘŀƎƻ ¢ƘŜƳŜ
Matt Bixley
Department of Biochemistry, University of Otago, New Zealand
The endpoint of most research is a publication, be that a journal article (hopefully in
Nature), a conference presentation or a lab meeting. The premise of Reproducible Research
is that not only do we now present a summary of our findings, but we also make available
the details, code and (where possible) the data that lead to those findings. Various tools
exist to assist us in sharing our work and documenting our workflows. One extremely
popular tool for this is R Markdown, which provides the ability to write, document and
publish in a single workflow.
Here we present, postOTAGO an R package for creating posters with an Otago theme, that is
readily transferable to other organisations.
ABOUT THE AUTHOR(S)
Matt is a Carpentries Instructor and Teaching/Research Fellow at the University of Otago.
His research background extends from laboratory and field work through quantitative
genetics and bioinformatics. Matt’s current research focus is on the use of machine learning
tools to predict cancer outcomes.
27
![Page 28: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/28.jpg)
DataONE: Supporting Data Discovery and Access through Social and
Technical Infrastructure
Amber E Budden National Center for Ecological Analysis and Synthesis, University of California Santa
Barbara DataONE [email protected]
Addressing grand challenge questions requires exploration at broad spatial, geographic and temporal scales, facilitated through easy access to distributed, heterogeneous data. DataONE is an interoperable, federated network of data repositories providing open, persistent, robust, and secure access to well-described and easily discovered data about life and the environment. Over the last ten years of development, both technical and social capacity building has been critical in creating an infrastructure that meets the current and future needs of the community. Informed by working group research, community engagement, and usability evaluation, DataONE has developed a comprehensive search and discovery platform exposing over 1.2M data files; tools and services that support research reproducibility, transparency and credit; and data management training and resources to enhance data literacy. Through these and aligned activities, DataONE has improved interoperability across a broad coalition of data repositories and enhanced data practices across a diverse community of researchers, data managers, and data librarians.
DataONE is a community-governed network built in partnership with existing data repositories supporting distinct and diverse communities. As DataONE continues to grow from a funded project into a sustained program, this networked, user-driven approach continues to inform infrastructure development, feature design and prioritization, maximizing the value and impact of research data in an increasingly complex, diversified data discovery and use landscape.
ABOUT THE AUTHOR(S)
- Amber E Budden, PhD BScAmber Budden is the Director of Learning and Outreach at the National Center forEcological Analysis and Synthesis where she leads the NCEAS Learning Hub and short
28
![Page 29: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/29.jpg)
course activities. She is an open science facilitator, community manager and data literacy trainer and serves as a co-lead on several projects, including DataONE, a community-networked infrastructure supporting Earth and environmental scientists in their data management, preservation, search and discovery needs. An advocate for open and transparent science, Amber previously conducted research on article publication practices before working in the open data landscape. In her current roles, Amber supports the community in using open science infrastructure and leads training and outreach activities focused on best practices for data management.
Amber has a PhD in behavioral ecology and has conducted postdoctoral research on avian sexual selection and life-histories at the University of California Berkeley in addition to bibliometrics research at NCEAS. Amber has held teaching positions at the University of Toronto and York University in Canada and she has worked in outreach and publications within the non-profit sector. She is currently a principal investigator on several cyberinfrastructure awards including DataONE the Arctic Data Center and the Permafrost Discovery Gateway; is Chair of the ESIP Data Stewardship Committee;Member of the Make Data Count team; Advisory Board member for the Center for Scientific Collaboration and Community Engagement; and was a board member of the National Postdoctoral Association.
Amber holds a PhD in Behavioral Ecology from the University of Wales, a Joint Honors BSc in Psychology and Zoology from the University of Bristol and qualification in youth and community work.
29
![Page 30: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/30.jpg)
.dzƛƭŘƛƴƎ ŀƴŘ {dzLJLJƻNJǘƛƴƎ ŀ bŜǿ ½ŜŀƭŀƴŘ 5ƛƎƛǘŀƭ [ƛǘŜNJŀŎȅ ¢NJŀƛƴƛƴƎ /ƻƳƳdzƴƛǘȅ
Megan Guidry, Ngoni Faya, Murray Cadzow, and Fabiana Kubke
NeSI, Genomics Aotearoa, University of Otago, and University of Auckland
[email protected], [email protected], [email protected], and
The delivery of digital literacy training has been a focus nationally for several years. An
increase in data availability across fields has led to a capability gap that can only be filled by
providing researchers and support staff with relevant training opportunities that encourage
and incentivise continual learning.
One-off events (such as Carpentries workshops, ResBaz events, etc...) are a great start, yet
attendees can often feel stuck or discouraged once they leave a supportive workshop
environment-- a topic touched on the 2019 eResearch NZ BoF ‘What Happens on Monday’.
It is difficult for researchers and support professionals to discern what skills are needed to
increase their research capability.
Attendees of this BoF will work together to co-create the first iteration of a skills roadmap
for eResearchers in NZ. This will involve an interactive discussion of efforts of support and
development of local communities, identification of knowledge and skills delivered in
training, and exploration into how credentialing could be applied. Understanding, as best as
possible, the skill building ‘stepping stones’ that lead eResearchers to increased capability
will provide a foundation for local communities in NZ to provide digital literacy training
opportunities that are more coordinated, cooperative, and nationally effective at raising
eResearch capability.
ABOUT THE AUTHOR(S)
Megan Guidry is the Regional Coordinator for the Carpentries in New Zealand and also
works as the training coordinator for the New Zealand eScience Infrastructure (NeSI). Her
main priority is raising the eResearch capability in New Zealand through training delivery
and community building.
30
![Page 31: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/31.jpg)
Ngoni Faya is Genomics Aotearoa’s Training Coordinator, tasked with supporting and
building capacity and capability in bioinformatics for New Zealand. Based at the University
of Otago, he will be working with Genomics Aotearoa partners across New Zealand to
develop resources and technologies that provide international level training for genomics
and bioinformatics.
Murray Cadzow is a Teaching Fellow and Postdoctoral Fellow at the University of Otago. He
is both a Carpentries instructor and instructor trainer. His teaching focus is on delivering
digital literacy training to researchers, and the development and support of the local
Carpentries community at Otago. His research involves the use of large datasets to
investigate the genetic basis of Gout in Māori and Polynesian populations.
Fabiana Kubke is a Neuroscientist at the University of Auckland and is a strong advocate for
developing and supporting the training and development of digital literacy for students and
researchers.
31
![Page 32: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/32.jpg)
/ŀNJLJŜƴǘNJƛŜǎ ŀǘ hǘŀƎƻ
Murray Cadzow, Matt Bixley and Mik Black
University of Otago
{murray.cadzow, matt.bixley, mik.black}@otago.ac.nz
As part of a strategic initiative from the Division of Health Sciences at the University of
Otago, a project was established to increase researchers’ use of “big data” in research
projects. The first steps taken in beginning to build this capability were to ramp up both the
delivery of Software and Data Carpentry workshops, and the training of local instructors in
The Carpentries pedagogy. As part of this initiative, Murray and Matt have been delivering
and facilitating Carpentries workshops across the multiple University of Otago campuses
(Dunedin, Christchurch, Wellington), developing additional training materials and lessons,
and supporting other groups in the use of Carpentries pedagogy for non-Carpentries
workshops. In this talk we will discuss some of the impacts this initiative has had on
delivering Carpentries workshops, and on the Carpentries community at Otago.
ABOUT THE AUTHOR(S)
Murray Cadzow is a Teaching Fellow and Postdoctoral Fellow at the University of Otago. He
is both a Carpentries instructor and instructor trainer. His teaching focus is on delivering
digital literacy training to researchers, and the development and support of the local
Carpentries community at Otago. His research involves the use of large datasets to
investigate the genetic basis of Gout in Māori and Polynesian populations.
Matt Bixley a Carpentries Instructor and Teaching/Research Fellow at the University of
Otago. His research background extends from Lab and Field work through Quantitative
Genetics and Bioinformatics. Current research is in the use of Machine Learning tools to
predict cancer outcomes.
Mik received a BSc(Hons) in statistics from the University of Canterbury, and a MSc
(mathematical statistics) and PhD (statistics) from Purdue University. After completing his
PhD in 2002, Mik returned to New Zealand to work as a lecturer in the Department of
32
![Page 33: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/33.jpg)
Statistics at the University of Auckland. An ongoing involvement in a number of Dunedin-
based collaborative genomics projects resulted in a move to the University of Otago in 2006,
where he now leads a research group focused on the development and application of
statistical methods for the analysis of data from genomics experiments, with a particular
emphasis on human disease. Mik has also been heavily involved in major initiatives
designed to put in place sustainable national research infrastructure for NZ: Genomics
Aotearoa and NZ Genomics Limited for genomics, digital literacy training via The
Carpentries, and NeSI (New Zealand eScience Infrastructure) for high performance
computing and eResearch.
33
![Page 34: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/34.jpg)
Building Machine Learning Systems on Microsoft Azure Cloud Virtual
Machines –
A Joint Report of Two Projects
Cheng-Hao Cai
School of Computer Science
University of Auckland
E-mail: [email protected]
We complete two machine learning projects using the Microsoft Azure Virtual Machines. The first project is automatic voice recognition, which is the use of machine learning techniques to convert human speech to text. We build Gaussian mixture models, hidden Markov models and deep neural networks on the Azure VM, then use 100 hours of voice data to train the models. We find that better machine learning models and more training data can lead to increased accuracy of voice recognition, while background noise can reduce the recognition accuracy. The second project is automated program repair. In this project, machine learning models such as support vector machines and random forests are used to learn the semantics programs. The training of such machine learning models, model checking processes and constraint solving processes are completed using the Azure VM. As both the model checking and machine learning techniques require considerable computational resources, we suggest using these techniques with Azure cloud computing services.
ABOUT THE AUTHOR
Chenghao Cai is a PhD student at the School of Computer Science, University of Auckland, whose study has been financially supported by the China Scholarship Council (CSC). His PhD work provided substantial contributions to the field of automated software engineering, especially in the area of machine learning approach to formal design model repair. Chenghao is in stage of finishing the PhD study. His thesis consists of eleven chapters, where the content of the chapters is supported by internationally peer reviewed publications. Chenghao has published ten research papers to date, among which was the 52-pages manuscript published in the Automated Software Engineering (ASE) journal. ASE is a top quality and prestigious international journal in the field of Software Engineering, which has an A-tier ranking by the Computing Research and Education Association of Australasia (CORE). Furthermore, Chenghao received the Microsoft Asia Cloud Research Software Fellowship (CRSF) Award in June 2019.
Additional Authors: Jing Sun, Gill Dobbie
School of Computer Science, University of Auckland
34
![Page 35: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/35.jpg)
!dzǘƻƳŀǘƛƴƎ ǘƘŜ wŜǎŜŀNJŎƘ 5ŀǘŀ [ƛŦŜŎȅŎƭŜ ǿƛǘƘ Dƭƻōdzǎ !dzǘƻƳŀǘŜ
Ryan Chard, Kyle Chard, Ian Foster
Argonne National Laboratory and University of Chicago
[email protected], [email protected], [email protected]
Research data can traverse a multitude of compute and storage devices from their
collection, through analysis, dissemination, and archival storage. The scientific data lifecycle
often requires acting on data spanning geographical locations and timescales, from near-
real time quality control, to human-oriented curation, through to long-term cataloguing and
archival. Further, almost any step of this lifecycle can require the use of specialized
hardware or computing resources resident in one or more administrative domains.
Combined with ever-growing data rates and volumes, these challenges necessitate new
technologies to aid researchers in reliably, and simply, offloading distributed data
management and analysis tasks.
To address these needs we have developed Globus Automate--a distributed research
automation platform designed to empower scientists to create, deploy, and apply data-
oriented pipelines. Globus Automate can reliably automate the entire research data
lifecycle, governing data from its generation at various instruments, through analysis, to
dissemination and archival, while weaving fine-grained access control throughout the
pipeline to securely interoperate with services across administrative domains. Globus
Automate enables users to offload the management of data and abstract the challenges
associated with distributed analysis and storage pipelines.
Globus Automate fills an important, yet previously unmet need in science by enabling the
composition of data management services into distributed data management pipelines.
Using any of the provided Globus services, such as Transfer, Search, and Auth, as well as any
custom service that exposes an Automate API, users can construct rich data pipelines to
perform various tasks. Further, users can leverage funcX--a distributed function as a service
platform-- in Automate flows to perform remote computation on almost any resource to
which the user has access.
35
![Page 36: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/36.jpg)
Fig 1. An overview of the Globus Automate pipeline used to analyse and publish data from
the Advance Photon Source’s (APS) 8ID beamline. This flow captures data at the APS (1-2),
transmits it to the leadership computing facility for analysis (3-4), and publishes the data (6)
into a data portal for visualization and consumption.
In this talk we will present Globus Automate and describe uses cases from initial pilot
deployments. We will describe how funcX and Globus Automate make it possible to easily
and seamlessly exploit a wide range of computational resources to automate the research
data lifecycle, such as is depicted in Fig 1., from performing preprocessing and quality
control tasks locally through to outsourcing large-scale analyses to leadership computing
facilities.
ABOUT THE AUTHORS
Ryan Chard is an Assistant Computer Scientist at Argonne National Laboratory having joined
2016 where he was awarded a Maria Goeppert Mayer Fellowship. His research focuses on
the development of cyberinfrastructure to enable scientific research. He is particularly
interested in automation platforms and performing on-demand scientific analysis at scale.
He has a Ph.D. in Computer Science from Victoria University of Wellington, New Zealand and
a Masters of Science from the same university. His research interests include high
performance computing, scientific computing, cloud computing, cloud economics, and
network inference.
Kyle Chard is a Research Assistant Professor at the University of Chicago and a researcher at
Argonne National Laboratory. He received his Ph.D. in Computer Science from Victoria
University of Wellington, New Zealand. His research interests include data-intensive
computing, cloud computing, and economic resource allocation.
36
![Page 37: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/37.jpg)
Ian Foster is an Argonne Senior Scientist and Distinguished Fellow and the Arthur Holly
Compton Distinguished Service Professor of Computer Science. Ian received a BSc (Hons I)
degree from the University of Canterbury, New Zealand, and a PhD from Imperial College,
United Kingdom, both in computer science. His research deals with distributed, parallel, and
data-intensive computing technologies, and innovative applications of those technologies to
scientific problems in such domains as climate change and biomedicine. Methods and
software developed under his leadership underpin many large national and international
cyberinfrastructures. Ian is a fellow of the American Association for the Advancement of
Science, the Association for Computing Machinery, and the British Computer Society. His
awards include the Global Information Infrastructure (GII) Next Generation award, the
British Computer Society's Lovelace Medal, R&D Magazine's Innovator of the Year, and an
honorary doctorate from the University of Canterbury, New Zealand. He was a co-founder of
Univa UD, Inc., a company established to deliver grid and cloud computing solutions.
37
![Page 38: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/38.jpg)
²Ƙŀǘ ƭŀƴƎdzŀƎŜ ŀNJŜ ȅƻdz ǎLJŜŀƪƛƴƎΚΗ Wallace Chase
REANNZ [email protected]
IT people. Students. Administrators. Scientists. Librarians. Corporate suits. Cultures. Nations. Collaborative research involves folks from across the whole spectrum. How do we communicate with each other? How do we speak the same language when often our lived experiences are so different? Join us for a discussion about how we can more effectively communicate with each other to solve complex multi-disciplinary problems. ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.
38
![Page 39: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/39.jpg)
Iƻǿ ŘƻŜǎ w9!bb½ ǿƻNJƪΚ Wallace Chase
REANNZ [email protected]
Moving data around is key to successful collaborations and REANNZ is here to help! Come for a discussion around how REANNZ moves your data around. Come learn about the services and tools available to the REANNZ community! ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.
39
![Page 40: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/40.jpg)
CNJƻȊŜƴ Řŀǘŀ Wallace Chase
REANNZ [email protected]
How does data from polar experiments get to the warm labs of researcher around the world? Come join Wallace Chase of REANNZ to lean how the data gets thawed out and moved around the world. Hear about how REANNZ and the international NREN community are currently working to increase the ability to move every increasing big data from the ice to the world. ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.
40
![Page 41: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/41.jpg)
²Ƙȅ ǎƻ ǎƭƻǿΚ aƻƭŀǎǎŜǎ ōƛŀǎŜŘ Řŀǘŀ transfers…
Wallace Chase REANNZ
Why does is my data moving so slowly? Come hear the top reasons why your data transfer is so very slow and how you can speed it up… ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.
41
![Page 42: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/42.jpg)
{ŎŀƭŀōƭŜ ²ƻNJƪŦƭƻǿǎ ŀƴŘ wŜLJNJƻŘdzŎƛōƭŜ 5ŀǘŀ !ƴŀƭȅǎƛǎ όŦƻNJ DŜƴƻƳƛŎǎύ
Francesco Strozzi, Roel Janssen, Ricardo Wurmus, Michael R. Crusoe, George Githinji, Paolo
Di Tommaso, Dominique Belhachemi, Steffen Möller, Geert Smant, Joep de Ligt & Pjotr Prins
ESR Institute of Environmental Science and Research
Biological, clinical, and pharmacological research now often involves analyses of genomes,
transcriptomes, proteomes, and interactomes, within and between individuals and across
species. Due to large volumes, the analysis and integration of data generated by such high-
throughput technologies have become computationally intensive, and analysis can no longer
happen on a typical desktop computer.
This group of authors came together to describe and execute the same analysis using a
number of workflow systems and how these follow different approaches to tackle execution
and reproducibility issues. In a book chapter [1] about these topics showcases how any
researcher can create a reusable and reproducible bioinformatics pipeline that can be
deployed and run anywhere. This includes how to create a scalable, reusable, and shareable
workflow using four different workflow engines: the Common Workflow Language (CWL),
Guix Workflow Language (GWL), Snakemake, and Nextflow.
We would like to present the different components discussed in this chapter and dicuss how
these can foster stonger and more efficient collaboration across Aotearoa. It should be
noted that while the examples are from a genomics background these principles apply to all
data based research projects that require reproducible and scalable workflows.
1. Strozzi, F. et al. Scalable Workflows and Reproducible Data Analysis for Genomics. in Evolutionary Genomics: Statistical and Computational Methods (ed. Anisimova, M.) 723–745 (Springer New York, 2019). doi:10.1007/978-1-4939-9074-0_24 https://link.springer.com/protocol/10.1007/978-1-4939-9074-0_24
ABOUT THE AUTHOR(S)
- Joep de Ligt, PhD
Dr. Joep de Ligt is the lead Bioinformatics at ESR. Prior to this role in New Zealand he was
involved in genomics research and bioinformatics education and community building in the
42
![Page 43: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/43.jpg)
Netherlands. For a full overview of his scientific publications please see this scholar page;
https://scholar.google.com/citations?user=z2edTLkAAAAJ&hl=en
43
![Page 44: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/44.jpg)
!ŎŎŜƭŜNJŀǘƛƴƎ 5ŀǘŀ {ŎƛŜƴŎŜ
Richard Dean
Institute of Environmental Science and Research
The ability to influence decision making by extracting knowledge from data is key to success
in organisations across New Zealand. However, high demand for data scientists means that
many organisations who want to expand their data analytics capability experience
difficulties in recruiting suitably skilled candidates. Richard will present an alternative
approach focussed on the upskilling, retraining and empowering of existing employees
through what is termed a ‘data science accelerator’. He will discuss his experience as Public
Health England’s first graduate from the UK government’s data science accelerator
programme, how that led to working on some cool projects in the UK and why he’s now just
as fired up to bring the concept over to New Zealand. He will provide some insights from
how the data science initiative is settling in at ESR and how New Zealand could become
more ‘united in data’.
ABOUT THE AUTHOR(S)
Richard is a Data Scientist at ESR, a crown research institute that deals with nitty gritty real
world problems affecting human communities covering everything from forensic science to
human health, biowaste, microplastics and the environment. Before joining ESR, he worked
as a Senior Data Scientist for Public Health England, an executive agency of the UK’s
Department of Health.
In his current role, he works across the whole organisation on projects that gain insight from
big data sets. He is also responsible for driving forward ESR’s data science initiative which
involves training staff through data carpentries and pushing the boundaries through an
engineering, robotics, innovation, coding and automation club – Erica for short.
Richard was the first member of staff from PHE to graduate from the UK government digital
service ‘data science accelerator’ programme. In 2019, he brought the scheme to New
Zealand through an internal accelerator programme within ESR. A second cohort is currently
being planned and will run from February – May 2020.
44
![Page 45: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/45.jpg)
He has a BSc in Information Systems Management from Durham University and wrote an
MSc thesis on public health data interoperability standards while working in Durham.
He moved to New Zealand in November 2017 with his Kiwi wife and is trying his best to raise
two crazy kids – one born in the UK and one born in NZ.
Richard’s claim to fame is that he is one of New Zealand’s most successful mini golf coaches,
having convinced his wife to travel to Kosovo for the 2016 World Adventure Golf Masters,
where she won a bronze medal - New Zealand’s first ever medal in international match play.
45
![Page 46: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/46.jpg)
/ƘŀƭƭŜƴƎŜǎ ŀƴŘ ƻLJLJƻNJǘdzƴƛǘƛŜǎ ƛƴ ǘƛƳŜƭȅ ŀƴŘ ŜŦŦƛŎƛŜƴǘ ŘŜƭƛǾŜNJȅ ƻŦ L¢ ŦƻNJ ŜwŜǎŜŀNJŎƘ LJNJƻƧŜŎǘǎ
David Eyers
University of Otago
Lahiru Ariyasinghe
University of Otago
There are many instances of “big” eResearch projects that have been very well supported by
initiatives both within the University of Otago, and across New Zealand overall. Often at the
other end of the spectrum of project scale, initiatives such as The Carpentries have
supported widespread capability lift in terms of researchers adopting computing
technology. However there are many eResearch projects that face tricky computational and
data processing problems, while likely leading to great opportunities, but in which it is
difficult to assess and prioritise the potential impact that that project might have, relative to
others.
From the perspective of the eResearch Advisory Group at the University of Otago, and from
collaborations across the University, we have seen many eResearch projects face types of
barriers to their completion that would have been difficult to predict ahead of time. Some
of the types of challenges encountered have included:
• access to funding—e.g., where a potential cost emerges within a project, that did not
fit into the scope of research grants;
• types of funding—capital expenditure versus operating expenditure in terms of
research computing, e.g., DIY clusters versus use of the cloud;
• sustainability—e.g., considering how to support projects after their headline grant
funding has finished;
• tracking issues that need resolution across multiple different teams—e.g., across
departmental and central IT, researchers, NRENs, etc.;
• prioritisation and opportunity costs—e.g. the mechanisms that can support
escalation of issues in an efficient manner;
• management of the expectations of researchers and professional staff involved in
research projects;
46
![Page 47: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/47.jpg)
DŜƴƻƳƛŎǎ !ƻǘŜŀNJƻŀ ¢NJŀƛƴƛƴƎ
мΦbƎƻƴƛ Cŀȅŀ ϧ нΦ5ƛƴƛƴŘdz {ŜƴŀƴŀȅŀƪŜΦ 1.Genomics Aotearoa, New Zealand. 2.NeSI, New Zealand
[email protected] & [email protected]
A decrease in sequencing cost has seen a large amount of sequence data being generated in the last few years, leading to a paradigm shift from sequencing data generation to data analysis. Despite the ease of data generation, the same cannot be said for data analysis mainly due to fewer researchers with the bioinformatics skills necessary to analyze these datasets. Moreover, most data analysis tools are developed for use with the Linux command line and require use of high-performance computers, therefore there is need for hands-on data analysis training. Empowering researchers through hands-on training courses is the key to improve knowledge and understanding of bioinformatics approaches thereby easing the skills shortage. Genomics Aotearoa (GA) is a collaborative platform established to ensure that New Zealand is internationally participating and leading in the fields of genomics and bioinformatics. One of GA’s projects which is critical to genomics research is bioinformatics capability where bioinformatics tools and strategies needed to analyze information are provided. The bioinformatics capability project aims to address the increasing local demand for data analysis methods as well as training. The concept is: develop material/pipelines that can be accessed by everyone and travel to offer hand-on bioinformatics workshop. Post-doctoral researchers with strong bioinformatics background have been brought on board to develop open-source and reusable data analysis material and pipelines to benefit the genomics research community. At this stage, development of introductory, intermediate and advanced bioinformatics training material for genomics researchers is underway. Together with our partner, NeSI, coordination and delivery of data science and bioinformatics training workshops has already begun around the country seeing just above 250 researchers trained. Basically, based on the expressions of interest, we managed a supply/demand of ~66% with factors such as instructors and room availability contributing significantly to lowering this figure. In 2020, we are coming up with strategies to improve our demand/supply. NeSI platforms and it’s virtual machines were instrumental in hosting the training workshops which allows trainees to expand their skill set from introductory to advanced levels with a focus on how to use HPCs for their research. !dzǘƘƻNJǎ bƎƻƴƛ Cŀȅŀ Ngoni is the Genomics Aotearoa’s Training Coordinator, tasked with supporting and building capacity and capability in bioinformatics for New Zealand. Based at the University of Otago, he is working with Genomics Aotearoa partners across New Zealand to develop resources and technologies that provide international level training for genomics and bioinformatics. The aim is to give genetics researchers the training opportunities they need to analyse their own data sets, as well as facilitating the NeSI computing platforms and infrastructure required in their projects.
47
![Page 48: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/48.jpg)
5ƛƴƛƴŘdz {ŜƴŀƴŀȅŀƪŜ Dini is an Applications Support Specialist at NeSI with a particular interest use of High performance computing for Computational Biology and Bioinformatics. He joined NeSI following a decade of research experience gained in the field of Cancer Genetics, Chemical Genetics , Immunolgy and Bioinformatics.
48
![Page 49: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/49.jpg)
• estimation of projects’ timelines, and likely response times for resolving IT issues
that are encountered.
There are many changes are on the eResearch horizon that could have a significant impact
on the above challenges, including:
• use of cloud computing as a mechanism for performing eResearch off campus,
centred on the researcher;
• DevOps tooling for reproducible research within projects, e.g., to avoid bit-rot, and
to help researchers take a central role in the software the develop and/or use;
• including Research Software Engineers (RSEs) within the University staff, and thus
being able to decouple ongoing support from project funding;
• increased capability for delegated authorisation and security management;
• shifting from batch to interactive or streaming data processing;
• machine learning computation gravitating to the devices with cutting-edge
performance (often able to be sourced for free) being installed on researchers’
computers, rather than being centrally sited and/or managed;
• providing delegated authorisation to support self-service facilities for researchers;
• changing roles of supporting organisational units such as university libraries.
The University of Otago is currently restructuring its offerings in terms of eResearch support
within the University’s Information Technology Services. Beyond reviewing various problem
cases, this talk will view the above topics through the lens of what is likely to be practical in
the near future at the University of Otago.
ABOUT THE AUTHOR(S)
- David Eyers is an academic in the Department of Computer Science at the University
of Otago in Dunedin, New Zealand. He previously worked as a senior research
associate at the University of Cambridge, from where he was awarded his PhD. One
of his primary research focus areas is in distributed systems, particularly regarding
communication efficiency, energy monitoring, storage and security management
technologies such as decentralised information flow control. He aims to help
develop power-efficient, reliable, highly scalable and secure architectures for cloud
computing. Beyond research relevant to HPC and cloud computing, he has broad
interests in eResearch topics that include reproducibility, and the distributed
management of research data and metadata.
- Lahiru Ariyasinghe is a postdoctoral researcher in the Department of Computer
Science at the University of Otago in Dunedin, New Zealand, from where he was
awarded his PhD. Lahiru is an expert in building computational pipelines, and using
49
![Page 50: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/50.jpg)
emerging DevOps approaches to develop reproducible research platforms. Lahiru
has first-hand experience of innovative eResearch projects, including some situations
in which unexpected barriers have been encountered, and creative solutions have
overcome those barriers.
50
![Page 51: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/51.jpg)
.dzƛƭŘƛƴƎ ŀ CŜŘŜNJŀǘŜŘ wŜǎŜŀNJŎƘ /ƻƭƭŀōƻNJŀǘƛǾŜ
David Fellinger
iRODS Consortium
The concept of countrywide and worldwide research collaboratives is relatively new. Several
decades ago it was common for a department head to have multiple vertical file cabinets
with paper folders housing the work of researchers and students in his or her department.
Access and subsequent citations of this work was generally based on the department heads’
knowledge of the works. As digital storage technologies became less expensive and
relatively ubiquitous the vertical files turned into disk storage systems reflecting the work of
each university department. The works were still filed and maintained by standard file
system references such as creation date, name, and access controls. In the many cases, card
catalogs or spreadsheets were used to further describe the titles. The introduction of
ethernet in the late 1970’s largely changed the manner in which research works were
conserved. The deployment of campus-wide data networks enabled universities to establish
and maintain central data repositories. Storage could become a service of the university
where individual colleges or departments no longer had to maintain their own archival
systems. The era of the digital research collaborative was born. In many cases, this
transition took years, and even today, some university departments retain internal storage.
Locating a specific work based upon anything other than title was a challenge and that
problem grew with the number of works that were archived.
The Advent of Storage Management Technology
A digital file system is really just a means for storing and maintaining data like a set of
shelves is a means to hold books. What is actually required is a way to relate descriptive
data to files indicating the contents of a file. This was largely understood for libraries
containing shelves of books starting thousands of years ago dating back to 2000 BC [1]. In
the United States, the Defense Advanced Research Projects Agency (DARPA) funded a
program called the Storage Resource Broker (SRB) in 1995 and 1996 and the first
middleware to identify works based on content and user defined metadata was written. In
2006 the DICE group, a group of research institutions in the US created the Integrated Rule-
Oriented Data System (iRODS) expanding on the concepts of SRB and in 2013 the iRODS
Consortium was formed as a user supported community devoted to the long term
continuation of this open source middleware. This project, that was first launched 25 years
51
![Page 52: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/52.jpg)
ago, has spawned software that is being used to manage data archives worldwide. The
iRODS software can completely virtualize entire file system infrastructures so that storage
purchased from any vendor at any time can be made to appear as one effective file system.
Researchers no longer have to be concerned with the location of data but just the contents.
Data discovery is one of the primary features of iRODS. A researcher can specify search
terms retained in an index that allows other researchers to discover that research work. The
process of building an index does not necessarily require human intervention. Metadata can
be automatically extracted from files at rest or while being ingested to enable
discoverability. In fact, complete workflow automation can be realized with iRODS. Data can
be automatically ingested from numerous sensors and routed, based on content and
policies, to specific compute platforms for analysis. The subsequent data products can then
be distributed based on policy. Data products can be published according to policies
associated with the collections under management. All of this functionality can be audited in
real time to precisely track the operation of a data center.
Bandwidth Availability Enables Global Collaboration
The deployment of 100Gbps ethernet wide area networks across many universities
launched a new era of research data communication. Initially all data operations were
relegated to one campus or entity simply due to the limitations of communication
technology. While it was possible to transfer files by way of File Transfer Protocol (FTP)
technologies it was not easily possible to create indices that spanned federated collections
allowing data to be discovered or easily accessed. The secure federation capabilities of
iRODS has changed the way that we think of data locality. One of the key focuses of iRODS
development has been to enable federated collaboration. When the administrators of two
iRODS sites share a set of keys, the two sites, with permissions, can appear as one. The
researcher or administrator can assign access controls for local and WAN access. A user in a
remote zone can easily discover data through access to user defined metadata. A file
transfer can then be enabled with the iRODS servers brokering a direct transfer to the
requesting client. A researcher can even share data with a non-iRODS users issuing a secure
ticket for a specific file or files.
A New Era of Data Sharing is Underway
Large scale iRODS deployments span the world and have enabled collaborations of multi-
national scientists and researchers. In the US the iPlant Collaborative was formed in 2008
with funding from the National Science Foundation. Data management was based on iRODS
from the start of the project and it initially served the plant science communities primarily in
the US. From its inception, iPlant quickly grew into a mature organization providing
powerful resources and offering scientific and technical support services to researchers
52
![Page 53: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/53.jpg)
nationally and internationally. In 2015, iPlant was rebranded to CyVerse to emphasize an
expanded mission to serve all life sciences [2]. Today CyVerse serves over 47,000 users with
5,690 participating academic institutions and 2,438 non-academic institutions. A major
feature of the collaborative is the Discovery Environment (DE) which allows researchers to
quickly find files of interest relating to their life science discipline. The primary site is in
Tucson Arizona with a mirror at Texas Advanced Computing in Austin Texas. Both data
management and workflow control is enabled by the use of iRODS.
In Europe the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of
over 50 universities and research institutions in the European Union. The infrastructure is
managed under iRODS and the data covers over 30 scientific disciplines from atmospheric
research to physics, hydro-meteorology, genomics, and ecology. As with CyVerse, a major
feature of EUDAT is data discovery across the entire geography of the EU. The goal is to
provide both data access and re-use for near term needs as well as data preservation to
build a long term archive [3].
In the Netherlands, SURF has built a data management framework based on iRODS.
Countrywide data from several universities is stored at their data site. Besides the service of
offering data storage and management, they also offer data processing and analysis as well
as compute services. All of the data at the site is moved to various platforms and tiers using
iRODS [4]. SURF is a member of the iRODS community as well as several universities in the
Netherlands.
In Sweden, The Swedish National Infrastructure for Computing (SNIC) is a national research
infrastructure that makes available large scale high performance computing resources,
storage capacity, and advanced user support, for Swedish researchers. This service is
managed under iRODS control [5]. This service uses the Swedish University Network
(SUNET) which links the infrastructure at the KTH Royal Institute of Technology to other
universities in Sweden with a 100Gbps link to facilitate data movement [6].
These are just a few of the iRODS deployments in both the academic and research sectors.
The use of iRODS and its discovery capabilities accelerates scientific research allowing
researchers to quickly find relevant materials while building on them. The power of iRODS to
manage data based on collection policies cannot be overstated as data sets grow and
automation becomes a requirement. Many worldwide universities, libraries, museums, and
companies have chosen iRODS as a technology that allows the “future proofing” of data
collections independent of the evolution of storage. These institutions have realized that
their data policy decisions can be maintained by iRODS at any scale regardless of the change
of data storage or networking technologies over time.
53
![Page 54: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/54.jpg)
About the author
Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS
Consortium. He has over three decades of engineering experience including film systems,
video processing devices, ASIC design and development, GaAs semiconductor manufacture,
RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc.
he focused on building an intellectual property portfolio and presenting the technology of
the company at conferences with a storage focus worldwide.
In his role at the iRODS Consortium, Dave is working with users in research sites and high
performance computer centers to confirm that a broad range of use cases can be fully
addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a
member of the founding board.
He attended Carnegie-Mellon University and holds patents in diverse areas of technology.
References
1. The history of the card catalog is available from;
https://www.vox.com/culture/2017/4/21/15357984/card-catalog-library-of-
congress-history ,accessed 2 November 2019
2. The history of CyVerse is available from;
https://www.cyverse.org/about ,accessed 9 October 2019
3. Information regarding EUDAT is available from;
https://www.eudat.eu/eudat-cdi ,accessed 8 October 2019
4. Information regarding SURF is available from;
https://www.surf.nl/en/research-ict ,accessed 9 October 2019
5. Information regarding SNIC is available from;
https://www.snic.se/ ,accessed 9 October 2019
6. Information regarding SUNET is available from;
https://www.sunet.se/about-sunet/ ,accessed 9 October 2019
54
![Page 55: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/55.jpg)
²ƘŜNJŜ 5ŀǘŀ [ƛǾŜǎΥ bŜ{LΣ ǘŀƻƴƎŀ ŀƴŘ ƎNJƻǿƛƴƎ NJŜLJƻǎƛǘƻNJȅ ǎŜNJǾƛŎŜǎΦ
Brian Flaherty
NeSI
Data transfer and data sharing have been a part of NeSI's service catalogue for several years, but the service priority has always been in support of compute-intensive research (and related training & consultation). The launching of a national data transfer platform and a new relationship with Genomics Aotearoa have provided NeSI the opportunity to re-evaluate its data service offering. A key project in the Genomics Aotearoa Workplan is bioinformatics capability (Project 1811), which encompasses the development of a national genomics data repository including bespoke processes for Māori management of indigenous data, which is actively populated across all New Zealand genomics research activities. GA's functional requirements for this repository include securely storing, preserving and providing mediated access to genomics data for the longer term. It is also intended that the repository be interactive and usable by a large number of NZ researchers. NeSI's early response has been to focus on implementing the base -level infrastructure requirements for the repository while beginning to investigate platform options and prototype permissions workflows. This presentation will provide an update on progress to date, including storage and access management through Globus, and introduce topics from data publishing and discovery services (from simple metadata to interrogation of genome summaries), to indigenous data governance requirements, and longevity/persistence.
ABOUT THE AUTHOR(S)
- Brian Flaherty
- Brian is Product Manager, Data at NeSI. He has a background in digital libraries
digital scholarship, research infrastructure & support and discovery services.
-
55
![Page 56: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/56.jpg)
.dzƛƭŘƛƴƎ ŀ ƴŀǘƛƻƴŀƭκNJŜƎƛƻƴŀƭ Řŀǘŀ ǘNJŀƴǎŦŜNJ LJƭŀǘŦƻNJƳΥ Dƭƻōdzǎ .ƻC
Kyle Chard
University of Chicago, Chicago, Illinois, USA
Argonne National Laboratory, Chicago, Illinois, USA
Brian Flaherty
NeSI, Auckland, New Zealand
The Globus research data management service (https://globus.org), operated by the University of Chicago for the global research community, supports the rapid, reliable, and secure transfer and sharing of research data within and among institutions. With more than 140,000 registered users and 10,000 active endpoints worldwide, it is essential science infrastructure for many institutions and in many fields. This BOF will provide an opportunity for attendees to learn about the latest Globus features, hear from operators of research infrastructure and research networks about how Globus is used in research, and explore the options for high speed Trans-Tasman data transfer. The BoF will be organized as brief presentations followed by a facilitated conversation. Kyle Chard will introduce Globus and highlight new features, including access to file sharing, cloud storage, protected data management, and data automation. Brian Flaherty will describe how NeSI in collaboration with REANNZ and Globus is building a national data transfer platform. Bring your questions and ideas on how the national platform can be improved/enhanced.
ABOUT THE AUTHOR(S)
- Kyle Chard is a Research Assistant Professor in the Department of Computer Science at the University of Chicago and a researcher at Argonne National Laboratory. He received his Ph.D. in Computer Science from Victoria University of Wellington, New Zealand in 2011. He co-leads the Globus Labs research group which focuses on a broad range of research problems in data-intensive computing and research data management. He currently leads projects related to parallel programming in Python, scientific reproducibility, and elastic and cost-aware use of cloud infrastructure.
- Brian Flaherty is Product Manager, Data at New Zealand eScience Infrastructure. He has a background in digital libraries, scholarly communication, open repositories, data management, research information management and discovery services.
56
![Page 57: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/57.jpg)
aƛŎNJƻπŎNJŜŘŜƴǘƛŀƭǎ ŀƴŘ wŜǎŜŀNJŎƘ {ƪƛƭƭǎ 5ŜǾŜƭƻLJƳŜƴǘ
Jonathan Flutey
Victoria University of WELLINGTON
The New Zealand Qualifications Authority (NZQA) has recently formalised a new micro-
credential policy and piloted a series of small skills based courses that align to the NZQA
credit framework.
While this is not yet gaining traction in Universities, Tertiary Education Organisations (TEO’s)
with a strong skills based focus are finding new ways of rewarding, and recognising, learning
through this new policy and framework.
The policies focus is not only on TEO’s. NZQA have formalised a process for non-TEO’s
(professional groups, accreditation boards, communities) to benchmark their skills based
learning programmes for micro-credential equivalency -
https://www.nzqa.govt.nz/providers-partners/approval-accreditation-and-
registration/micro-credentials/equivalency/
This birds of a feather session opens up the discussion, and possibilities, of nationally
recognised credentials and development pathways for RSE’s and research support staff with
a particular focus on skills based and professional practice assessment. Is this something our
communities want, lets discuss!
ABOUT THE AUTHOR(S)
- Jonny Flutey
- https://orcid.org/0000-0002-2210-755X
57
![Page 58: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/58.jpg)
5ƛƎƛǘŀƭ tNJŜǎŜNJǾŀǘƛƻƴ bŜǿ ½ŜŀƭŀƴŘ
Andrea Goethals
National Library of New Zealand
Since 2004, the New Zealand government has invested approximately $50 million in the
digital preservation programmes of the National Library of New Zealand (NLNZ) and
Archives New Zealand. These funds have been used to develop the repository infrastructure
and staff expertise needed to operate, manage and support sustainable long-term digital
programmes to care for the nation’s cultural heritage and government records in digital
form.
Recognising that data with long-term value, and therefore in need of digital preservation, is
being produced by many individuals and organisations across NZ, the NLNZ began a project
five years ago to explore a national approach to digital preservation. The idea is that NLNZ’s
digital preservation programme could be expanded to provide a digital preservation service
for other NZ organisations creating digital content with ‘high value’, i.e. that will contribute
to economic, social, cultural or economic outcomes, now or in the future.
The NLNZ’s research has included surveys of targeted populations to understand for NZ the
value of data being created; the policies, strategies, practices and systems in place to
manage and maintain access to it; and the appetite to use a NZ digital preservation service.
CIO/CTOs of NZ state sector organisations were surveyed because of their responsibility to
maintain access to their organisation’s digital material, and eResearchers were surveyed
because they generate digital material of value. The surveys were first conducted in 2015,
and then repeated in 2019 to understand the extent to which changes in digital preservation
practice and needs in NZ had changed or remained the same.
This presentation will share what has been learned through this research, and the
eResearch conference attendees will be invited to provide feedback on a potential national-
level digital preservation service.
58
![Page 59: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/59.jpg)
ABOUT THE AUTHOR(S)
- Andrea Goethals
- Andrea manages the digital preservation team at the National Library of New
Zealand. She has primary responsibility for the overall day-to-day operations of the
National Digital Heritage Archive and contributes to the strategic direction of the
Library’s digital preservation programme. She champions digital preservation issues
and collaborates closely with others at the Library and around the world to advance
digital preservation standards and practices.
59
![Page 60: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/60.jpg)
LΩƳ ŀ .ƛƎ aŜǘŀƭ CŀƴΥ
.ƛƎ 5ŀǘŀ ŀǘ ǘƘŜ [ƻǿŜǎǘ [ŜǾŜƭ
Joseph Guhlin
Genomics Aotearoa @ University of Otago
How can 1Gbp pair genome and fewer than 200 samples produce 10Tb of data? How do we
work with such massive datasets?
Genomics is benefitting from an accelerated increase in data. As we work with more
samples and larger genomes, data increases linearly. Working with machine learning
algorithms data can increase exponentially. We need to change how we think about
processing data and performing analyses. At the analytical level, researchers should
understand how to reduce problems into the smallest solvable problem set. By attacking
small solvable problems, a large dataset becomes a series of computations which is easily
parallelizable. Map/Reduce is a technique from data science used to address this specific
problem. This technique benefits workflows, high-performance computing, and
programming.
Other problems can arise from large datasets. Common bioinformatics software does not
scale to large genomes. Throwing hardware at the problem is the most common solution,
but there are alternatives such as memory mapping files.
Finally, there are processor intrinsics called Single instruction, multiple data (SIMD). These
allow running a single computation over multiple data points simultaneously. Experience in
a systems programming language is not a pre-requisite for this. Both Python and R have
tools to work with SIMD and GPU instruction sets.
In this lightning talk, I plan to share my story of working with large datasets, how I try to
address problems, and my failings and successes. Working with datasets with a size
unthinkable a decade ago requires a shift in thinking, both from the analysis level as well as
the level of those writing the tools and libraries.
60
![Page 61: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/61.jpg)
ABOUT THE AUTHOR(S)
Joseph Guhlin, PhD in Plant and Microbial Sciences. Has been working with Unix (FreeBSD,
originally) and Perl since age 12. Has expanded programming skills in Clojure, a lisp-dialect
that runs on the JVM, and Rust, a systems-level programming language gaining traction as
an alternative to C++. Interests include programming, genomics, big data sets, and machine
learning applications.
61
![Page 62: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/62.jpg)
¢NJŀƛƴƛƴƎΥ LǘΩǎ ōŜǘǘŜNJ ǘƻƎŜǘƘŜNJ
Megan Guidry
NeSI
New Zealand eScience Infrastructure (NeSI) provides expertise and capability to researchers
conducting computation and data intensive research in New Zealand. Within the training
sector, our core purpose is to raise the computational capability of New Zealand research
and, in turn, shrink the existing eResearch skills gap. To do this, however, we rely heavily on
healthy partnerships with various organizations throughout the country.
In this presentation, we will discuss our training efforts so far (both in terms of delivering
training, but also cultivating the New Zealand training community) and reflect on the scale
of the opportunity/challenge that we face. Ideally, this talk will be a conversation starter on
how we, as a community of busy and passionate eResearch enthusiasts, can continue to
improve processes and share knowledge more freely.
Ultimately, training needs to be useful and relevant to those who need it. NeSI strives to be
agile in it’s approach to training delivery and this presentation will conclude by noting what
we are doing today to ensure our efforts are increasingly measurable, scalable, and
community-focused.
ABOUT THE AUTHOR(S)
Megan Guidry is the training coordinator for New Zealand eScience Infrastructure (NeSI) and
is also the Regional Coordinator for the Carpentries in New Zealand. Her main priority is
raising the eResearch capability in New Zealand through training delivery and community
building.
62
![Page 63: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/63.jpg)
IȅōNJƛŘ ¢NJŀƛƴƛƴƎΥ ŀ ǎŎŀƭŀōƭŜ ƳƻŘŜƭ ŦƻNJ ŘŜƭƛǾŜNJƛƴƎ
ƘŀƴŘǎπƻƴ ǘNJŀƛƴƛƴƎ ǘƻ ŘƛǎLJŜNJǎŜŘ ƭŜŀNJƴŜNJǎ
Christina Hall1, Andrew Lonie1, Jeff Christiansen1
(1) Australian BioCommons, Australia
Australia has a diverse array of government agencies, universities and research institutes
undertaking bioscience research. Biologists and bioinformaticians are distributed widely
throughout the country, and are sometimes isolated in small research groups or remote
locations. A novel training delivery methodology was developed to service the urgent needs
for bioinformatics skills in a cost and time efficient way. The method, which combines an
expert Lead Trainer delivering a presentation online in conjunction with a hands-on
interactive practical session at multiple venues supported by trained local Facilitators, is
ideally suited to the delivery of simultaneous training workshops around Australia. Referred
to as the ‘hybrid training model’ the scalable method combines the advantages of webinar
presentations with some valuable components of in-person group training.
Australian BioCommons’ hybrid training events regularly cater for more than 100
participants at up to 9 venues. Each participant brings their own laptop to a venues hosted
by one or more local Facilitators who are responsible for the local logistics including room
bookings and WIFI connections. Critically, Facilitators are themselves trained in the
workshop materials ahead of time. Presentations from the Lead Trainer are viewed on a
communal screen, with each participant simultaneously completing guided hands-on
activities. Live camera feeds from each venue help participants to feel they are part of a
larger community, and allow the Lead Trainer to observe room dynamics in real time. An
online shared ‘Discussion Board’ is active during the session, available for participants to
interact across venues, asking questions about their own specific challenges or interests.
Peers and experts alike join the discussion and answer technical questions. The 3-4 hour
events are structured to enable successful completion of exercises and to ensure nobody is
left behind or rushed through tasks. The recording of the training events, presentations,
tutorials and Discussion Boards are made available for perpetual reference after the event
has concluded.
Engagement with skilled Facilitators is key to the success of each training event, and the
availability of training at particular locations is dependent on identifying a willing volunteer.
63
![Page 64: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/64.jpg)
Group size is also determined by Facilitator availability, with an approximate ratio of 1:10
Facilitators to participants strongly encouraged. Experienced and active trainers and
researchers themselves, the Facilitators have been a valuable source of feedback on the
development of the model, as well as being integral local event organisers and workshop
helpers. They have been enthusiastic in their support of the hybrid training model and its
ability to supplement their own local training programs. The delivery method allows
regional universities with only a few participants to have direct access to training expertise
on the same footing as larger universities.
Online evaluation surveys show that close to 100% of all participants think ‘this was a useful
workshop that enhanced my knowledge and skills’, and that ‘the format of the exercises and
activities enhanced participants’ learning and increased their level of skills’.
The hybrid method of training delivery provides an efficient way to reach many venues
simultaneously, and is easily extensible to new sites. The events are particularly valued by
regional locations that may not otherwise have access to the depth and breadth of expertise
offered by national events. This methodology fosters the development of a community of
people interested in bioinformatics training and can help to elevate the profile of local
Facilitators and domain experts who participate. The recording of each event’s
presentations, cameras and links to materials allows for continued use by participants who
trained on the familiar environment of their own laptop. By posting these resources online,
the content is also suitable for self-guided use by the public.
The hybrid training methodology is an important feature of the Australian BioCommons
training program. Its ability to efficiently enable training of dispersed learners is compelling.
The potential to extend this format to incorporate a larger multi-national audience with
shared geographic challenges is currently being investigated.
ABOUT THE AUTHOR(S)
- Name: Dr Christina Hall
- Bio: Christina is the Training and Communications Manager of the Australian
BioCommons. In developing and implementing a national program of bioinformatics
training events and resources, Christina builds on similar previous roles for
Melbourne Bioinformatics and the EMBL Australia Bioinformatics Resource. Her
research career in plant pathology was interspersed with several science
communications roles, including museum public program management. Christina’s
professional motivation is to enhance scientific progress through supporting
Australian biologists to do their best science.
64
![Page 65: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/65.jpg)
{ƛƴƎdzƭŀNJƛǘȅ ŎƻƴǘŀƛƴŜNJǎ ƻƴ It/
W. Hayek1, B. Bethwaite2, B. Roberts3
NeSI/NIWA1, NeSI/University of Auckland2, NeSI/Manaaki Whenua - Landcare Research3
Containerisation is a form of virtualisation that has become very popular in the world of IT services and cloud computing, offering straightforward portability and deployment of software and services without having to install a complex set of dependencies. It has recently become available on High Performance Computing (HPC) systems through the popular Singularity software package. Singularity is a containerisation tool that is particularly suitable for HPC and scientific applications, featuring immutable software to support reproducibility of scientific results, as well as integration with HPC file systems, MPI, and more. This talk will outline the basic concepts of containerisation and discuss a recent NeSI consultancy project where a web server and database were containerised to process data on the Mahuika HPC. The project is now easily portable and can scale out to many cores, enabling very significant speed-ups.
ABOUT THE AUTHOR(S)
- Wolfgang Hayek is a research software engineer at NeSI and NIWA, and group
manager of NIWA’s scientific programming group, with many years of experience in
scientific computing and HPC.
- Blair Bethwaite is solutions manager at NeSI; he has strong expertise in HPC, cloud
computing, cloud architectures, and scientific computing
- Ben Roberts is an application support specialist at NeSI and has many years of
experience in scientific computing and HPC
65
![Page 66: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/66.jpg)
5ŀǘŀΧ ǎƻ ǿƘŀǘΩǎ ǘƘŜ LJNJƻōƭŜƳΚ Rosie Hicks, CEO Australian Research Data Commons
eResearch Australasia 2019
Digital data, tools and methods are changing everything, including the way research occurs and the societal challenges that we can address. All areas of research are becoming ever more dependent on data and eResearch. eResearch is now simply research. We care about data to enable discovery, to speed up research, to generate new knowledge that will make a difference. Ultimately, we undertake research to have an impact on people’s lives. A discussion of the challenges and opportunities now includes terms such as sensitive data, access to government data, open data, FAIR data, and trusted data. Along with buzz words such as Artificial Intelligence, Machine Learning and Cloud. Our first problem is a lack of shared understanding of these concepts. What does ‘sensitive data’ mean for health research or for defence technologies? This talk will examine a number of these terms to identify specific challenges that are
preventing us from achieving the full impact of our research
66
![Page 67: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/67.jpg)
{ǘNJdzŘŜƭнΥ LƴŎNJŜŀǎƛƴƎ ŀŎŎŜǎǎƛōƛƭƛǘȅ ƻŦ It/ LƴŦNJŀǎǘNJdzŎǘdzNJŜ Chris Hines
The Monash eResearch Centre supports a large number of communities with highly varied
computing skills. As such one of our flagship offerings has been for a number of years simple
easy access to Desktops (i.e. vncservers) running on the same HPC hardwarethat supports
our large users. This facilities growing our users computing skills from simple visualisation
tasks to large batch processing tasks over the course of their research career.
Strudel2 is the next generation of tool supporting access. By carefully melding various
standard technologies we've produced a framework capable of supporting not just our
traditional desktop embedded within the HPC environment but also many of the other tools
for example Jupyter Notebooks and RStudio Server. We've also blended in Federated Single-
Sign on Authentication, and, for Federation members that support it, Multifactor
authentication.
!.h¦¢ ¢I9 !¦¢Ihwό{ύ
Chris has been kicking around the Australian eResearch sector for more years than he cares
to admit to anyone, let alone himself. Chris exhibits the typical arrogance of most people
with a background in physics assuming all your problems can be solved easily if you simply
approximated your cows as spheres to simplify the maths. An itinerant sys-admin,
programer and HPC consultant, he uses his skills wherever the needs of Monash university
research require.
67
![Page 68: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/68.jpg)
The Undies-mate Un-debate
Chris Hines and Ai-Lin Soo
Monash eResearch Centre
[email protected] [email protected]
Machine learning is far from error free, but if its good enough to improve
outcomes, perhaps we should deploy anyway? The ethical implications of how and
when we use techology, and specifically approaches such as machine learning and
data mining where the inherient biases of the system are not always obvious, is
something we all should engage in. We'll take topics from the audience, vote on
the most interesting and kick of the verbal equilvent of a WWE wrestling match.
Attendies should be prepared to "step into the ring" and voice their own
opinions. This is supposed to be a fun look at a serious topic.
ABOUT THE AUTHOR(S)
Chris Hines: Chris has been kicking around the Australian eResearch sector for more years than he cares to admit to anyone, let alone himself. Chris exhibits the typical arogance of most people with a background in physics assuming all your problems can be solved easily if you simply approimated your cows as spheres to simplify the maths. An itinerant sys-admin, programer and HPC consultant, he uses his skills wherever the needs of Monash university research require.
Ai-Lin Soo: A background in Commerce and Biomedical Science, eResearch was not on the radar until I stumbled across the Monash eResearch Centre. With a year under the eResearch belt (and hopefully many more to come), I'm enjoying my time as Project Officer and still
68
![Page 69: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/69.jpg)
contemplating what the 'e' in eResearch stands for. I have an interest in improving health outcomes for society and when not contemplating all of the above, I spend my time watching TV crime dramas.
69
![Page 70: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/70.jpg)
{dzLJLJƻNJǘ ŦƻNJ wŜǎŜŀNJŎƘ 5ŀǘŀ aŀƴŀƎŜƳŜƴǘ ƛƴ dzƴƛǾŜNJǎƛǘȅ ƭƛōNJŀNJƛŜǎ ς Iƻǿ ŦŀNJ ƘŀǾŜ ǿŜ ŎƻƳŜΚ
Jess Howie – Research Support Librarian
University of Waikato
Advances in computing and technology have triggered a tidal wave of data on a global scale.
In this rapidly changing environment, data has become an output in its own right and steps
need to be taken in order to ensure that data is appropriately managed, stored and
preserved. Ideal Research Data Management practices help to realise the potential of data
to enrich knowledge through re-use and re-analysis, as well as provide mechanisms for
validation and enhance reproducibility. Librarians are well-placed to support researchers to
manage their data optimally. Not only are they well-versed in metadata and findability, they
also have an important role to play as advocates and balancing voices in a discussion which
is politically, ethically and culturally charged.
This lightning talk will summarise research which explored the development of research
support services in New Zealand University Libraries via survey responses from all eight
Universities. The survey questions for the Research Data Management section were
repeated from a multi-country carried out in 2012. The sharing of data from the original
survey enabled some longitudinal analysis over this time period. Respondents were asked to
provide details on the level of services offered, partnerships with other units, job titles, staff
time, barriers to service development and skills gaps.
Among the findings were a strong indication of growth in services, alongside a reduction in
perceived barriers and an increase in staff capacity. The results point to a strong future for
Research Data Management support in libraries but also provide some warning as to areas
that require more development and greater level of collaboration.
ABOUT THE AUTHOR(S)
- Name - Jess Howie
- Bio - I am Researcher Support Librarian at University of Waikato. My areas of interest
include Research Data Management, scholarly communication, Open Access and
research impact.
70
![Page 71: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/71.jpg)
¦ǎŜNJ ƧƻdzNJƴŜȅπŘNJƛǾŜƴ LJNJƻŘdzŎǘ ƳŀƴŀƎŜƳŜƴǘ
Jun Huh
NeSI
NeSI was facing challenges around user onboarding. We built a journey map for NeSI
researchers to gain better understanding of the extent of the problem, and focus on where
the biggest issue was. As an organisation, we are striving to be more metric driven, and
using this user journey as a reference for the team members to see things from researchers’
perspective.
Jun will share the process NeSI went through, along with the user journey and service
blueprint that maps the journey to internal processes, how looking at the numbers in the
context of the user journey helped us identify problem areas.
The process led us to achieve improvements in the account setup process, and have given
us a useful reference point to understand what to focus on next.
ABOUT THE AUTHOR(S)
- Jun Huh, Innovation and Growth at NeSI
- Jun comes from a start-up background with focus around providing genuine value to
the users and steering organisations to be more user driven.
71
![Page 72: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/72.jpg)
[ŜŀNJƴƛƴƎ Iƻǿ ¢ƻ [ŜŀNJƴ
Jun Huh
NeSI
This lightning talk presents some learning concepts that could be useful for researchers
wanting to learn a new skill or a new tool, and trainers who wants to create effective
training programmes.
Jun will explain some learning related concepts including but not limited to:
• The mastery curve
• Chunking
• Categorising what to understand vs memorise vs practice
ABOUT THE AUTHOR(S)
- Jun Huh, Innovation and Growth at NeSI
- Jun comes from a start-up background with focus around providing genuine value to
the users and steering organisations to be more user driven.
72
![Page 73: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/73.jpg)
CNJŀƳŜǿƻNJƪǎ ŦƻNJ ƎNJƻǿǘƘ ŀƴŘ ƛƴƴƻǾŀǘƛƻƴ
Jun Huh
NeSI
What does it mean to grow as an organisation or a community? How do we innovate?
As an organisation we have to be adaptive to changes and new requirements. We have
touched upon many different frameworks to frame our thoughts and processes. We wish to
share some of these frameworks and talk about how they have helped the decision-making
processes in NeSI.
• Cynefin: How different levels of complexity in a problem pushes you toward
different approaches.
• Wardley map: Mapping different maturity levels of products and their directions.
Roles of pioneers vs town planners.
• User journey / service blueprint: Helping us see things from the users prespectives.
• Market segmentation / user personas: Building archetypes to help us communicate
using the right voice.
• Growth strategies: Managed growth vs growth hacking.
• Understanding user needs: What users say, what users do, and what users should
do.
ABOUT THE AUTHOR(S)
- Jun Huh, Innovation and Growth at NeSI
- Jun comes from a start-up background with focus around providing genuine value to
the users and steering organisations to be more user driven.
73
![Page 74: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/74.jpg)
.dzƛƭŘƛƴƎ ŀƴ LƴǘŜNJƴŀǘƛƻƴŀƭ C!Lw ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ ŦƻNJ wŜǎŜŀNJŎƘ 5ŀǘŀ
Guido Aben, AARNet, Australia [email protected]
Kuba Moscicki, CERN, Switzerland [email protected]
Carina Kemp, AARNet, Australia [email protected]
Over the past ~6 years, a budding community of NREN and discipline operators of
synch&share stores has popped up. These operators typically run one of [ownCloud, seafile,
NextCloud, PowerFolder]. Judging by site surveys presented at consecutive synch&share-
focused CS31 conferences, their services have all become runaway successes – it’s not
unusual for these stores to be in the PB range and to serve tens of thousands of real
researchers and their real research data. The next wave of open science policy, however,
tells us that data shouldn’t be locked inside a single vault – instead it needs to be
interlinked, citable, free to move; in short, FAIR. The CS3 community have always been
working towards enabling interlinking of the data between stores at the identity and
metadata levels. An open protocol was developed to announce, accept and propagate
shared volumes from one installation to another. This protocol is called OpenCloudMesh2
and is by now supported by most synch&share software vendors. So, we have the installed
base, the incentive to interlink, and the technology to interlink. We just haven’t taken actual
linking beyond proof of concept yet; not in an operational, sustainable way in any case.
A proposal: interlink synch&share stores into a new pan-european data eInfrastructure
We were informed in late 2018 that the EC had put out a call for the development of
innovative science cloud eInfrastructures, called InfraEOSC-023.
This call matched surprisingly well with our intents. A few guidelines from the call may
illustrate this. Imagine you have interlinked sets of synch&share nodes, and that data can be
1 http://cs3.infn.it/info.html# 2 https://www.geant.org/Services/Storage_and_clouds/Pages/OpenCloudMesh.aspx 3 http://ec.europa.eu/research/participants/data/ref/h2020/wp/2018-2020/main/h2020-wp1820-infrastructures_en.pdf
74
![Page 75: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/75.jpg)
freely requested and mounted between them. Now think how well you’re placed to answer
these challenges from the call:
==========================
Highlights
• innovative models of collaboration that genuinely include incentive mechanisms for a user oriented open science approach
• develop innovative services that address relevant aspects of the research data cycle (from inception to publication, curation, preservation and reuse),
• allowing implementation of new scientific data-related developments and intelligent linking and discovering of all research artefacts
• foster interdisciplinary research, serving a wider remit of research needs, as well as new users like industry and the public sector.
==========================
A consortium has now been formed to deliver this project and is made up of ~10
eInfrastructure providers (NRENs, landmark instruments etc.), most in Europe, but AARNet is
also a contributor through their Cloudstor Services. The project shall be delivered not from a
blank slate, but rather building on an existing set of services already operated and in
widespread use among end users at the participant sites. This proposal does not focus on
development of software for a new infrastructure; rather, it is about systems integrating
existing components to deliver added value to the existing and active participants of the CS3
and GEANT communities.
The resultant eInfrastructure will be established by interfederating exiting stores into a fabric
of “federated sites” based on federation mechanisms, operational routines and trust.
Federative best practices learned from EduGAIN and EduRoam will be adopted and applied.
This presentation will present the building blocks for the project, the conditions to consider
to make this a success and the proposed milestones and invite additional international
collaborators.
75
![Page 76: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/76.jpg)
! /ƻƳƳƻƴ ¢ƘNJŜŀŘΥ /NJŜŀǘƛƴƎ ŎƻƳƳdzƴƛǘȅΣ ǿƻNJƪƛƴƎ ǘƻƎŜǘƘŜNJ ŀƴŘ ŜƴNJƛŎƘƛƴƎ NJŜǎŜŀNJŎƘ
Sara King, AARNet, [email protected]
Natasha Simons, ARDC, [email protected]
Paula Andrea Martinez, National Imaging Facility (NIF), The University of Queensland (UQ)
Carina Kemp, AARNet, [email protected]
This Birds of a Feather session is for colleagues from the eResearch training sector to share experiences and knowledge in community building practices.
It will be an interactive session, starting with a panel (20 min) of eResearch professionals speaking about their roles in creating communities – describing the barriers as well as the successes – and future plans, opportunities, and maybe even a few ‘dream big’ moments!
Using the ‘Open Space’ rule allowing participants to move between groups, in the second part (30 min) of the session participants will select from proposed topics for smaller group discussions to take a direct approach to building community.
The goal of this BoF is to discuss and share ideas on how to nourish and grow a Community of Practice and to (not-so-sneakily) actively create or continue to evolve collegial relationships both within New Zealand and internationally.
This will be an occasion for participants from a broad range of areas, such librarians, ITS staff, data stewards and others from the eResearch community, to connect and contribute, foster new collaborations and create opportunities to develop personally and professionally.
The session will have a short (10 min) debrief, and all discussions will be documented for future action and sharing. Anyone interested to stay in contact is welcome to join the online ENRICH Community of Practice, currently hosted on Slack.
Organisations on the potential panellist list:
AARNet
ARDC
76
![Page 77: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/77.jpg)
New Zealand eScience Infrastructure (NeSI)
University of Queensland
References
Community of Practice, available from https://en.wikipedia.org/wiki/Community_of_practice, accessed 21 November 2019.
Open Space Technology, available from https://openspaceworld.org/wp2/what-is/, accessed 21 November 2019.
ENRICH Community of Practice: https://enrichcop.slack.com
ABOUT THE AUTHOR(S)
- Dr Sara King is an eResearch Analyst with Australia’s academic and researchnetwork provider, AARNet. She has extensive experience in researcher engagementand training, with expertise in research data and technologies in the Humanities andSocial Science (HASS) research areas.
- Natasha Simons is Associate Director, Skilled Workforce, for the Australian ResearchData Commons (formerly ANDS, RDS and Nectar). With a background in libraries, ITand eResearch, Natasha has a history of developing policy, technical infrastructure(with a focus on persistent identifiers) and skills to support research.
- Dr Carina Kemp is the Director of eResearch for AARNet responsible for making thenetwork work the best it can for research in Australia. She works with the Australianand international research community to find and implement tools that sit above thenetwork to make technology and data research ready. Previous to joining AARNet,Dr Kemp was the Chief Information Officer at Geoscience Australia.
- Dr Paula Andrea Martinez is the National Training Coordinator for theCharacterisation Community in Australia. Her interests are on research methodsdevelopment and now outreach and advocacy in data and software best practices.
77
![Page 78: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/78.jpg)
5ŀǘŀπƛƴǘŜƴǎƛǾŜ ŀLJLJNJƻŀŎƘŜǎ ǘƻ ŦƛƴŘƛƴƎ ŀƴŘ LJNJŜŘƛŎǘƛƴƎ NJŜǎŜŀNJŎƘ ƻdzǘŎƻƳŜǎ ŦƻNJ bŜǿ ½ŜŀƭŀƴŘ ƘŜŀƭǘƘ NJŜǎŜŀNJŎƘ
Stephanie Guichard1, Stacy Konkiel2
Digital Science [email protected], [email protected]
How can we use data science to measure research outcomes at scale? Can quantitative data be used at all to understand research’s “impact” in its truest sense? In this presentation, we will share how—by asking the right questions, using the right data, and understanding data science’s strengths and limitations—New Zealanders can measure their success towards achieving health research outcomes, and even forecast future success. Using the New Zealand Health Ministry’s “New Zealand Health Research Strategy 2017-2027” report as a case study, we will first show how thoughtful strategic planning makes it possible for data scientists to answer pressing questions like, “How can we track the implementation of research into health policy?” and “How can we produce the best research that supports the well-being of all New Zealanders?” Next, we will discuss how linked bibliometric and altmetric data sources can help analysts better understand if and how New Zealand health research has achieved strategic priorities. Using unique data from Dimensions Analytics, a linked research intelligence database, and Altmetric Explorer, which provides data for understanding the broader impacts of research, we will use large scale visualization and statistical approaches to understand the current state of New Zealand health research with regard to desired outcomes; predict future trends based on past funding and publishing activity; and offer suggestions for ways to improve the likelihood of achieving desired research outcomes in the future. Among the outcomes studied will be international and industry collaboration trends, the translation of research into innovation and public policy, and public engagement with health research. Finally, we will offer a frank discussion on the benefits and limitations of quantitative data in measuring desired outcomes like community collaborations and whether research is improving health outcomes for Maōri and disabled peoples. In some cases, leading engagement indicators like altmetrics can be used as a rough proxy for success, and are complementary to traditional program evaluation approaches. We will explore several instances where altmetric and bibliometric data succeed and where they fail. References
78
![Page 79: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/79.jpg)
Ministry of Business, Innovation and Employment and Ministry of Health. 2017. New Zealand Health Research Strategy 2017-2027. Wellington: Ministry of Business, Innovation and Employment and Ministry of Health. Retrieved from: https://www.health.govt.nz/publication/new-zealand-health-research-strategy-2017-2027
ABOUT THE AUTHOR(S)
Stephanie Guichard is a Regional Sales Manager for Digital Science in the Asia-Pacific region.
Previously, Stephanie worked with teams at Nature Publishing Group and Palgrave
Macmillan (Macmillan Science & Education). Stephanie graduated with a triple major in
Literature, Philosophy and History, specializing in medieval French literature and history. A
native of New York and Singapore, Stephanie now lives in Melbourne.
Stacy Konkiel is the Director of Research Relations at Digital Science. Stacy’s research
interests include incentives systems in academia and informetrics, and she has written and
presented widely about altmetrics, Open Science, and library services. Stacy was a co-
founder of the HuMetricsHSS initiative and is a Metrics Toolkit co-founder and Editorial
Board member. Previously, Stacy worked with teams at Impactstory, Indiana University &
PLOS. You can learn more about Stacy at stacykonkiel.org.
79
![Page 80: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/80.jpg)
/ƭƻdzŘπƴŀǘƛǾŜ ǘŜŎƘƴƻƭƻƎƛŜǎ ƛƴ ŜwŜǎŜŀNJŎƘ π ōŜƴŜŦƛǘǎ ŀƴŘ ŎƘŀƭƭŜƴƎŜǎ
Marko Laban NeSI
[email protected] The commercial world of public cloud has been rapidly evolving in the last 10 years, democratizing access to software engineering tools and technologies that provide an easy way for small teams to design, build and operate large distributed fault-tolerant applications at Google/Amazon scale. “Cloud native” is a multi-disciplinary approach/methodology that applies selected architectural patterns, software development processes and freely available open source libraries/frameworks to build distributed software applications designated to fully utilize the advantages of the modern cloud-computing model. In this paper, we aim to make a case for wider adoption of cloud native technologies in eResearch and discuss challenges on that path - Method: Comparative analysis, applying a mature methodology from commercial space in a new space (eResearch) - Conclusion: “Researchers and the eResearch sector as a whole could reap significant benefits from adopting Cloud-native more widely” - References: o Cloud-native applications https://ieeexplore.ieee.org/document/8125550 o Understanding Cloud-native applications after 10 years of cloud computing https://www.researchgate.net/publication/312045183_Understanding_Cloud-n ative_Applications_after_10_Years_of_Cloud_Computing_-_A_Systematic_M apping_Study o Understanding the e-Research ecosystem in New Zealand: https://www.nesi.org.nz/sites/default/files/media/eScienceFuturesWorkshop-R eflectionReport.pdf o Cloud-native infrastructure https://www.oreilly.com/library/view/cloud-native-infrastructure/978149198429 1/ch01.html ABOUT THE AUTHOR(S) - Marko Laban - Bio: Marko has more than 20 years of technical and product management experience in various software industry areas including Cloud, Enterprise Software, 3D/Manufacturing, Bioinformatics and start-ups. In the past he was involved with companies of different sizes from large ones like SAP and Cisco to early stage
80
![Page 81: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/81.jpg)
start-ups. Today, Marko is working as a Software Product Engineering Lead at NeSI.
81
![Page 82: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/82.jpg)
{ŎƛŜƴǘƛŦƛŎ ǎdzLJŜNJŎƻƳLJdzǘƛƴƎΥ ¢ŜŀŎƘƛƴƎ LJNJŀŎǘƛŎŀƭ ǎƪƛƭƭǎ ŦƻNJ ŎNJŜŘƛǘ
Joseph Lane
University of Waikato
While theoretical modelling and simulation are increasingly used in research, postgraduate
students are typically expected to learn these skills outside of the formal credit-bearing
papers that make up their degrees.
The University of Waikato recently undertook a complete redesign of its postgraduate Science
qualifications, including the redevelopment of all of the underlying papers. As part of the
review process, focus groups were held with both current and former postgraduate students.
One of the key themes that emerged through these focus groups was a desire to “learn
through doing”, with more focus on skills development rather than fact recollection. In
response to that feedback, a new postgraduate paper, SCIEN511 – Scientific Supercomputing,
was established, which provides a practical introduction to undertaking scientific research on
a supercomputer. The paper assumes no prior computational experience and is intended for
science students from a broad range of disciplines.
In this presentation, I will outline my experience in developing and teaching SCIEN511 –
Scientific Supercomputing, reflecting on the successes and challenges of running the paper
for the first time. A close collaboration with the NeSI team ensured a great outcome for the
enrolled students.
ABOUT THE AUTHOR(S)
Associate Professor Jo Lane is a computational chemist at the University of Waikato and is
currently Deputy Dean for the School of Science. Jo obtained his BSc(Hons) and PhD from the
University of Otago.
82
![Page 83: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/83.jpg)
5ŀǘŀ ŀƴŀƭȅǘƛŎ ǘNJŀƴǎŦƻNJƳŀǘƛƻƴ ƧƻdzNJƴŜȅ ǿƛǘƘ WdzLJȅǘŜNJ
Nancy Lin
NeSI
This is a short presentation of key takeaways for how NeSI internal data analytics shift from
off shelf software to open source. Demonstrate the journey of building a business reporting
system by python and Jupyter Lab and what python enable us to do in the future.
ABOUT THE AUTHOR(S)
- Nancy Lin
- NeSI Data Analyst
83
![Page 84: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/84.jpg)
.ƻCΥ bŜȄǘ {ǘŜLJǎ ŦƻNJ ǘƘŜ wŜǎŜŀNJŎƘ {ƻŦǘǿŀNJŜ 9ƴƎƛƴŜŜNJƛƴƎ /ƻƳƳdzƴƛǘȅ ƛƴ bŜǿ ½ŜŀƭŀƴŘ
Nooriyah P. Lohani
New Zealand eScience Infrastructure
The term Research Software Engineer (RSE), originally coined by the UK RSE association (https://rse.ac.uk), describes a growing number of people in academia who combine expertise in programming with an intricate understanding of research. Although this combination of skills is extremely valuable, these people lack a formal place in the traditional academic system. Inspired by the success of the RSE Association in the UK, we continue to work towards establishing an Australasian Chapter of the RSE Association (https://rse-aunz.github.io/). Together with international bodies and support from national organisations such as AeRO, NeSI, CAUDIT, the Australian Research Data Commons (ARDC) and other research institutions, we aim to campaign for the recognition and adoption of the RSE role within the research ecosystem. Appropriate recognition, reward and career opportunities for RSEs are also needed. We would like to discuss the shortcomings and what worked in the year of events to allow RSEs to meet, exchange knowledge and collaborate. This BoF will cover the community building activities that have occurred, identify future plans and activities for the coming year in New Zealand and Australia, and discuss the next steps that were identified at the pre-conference workshop. If you are an academic or researcher who codes; a professional software engineer working in the research space; a systems administrator who maintains research systems; or someone who is passionate about quality research software, please join us at this event to actively work on how we can grow this community and advocate for others. Together, we can build a thriving community that benefits research software engineers, and ultimately contributes to a more efficient and sustainable research ecosystem.
ABOUT THE AUTHOR(S)
- Nooriyah Lohani
- I am a Bioinformatician by training and after working for a few years in a commercial
and academic realm, I am now a research communities advisor at NeSI passionate
about understanding research needs in the eScience sector. I am also Co-chair of the
RSE Australia New Zealand steering committee.
84
![Page 85: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/85.jpg)
tNJƻƎNJŜǎǎ ƛƴ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎ
Andrew Lonie, University of Melbourne, Melbourne, Australia, [email protected]
As for many other research disciplines, rapid advances in digital technologies and methods
are proving transformational in life sciences. Internationally, major life sciences
infrastructure initiatives are increasingly defining global scale data infrastructures; in
particular, the US-based National Institutes of Health through their Data Commons program,
and the EU-based ELIXIR program and EBI, are building data infrastructures that are, in
many ways, equivalents of the global data-focussed infrastructures driving advances in
astronomy and physics - infrastructures like Hubble, LIGO, and the LHC. And, like astronomy
and physics, it is clear that world-class life sciences research in Australia and New Zealand
will increasingly depend on digital methods and data resources, and communities, that are
globally sourced and supported. Therefore, sponsored by Bioplatforms Australia, we have
developed a research infrastructure program called the Australian BioCommons that
strongly engages the research community, international infrastructure initiatives, and
national digital resource providers, recognising that Australia must understand, participate
in and contribute to global life science-enabling endeavours as a first class partner, and
presenting this as a clear vision of implementable requirements to national providers.
Method As part of the challenge of building towards the BioCommons, a pathfinder project was established late in 2018. BioPlatforms Australia, the Australian Research Data Commons and AARNet committed $2.5M to the project, which was then extended by Pawsey and NCI donating time and facility resources. The pathfinder project demonstrates a concerted national infrastructure effort to better characterise future life science solutions. Results
The pathfinder project required the identification of unknowns confronting the planning and development of the BioCommons proper. The five that we identified relate to key long term challenges:
• Human Genomes: the fundamental requirement for sensitive human data access andsharing
• Interoperability with global data: using paediatric cancer genomics as an exemplar• Non-model Genome Assembly & Annotation: using genomics programs aimed at
characterising Australian flora and fauna) as exemplars
85
![Page 86: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/86.jpg)
• Improvements to researchers’ ability to analyse their data: using phylogenetics as anexemplar, and addressing the technical challenges of integrating instruments andAARNet’s CloudStor as a data mover with the Galaxy research workflow system
• A Pathfinder Cloud: investigating approaches to compute and storage access thatalign with life science research practice and to understand the scale required ofvarious classes of compute and storage infrastructure
Conclusion
Experience with the Implementation Studies confirms several design challenges that life science infrastructures confront, including:
• Data growth rates exceed constant cost technology performance growth rates• The compute demand is different to established HPC practice and culture• Australian data will continue to exist, but must be interpreted in a global context
where data is increasingly too big to move/copy.• Infrastructure compatibility will be vital as more of the analysis and methodology
software will be imported or globally developed• A ‘cloud native’ paradigm is dominating peer life science infrastructure investment
!ŎƪƴƻǿƭŜŘƎƳŜƴǘǎ ²Ŝ ŀŎƪƴƻǿƭŜŘƎŜ .ƛƻtƭŀǘŦƻNJƳǎ !dzǎǘNJŀƭƛŀ ŀǎ ǘƘŜ LJNJƛƳŀNJȅ ǎLJƻƴǎƻNJ ƻŦ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎΣ ŀƴŘ ƴdzƳŜNJƻdzǎ LJŀNJǘƴŜNJǎ ƛƴŎƭdzŘƛƴƎ ǘƘŜ !dzǎǘNJŀƭƛŀƴ wŜǎŜŀNJŎƘ 5ŀǘŀ /ƻƳƳƻƴǎΣ !!wbŜǘΣ b/L ŀƴŘ tŀǿǎŜȅ ǎdzLJŜNJŎƻƳLJdzǘƛƴƎ ŎŜƴǘNJŜǎΣ ŀƴŘ ǘƘŜ !dzǎǘNJŀƭƛŀƴ !ŎŎŜǎǎ CŜŘŜNJŀǘƛƻƴΦ .ƛƻLJƭŀǘŦƻNJƳǎ !dzǎǘNJŀƭƛŀ ƛǎ ŀƴ !dzǎǘNJŀƭƛŀƴ bŀǘƛƻƴŀƭ /ƻƭƭŀōƻNJŀǘƛǾŜ wŜǎŜŀNJŎƘ LƴŦNJŀǎǘNJdzŎǘdzNJŜ {ŎƘŜƳŜ LJNJƻƎNJŀƳΦ
!.h¦¢ ¢I9 !¦¢Ihwό{ύ
!κtNJƻŦ !ƴŘNJŜǿ [ƻƴƛŜ 5ƛNJŜŎǘƻNJΣ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎΣ ¦ƴƛǾŜNJǎƛǘȅ ƻŦ aŜƭōƻdzNJƴŜ !ƴŘNJŜǿ [ƻƴƛŜ ƛǎ 5ƛNJŜŎǘƻNJ ƻŦ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎ όƘǘǘLJΥκκōƛƻŎƻƳƳƻƴǎΦƻNJƎΦŀdzύΣ 5ƛNJŜŎǘƻNJ ƻŦ ǘƘŜ 9a.[ !dzǎǘNJŀƭƛŀ .ƛƻƛƴŦƻNJƳŀǘƛŎǎ wŜǎƻdzNJŎŜ ό9a.[π!.wΥ ƘǘǘLJΥκκŜƳōƭπŀōNJΦƻNJƎΦŀdzύΣ ŀƴŘ ŀƴ ŀǎǎƻŎƛŀǘŜ LJNJƻŦŜǎǎƻNJ ŀǘ ǘƘŜ CŀŎdzƭǘȅ ƻŦ aŜŘƛŎƛƴŜΣ 5ŜƴǘƛǎǘNJȅ ŀƴŘ IŜŀƭǘƘ {ŎƛŜƴŎŜǎ ŀǘ ǘƘŜ ¦ƴƛǾŜNJǎƛǘȅ ƻŦ aŜƭōƻdzNJƴŜΣ ǿƘŜNJŜ ƘŜ ŜǎǘŀōƭƛǎƘŜŘ ŀƴŘ ǘƘŜƴ ŎƻƻNJŘƛƴŀǘŜŘ ǘƘŜ a{Ŏ ό.ƛƻƛƴŦƻNJƳŀǘƛŎǎύ ŦƻNJ Ƴŀƴȅ ȅŜŀNJǎΦ !ƴŘNJŜǿ ŘƛNJŜŎǘǎ ŀ ƎNJƻdzLJ ƻŦ ōƛƻƛƴŦƻNJƳŀǘƛŎƛŀƴǎ ŀƴŘ ŜπNJŜǎŜŀNJŎƘ ǎLJŜŎƛŀƭƛǎǘǎ ǿƛǘƘƛƴ ǘƘŜ .ƛƻ/ƻƳƳƻƴǎ ǘƻ ŎƻƭƭŀōƻNJŀǘŜ ǿƛǘƘ ŀƴŘ ǎdzLJLJƻNJǘ ƭƛŦŜ ǎŎƛŜƴŎŜǎ NJŜǎŜŀNJŎƘŜNJǎ ƛƴ ŀ ǾŀNJƛŜǘȅ ƻŦ NJŜǎŜŀNJŎƘ ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ LJNJƻƧŜŎǘǎ ŀŎNJƻǎǎ !dzǎǘNJŀƭƛŀ
86
![Page 87: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/87.jpg)
/ƭƛƳŀǘŜ 5ŀǘŀ ŀƴŘ /ƻƳLJdzǘƛƴƎ ŀNJŜ IƻǘǘƛƴƎ ¦LJΗ
Shona Mackie, Annika Seppälä, Inga J Smith
University of Otago
[email protected], [email protected], [email protected]
Understanding our climate and how it changes in future is a topic of increasingly urgent
research, with heightened levels of public and political support worldwide. Climate models,
however, are necessarily big. In theory, they represent all physical processes from the top of
the atmosphere to the bottom of the ocean, over land, water and sea ice, in 3-dimensional
grid cells with a resolution of 1 degree or finer. The interaction and evolution of these
processes is modelled with a temporal resolution of more than a single timestep per hour,
and typically we need to run for at least 100 years. Furthermore, uncertainties and internal
variability in the climate system mean that we run an ensemble rather than a single model
run. The structure of climate models means they can usually be parallelized to a point, but
they are not generally suitable for the distributed computing solutions that can be
implemented in other fields. As well as being expensive to run, climate models produce a lot
of data (PB scale). The idea is to capture the state of the whole world in a 3-dimensional
mesh with a temporal resolution fine enough to see how it changes, and a spatial resolution
fine enough to examine any physical process anywhere on Earth that might impact on
climate. For example, one model component (atmosphere, ocean etc) can be made of 1.2
million grid points. Saving just one parameter daily for 100 years = 44 billion data points. 30
parameters from 6 ensemble simulations amounts to 8 trillion data points, just from one
model component. These data have to be accessible so that we can do processing and
monitoring of climate model runs while they are underway, and need to be securely
archived in a way that makes them accessible for long term use, and shareable with
collaborators both present and future, here in Aotearoa and abroad. Running a climate
model is just the beginning of climate research, analysis of the data requires tools capable of
accessing and handling large data volumes that are generally stored on remote servers,
sometimes overseas, at a speed that makes interrogation and analysis practical.
Climate modelling is one of the most computationally hungry fields of research and New
Zealand has recently joined the list of the relatively few countries with the resources and
infrastructure to do it. Growing this field of research in New Zealand will need development
of resources and expertise to manage those resources. Events like eResearch 2020 are an
87
![Page 88: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/88.jpg)
important way for information to be shared with network architects and data managers to
ensure that the systems and infrastructure are in place to support the next generation of
climate researchers.
ABOUT THE AUTHOR(S)
- Shona Mackie
- Shona Mackie is a climate modeller at University of Otago, developing the New
Zealand Earth System Model to include new physics processes, and carrying out
senstivity studies using the current version of the model to better understand
uncertainties inherent in our climate projections.
- Annika Seppälä
- Annika Seppälä is a senior lecturer at Otago University Physics department. Her
research uses computational simulations together with large space based Earth
observation datasets to investigate solar influence on the atmosphere and climate
from global to regional scales.
- Inga
- Inga Smith is a senior lecturer in the Department of Physics, University of Otago. Her
research interests are in sea ice physics and climate change, particularly the
influence of fresh water on sea ice formation.
88
![Page 89: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/89.jpg)
!ŘǾŀƴŎƛƴƎ bŜǿ ½ŜŀƭŀƴŘΩǎ ŎƻƳLJdzǘŀǘƛƻƴŀƭ NJŜǎŜŀNJŎƘ ŎŀLJŀōƛƭƛǘƛŜǎ ŀƴŘ ǎƪƛƭƭǎ
Nick Jones New Zealand eScience Infrastructure
Supporting research today and tomorrow requires an inclusive partnership with New Zealand researchers, communities, and Te Ao Māori, underpinned by a specialised and powerful technology ecosystem. Over the last year, New Zealand eScience Infrastructure (NeSI) has advanced New Zealand’s computational research capabilities and skills through training programmes, consultancy in research software engineering, strategic partnerships, and recruitment of NeSI team members with stronger knowledge of and connection into research communities. This session will look at how those activities, combined with recent enhancements to NeSI’s existing platforms and services, are enabling New Zealand’s science sector to compete and excel globally.
ABOUT THE AUTHOR Nick Jones is NeSI’s founding Director, having established and led NeSI alongside a team of colleagues and peers since inception in mid-2011. Nick is responsible for NeSI’s strategic directions and performance overall, bringing together a talented and diverse array of people, and their institutions and interests. Nick has over 20 years’ experience in innovating in advanced information/computing technology in sectors including education, science and research. Nick established the eResearch NZ conference series in 2010 to support the sector coming together in the spirit of community to share experiences and explore directions in an area so critical to our future prosperity as a nation.
89
![Page 90: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/90.jpg)
Growing the eResearch workforce in an inclusive way
Jana Makar 1 , Loretta Davis 2
New Zealand eScience Infrastructure (NeSI) 1 , Australasian eResearch Organisations (AeRO) 2
[email protected] 1 , [email protected] 2
Diverse teams have been shown to increase a company’s ROI, make decisions 2X faster with half the meetings,and corporations that hire women into the C-suite often see a 15% increase in profitability.In this eResearch NZ Birds-of-a-Feather (BoF) session, we would like to build on the momentum andconversations started at a previous eResearch Australasian BoF and discuss the opportunities around bettercoordinating, supporting, and expanding diversity efforts within Australasia’s eResearch and HPC sectors.This BoF’s ideal outcomes would include:● a greater understanding of existing efforts and support structures in place for encouraging diversityand gender equity in Australasia’s HPC and eResearch sectors● building new and/or stronger relationships built amongst Australasia diversity and gender equityadvocates● a short-term action plan for 2020-21 to explore ways to better connect, coordinate, and leverageexisting diversity and inclusion effortsDiscussion notes and feedback gathered in this session will be collated and shared with the broader eResearchcommunity, as part of a newly formed working group’s efforts to support the development of a Women in HPC(WHPC) community in Australasia. For more information on the Australasian WHPC Working Group, visit theAeRO eResearch Chat’s Diversity & Inclusion space .
ABOUT THE AUTHOR(S)
Jana MakarBased at the University of Auckland, Jana coordinates a variety of engagement initiatives and externalcommunications to raise the profile of NeSI’s activities, impacts, and collaborations. Prior to joining NeSI, Jana worked as a communications consultant for multiple organisations in Canada’s technology, academic, and startup sectors.
Loretta DavisLoretta is a seasoned IT professional with 25+ years experience in the eResearch, commercial and government sectors in Australia, Africa and the USA. When not working part time for AeRO, Loretta consults as a Solutions Specialist to a number of private clients.
90
![Page 91: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/91.jpg)
¢ƻǿŀNJŘǎ C!Lw LJNJƛƴŎƛLJƭŜǎ ŦƻNJ NJŜǎŜŀNJŎƘ ǎƻŦǘǿŀNJŜ Paula Andrea Martinez, National Imaging Facility (NIF), The University of Queensland (UQ)
[email protected] The FAIR Guiding Principles, published in 2016, aim to improve the findability, accessibility, interoperability and reusability of digital research objects for both humans and machines. The FAIR principles are also directly relevant to research software. In this position paper “Towards FAIR principles for research software”, we summarised and developed a basis for community discussion. At the start, we discussed what makes software different from data concerning the application of the FAIR principles, and which desired characteristics of research software go beyond FAIR. Then, we presented an analysis of where the existing principles can directly apply to software, and where they need to be adapted or reinterpreted. Our next step after the position paper is to prompt for community-agreed identifiers for FAIR research software. - Acknowledgments To all the authors of Towards FAIR principles for research software https://doi.org/10.3233/DS-190026, and the numerous people who contributed to the discussions around FAIR research software at different occasions preceding the work on this paper. - References Lamprecht, Anna-Lena, et al. (2019) Towards FAIR principles for research software. Data Science. https://doi.org/10.3233/DS-190026 ABOUT THE AUTHOR(S) - Dr Paula Andrea Martinez is leading the National Training Program for the Characterisation Community in Australia since 2019. She works for the National Image Facility (NIF). Last year she worked at ELIXIR Europe coordinating the Bioinformatics and Data Science training program in Belgium and collaborated with multiple ELIXIR nodes in the development of Software best practices. Her career, spanning Sweden, Australia and Belgium nurtured her experience in Bioinformatics and Research Software development for complex and data-intensive science. She started a career in Computer Science, later on, interested in research methods development and now outreach and advocacy in data and software best practices
91
![Page 92: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/92.jpg)
¦ǘƛƭƛǎƛƴƎ hȄŦƻNJŘ bŀƴƻLJƻNJŜ 5ŀǘŀ ŦƻNJ ǘƘŜ DŜƴƻƳŜ !ǎǎŜƳōƭȅ ƻŦ 9ƴŘŜƳƛŎ bŜǿ ½ŜŀƭŀƴŘ {LJŜŎƛŜǎ
Ann Mc Cartney1, Chen Wu3, Ross Crowhurst3, Joseph Guhlin2, Chris Smith1, David Chagne5,
Thomas Buckley1.
1 Systematics Team, Manaaki Whenua, 231, Morrin Road, Saint Johns Auckland.
2 Department of Biochemistry, School of Biomedical Sciences, University of Otago,
Cumberland Street, Dunedin.
3 Plant and Food Research,120 Mt Albert Road, Sandringham, Auckland.
4 Commerce A, Room 113, University of Auckland, Symonds Street, Auckland.
5 Plant and Food Research, Batchelar Road, Fitzherbert, Palmerston North.
As part of Genomics Aotearoa, a high-quality genomes project has been established to
generate pipelines for the assembly of species across New Zealand. These pipelines are
specifically targeted at key species that are on the verge of extinction, treasured by Māori,
key players in the primary production industry, a significant threat to biosecurity within New
Zealand or have complex genomes,i.e. are abnormally large, have higher ploidy levels, are
highly repetitive or heterozygous. These species have been sequenced using a variety of
NGS platforms, namely Illumina, Oxford Nanopore (ONT), PacBio, Chromium 10X and Hi-C.
Here, genome assembly strategies under development on NeSI will be outlined specifically
using ONT and Illumina datatypes in order to highlight the impact of read depth and
coverage on genome assembly quality. This study deals with the optimisation of genome
assembly construction for projects with a limited budget or those that are confined to
certain locations/sequencing platforms. It also addresses optimal assembly strategies for
species with unique genome architectures. In order to investigate this five species with
unique genome characteristics were selected, namely; Hericium novae-zealandiae a small
diploid fungi, Clitarchus hookeri a species containing a large, repetitive and highly
heterozygous genome, Knightia excelsa a plant species with a medium sized, non-repetitive
genome with low heterozygosity, and kiwifruit another plant species with a smaller and
more repetitive genome structure. A focus will all be placed on the importance of
appropriate data management, transfer and sharing when working with toanga species.
92
![Page 93: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/93.jpg)
ABOUT THE AUTHOR(S)
Name: Ann Mc Cartney
Bio: Ann Mc Cartney is a Genomics Aotearoa postdoctoral fellow at Manaaki Whenua -
Landcare Research. After completing her Genetics and Cell Biology degree at DCU in 2012,
where she graduated top of her class, she won IRCSET funding to complete her PhD in the
Bioinformatics and Molecular Evolution under the supervision of Dr. Mary O’Connell (now at
Nottingham University) in 2018. During her PhD, Ann identified and characterised fusion
genes with a specific focus on primate genomes, producing a thesis entitled "Novel gene
genesis by gene fusion: a network based approach". Since moving to Auckland, New Zealand
in 2018, she has worked as a Postdoctoral Researcher for the Genomics Aotearoa’s High
Quality Genomes Project under Associate Professor Thomas Buckley creating protocols for
the sequencing and assembly of endemic New Zealand species including stick insects such
as Clitarchus hookeri, fungi from the Herecium, venturia and Pithomyces clades, birds such
as the Hihi and Kakapo and honey suckle trees such as the rewarewa.
93
![Page 94: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/94.jpg)
5ŀǘŀ tƛLJŜƭƛƴŜǎ ŀƴŘ tNJƛǎƳǎ
Alan McCulloch
AgResearch (NZ)
We describe a data-prism data processing and analysis metaphor, contrast this with the
data-pipeline metaphor and topology and describe several use cases.
The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end
(longitudinal) processing integrity and performance is usually uppermost in the minds of
analysts and software designers; secondly, mature implementations of well-understood
formal pipeline-topology abstractions such as directed acyclic graphs are readily available,
as are well-understood end-to-end-oriented quality-control processes and metrics.
However, collections of input files associated with some large-scale datasets have important
side-to-side (latitudinal) structural features, processing and quality control metrics that are
not so well represented by the longitudinal pipeline metaphor and topology. For example,
while processing of a set of samples from a sequencing machine may conclude with perfect
end-to-end integrity per data-file, unsupervised machine learning (for example clustering)
applied latitudinally to a low-entropy precis of all of the input, intermediate or final data-
files may identify data features of interest such as outliers, relevant to quality control.
Another example of latitudinal processing and structure is in the use of multiple reference
frames for sample annotation, rather than a single reference, so that a single stream of
processing is refracted into multiple streams, with each stream searching a different
reference database, and/or using alternate search parameters. Technical steps such as job-
scheduling and intermediate and output file-disposition for such “short, wide” (as opposed
to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For
example pipeline-oriented scripting usually stores and indexes input, intermediate and
output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a
different file-system folder for that stage, which does not work well if each folder receives
input from multiple threads of processing of the same data (for example, file-name
collisions will result).
We describe some data-prism use cases, and a number of simple techniques we have found
useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a
high level API for tasks such as random sampling and processing large numbers of input files
94
![Page 95: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/95.jpg)
and parameter sets, low-entropy data representation approaches to support a high level
latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for
easier refraction of single into multiple streams of processing (and to try to reduce the
impedance mismatch between the shell command-line that many users know and love, and
the cluster job-submission systems known and loved by systems administrators).
!.h¦¢ ¢I9 !¦¢Ihw
Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay
campus, mainly supporting genetic and genomic databases and high-throughput
computational pipelines.
95
![Page 96: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/96.jpg)
{ƘŀNJƛƴƎ ŀŎNJƻǎǎ ǘƘŜ ŘƛǘŎƘΥ ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ ŦƻNJ ǎƻŎƛŀƭ ǎŎƛŜƴŎŜ NJŜǎŜŀNJŎƘ Řŀǘŀ ƛƴ !dzǎǘNJŀƭƛŀ ŀƴŘ bŜǿ ½ŜŀƭŀƴŘ
Steven McEachern
Australian Data Archive
Australian National University
Marina McGale
Australian Data Archive
Australian National University
Martin von Randow
COMPASS Research Centre
University of Auckland
Janet McDougall
Australian Data Archive
Australian National University
The social science research community has a long tradition of working collaboratively to
study comparative political, social and economic problems in an international context.
Australian and New Zealand researchers have made regular, long term contributions to
international research programs such as the International Social Survey Program [1], World
Values Survey [2] and the Comparative Study of Electoral Systems [3], and the results of this
work is disseminated internationally.
These international collaborations highlight significant opportunities for the establishment
of shared resources and infrastructure to support such programs. Social science data
archives have been established in many countries to support the efforts associated with
programs. Particularly in the European Union, these now represent the major EU-funded
social science infrastructures under the ESFRI program, such as the Consortium of European
Social Science Data Archives (CESSDA) [4], and associated survey research programs of the
96
![Page 97: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/97.jpg)
European Social Survey (ESS) [5] and the Survey of Health and Retirement in Europe (SHARE)
[6].
Our project
Cross-national collaborations such as CESSDA and the World Values Survey represent an
opportunity for Australian and New Zealand researchers and research infrastructures in the
social sciences, but one that has not yet been realised.
This paper therefore provides an overview of the collaborations over the past 10 years
between the Australian Data Archive and the COMPASS Research Centre. This collaboration
began with a joint effort between ADA and COMPASS to establish the New Zealand Social
Science Data Service (NZSSDS), using the NESSTAR access system in 2007-8. This collection
has now been maintained for 10 years, including migration to a Figshare repository around 5
years ago by the Centre for eResearch at the University of Auckland.
Over the 10 years since, the ANU Centre for Social Research and Methods (home of ADA)
and COMPASS have contributed to the ISSP program and provide support for the national
Election Study in each country. This paper therefore presents an overview of the
development of social science infrastructure over this period, and the collaboration
between Australia and New Zealand over that time and into the future.
Methods
To further this collaboration, over the past 12 months, ADA and COMPASS have been
working to preserve and update the NZSDSS collection of datasets as a hosted collection
within the Australian Data Archive. The collection – re-establishing the New Zealand Social
Science Data Service – is now managed and maintained through the COMPASS team in New
Zealand, but housed at ADA on the National Computational Infrastructure.
The re-establishment of NZSDSS has involved three separate streams of activity – each of
which is critical to the preservation and dissemination of research data. These streams
included:
1. Technical update: Migration of the existing data from the previous service
(established at FigShare) to the ADA Dataverse environment
2. Policy: updating preservation and access policies for the COMPASS data to meet New
Zealand, Australian and international regulatory and ethical requirements
97
![Page 98: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/98.jpg)
3. Curation: Enabling web-based curation processes for data and metadata using
common procedures and international social science metadata standards (the Data
Documentation Initiative)
Results
The presentation will provide details of the progress of this new collection, a discussion of
the benefits and drawbacks to cross-national data management through the project, and
lessons for managing cross-national collaborations and shared infrastructure in the social
sciences more generally.
References
[1] International Social Survey Program (ISSP): http://issp.org/
[2] World Values Survey (WVS): http://www.worldvaluessurvey.org/
[3] Comparative Study of Electoral Systems (CSES): https://cses.org/
[4] Consortium of European Social Science Data Archives (CESSDA): https://www.cessda.eu/
[5] European Social Survey (ESS): http://www.europeansocialsurvey.org/
[6] Survey of Health and Retirement in Europe (SHARE): http://www.share-
project.org/home0.html
ABOUT THE AUTHOR(S)
Steven McEachern is Director, Marina McGale is Web Services Coordinator, and Janet
McDougall is senior archivist at the Australian Data Archive. Martin von Randow is Data
Manager and Analyst at the COMPASS Research Centre.
98
![Page 99: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/99.jpg)
9ƴŀōƭƛƴƎ ŀdzǘƘŜƴǘƛŎŀǘƛƻƴ ŀǘ Ǝƭƻōŀƭ ǎŎŀƭŜΥ ŀƴ dzLJŘŀǘŜ ƻƴ w9!bb½ ǎŜNJǾƛŎŜǎ
Vladimir Mencl
REANNZ
REANNZ operates for the New Zealand R&E community two well known services: Tuakiri,
the Identity and Access Management Federation, and eduroam, roaming infrastructure for
seamless network access. This presentation will give an update on new developments on
these services.
Tuakiri has recently completed the connection to eduGAIN, the global "federation of
federations" - and eduGAIN connectivity is now available to the NZ R&E community. This
allows users from NZ institutions to log into services from other federations via eduGAIN -
and in a similar vein, overseas users can log into NZ-based services connected to eduGAIN.
Tuakiri is rolling out eduGAIN with an opt-in process, and this talk will cover the steps an
Identity Provider or a Service Provider must take to join eduGAIN.
For eduroam, the global community has made several new very useful services available.
With eduroam CAT, the Configuration Assistant Tool, it is now easy to onboard users for
eduroam - either pointing them directly to the CAT website to install the connection profile,
or rolling out the connection profile across a fleet of devices through centralised
management infrastructure. In both cases, it results into consistent (and more secure)
connection profile deployment. And CAT 2.0, released this year, makes it even easier for
institutions to register with CAT and create the end-user connection profiles.
While larger institutions are well capable of running the infrastructure required for eduroam
themselves, smaller institutions often find this task challenging. For these, as an alternative
to running the IdP infrastructure, the Managed IdP offering might be the right fit: a fully
hosted and managed eduroam IdP, with an interface for managing user accounts and for
deploying these accounts on user devices.
ABOUT THE AUTHOR(S)
Dr. Vladimir Mencl has been part of the New Zealand R&E community since 2006 and has
been involved in identity and access management projects since the early days of the
BeSTGRID project. When the Tuakiri project moved to REANNZ, Vlad joined REANNZ where
he is part of the Systems team as a Senior Software Engineer.
99
![Page 100: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/100.jpg)
Beyond Super
April Neoh
Account Executive HPC/AI & Big Data Storage
Hewlett Packard Enterprise
E-mail: [email protected]
Discover “Beyond Super” supercomputing as the HPE and Cray now team together as one company. Listen to how HPE and Cray are combining unique IP to deliver the capabilities you will need in the exascale era with systems that perform like a supercomputer and run like a cloud.
Together, we are redefining the supercomputing industry with the solutions and services for a new era of converged modeling, simulation, analytics, and AI.
ABOUT THE AUTHOR
April has been in the IT industry for over 25 years in IBM, Cray and now HPE, with her time spent mainly on working closely with Technical and Research customers to build high performing computing solutions using bleeding edge HPC technologies. She has worked with Government Agencies and Universities in building collaborative research partnerships and implement TOP500 sized HPC systems in Australia and New Zealand, with the goal of exploiting technologies to deliver a scientific outcome.
100
![Page 101: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/101.jpg)
Using comparative RNASeq to identify small non-coding RNAs in bacterial clades
Thomas Nicholson1,2 and Paul Gardner1
1Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand, 2Genetics Otago, University of Otago, New Zealand.
Small non-coding RNAs are involved in regulation of a wide range of cell processes. There are a number of tools that exist that try to identify these RNAs using a range of methods, however challenges with predicting non-coding RNAs from the sequence alone and transcriptional noise making the use of RNASeq data unreliable has hindered annotation of functional elements (Jose et al. 2019). While these methods manage to predict RNAs it can be hard to determine whether results from RNASeq data are the result of a real RNA or noise and to deal with this problem we are using a comparative approach by taking RNASeq data from multiple genomes within a clade (Lindgreen et al. 2014). We have designed a pipeline that identifies peaks in intergenic regions of RNASeq data that may by functional RNAs and uses genome alignments to check if there are conserved regions of expression that would indicate the transcription that is observed is for functional RNAs. By using a comparative approach we aim to improve the signal to noise ratio in our results and better list of candidate small non-coding RNAs.
References
1. Jose, B. R., Gardner, P. P., & Barquist, L. (2019). Transcriptional noise and exaptation as sources for bacterial sRNAs. Biochemical Society Transactions, 47(2), 527-539.
2. Lindgreen, S., Umu, S. U., Lai, A. S. W., Eldai, H., Liu, W., McGimpsey, S., ... & Poole, A. M. (2014). Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling. PLoS computational biology, 10(10), e1003907.
ABOUT THE AUTHOR(S)
Tom is a PhD student in the Department of Biochemistry at the Univrsity of Otago. His research focusses on the bioinformatic analysis of small non-coding RNAs in prokaryotes.
101
![Page 102: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/102.jpg)
Network-based Nonparametric Tests to Identify Genetic Modifiers of Rare Diseases
Eliatan Niktab ([email protected])1, Stephen Sturley ([email protected])2,
Ingrid Winship ([email protected]) 3-4, Mark Walterfang
([email protected])3-4, Paul Atkinson ([email protected])1, Andrew
Munkacsi ([email protected])i1
1School of Biological Sciences, Victoria University of Wellington, Wellington, New Zealand
2Department of Biology, Barnard College at Columbia University, New York, New York, USA
3Melbourne Neuropsychiatry Centre, Royal Melbourne Hospital, Melbourne, Australia
3Medicine, Dentistry And Health Sciences, Melbourne Universityl, Melbourne, Australia
Genome and exome sequencing has been extensively successful in identifying disease causing gene mutations and variants in GWAS. However, there has been little success in deducing the pertinent genomic variants that significantly modify disease progression and fully account for phenotype. One reason is that current use of genome-wide association study (GWAS) utilize narrow sense heritability estimates and do not include assessment of epistasis and other components of broad-sense heritability1-2. Here we report investigation of genetic variants that modify the causal gene of a monogenic disease and ultimately regulate its onset and progression in individuals. Niemann-Pick type C (NP-C) disease, a rare monogenic Mendelian disease, is one of more than 6,000 Mendelian diseases for which there is no cure. Most NP-C patients with the NPC1 gene mutation are diagnosed as late infants and die before or during adolescence, yet survival of some to adulthood provides a testbed for elucidating genes that alleviate the primary mutation. Therefore, we collected whole-genome sequences of pediatric and adult-onset patients. We then developed a pipeline that integrates mathematical models of genetic polymorphisms, augmented Bayesian biological networks, clinical records, and semantic ontologies of GWAS data. The tool that we developed analyzes sequencing data, identifies genome-wide interactions, and has scripts that control for confounding factors using heterogeneous data harmonization and modularity-based clustering. Our approach mitigates the statistical challenge of sample sizes inherent to current GWAS methodology. There are a large number of modifying genes that appear to function in epistatic networks of disease-modifying variants whose genetic effects together explain the heritability NP-C in its various manifestations.
102
![Page 103: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/103.jpg)
1- Kim, S. et al. Genes with high network connectivity are enriched for disease heritability.Am. J. Hum. Genet. (2019).
2- Escala-Garcia, M. et al. A network analysis to identify mediators of germline-drivendifferences in breast cancer prognosis. Nat. Commun. (2020).
Keywords: system biology, network, genome, GWAS, Bayesian
Acknowledgments: Ara Parseghian Medical Research Fund
ABOUT THE AUTHOR(S)
Eliatan Niktab
I’m a Ph.D. candidate at Victoria University and a member of Quantitative Methods,
Machine Learning, and Functional Genomics group at Genomics England Clinical
Interpretation Partnership (GeCIP). I’m trained to investigate diverse, complex and multi-
feature data including human genomics, proteomics, and metabolomics by developing
mathematical models and statistical analyses. My Ph.D. dissertation utilizes such models for
investigating the rare neurodegenerative Niemann-Pick type C (NPC) disease. I’m mostly
practiced in genome-scale algorithm design, using deep neural networks for genetic variant
discovery, Baysian modeling, and GPU-accelerated software development.
Stephen Sturley, Ph.D.
Professor Sturley’s group uses a multidisciplinary approach that integrates genetics,
biochemistry and cell biology. He is specialized in applying yeast as a model organism to
understand human lipid metabolism. Particular emphasis and success of his lab has been
attained with regard to cholesterol, sphingolipid and fatty acid homeostasis underlying
lipotoxicity with particular reference to neurodegeneration, obesity, and diabetes. Their use
of yeast to identify genetic modifiers of recessive disorders such as Niemann- Pick Type C
(NPC) resulted in the identification of histone deacetylase inhibitors as a candidate therapy,
for which this drug was further tested in murine models of NPC disease and now in clinical
trial in human patients.
Mark Walterfang, M.D., Ph.D. FRANZCP
Professor Walterfang has significant experience in clinical neuropsychiatry and general adult
psychiatry with expertise in managing comorbid psychiatric and neurological disorders,
neurodegenerative disorders, neurometabolic disorders, and atypical dementias. Clinical
experience is the basis of his success in research, starting with 13 published papers from his
103
![Page 104: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/104.jpg)
Ph.D. that have been cited more than 500 times. He has since published over 130 papers in
major psychiatric, metabolic, neurological and neuroimaging journals.
Ingrid Winship, MBChB, MD, Ph.D., FRACP, FACD
The focus of her research is the relationship between genotype and phenotype with
particular emphasis on human diseases. In the last two decades, her research has helped to
frame the questions, define the phenotypes, and via statistical analyses associated
genotype and phenotype. At the other end of the translational pipeline, her research has
translated the discoveries and knowledge into clinical protocols and policies, which has
changed the way patients are treated in medical practice via new drug targets and
biomarkers to monitor the onset and progression of the disease.
Paul Atkinson, Ph.D.
Professor Atkinson is a cell biologist who has long investigated ER-related events in
specification of membrane protein synthesis and transport. His studies have included ER,
Golgi and plasma membrane specific glycosylation structure determined by multi-
dimensional NMR. Specific membrane glycoproteins studies utilised model systems
including membrane maturing viruses. More recently he has utilized yeast gene knockout
libraries to investigate epistatic network contribution to phenotype in ER related events.
Andrew Munkacsi, Ph.D.
Munkacsi lab studies the genetics, cell biology, and biochemistry of intracellular lipid
transport to identify novel targets to treat the defective transport of cholesterol and
sphingolipids underlying human disease. His group uses a unique combination of unbiased,
high- throughput systems biology approaches in yeast genomic screens based on synthetic
lethality, gene expression, protein localization, and protein-protein interactions, as well as
exome and genome sequencing of human patients. Dr. Munkacsi successfully used these
genome-wide yeast screens to identify unsuspected and precise targets to treat
neurodegenerative diseases such as Alzheimer’s disease and Niemann- Pick Type C (NPC)
disease, of which a subset have progressed to clinical trials.
104
![Page 105: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/105.jpg)
NVIDIA Accelerated Computing Workshop Dr. Gabriel Noaje
Senior Solutions Architect E-mail: [email protected]
This workshop will provide an overview of the accelerated computing solutions that NVIDIA offers for HPC, DL and ML. From the GPU architecture to the CUDA-X libraries collection and developer tools attendees will be introduced to the NVIDIA Platform for developers. In addition, dedicated frameworks for Deep Learning like DeepStream or Clara and for Machine Learning like RAPIDS will also be introduced.
13:30-13:45 Opening by Dennis Ang, NVIDIA APAC South Director
13:45-14:15 Convergence of HPC + AI
14:15-15:10 Accelerated platform overview and updates (system
architecture, CUDA, CUDA-X, development tools)
15:10-15:30 Afternoon tea
15:30-16:00 DeepLearning Tools and SDKs – DeepStream, Clara, etc.
16:00-16:20 NVIDIA GPU Cloud (Containerization, Transfer Learning Toolkit,
TensorRT, TensorRT Inferencing Server)
16:20-16:40 Convergence of HPC + Data Science (RAPIDS framework)
16:40-17:00 Q&A Session
ABOUT THE INSTRUCTOR Gabriel Noaje has more than 10 years of experience in accelerator technologies and parallel computing. Gabriel has a deep understanding of users’ requirements in terms of manycore architecture after he worked in both enterprise and public sector roles. Prior to joining NVIDIA, he was a Senior Solutions Architect with SGI and HPE where he was developing solutions for HPC and Deep Learning customers in APAC. Previously, he was a Senior Computational Scientist at A*STAR Computational Resource Centre in Singapore (A*CRC) supporting users with deploying their applications on GPUs and large HPC systems.
105
![Page 106: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/106.jpg)
Basics of Cloud Computing Workshop Daniel OôByrne Catalyst Cloud
E-mail: [email protected]
An introduction into cloud computing and how you can take advantage of Cloud technologies for your research. The workshop aims to show you the advantages of using cloud computing over legacy platforms and will guide you on how to set up your first instance on the Catalyst Cloud.
Day: Thursday 13 February Time: 11.00am – 12.30pm
106
![Page 107: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/107.jpg)
!ŎŀŘŜƳƛŎ 5ŀǘŀ {ŎƛŜƴŎŜΥ ŦNJƻƳ ƛƴŘƛǾƛŘdzŀƭǎ ǘƻ ƛƴǎǘƛǘdzǘƛƻƴǎ
Micaela Parker
Academic Data Science Alliance
Although data-driven research is already accelerating scientific discovery, substantial systemic challenges still exist in academia that impact both individual researchers and institutional decision-making. These challenges need to be overcome for academia to fully realize the promise of the new data era. Toward that end, working in partnership with one another and with the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation, three universities (the University of California Berkeley, New York University, and the University of Washington) have been attempting to create supportive environments for researchers using and developing data-intensive practices. Established in 2013, and known as the Moore-Sloan Data Science Environments (MSDSE), this collaboration was structured through a set of working groups on cross-cutting topics viewed as critical to advancing data science in academia: career paths and incentives, software development, education, reproducibility and open science, reflexive and reflective ethnography, and the role of physical space in collaboration. As the MSDSE grants approach sunset in 2020, the Academic Data Science Alliance (ADSA) has formed to expand the community and build a network across the U.S. and beyond to share lessons from the MSDSEs and from other higher education institutes experimenting with the integration of data science in academia. This talk will cover some of the efforts and activities of the MSDSEs and ADSA, emphasizing best practices and lessons learned that have emerged from six years of collaborative institutional experimentation, from cross-domain workshops and project incubators to the challenges of creating (and filling) new staff data scientist positions outside of any one particular lab or discipline.
ABOUT THE AUTHOR(S)
Micaela Parker is Executive Director of the Academic Data Science Alliance (ADSA). ADSA is a
domain-agnostic organization that supports university researchers in their efforts to
collaborate around data-intensive tools, methods, and responsible applications. By building
networks of data science practitioners and thought leaders, ADSA enables better sharing of
knowledge, ideas, and lessons learned.
Before launching ADSA, Dr. Parker served as Program Coordinator for the Moore-Sloan Data
Science Environments and Executive Director for the University of Washington’s eScience
Institute. In this role, she handled operations, developed research and training programs,
107
![Page 108: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/108.jpg)
participated in strategic planning and fiscal oversight, and worked directly with university
and industry partners and funders.
Prior to 2014, Dr. Parker was a senior research scientist in UW’s School of Oceanography,
where she also earned her PhD. She has been involved in many large, interdisciplinary
projects bridging oceanography and genomics. Coming from a data-rich domain, she
appreciates the new data-driven world for all its benefits and challenges. She now enjoys
facilitating collaborations to help researchers navigate this fourth scientific paradigm.
108
![Page 109: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/109.jpg)
Mice, organoids and s ingle cel ls : computat ional methods for cancer t reatment
Elizabeth Permina1*, Tom Brew1, Mik Black1, 2, Parry Guilford1,2, 3
1. Centre for Translational Cancer Research, University of Otago
2. Department of Biochemistry, University of Otago, Dunedin Otago, New Zealand
3. Pacific Edge Ltd, New Zelaland
Hereditary Diffuse Gastric Cancer (HDGC) affects hundreds of people in New Zealand, many from
Māori families. An inherited mutation in the E-cadherin gene (CDH1) is a strong driver of HDGC, affecting
individuals as young as 15 years old. A promising way of combatting HDGC involves finding a synthetic
lethal (SL) partner to the HDGC-defining gene, CDH1. Synthetic lethality is defined as a specific
relationship between two genes where a loss of one is tolerated by the cell but the loss of both is lethal.
An innovative way of mixing computational approaches with experimental data offers a method of
identifying a range of prospective drug targets and treatments. Generation of mouse gastric organoids
(simplified versions of a mouse stomach produced from mouse stem cells with a micro-anatomy that is
close to that of a real stomach) with and without CDH1 loss, provide a realistic model for HDGC, and
single-cell RNA sequencing (scRNA-seq) then provides whole-transcriptome data for these organoid
samples. Here I will present an analysis of the organoid scRNA-seq data, utilizing linkage to existing SL
gene and pathway data (including siRNA studies done previously in our lab) as well as integration of
publicly accessible data sets derived from patient tumours.
Elizabeth Permina is a Postdoctoral Research Fellow working on the HRC funded research programme
“Reducing the burden of gastric cancer in New Zealand”, based in the Centre for Translational Cancer
Research at the University of Otago, Dunedin.
Tom Brew is a PhD student in Biochemistry, with research focused on developing novel approaches to
treating hereditary diffuse gastric cancer.
Associate Professor Mik Black is a Principal Investigator in the Centre for Translational Cancer Research at the University of Otago, and the bioinformatics lead for Genomics Aotearoa, a national initiative for developing genomics and bioinformatics capability in New Zealand. His research focuses on the development and application of statistical and bioinformatics methodology to problems in human health, with a particular focus on cancer.
Professor Parry Guilford is a Principal Investigator in the Centre for Translational Cancer Research at the University of Otago, whose research focuses on the role played by the gene E-cadherin in the development and progression of hereditary diffuse gastric cancer. He is also a co-founder and Chief
Scientific Officer of Pacific Edge Ltd.
109
![Page 110: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/110.jpg)
9ƴƘŀƴŎƛƴƎ ŜwŜǎŜŀNJŎƘ LJNJƻŘdzŎǘƛǾƛǘȅ ǿƛǘƘ bŜ{Lϥǎ ŎƻƴǎdzƭǘŀƴŎȅ ǎŜNJǾƛŎŜ
Alexander Pletzer1, Chris Scott2, Wolfgang Hayek1 and Georgina Rae2
NeSI (NIWA1, University of Auckland2)
Many research areas from bioinformatics and genomics, to materials science, fundamental physics, earthquake simulation and weather/climate prediction are increasingly dependent on the availability of powerful computing platforms and deep software stacks. Unfortunately, scientific software too often runs at sub-optimal performance, sometimes reaching only a few percent of the maximum peak performance of the supercomputer. Small changes in code implementation details, the choice of compilers and libraries and adjustments in runtime environment have been shown to sometimes have a significant impact on code performance. By walking through some examples, we show how researchers were able to leverage NeSI’s free consultancy service to squeeze more performance out of their application, sometimes reducing the execution time by several factors, a win-win solution which benefits science and saves core hours.
ABOUT THE AUTHOR(S)
- Alex Pletzer is research software engineer for NeSI at NIWA. Originally a physicist,
Alex drifted towards high performance during a career that spans research in plasma
physics, working for a private company in Colorado and supporting users at
university in Pennsylvania.
- Chris Scott is research software engineer for NeSI at University of Auckland.
Currently lead of the computational science team, Chris has a background in
molecular dynamics, Monte Carlo methods, finite element analysis, visualisation and
parallel computing
- Wolfgang Hayek is research software engineer for NeSI at NIWA and scientific
programming group lead at NIWA. Wolfgang has expertise in radiative transfer
modelling, fluid dynamics, data analysis and high performance computing
- Georgina Rae is engagement manager and, until recently, was team lead for the
computational science team. Georgina has experience in food and plant research
and has worked in the world of intellectual property and commercialisation
110
![Page 111: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/111.jpg)
wdzƴƴƛƴƎ wņLJƻƛΥ wŜōƻƻǘƛƴƎ wŜǎŜŀNJŎƘ /ƻƳLJdzǘƛƴƎ ϧ {dzLJLJƻNJǘ ŀǘ ±¦²
Jonny Flutey, Andre Geldenhuis, Wes Harrell, Matt Plummer
Victoria University of Wellington
[email protected]; [email protected]; [email protected];
Over the past year, a team newly based in the Centre for Academic Development has taken
responsibility for consolidating and reconfiguring high performance computing resources at
Victoria University of Wellington. Accompanying this technical reboot have been refreshed
approaches to support, training, community building, rebranding and data gathering.
This paper will outline the proceses and practices adopted in this holistic approach to
research support and researcher development, ranging from wrangling computing nodes
into a shipping container, to onboarding non-typical users to an HPC environment. It will
overview tools utilised (Slurm, Slack, GitHub, MKDocs, Ganglia, AWS Cost Calculator),
training and events offered (Carpentries workshops, ResBaz, community catch-ups), and
approaches undertaken (including developing a capability tied to available resources,
reporting metrics, alignment with NeSI environments, and engagement with internal and
external stakeholders).
https://vuw-research-computing.github.io/raapoi-docs/
https://resbaz.github.io/resbaz2019/wellington/
https://medium.com/the-data-nudge/key-takeaways-from-research-bazaar-wellington-
2019-937c57c8699
ABOUT THE AUTHOR(S)
Jonny Flutey is Digital Learning and Research Manager, Andre Geldenhuis and Wes Harrell
are Research Software Engineers, and Matt Plummer is a Digital Research Consultant, all
based in the Centre for Academic Development at Victoria University of Wellington.
111
![Page 112: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/112.jpg)
tŜNJπǎŀƳLJƭŜ LJŀǘƘǿŀȅ ŀƴŀƭȅǎƛǎ ǘƻƻƭ ŦƻNJ 5b! ƳŜǘƘȅƭŀǘƛƻƴ Řŀǘŀ
Santana, A.F. 1,2, Benton, M.C.1, Macartney-Coxson, D.1, Black, M.A.2 1Human Genomics, Institute of Environmental Science and Research (ESR), Porirua, New
Zealand, 2 Department of Biochemistry, University of Otago, Dunedin, New Zealand. [email protected], [email protected], donia.macartney-
[email protected], [email protected]
Pathway enrichment analysis plays an important role in the understanding of biological processes and diseases. Such analyses interrogate defined sets of differentially methylated CpG sites and/or differentially expressed genes, evaluating whether changes to members of biological pathways are occurring by chance or not, indicating the potential biological relevance of these functional groupings. In the DNA methylation context, traditional paired or case:control designs are statistically tested to investigate which methylation sites are significantly altered across samples. However, due to the implementation and design of such tests, per sample analyses are infeasible. Moreover, unlike gene expression, methylation data can skew pathway analysis results if not properly processed, as CpG probes are generally unevenly distributed across genes. For instance, there are genes which contain as few as one probe, while others have hundreds of probes mapping to them. Consequently, genes with a large number of probes have a higher probability to exhibit differential methylation, and hence reported perturbed pathways are likely to include false positive results. We propose a novel pathway analysis tool for DNA methylation data, which enriches gene sets and analyses pathways disruption on a per-sample basis, and reduces the bias from CpG-to-gene mapping. This is a flexible approach that can apply different categorisation techniques to methylation signals (beta values) creating CpG sets. These are then converted into gene sets after adjustment using one of the multi-mapping bias methods developed. The final output is then enriched for pathway membership. To assess statistical significance, a resampling step is performed to evaluate whether enriched pathways are robust or if they could have been randomly obtained. This tool is under development and will be made available as an R package. ABOUT THE AUTHOR(S) Alessandra Santana is a PhD student in Bioinformatics at the Institute of Environmental Science and Research (ESR) and University of Otago. In her collaborative project, she develops computational and visualisation tools to investigate obesity and its related metabolic disorders and the tractability of epigenetic information as disease markers, particularly focusing on DNA methylation. Her interests include machine learning, clustering techniques, DNA methylation, microRNAs and genetics.
112
![Page 113: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/113.jpg)
Humanities Data Untied - An Untapped Resource or just an Untidy Office?
Alexander Ritchie University of Otago Library
This presentation will reflect on aspects of the impact made by data science and digital tools within the Library and the Humanities at Otago University. It will seek to untie some of the tightly bound threads of 'data' metaphors in a humanities context, and touch on how librarians and libraries are helping to order the 'untidy office' of data scholarship nationally and internationally. It will do this through three provocations:
• do the Humanities actually have 'data',• what should and do Humanists do with the data if we do indeed have it, and
finally• what does it mean to be 'united' in data in the context of a Humanities-in-
crisis, indigenous data sovereignty, and continual under-funding of the GLAMand cultural sector.
It will conclude in musing about the relationship between data, capta, information, knowledge, and wisdom, and what narrative and metaphor might offer eResearch in Aotearoa.
ABOUT THE AUTHOR(S) alexander ritchie currently works as a librarian in the humanities at the University of Otago Library, having previously worked as a librarian in sciences, with Otago Polytech, and at Te Uare Taoka o HǕkena | Hocken Collections. Recently, he has collaborated with colleagues in the UO Library and School of Arts to support the University's Divisional Digital Humanities Hub | Te PǾkapu Matihiko. This work has involved staffing open hours, running seminars, developing resources, hosting workshops, and advocating for the value of the digital in the humanities. He is yet to master the art of writing about himself in the third person.
113
![Page 114: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/114.jpg)
CƛNJǎǘ ǎǘŜLJǎ ƛƴ ƳŀŎƘƛƴŜ ƭŜŀNJƴƛƴƎ ǿƛǘƘ bŜ{L
Chris Scott1, Kameron Christopher2, Alexander Pletzer1, Wolfgang Hayek1, Nooriyah Lohani1
NeSI1, NIWA2
This is a hands on, beginner level workshop on machine learning with NeSI. We will focus on image recognition as an example but this workshop should also be useful to those who wish to build their confidence with machine learning tools such as Keras and TensorFlow. We will begin with a broad introduction to some of the machine learning techniques that will be applied during the workshop, such as convolutional neural networks. Then we will create and train a machine learning model to count objects in images using Keras and TensorFlow within a Python notebook environment. This workshop will then focus on tweaking the model to improve performance. The attendees will have the opportunity to provide feedback to the group to learn from each other’s experiences and discuss any pitfalls that were encountered, such as overfitting. This workshop is aimed at building your confidence in applying machine learning tools and techniques and to prepare you for taking the next steps in using machine learning for your research.
ABOUT THE AUTHOR(S)
- Chris Scott is a Research Software Engineer for NeSI at The University of Auckland
- Kameron Christopher is Chief Scientist – HPC and Data Science at NIWA
- Alex Pletzer is a Research Software Engineer for NeSI at NIWA
- Wolfgang Hayek is a Research Software Engineer for NeSI at NIWA
- Nooriyah Lohani is Research Communities Advisor for NeSI at The University of
Auckland
114
![Page 115: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/115.jpg)
It/ ŦƻNJ ƭƛŦŜ ǎŎƛŜƴŎŜǎΥ ƘŀƴŘƭƛƴƎ ǘƘŜ ŎƘŀƭƭŜƴƎŜǎ LJƻǎŜŘ ōȅ ŀ ŘƻƳŀƛƴ ǘƘŀǘ NJŜƭƛŜǎ ƻƴ ōƛƎ Řŀǘŀ
Dinindu Senanayake
New Zealand eScience Infrastructure (NeSI)
The advancement of sequencing technologies, proteomics, microscopy (High throughput
high content), etc. and decreasing cost is responsible in creating an avalanche of data across
multiple sub-domains that fall under life sciences. This data deluge demands an
interdisciplinary approach to face the associated challenges such as data storage, parallel
and high-performance computing solutions for data analysis, scalability, security and data
integration. Ability to deliver solutions to these needs will result in converting highly
granular, unstructured data into real scientific insights which will accelerate the advances
being made assisted precision medical treatment based on an individual’s genetic makeup,
developing drugs with minimum side effects, species conservation programmes, etc.
New Zealand eScience Infrastructure (NeSI) is focused on delivering these tools that are
required by our researchers who might need a “huge” amount of memory to assemble a
large genome, simulate the Newtonian equations of motion in biochemical molecules like
proteins, nucleic acids in parallel, facilitate the ever increasing requirement of data storage
(from day to day to “Sensitive”) and deploying efficient methods for end-to-end data
transfers. Also, NeSI’s partnership with Genomics Aotearoa had been instrumental in
introducing training tools such as virtual machines and an extensive number of workshops
hosted on these machine which are proving to assist beginners’ level
bioinformaticians/computational biologists to acquire advance skills within a short period to
be used in their search to understand the rules of life
ABOUT THE AUTHOR(S)
Dinindu Senanayake
- An Applications Support Specialist at NeSI with a particular interest in Bioinformatics
and Computational Biology. Joined NeSI following half a decade of research
experience gained in the field of Cancer Genetics, Chemical Genetics and
Bioinformatics
115
![Page 116: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/116.jpg)
!LJLJƭƛŜŘ 5ŜŜLJ [ŜŀNJƴƛƴƎ ŦƻNJ 5ƛǾŜNJǎŜ Research Communities Prof. Richard O. Sinnott University of Melbourne
The Melbourne eResearch Group (www.eresearch.unimelb.edu.au) are involved in a multitude of projects, many of which are focused on big data and data analytics. Many researcher challenges have much to benefit from artificial intelligence and especially from the application of deep learning and convolutional neural networks (CNNs). This talk will provide an overview of a portfolio of projects that have benefited from recent advances in the deep learning domain. These include case studies related to: • pedestrian/crowd counting for the City of Melbourne; • (early) fruit counting on trees (for fruit growers to estimate yield); • tree volume canopy estimation (for fruit growers to estimate the amount of spraying needed); • truck and trailer classification for VicRoads; • feral cat classification for ecology researchers working in rural Victoria; • plant and flower classification for commercial agricultural companies, and • encroachment of vegetation on powerlines for a range of utility companies The talk will cover a brief background to deep learning and CNNs and focus on the results that are now possible, with specific focus on projects requiring image detection and classification. Demonstrations of the result of the case studies will be provided. ABOUT THE AUTHOR(S) Professor Richard O. Sinnott is the Director of eResearch at the University of Melbourne and Chair of Applied Computing Systems. In these roles he is responsible for all aspects of eResearch (research-oriented IT development) at the University. He has been lead software engineer/architect on an extensive portfolio of national and international projects, with specific focus on those research domains requiring finer-grained access control (security). He has over 400 peer reviewed publications across a range of applied computing research areas
116
![Page 117: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/117.jpg)
¦ƴƛǘŜŘ ƛƴ Řŀǘŀ ƳŀƴŀƎŜƳŜƴǘΥ Lǎ ƛǘ ǘƛƳŜ ŦƻNJ ŀ ƴŀǘƛƻƴŀƭ NJŜǎŜŀNJŎƘ Řŀǘŀ ƳŀƴŀƎŜƳŜƴǘ ŦNJŀƳŜǿƻNJƪΚ
Shiobhan Smith and Laura Armstrong
University of Otago and University of Auckland
[email protected] and [email protected]
In 2015 the CONZUL Research Data Management Framework Report for Universities New
Zealand made the following observation:
“While the concept of research data management has not changed, the environment in
which research is conducted has. Researchers are now able to generate extremely large
volumes of data over very short periods of time, and analyse complex systems where
previously a reductive approach was required. The impact of technology on modern
research has led to a situation where our ability to manage research data has been
overtaken by our ability to generate it, a situation which has created a separation in the
scholarly record. Where once research data were available for peer reviewed
communication, whether in formal publication, collaborative agreements or between
individuals, data are now stored on volatile media in inaccessible locations and without any
contextual semantics or clear lines of ownership, provenance or purpose. Researchers are
unable to, or see little value in structuring their data more effectively and institutions are
unsure how to encourage this.
There is a significant risk that these data, this evidence of the scholarly record, will be lost;
rendering the publications, communications and discourse they generate un-defensible and,
in an academic context, useless. This risk of loss is borne of two circumstances. First,
technology’s inability to store and preserve digital objects for long periods; disks degrade or
fail and data bit-streams corrupt. Second, an absence in the research process of essential
data structure activities so that data may be found, understood, shared and attributed in
line with community conventions in data sharing and validation. Together these two
circumstances encapsulate the need for RDM.
The current state in New Zealand is a fragmented approach to provision that trails other
parts of the world; most notably the UK and EU, the US and Australia. ” 1 (p7)
Fast forward 5 years and we believe this fragmented approach persists. One possible reason
for this is that there is no nationally recognised framework or guideline that sets clear
expectations on how research data needs to be managed by New Zealand research
117
![Page 118: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/118.jpg)
institutions. Universities are including aspects of research data management in policies such
as researcher codes of conduct and open access guidelines but is this enough to ensure that
New Zealand’s research data is treated as a valued asset and is FAIR2?
This workshop is a follow-on from initial conversations at URONZ2019 and a session at
Figshare Fest NZ 2019 and is aimed at those with an interest in or support research data
management. Working in small groups, participants will brainstorm the concept of a
national research data framework; who should be responsible, what should be include, how
it should be monitored, and the risks versus benefits of adopting this collaborative
approach.
References:
1 Wilkinson, J Max, Flaherty, Brian, Hearne, Shari, Lynch, Helen, Lamond, Heather, Dewson,
Natalie, … Amos, Howard. (2016, February 2). Research Data Management Framework
Report (Version Public). Zenodo. http://doi.org/10.5281/zenodo.1193195
2 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for
scientific data management and stewardship. Sci Data3, 160018 (2016)
doi:10.1038/sdata.2016.18
ABOUT THE AUTHOR(S)
Shiobhan Smith
Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being
appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan
was Subject Librarian to a number of Humanities departments including Sociology,
Anthropology, Geography, and Theology. As Subject Librarian to the Centre for
Sustainability, Shiobhan was involved in the development of the Otago Data Management
Planning tool and has an interest in Research Data Management. Shiobhan also has
knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.
Laura Armstrong
Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,
University of Auckland working to engage researchers in eresearch, and deliver research
data management services and researcher enablement projects.
118
![Page 119: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/119.jpg)
¦ƴƛǘƛƴƎ ŜljdzƛLJƳŜƴǘ ŀƴŘ NJŜǎŜŀNJŎƘ LJdzōƭƛŎŀǘƛƻƴǎΥ ōƛƎƎŜNJ ǘƘŀƴ .Ŝƴ IdzNJΚ
Shiobhan Smith and Fiona Glasgow
University of Otago
[email protected] and [email protected]
The issue is simple enough; create a list of publications that are the result of using a specific
piece of equipment. But how do you do this? Does your institution have mechanisms that
track data outputs from their inception in equipment, like flow cytometric analysers, to a
completed and published article? How many researchers may work on that data in-
between? Would they cite that equipment in their outputs?
OMNI (Otago Micro and Nanoscale Imaging unit) administrators must report on the
scholarly outputs produced as a result of using their equipment. To date this process has
been difficult and manual, requiring hours of work gathering lists of publications from the
research outputs database, contacting researchers, and relying on self-reporting. As
publishers do not systematically ask for equipment metadata as part of the publishing
process there is no easily way to query publication databases. In 2018, OMNI manager
Charlene Gell met with the library Research Support Unit (RSU) to investigate a more
streamlined and sustainable solution. What the RSU discovered is that the issue is simple
but the solution complex.
In this presentation RSU members Shiobhan Smith and Fiona Glasgow will use the OMNI
project as a case study to discuss; persistent identifiers, the data lifecycle, data
management, research management systems, and citation culture. They will breakdown
the problem, present possible solutions, and seek feedback. After all equipment and
publications are united by data. But is maintaining that connection, as the data moves
through the research lifecycle, simply bigger than Ben Hur?
ABOUT THE AUTHOR(S)
Shiobhan Smith
Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being
appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan
was Subject Librarian to a number of Humanities departments including Sociology,
Anthropology, Geography, and Theology. As Subject Librarian to the Centre for
Sustainability, Shiobhan was involved in the development of the Otago Data Management
119
![Page 120: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/120.jpg)
Planning tool and has an interest in Research Data Management. Shiobhan also has
knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.
Fiona Glasgow
Fiona is an information management enthusiast, working in both libraries and museum for
the past five years. After finishing an Honours degree in English, Fiona began the Masters of
Information Studies, which she completed in July 2017. Her research topic was focused on
digitising Māori collections in museums.
120
![Page 121: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/121.jpg)
5ŀǘŀ ƳƻǾŜƳŜƴǘ ŎƘŀƭƭŜƴƎŜǎ ǘƻ NJŜǎŜŀNJŎƘ LJNJƻŘdzŎǘƛǾƛǘȅ π ŜȄŀƳLJƭŜǎ ŀƴŘ NJŜǎLJƻƴǎŜǎ
Dr Frankie Stevens1, Dr Carina Kemp1, Dr Andrew Lonie2, Dr Steve Manos2
1 AARNet, 2 Australian Biocommons
[email protected], [email protected],
[email protected], [email protected]
The ability to reliably and repeatedly move data from A to B is becoming the key
underpinning capability of modern research. This has been driven by changes in how
research happens: national and international-scale collaborations around data are growing
and more numerous; researchers are using more diverse computational infrastructure that
is more geographically distributed; there’s an increasing use of international reference data
collections; and there is an upscale of instruments and the volumes of data they produce.
This BoF aims to explore this challenge, and seeks to delve into some fundamental
questions, such as:
● How much research impact does data movement have?
● What examples of approaches, tools and collaborations exist where things have
worked well?
● What is the taxonomy to describe the challenges and our responses within data
movement? (A common language that describes the multi-faceted nature of this
challenge is needed. Data movement software, data movement scheduling, data
placement and data security, are all related areas that begin to describe responses to
the challenge.)
● What could an optimal ‘data movement ecosystem’ look like? If we treated data
movement as a first class citizen of the research computing world, akin to HPC, data
storage, or cloud, what sort of training, expertise, resources, help and approaches
could we envisage?
This BoF will probe these questions across various disciplines, to determine the levels of
support and tooling required to do so, and to build partnerships between research
disciplines and eInfrastructure providers. The BoF will feature real world examples of data
movement challenges being experienced by life science researchers in Australia, but aims to
capture the broad array of issues from other disciplines, and provide information on the
solutions available today. The BoF will make use of online polling software to enable all
delegates to participate in real time, and permit broad engagement on the topic.
121
![Page 122: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/122.jpg)
ABOUT THE AUTHOR(S)
- Frankie Stevens is AARNet’s Research Engagement Strategist. Previously, Frankie has
held roles with the Australian Research Data Commons, the NSW state body for
eResearch, the Research Data Storage Infrastructure (RDSI) Project and was
eResearch Programme Manager at the University of Sydney. Frankie has 20 years'
experience in the Higher Education Sector, with a background in Molecular Biology,
having worked in both the Australian and overseas university sectors.
- Carina Kemp is AARNet’s eResearch Director. Carina has worked across government,
industry and research. She is passionate about enabling and connecting innovative
people, innovation, team empowerment, bridging the gap between IT and everyone
else and engagement with stakeholders at all levels. Carina has a background in
Geosciences.
- Andrew Lonie is the DIrector of the Australian Biocommons. Andrew has a
background in molecular biology and computer science, and was appointed Head of
the Victorian Life Sciences Computation Initiative’s Life Sciences Computation Centre
in 2010 to create a multi-disciplinary centre of expertise in life sciences offering best
practice analyses, training and education. He was subsequently appointed Director
of the VLSCI in 2015, which underpinned the formation of Melbourne Bioinformatics
in 2017. Andrew is also the Director of Melbourne Bioinformatics and EMBL-ABR.
- Steven Manos is Associate Director of Cyberinfrastructure in the Australian
Biocommons. Previously, Steven was the University of Melbourne as the Director of
Research Platform services. Steven has a background in Physics.
122
![Page 123: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/123.jpg)
.ƛƎ LƴǘŜNJƴŜǘ tƛLJŜ ŀƴŘ /ƭƻdzŘ {ŀǾŜŘ aȅ {ǘƻNJŀƎŜ ƛƴ /NJƛǎƛǎ
Authors name: Dan Sun
Organisation: AgResearch
Authors Email: [email protected]
The current storage solutions in AgResearch are all based on Network Attached Storage
(NAS) technologies. It was simple, quick and cost effective to deploy. In some instances, it
was even easy to scale up their capacities. However, individual fileservers have become
data silos and we suffered from their limitations regularly. This talk is based on an incident
caused one of those struggles. It also covers how we recovered from it quickly by utilising
the Cloud, and our thoughts on our future storage platform.
Over one weekend in early October 2019, unexpected amount of data was placed on one of
user accessible fileservers and pushed its utilisation over 85%. Consequently, its
performance started to degrade. Unfortunately, there was no other storage which had
enough spare capacity to offload this additional load in the same physical location.
We decided to remove some large datasets which had not been accessed by users for over 2
years to reclaim capacity quickly. At the same time, we had to maintain the same data
protection level (two separated copies of the same data stored in two different locations).
To achieve this objective, we uploaded a copy of such datasets’ offsite replicas to Microsoft
Azure Blob storage before removing the original copy from the server. Additionally, we also
configured the Cloud storage to automatically migrate data from the Cool tier to the Archive
tier after data being in the cloud for 7 days. This significantly reduces the cost of storing
data in the Cloud for the long term, although we acknowledge the additional cost and time
for retrieving such data if that’s required. We deem the probability of such operation low
and would only be necessary in a disaster recovery scenario.
We were extremely pleased by the performance of REANZ’s network when we were
uploading data to Microsoft Azure’s instance in Australia. We were able to upload 2TB of
data in just over 37 minutes, which translates to 7 Gbps per second in average. The speed
of our WAN is 10 Gbps. It took us another 2 hours to remove the dataset on the fileserver
where we were running out of capacity. Overall, it took us just less than 3 hours to stabilise
this fileserver and we think it was a fairly good outcome. After the initial crisis was over, we
uploaded further 6TB of data to the Cloud to reclaim capacity from the same fileserver. We
plan to use the same approach whenever we encounter similar issues in the short term until
we are able to replace our current generation storage solutions.
123
![Page 124: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/124.jpg)
Almost all of our storage solutions will reach their end of life in the next 12 to 24 months,
and we are currently planning a new generation storage platform to replace them. From all
lessons we have learned to date, we think a scale out storage solution is much more fit for
purpose than NASs or fileservers. Based on our uses of the Cloud, we start to see the value
of Object stores, although we won’t be getting rid of unstructured data store, filesystems,
any time soon. It is our ambition to integrate both by some smart software. We also think
data replication is more practical and appropriate than the traditional backup/restore model
for the amount of data volume we have to keep. Lastly, the possibility to replicate data to
the Cloud is attractive, particularly the low-cost archival storage, but its high retrieval
overhead (both time and cost) is a risk that needs to be further investigated and mitigated.
ABOUT THE AUTHOR(S)
Dan is currently working for AgResearh as a HPC consultant and maintains a smallish Linux
cluster and storage. He is passionate about helping researchers to do science by using
advanced technologies. When he is not firefighting at work, he enjoys having barista made
coffee, fancy burgers and donuts with his collaborators and friends.
124
![Page 125: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/125.jpg)
LƴǘŜNJƴŀǘƛƻƴŀƭƛǎŀǘƛƻƴ ƻŦ ¢ƘŜ /ŀNJLJŜƴǘNJƛŜǎ ς [Ŝǎǎƻƴǎ ƭŜŀNJƴǘ ƻƴ ǘƘŜ ǿŀȅ
Riku Takei
University of Otago
At the end of 2018, there was growing interest in the internationalisation of the Carpentries
teaching material for non-English language speaking countries. As part of this community-
driven initiative, I became involved in the translation of the Software Carpentry materials
into Japanese. In this lightning talk, I would like to share my experience on organising,
managing, and collaborating with people living in Japan, using Git and GitHub.
ABOUT THE AUTHOR(S)
Riku Takei
I have been involved with The Carpentries since I began my MSc at the University of Otago;
first as a learner, then a helper, and finally as an instructor. I became an official Carpentries
instructor in 2018, and since then, I have been involved in the translation of The Carpentries
material into Japanese.
125
![Page 126: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/126.jpg)
LƴŦƭdzŜƴŎƛƴƎ 5ŀǘŀ /dzƭǘdzNJŜ ǘƻ hLJǘƛƳƛǎŜ 5ŀǘŀ ¦ǘƛƭƛǎŀǘƛƻƴ
Lisa Thomasen
Fonterra Co-Operative Group Ltd
At Fonterra’s Research & Development Centre we have a range of data which describes both our dairy products and manufacturing processes. We want to preserve this data for the future and increase our opportunities for applications of analytics. To be successful this requires a significant shift in the data culture. This talk will outline the approaches we are taking to change the data culture in our research teams. We have started this process by conducting a data usage survey which allowed us to define our biggest data challenges and their position in the data life cycle. We have now built a team of data stewards to help us map out and execute solutions to the data challenges we’re facing to allow us to get optimal value out of our research data now and many years into the future. This talk will cover the work we have done with our data stewards over the past year and the next steps we have planned to achieve our data management vision. This includes our work to implement unique identifiers for all samples and our proposed metadata database.
ABOUT THE AUTHOR(S)
Lisa Thomasen has been working as a research statistician for the Fonterra Research &
Development Centre in Palmerston North for four years. Throughout this time, she has
dedicated a lot of resource towards data management and statistics training.
126
![Page 127: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/127.jpg)
IdzƳŀƴƛǘƛŜǎΣ !NJǘǎ ŀƴŘ {ƻŎƛŀƭ {ŎƛŜƴŎŜǎΥ ²Ƙŀǘ ƘŀǾŜ ǿŜ ƭŜŀNJƴŜŘΣ ǿƘŜNJŜ ŀNJŜ ǿŜ ƎƻƛƴƎΚ
Alexis Tindall Australian Research Data Commons
Ian Duncan Australian Research Data Commons
The Humanities, Arts and Social Sciences (HASS) community covers an extraordinary breadth of research activities. In this Birds of a Feather (BoF) session, we explore Australian developments to support the communities under this wide umbrella. These communities can be divergent in approaches and objectives, but remain united in data and united in approaches to research support.
The Australian Government is poised to make a long-awaited investment to support the humanities, arts and social sciences community under the National Collaborative Research Infrastructure project. In preparation for this investment, they have commissioned the Australian Research Data Commons (ARDC) to map the data landscape relevant to the HASS community, relevant concurrent initiatives, and the role of existing research infrastructure in supporting those communities. While the data and research support landscape for HASS is rich, fragmented and diverse, the real challenge is in capturing the landscape of research communities and responding to their needs.
But what about the research communities? Under the umbrella of ‘HASS’ we cluster research as diverse as Urban Environments and Design, with Law, Classics and the Philosophy of Religion. How do we plan for a research infrastructure investment that can support such breadth of activity, diversity of approaches to data and sources, differing research ambitions? A HASS Research Data Commons has been identified as a model that could aid data-enabled HASS research. A Commons can be planned with flexibility and interoperability, to allow focussed support to benefit specific research communities in initial implementation, that can be extended to support related activities across new and emerging communities.
This 60 minute BoF seeks to share approaches to supporting HASS research communities, and learn from related regional initiatives. The BoF will open with a presentation on the Australian HASS data and data-enabled research landscape as captured in this project, and a sketch of our proposed responses. Discussion after the presentation will focus on challenges and opportunities in this area.
127
![Page 128: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/128.jpg)
Questions for discussion include:
• How do we usefully characterise HASS-relevant data collections? How do we determine significance and support access in a responsible and sustainable way? How do we balance openness, access and the rights of communities contributing to those datasets?
• How do we accommodate commonalities and differences across HASS research communities? How do we strengthen opportunities for those conducting data-driven HASS research when they might be atypical of their field?
• How ready are HASS communities? How can we ensure they’re ready to make the most of new digital opportunities to amplify, enhance and supercharge their research?
ABOUT THE AUTHOR(S)
Alexis Tindall is a Senior Research Data Specialist at the Australian Research Data Commons, with a particular interest in supporting and enabling humanities, arts and social sciences research. She has extensive project management experience in diverse environments. Before joining the eResearch community, she worked in natural history and social history museums, and is passionate about digitisation and improving digital access to the nation’s treasured collections.
Ian Duncan is Director, eResearch Infrastructure & Services, at the Australian Research Data Commons and has many years experience in defining, developing, and running infrastructure to support high-impact research. Ian was previously Director of the NCRIS Research Data Storage project and has a keen interest in working with evolving communities in making more data more available to more people and looking at how to effectively incorporate citizen science, industry and government partners into our sector.
128
![Page 129: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/129.jpg)
9ƴƎƛƴŜŜNJƛƴƎ It/Υ ²ƘŀǘΩǎ ƎƻƛƴƎ ƻƴΚ
Callum Walley
NeSI
Engineering Researchers are met with many unique challenges when scaling their research
on high performance computers.
In this presentation the current state of NeSI’s support for Engineers will be discussed, the
developments, challenges and what can be expected in the future.
ABOUT THE AUTHOR
Callum Walley is part of New Zealand eScience Infrastructure (NeSI) applications support
team, their main goal being to develop HPC capability and engagement within the
Engineering community.
129
![Page 130: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/130.jpg)
9ŀNJǘƘ ǎȅǎǘŜƳ ƳƻŘŜƭƭƛƴƎ ƛƴ bŜǿ ½ŜŀƭŀƴŘ ς ǘdzNJƴƛƴƎ ōƛƎ Řŀǘŀ ƛƴ ōƛƎ ǎŎƛŜƴŎŜ
Jonny Williams, Erik Behrens, Olaf Morgenstern, Mike Williams NIWA, Wellington, NZ
[email protected] After several years of development, the first results and papers showcasing the output from New Zealand’s earth system modelling community are now available. This represents a large body of behind-the-scenes work from multiple NIWA and NeSI staff, not to mention our international collaborators in the Unified Model partnership. This is all very well, but how are we going about turning approximately 0.5PB of raw model output into science which enables New Zealanders to ‘anticipate, adapt, manage risk, and thrive in a changing climate.’ This is the mission statement of the Deep South National Science Challenge, through which this work is funded. We are simulating three greenhouse gas emissions scenarios representative of an unknown future. From the model output, we can estimate how the world will warm. However, earth system models enable us to do a lot more than this. We can also examine changes to chemical processes in the atmosphere, biogeochemical processes in the ocean, as well as changes to the terrestrial biosphere. I will discuss the theory and practice of turning this raw data into useful science in an HPC context. I will also present some early findings from our model, which differs from its parent model – the UKESM – in its ability to simulate the ocean circulation around Aotearoa New Zealand at high, ‘eddy permitting’ resolution. ABOUT THE AUTHOR(S) Jonny Williams moved to New Zealand in 2015 after a postdoc in physical geography at Bristol University studying extreme warm paleoclimates of the Cretaceous and Jurassic periods. Before this he worked in private practice as a junior consultant at Eunomia Research and Consulting and as a climate scientist at the UK Met Office. Jonny has a PhD in molecular electronics from the University of Bath and a degree in physics from Imperial College London.
130
![Page 131: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/131.jpg)
Erik Behrens leads the ocean modelling project, to further improve NZESM, in the second phase of the Deep South. He has a PhD and degree in physical oceanography from the Christian-Albrechts University of Kiel, Germany. His main interest is to understand how oceans around New Zealand and around Antarctica change due to climate change. Olaf Morgenstern is leading climate modelling at NIWA and for the Deep South National Science Challenge. Prior to joining NIWA worked for Cambridge University in the UK and the Max-Planck-Institute for Meteorology in Hamburg, Germany. His main research interest is in the linkages between physical climate change and atmospheric composition. He holds a PhD in meteorology from ETH Zurich, Switzerland, and a physics degree from Freiburg University, Germany. Mike Williams has been the director of the Deep South National Science Challenge since
2016. He obtained his PhD in polar oceanography from the University of Tasmania in 1999
and was an assistant professor at the Niels Bohr Institute for Physics in Copenhagen,
Denmark for three. He joined in NIWA in 2001 and has had various roles, including leading
the climate observations programme and Antarctic research programmes
131
![Page 132: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/132.jpg)
²ƻNJƭŘǿƛŘŜ ¢NJŜƴŘǎ ƛƴ /ƻƳLJdzǘŜNJ !NJŎƘƛǘŜŎǘdzNJŜǎ ŦƻNJ 5ŀǘŀ {ŎƛŜƴŎŜ
Jeff Zais
NeSI
High performance computing architectures continue to evolve along several dimensions.
These changes are driven by the demand for more complex simulations and the ability to
create, handle, and analyse ever growing volumes of data.
This paper will focus on the state of the art in computer architectures designed to server
large academic research communities in countries around the world. Prominent examples
will include NCI (Australia), LRZ (Germany), and SciNet (Canada).
Besides these examples that are in place, trends in technology will be summarized, to show
what can reasonably be expected in the next five years. This will include expected advances
in many of the key areas of computer architecture, including processors, memory,
networking, and storage. Particular emphasis will be placed on the rapidly evolving area of
storage technology.
ABOUT THE AUTHOR(S)
Jeff Zais recently joined NeSI and NIWA as the Senior High Performance Computing
Architect and Science Advisor. His academic background includes a B.S. degree from the
University of Wisconsin, and M.S. and Ph.D. degrees from Stanford University in Aerospace
Engineering. Professional experience includes technical and management roles at Ford
Aerospace, Cray Research, IBM, and Lenovo, focused on application performance and
system architecture.
132
![Page 133: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/133.jpg)
Otago’s Network for Engagement And Research: Mapping Academic
Expertise and Connections
Sander Zwanenburg
University of Otago
Academic expertise is complex, dynamic, and often encoded in jargon (Auriol et al. 2013).
Academics move across institutions and increasingly change the topics of their research
(Zeng et al. 2019) They apply their expertise in writing that is often specific to a specialised
community.
This makes academic expertise hard to find and to understand. For example, it is difficult for
prospective postgraduate students of the University of Otago to find the right supervisor.
Likewise, organisations that require expertise for their R&D may not be able to identify
available experts. Even within universities, their schools and departments, an understanding
of expertise and its applications in collaboration is very limited, complicating the
management of expertise and the facilitation of its application.
Currently, to find or understand expertise, one might rely on social networks or digital
facilities such as search engines, websites, and academic databases. All of these carry
important shortcomings in the search for experts. For example, asking people in one’s social
network can be time-consuming and ineffective since individuals’ awareness of expertise in
their network quickly fades or becomes outdated when going beyond an intimate inner ring
of contacts (Hill and Dunbar 2003). Search engines are optimized for finding relevant pages
and documents, not experts (Dudek et al. 2007).
How can we map academic expertise, to make it easier to find and understand?
Our answer is a local and practical one. NEAR, the Network for Engagement And Research is
an information system under development, that aims to help its users find and understand
academic expertise in the University of Otago. Its proof of concept has been developed in
the Otago Business School, with data from and about academic staff. The vision is to build a
data warehouse around academics’ expertise and its social context, and to communicate
that data visually and interactively through a web application. This can be rolled out to all
other schools and divisions in the university.
133
![Page 134: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/134.jpg)
Essentially, NEAR collects data, integrates and interprets it, and communicates this
information to its users. The data collected is a combination of user-inputted data and
existing data from other systems. Initially, NEAR only collected basic phonebook-type data
from Active Directory, another institutional system, and asked academics to put in detail
around their Fields of Research, research methods, Sustainable Development Goals,
collaborations, and the Fields of Research of their collaborations. The data input from
academic colleagues came with challenges. Initially this relied on an online survey that
quickly became so complicated that Qualtrics, the survey provider, had to change its
technical specifications. We later developed a custom-made profile system, based on a
LAMP stack, where people could log in, and fill out their details. One difficulty that remained
was that this required a push, not just one-off but continued over time, to get this data
actually collected. The data collection emphasis shifted to other systems that contained
data on people’s expertise, and that was maintained and updated elsewhere: the Research
Output Database, a Research Management Information System, the Media Expertise
Database, but also external databases like Elsevier’s Scopus and Clarivate’s Web of Science.
This shift meant that we started the development of data harvesting and integration
protocols. These were all developed in-house in R. They consisted of working with APIs and
the resulting output. A current challenge is to infer expertise based on the available
evidence. This evidence is based on data on different types of publications, grants, and self-
reports and are linked to different classifications of research fields. These will overlap to
different extents and it is possible that not all fields of expertise are homogenously reflected
in such evidence. Possibly, semantic fingerprints can be applied to enhance accuracy and
reduce reliance on particular classifications.
We have communicated our information about expertise and their social context through an
interactive web visualisation, as shown in the figure below. The visualisation is written with
an R package called visNetwork, which is then embedded in a Shiny app. It consists of a
network graph, where each node represents a staff member (colour coded for department)
or an external party (red), and each link represents an active collaboration. One can zoom
in, and hover over these elements to view their details. They are also searchable by name,
department, field of research (and their corresponding disciplines and sub-disciplines),
research method, and sustainable development goal through drop-down lists. This
highlights those applicable nodes and edges. Highlighted staff members can be emailed with
the click of a button, allowing to easily bring together people with like-minded research
expertise or interests.
134
![Page 135: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/135.jpg)
Screenshot of NEAR’s network visualisation
In the next stage of the project, we will develop further the data integration schemes,
enhance our algorithm to infer expertise based on this data, and update the interactive
visualisation to reflect these inferences. This visualisation should not only help users find
fitting experts, but gain an understanding of how these experts sit in a dynamic, social
context. For example, given a field of expertise, do the experts form a close-knit group or
are they scattered around the university? Deeper insights like these can allow for potent
outcomes, such as an email to a strategically positioned expert.
We believe that our approach has the potential to augment popular search engines in an
important yet local way. Current search engines are optimized for web pages and online
documents (e.g. Google), scholarly output (Google Scholar, Web of Science, Scopus),
geographic information (e.g. Google Maps, Yelp). NEAR can offer deeper insights about
expertise of individuals by combining institutional and public data. It has the potential to
allow its users not only to find the most fitting expert, but also to understand the structure
and dynamics of particular areas of expertise. Hopefully, in the future, this will help bridge
the demand and supply of expertise, and identify opportunities to leverage more fully what
people have developed over many years.
Acknowledgements
I thank Brian Spisak for his fellow leadership in this project and Caitlin Owen and Lahiru
Ariyasinghe for their development support. Further, there are many internal organisations
that have contributed to the initiative, including the Otago Business School, the Research
Support Unit of the Library, Information Technology Services, and Research & Enterprise.
Thank you.
135
![Page 136: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research](https://reader034.fdocuments.in/reader034/viewer/2022042412/5f2c84b2fc542059b656a2d1/html5/thumbnails/136.jpg)
References
Auriol, L., Misu, M., and Freeman, R. A. 2013. "Careers of Doctorate Holders,").
Dudek, D., Mastora, A., and Landoni, M. 2007. "Is Google the Answer? A Study into Usability
of Search Engines," Library Review (56:3), pp. 224-233.
Hill, R. A., and Dunbar, R. I. 2003. "Social Network Size in Humans," Human nature (14:1),
pp. 53-72.
Zeng, A., Shen, Z., Zhou, J., Fan, Y., Di, Z., Wang, Y., Stanley, H. E., and Havlin, S. 2019.
"Increasing Trend of Scientists to Switch between Topics," Nature communications
(10:1), pp. 1-11.
ABOUT THE AUTHOR
Sander Zwanenburg is a Lecturer within the Department of Information Science. He
obtained Bachelor and Master of Science degrees from the University of Groningen, The
Netherlands, and a PhD degree in Management Information Systems from The University of
Hong Kong. Sander’s research interests lies in the fields of the psychology of IT use, the
development of metrics, and networks of knowledge. He has published in various
Information Systems venues such as the Australasian Journal of Information Systems,
Communications of the Association for Information Systems, and the proceedings of the
International Conference on Information Systems.
136