eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on...

136
eResearch NZ 2020 12-14 February 2020 | Dunedin Centre Abstract booklet Programme ........................... 2 - 4 Abstracts .................................. 5 - 136 *Click on the session title to read the abstract* 1

Transcript of eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on...

Page 1: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

eResearch NZ 2020

12-14 February 2020 | Dunedin Centre

Abstract booklet

Programme...........................2 - 4Abstracts..................................5 - 136

*Click on the session title to read the abstract*

1

Page 2: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9:00

10:00 - 12:30

10:00

10:30 Keynote 1 - Rosie Hicks

11:30

12:30 - 13:30

Breakout Session 1

Session 1 A Session 1 B Session 1 C Session 1 D Session 1 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

13:30 Megan Guidry - Training: It's better

together

David Fellinger - Building a Federated

Research Collaborative

Miles Benton - Assessing the potential of

autonomous AI devices for portable real-time DNA

sequencers and deployable sensors

13:50 Ngoni Faya - Genomics Aotearoa Training Callum Walley - Engineering HPC: What’s

going on?

Ann Mc Cartney - Utilising Oxford Nanopore Data

for the Genome Assembly of Endemic New

Zealand Species

14:10 Murray Cadzow - Carpentries at Otago Marko Laban - Cloud-native technologies

in eResearch - benefits and challenges

Elizabeth Permina - Mice, organoids and single

cells: computational methods for cancer treatment

14:30 Christina Hall - Hybrid Training: a scalable

model for delivering hands-on training to

dispersed learners

Ryan Chard - Automating the Research

Data Lifecycle with Globus Automate

Eliatan Niktab - Network-based Nonparametric

Tests to Identify Genetic Modifiers of Rare

Diseases

14:50 Lightning talks: Riku Takei -

Internationalisation of The Carpentries –

Lessons learnt on the way / Matt Bixley -

Reproducible Posters: an Otago Theme

Lightning talks: Wallace Chase - Why so

slow? Molasses biased data transfers… /

Jun Huh - Learning How To Learn

Lightning talks: Alessandra Santana - Per-

sample pathway analysis tool for DNA methylation

data / Joseph Guhlin - I’m a Big Metal Fan: Big

Data at the Lowest Level

15:10 - 15:30

Birds-of-a-Feather (BoF) Sessions

Session 2 A Session 2 B Session 2 C Session 2 D Session 2 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

15:30 Megan Guidry - Building and Supporting a

New Zealand Digital Literacy Training

Community / Sara King - A Common

Thread: Creating community, working

together and enriching research

Laura Armstrong - Identifying, connecting

and citing research with persistent

identifiers.

Blair Bethwaite - Research Cloud NZ Workshop: Chris Scott - First steps in machine learning with NeSI (part 2)

Workshop: Gabriel Noaje - NVIDIA Accelerated

Computing Workshop

17:30 - 18:30

Conference Welcome Address

Lunch

Afternoon Tea

Registration Open

Session: Opening Ceremony and Keynotes

Workshop: Chris Scott - First steps in machine learning with NeSI (part 1)

Keynote 2 - Micaela Parker

Workshop: Gabriel Noaje - NVIDIA Accelerated

Computing Workshop

Wednesday 12 February

13:30 - 15:10

15:30 - 17:30

Welcome Function

Dunedin Centre

End of Day One

*Click on the session title to read the abstract*

Page 3: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Thursday 13th February

9:00

9:30 - 10:30

9:30

10:30-11:00

Breakout Session 2

Session 3 A Session 3 B Session 3 C Session 3 D Session 3 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

11:00 Alexander Ritchie - Humanities Data

Untied - An Untapped Resource or just an

Untidy Office?

Wolfgang Hayek - Singularity containers

on HPC

11:20 Alan McCulloch - Data Pipelines and

Prisms

Lahiru Ariyasinghe - Challenges and

opportunities in timely and efficient delivery

of IT for eResearch projects.

Workshop: Daniel O'Byrne - The Basics of

Cloud Computing

11:40 Shona Mackie - Climate Data and

Computing Needs are Hotting Up!

Paula Andrea Martinez - Towards FAIR

principles for research software

12:00 Shiobhan Smith - Uniting equipment and

research publications: bigger than Ben

Hur?

Chris Hines - Strudel2: Increasing

accessibility of HPC Infrastructure

Rudiger Brauning- GBSathon: Benchmarking

reproducibility of Genotyping-By-Sequencing

analysis workflows through comparison with

SNP chip and pedigree data

Thomas Nicholson - Using comparative RNASeq

to identify small non-coding RNAs in bacterial

clades

Alana Alexander - Akoranga from research

consultation with Māori on sequencing the

genome of a taonga species

Matt Bixley - Naive Prediction of Cancer

Outcomes using Machine Learning

12:20 - 13:30

Breakout Session 3

Session 4 A Session 4 B Session 4 C Session 4 D Session 4 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

13:30 Lisa Thomasen - Influencing Data Culture

to Optimise Data Utilisation

Jun Huh - User journey-driven product

management

13:50 Brian Flaherty - Where Data Lives: NeSI,

taonga and growing repository services.

Vladimir Mencl - Enabling authentication at global

scale: an update on REANNZ services

14:10 Andrea Goethals - Digital Preservation

New Zealand

Dinindu Senanayake - HPC for life sciences:

handling the challenges posed by a domain that

relies on big-data

14:30 Sander Zwanenburg - Otago’s Network for

Engagement And Research: Mapping

Academic Expertise and Connections

Chris Hines - The Undies-Mate Un-Debate

14:50 Lightning talks: Lana Alsabbagh - Use of

the National Library’s Web and Twitter

Collections for Research / Jess Howie -

Support for Research Data Management in

university libraries – How far have we

come?

Nick Jones - Advancing New Zealand’s

computational research capabilities and

skills

Jeff Zais - Worldwide Trends in Computer

Architectures for Data Science

Andrew Lonie - Progress in the

Australian BioCommons

April Neoh - Beyond super

Lightning talks: Wallace Chase - Frozen data /

Adam Bartonicek - Why overfitting is bad for

science: Lessons from psychology

15:10 - 15:30

15:30 - 17:30 Birds-of-a-Feather (BoF) Sessions

15:30 Jonny Flutey - Micro-credentials and

Research Skills Development

Jana Makar - Growing the eResearch workforce in an inclusive way

Anton Angelo - All Research Questions Are

Ethical Questions

Workshop: Blair Bethwaite - Containers in HPC Tutorial (part 2)

19:00 - 22:00

11:00-12:20

Morning Tea

Tuesday 19 February

Registration Open

Session: Plenary/Keynote

Lunch

Birds of a Feather: Brian Flaherty - Building a

national/regional data transfer platform: Globus BoF

Workshop: Blair Bethwaite - Containers in HPC Tutorial (part 1)

Keynote 3 - Richard Dean

Afternoon Tea

End of day 2

Conference Dinner

Larnach Castle

This event is included in full and student registrations, however tickets are limited so your attendance must be confirmed prior to the conference commencement.

13:30 - 15:10

*Click on the session title to read the abstract*

Page 4: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Friday 14 February

9:00

9:30 - 10:30

9:30

10:30 - 11:00

Breakout Session 4

Session 5 A Session 5 B Session 5 C Session 5 D Session 5 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

11:00 Richard Sinnott - Applied Deep Learning

for Diverse Research Communities

Stephanie Guichard - Data-intensive

approaches to finding and predicting

research outcomes for New Zealand health

research

Wallace Chase - How does REANNZ work?

11:20 Justin Baker - eResearch Collaboration

Projects – supporting CSIRO’s digital

science and research

Jonny Williams - Earth system modelling in

New Zealand – turning big data in big

science

Alexander Pletzer - Enhancing eResearch

productivity with NeSI's consultancy service

11:40 Jo Lane - Scientific supercomputing:

Teaching practical skills for credit

Nancy Lin - Data Analytic Transformation

Journey with Jupyter

Cheng-Hao Cai - Building Machine Learning

Systems on microsoft Azure Cloud Machines

12:00 Matt Plummer - Running Rāpoi: Rebooting

Research Computing & Support at VUW

Carina Kemp - Building an International

FAIR Infrastructure for ‘Uniting’ Research

Data

Dan Sun - Big Internet Pipe and Cloud Saved My

Storage in Crisis

12:20

12:30 - 13:30

Birds-of-a-Feather (BoF) Sessions

Session 6 A Session 6 B Session 6 C Session 6 D Session 6 E

Glenroy Auditorium (Plenary) Conference Room 1 Conference Room 2 Chester Lounge (Breakout Room) The Terrace (Boardroom)

13:30 Nooriyah Lohani - Research Software

Engineering Community update and next

steps in New Zealand

Joep De Ligt - Scalable Workflows and

Reproducible Data Analysis (for Genomics)

Carina Kemp- Data movement challenges to

research productivity - examples and responses

End of day 3

Wednesday 20 February

Registration Open

Session: Plenary/Keynote

Morning Tea

Conference wrap-up

Workshop: Shiobhan Smith - United in data

management: Is it time for a national research data

management framework?

Alexis Tindall - Humanities, Arts and Social

Sciences: What have we learned, where are we

going?

13:30 - 15:30

Lunch

11:00 - 12:20

Keynote 4 - Amber Budden

*Click on the session title to read the abstract*

*Programme subject to change*

Page 5: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!ƪƻNJŀƴƎŀ ŦNJƻƳ NJŜǎŜŀNJŎƘ Ŏƻƴǎdzƭǘŀǘƛƻƴ ǿƛǘƘ aņƻNJƛ ƻƴ ǎŜljdzŜƴŎƛƴƎ ǘƘŜ ƎŜƴƻƳŜ ƻŦ ŀ ǘŀƻƴƎŀ ǎLJŜŎƛŜǎ

Alana Alexander1 and Benjamin Iwikau Te Aika2

1. Department of Anatomy, University of Otago; 2. Genomics Aotearoa

[email protected], [email protected]

As uri (descendants) of Tangaroa (or Tāne-Mahuta in the pūrākau of some hapū), Hector’s

and Māui dolphins (Cephalorhynchus hectori) are taonga (treasured). However,

anthropogenic activities, particularly fishing (through fisheries bycatch), have led to

restricted/fragmented distributions and significant reductions in genetic diversity in both

subspecies. A worrying additional trend are deaths due to the parasitic disease

toxoplasmosis, potentially exacerbated by decreased genetic diversity. Hologenomics – a

new paradigm where genomes of a host and its co-existing microbes (microbiome) are

simultaneously investigated for novel insights into host health, population sizes, and

connectivity – could therefore be an important tool to address susceptibility to

toxoplasmosis and other diseases, as well as population sizes through time, potential

divergence, and past patterns of interchange between the Hector’s and Māui dolphin.

However, in order to be effective partners to Te Tiriti o Waitangi – particularly maintaining

Māori rangatiratanga over resources and taonga, it is important that research consultation

with mana whenua from the areas where Hector’s and Māui samples originate is

undertaken. This is particularly important given the taonga status of Hector’s and Māui

dolphins, as well as potential concerns about the rendering of ‘biological whakapapa’ into

digital form during this project. Here, we outline our consultation procedures, the general

feedback based on this consultation, our lessons learned from the process, and what we

would do better/differently next time. We hope that presenting our experiences –

particularly where there was room for improvement by us – will help other researchers to

communicate more effectively with mana whenua in order to benefit Māori, the

researchers, and their rangahau (research).

ABOUT THE AUTHOR(S)

Alana Alexander: Alana’s research utilises the ‘time-traveling’ ability of population genomics

and phylogenomics by combining genomics, advanced computational tools, and

behavioural, ecological, and biogeographic data to make inferences about the processes

leading to patterns of genetic diversity within and among populations. These inferences

5

Page 6: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

range from global spatial and deep temporal scales (e.g. the worldwide impact of climate

fluctuations on global sperm whale populations over the last 125,000 years), to regional

spatial scales across time scales relevant to local adaptation (e.g. the evolution of MHC

immune genes in Hector’s and Māui dolphin populations), to finer spatial and temporal

scales (e.g. the movement of a chickadee hybrid zone in Missouri by just a few kilometres

over three decades). Overall, she considers herself a molecular ecologist/evolutionary

biologist who focuses on the interplay between pattern and process in genomic data. As a

Māori scientist (Ngāpuhi, Te Hikutu) she also maintains a strong interest in ensuring that her

research can be used to support kaitiakitanga and rangatiratanga of resources within the

rohe of iwi, hapū and papatipu rūnaka.

Benjamin Iwikau Te Aika: Ngati Mutunga, Te Ati Awa, Kati Wairaki, Kati Mamoe, Waitaha.

Ben is a specialist in multiple areas, including Māori economic development in

environmental advocacy, toi Māori (Māori art), whakairo (carving), and tā moko. Currently,

he is the Vision Mātauranga Coordinator at Genomics Aoteraoa where he coordinates Māori

consultation and outreach, identifies potential research collaborations with Māori

communities, and supports Genomics Aotearoa’s projects and researchers. Ben aims to

facilitate engagement to identify levels of acknowledgement and degree of control and

provide proper recognition to the interests of Māori. Ben works with researchers and with

Māori at multiple levels in the community to improve confidence, capacity and capability for

engagement. Ben draws on knowledge in Mātauranga Māori, and also the research

guidelines Te Ara Tika, Te Mata Ira and He Tangata Kei Tua. He also works on projects to

improve genomics research relevance to Māori. One initiative has enhanced kaitiaki

practices for a Māori landowner group in their management of native species - a great

example of commerce, science and kaitiakitanga in the hands of flax roots Māori. Ben is

passionate about his tamariki, hunting, whakapapa and whenua.

6

Page 7: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

LƎƴƛǘŜ ς [ƛƎƘǘŜƴƛƴƎ ¢ŀƭƪΥ ¦ǎŜ ƻŦ ǘƘŜ bŀǘƛƻƴŀƭ [ƛōNJŀNJȅΩǎ ²Ŝō ŀƴŘ ¢ǿƛǘǘŜNJ /ƻƭƭŜŎǘƛƻƴǎ ŦƻNJ wŜǎŜŀNJŎƘ

Lana Alsabbagh

National Library of New Zealand

[email protected]

The National Library of New Zealand has performed a “whole-of-domain” harvest since

2008, acquiring publicly available web content from the New Zealand .nz, .net, .org and

.com domains. The National Library’s Web Archiving team has also undertaken a number of

web harvests related to significant events in recent history such as the 2017 General

Election, including tweets, related data, and images. The Whole-of-Domain collection and

the Twitter harvests are both presently inaccessible to researchers. The Library’s goal is to

improve usage of this data by providing researchers with tools and services that would

enable computational access to this data.

In partnership with Library staff, the Digital Research Coordinator planned and carried out

interviews and a survey with a select group of scholars involved in digital humanities to help

the Library understand the tools and services researchers need to make full use of these

digital collections. This lightening talk will discuss the findings and ideas for further research.

ABOUT THE AUTHOR(S)

- Lana Alsabbagh

- Lana is the Digital Research Coordinator at the National Library of New Zealand. She

is currently researching ways to facilitate stakeholder engagement with the Library

web archival collection.

7

Page 8: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

All Research Data Questions are Ethical Questions

Anton Angelo

University of Canterbury

[email protected]

The data lying behind research is becoming steadily more transparent, and research is

becoming more about using huge existing datasets than just generating data to answer a

specific question.

This will be a facilitated discussion amongst all the attendees about current concerns

- Can data ever not be biased on race or gender?

- Is anonymity a lost cause?

- How much data do I have to give away? (And whose was it, anyway?)

- Data datasheets – is more bureaucracy the answer?

- Is the researcher to blame for bias, or the training set?

ABOUT THE AUTHOR(S)

- Anton Angelo

- Anton Angelo is Research Data Coordinator at the University of Canterbury Library.

He has an active interest in all aspects of research data: storage, publishing, ethics,

licensing and review. He is a certified Data and Library Carpentry instructor.

8

Page 9: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9ƴƎŀƎƛƴƎ NJŜǎŜŀNJŎƘŜNJǎ ǿƛǘƘ ŜwŜǎŜŀNJŎƘΥ /ƻw9 ōƛŘ ǎdzLJLJƻNJǘ

Laura Armstrong

Centre for eResearch, University of Auckland

[email protected]

‘Build it and they will come’ doesn’t seem to work when it comes to researchers engaging with eresearch. Institutions invest in infrastructure and platforms but for a portion of our communities this doesn’t deliver better, faster research or more connected researchers because they are not engaging with eresearch – aren’t aware of it, can’t access it, don’t know how it applies to them, struggle to use it or don’t feel it meets their needs.

We assume those involved with engaging researchers in eresearch grapple, as we do, with what is engagement in this context – beyond marketing and promotion - and how to identify and address barriers. How can we connect researchers with eresearch so they truly engage with it – access it, use it, shape it, innovate with it and achieve amazing things with it?

Many universities services and programmes to engage researchers in eresearch at various scales, career stages and across diverse communities. This presentation offers our model and experience of engaging researchers in eresearch to support the 2019 TEC call for Centre for Research Excellence bids. Part of a strategic and coordinated approach led by our Office of Research Strategy and Integrity (ORSI), this engagement has deepened our relationship with many senior PIs and research administrators, led to uptake of eresarch services across several research groups, and created connections that have resulted in non-CoRE funded research that relies on eresearch services and expertice.

ABOUT THE AUTHOR(S)

Laura Armstrong

Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,

University of Auckland working to engage researchers in eresearch, and deliver research

data management services and researcher enablement projects.

9

Page 10: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

LŘŜƴǘƛŦȅƛƴƎΣ ŎƻƴƴŜŎǘƛƴƎ ŀƴŘ ŎƛǘƛƴƎ NJŜǎŜŀNJŎƘ ǿƛǘƘ LJŜNJǎƛǎǘŜƴǘ ƛŘŜƴǘƛŦƛŜNJǎΦ BoF Session

Natasha Simons, Anton Angelo, Shiobhan Smith and Laura Armstrong

Australian Research Data Commons, Brisbane, Australia, [email protected]

University of Canterbury, [email protected]

University of Otago, [email protected],

Centre for eResearch, University of Auckland [email protected]

Increasingly, the research community, including funders and publishers, is recognising the

power of ‘connected up’ research to facilitate reuse, reproducibility and transparency of

research. Persistent identifiers (PIDs) are critical enablers for identifying and linking related

research objects including datasets, people, grants, concepts, places, projects and

publications. PID systems:

● Provide social and technical infrastructure to identify and cite a research output over time

● Enable machine readability and exchange ● Collect and make available metadata that can provide further context and

connections ● Facilitate the linkage and discovery of research outputs, objects, related people and

things ● Provide key tools for tracking the impact of research and researchers

Join this BoF to learn about recent developments in PID services and infrastructure with a

particular focus on DOI (research data), ORCID (people and organisations), RAID (research

activities and projects), IGSN (physical samples and specimens) and ROR (research

organisations).

Find out how to maximise the return on your investment in PIDS through participation in

national and global initiatives such as the NZ DOI consortium, Scholix and the Project FREYA

PID Graph which uses PIDS to offer researchers, and research institutions a richer, more

connected experience.

AUDIENCE

This BoF will be of interest to those designing, implementing, maintaining and supporting

PID services including eresearch professionals, repository managers, developers and

librarians. Participants should come along prepared to exchange knowledge, share

experiences and contribute to discussions about optimising the ‘power of PIDs’.

10

Page 11: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

The session will kick off with brief lightning talks presented by those working at the cutting

edge of global developments in PID services and infrastructure. Following facilitated Q&A,

participants will be encouraged to contribute to an open discussion to share experiences,

explore ideas and ask questions.

OUTCOMES

Participants will leave the BoF with a fresh perspective on the opportunities PIDs can offer

researchers and research organisations. We envisage that many participants will be

prompted to explore in greater depth, ideas raised during the session as they might apply to

their organisation. The BoF will also offer participants the opportunity to establish or

strengthen connections with the broader PID community in New Zealand, Australia and

internationally.

ABOUT THE AUTHOR(S)

Natasha Simons

Natasha Simons is Associate Director, Skilled Workforce, for the Australian ResearchData

Commons (formerly ANDS, RDS and Nectar). With a background in libraries, IT and

eResearch, Natasha has a history of developing policy, technical infrastructure (with a focus

on persistent identifiers) and skills to support research. She works with a variety of people

and groups to improve data management skills, platforms, policies and practices. Based at

The University of Queensland, Brisbane, Australia, Natasha is the co-chair of the Research

Data Alliance Interest Group on Data Policy Standardisation and Implementation, Deputy

Chair of the Australian ORCID Advisory Group and co-chair of the DataCite community

Engagement Steering Group.

https://orcid.org/0000-0003-0635-1998

Anton Angelo

Anton Angelo is a data librarian working at the university of Canterbury. He managed

Canterbury’s effort to be among the first NZ Universities in the NZ DOI consortium, and

adopting the NZ Orcid Hub, verifying over 80% of Canterbury’s scholars’ affiliations. He also

manages the UC Research Repository, the Canterbury Institutional Repository, and has been

very active in supporting Open Access. He has two cats and three chickens.

https://orcid.org/0000-0002-2265-1299

Shiobhan Smith

Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being

appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan

was Subject Librarian to a number of Humanities departments including Sociology,

11

Page 12: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Anthropology, Geography, and Theology. As Subject Librarian to the Centre for

Sustainability, Shiobhan was involved in the development of the Otago Data Management

Planning tool and has an interest in Research Data Management. Shiobhan also has

knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.

https://orcid.org/0000-0003-1738-9836

Laura Armstrong

Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,

University of Auckland working to engage researchers in eresearch, and deliver research

data management services and researcher enablement projects.

12

Page 13: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

D.{ŀǘƘƻƴΥ .ŜƴŎƘƳŀNJƪƛƴƎ NJŜLJNJƻŘdzŎƛōƛƭƛǘȅ ƻŦ DŜƴƻǘȅLJƛƴƎπ.ȅπ{ŜljdzŜƴŎƛƴƎ ŀƴŀƭȅǎƛǎ ǿƻNJƪŦƭƻǿǎ ǘƘNJƻdzƎƘ ŎƻƳLJŀNJƛǎƻƴ ǿƛǘƘ {bt ŎƘƛLJ ŀƴŘ LJŜŘƛƎNJŜŜ Řŀǘŀ

Rachael Louise Ashby (1), Rudiger Brauning (1), Hayley Baird (1), Ruy Jauregui (2), Monica

Vallender (1), Aurelie Laugraud (3), Charles Hefer (3), Abdul Baten (2), Paul Maclean (2),

Rayna Anderson (1), Roger Moraga (2), Siva Ganesh (2), Tracey van Stijn (1), Jeanne Jacobs

(3), Ken Dodds (1), John McEwan (1), Shannon Clarke (1) and Andrew Griffiths (2)

(1) AgResearch, Invermay Agricultural Centre, Private Bag 50034, Mosgiel 9053, New

Zealand

(2) AgResearch, Grasslands Research Centre, Private Bag 11008, Palmerston North 4442,

New Zealand

(3) AgResearch, Lincoln Research Centre, Private Bag 4749, Christchurch 8140, New Zealand

The advent of reduced representation genotyping-by-sequencing (GBS) provides a cost-

effective high-throughput genotyping platform to many ‘orphan’ species. This enables

downstream analyses including genomic selection, parentage assignment, conservation

genetics, population genetics and genome wide association studies. There are many

different workflows available for deriving SNPs from GBS data. Key aspects of any

bioinformatic workflow include accuracy, reproducibility and reliability. Few independent

studies benchmark multiple workflows to biological ‘gold standards’, such as pedigree or

SNP chip data, to assess these key aspects. Here, we benchmark open source SNP-calling

workflows for GBS data to assess their accuracy and reproducibility. To do this, we

generated GBS data for a cohort of 333 sheep. These have also been genotyped using a 50k

or 600k SNP chip. Furthermore, the cohort comprised 125 parent-offspring trios and all

individuals had multigenerational pedigree data. The SNPs called from the GBS workflows

were compared back to the gold standards to assess the accuracy, reproducibility and

reliability of SNP callers. Focusing on the bigger picture, we derived genomic relationship

matrices (GRMs) from all methods to compare the accuracy of the SNPs called for

downstream biological applications including relationship estimates among parents and

progeny.

13

Page 14: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR(S)

Rachael Louise Ashby

Rachael Ashby is a postdoctoral researcher with the Bioinformatics team at AgResearch and Genomics Aoteroa. Her research focusses on the use of next generation sequencing for applications including genome assembly and genotyping-by-sequencing for genomic management of highly diverse species.

14

Page 15: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ŜwŜǎŜŀNJŎƘ /ƻƭƭŀōƻNJŀǘƛƻƴ tNJƻƧŜŎǘǎ ς ǎdzLJLJƻNJǘƛƴƎ /{LwhΩǎ ŘƛƎƛǘŀƭ ǎŎƛŜƴŎŜ ŀƴŘ NJŜǎŜŀNJŎƘΦ

John Zic1, Justin Baker2

1 CSIRO, Sydney, Australia, [email protected]

2 CSIRO, Clayton, Australia, [email protected]

Background

CSIRO is Australia’s largest research agency and is a recognised leader in a diverse set of

science domains: Agricultural Sciences, Environment/Ecology, Plant and Animal Sciences,

Geosciences, Chemistry and Materials Science. CSIRO also manages research infrastructure

like the Australia Telescope National Facility (ATNF), the Marine Research Vessel RV

Investigator and the Pawsey Supercomputing Centre.

For many years in Australia, and also worldwide [2], research and science have undergone

transformational changes with the introduction of new instruments and advanced facilities

with matching increases in storage and computing capabilities. Individual researchers were

taking a bespoke approach to matching these technologies and capabilities to the way that

research and science were carried out. Wider adoption of new practices required social

change (in the practice of science and research) and these changes remained fragmented

and tailored to specific sciences or even projects. Organisations, by and large, varied

enormously in their support of these new practices.

As far back as 2007 [1], CSIRO eResearch practitioners advocated that science and research

practices within CSIRO adapt to deal with these challenges. Much like the rest of the world,

practices matured over the years: in CSIRO’s health and biosecurity, oceanographic and

atmospheric research, radio astronomy, agriculture and food as well as geological and

other earth sciences.

However, a significant shift occured in 2018, with a formal recognition by the CSIRO Board

of the need to support the new “digital” science and research at an organisational level.

CSIRO developed strategic digital transformation initiatives, including CSIRO’s Managed

Data Ecosystem (MDE), Missions and the Digital Academy [4].

The aim of the MDE is to connect current and new platforms in a seamless way and improve

interoperability between datasets so users will be able to easily find and work on multiple

datasets. It will provide a set of tools and approaches enabling CSIRO and partners to

improve our collaboration, mining and analysis of data.

15

Page 16: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

CSIRO Missions are major scientific and collaborative research programs aimed at making

significant breakthroughs in one of six major challenges facing Australia. They include the

resilient and valuable environments, food security and quality, health and well-being, future

industries, sustainable energy and resources, and regional security.

CSIRO's Digital Academy is focused on investing in the digital capability of our staff and

involves a rethink in planning for a digitally driven research environment. It provides a

learning opportunity for our staff, helping define the digital talent, skills and new ways of

working. The Academy will help attract and retain new digital talent within the Australian

innovation system, develop new digital skills and mindsets in Australian’s scientists and

facilitate digital talent accessibility and collaboration across Australia’s innovation system.

Existing Support for “Digital” Science through “eResearch” initiatives

CSIRO Scientific Computing Services group has been providing a dedicated eResearch service

since 2011 [3] This service is delivered through "eResearch Collaboration Projects” (eRCPs)

which now delivers specialist capabilities that includes Machine Learning, Data Analytics,

Scientific Visualisation, Workflow Management and Science Data Handling into research and

science projects.

The eRCP process is run as a competitive grant process and continues to be very successful.

In the latest cycle, forty Scientific Computing Services specialists successfully completed and

delivered over sixty eRCPs outcomes from a total of eighty submissions. The underlying

capabilities are delivered by members from each of teams in the Scientific Computing

Services group: Technical Solutions; Data Analytics and Visualisation; Research Software

Engineering; and Modelling and Dataflow. The eRCP process also provides a mechanism to

promote and introduce new tools and frameworks for consumption to CSIRO’s research

community eg Jupyter and R/Shiny.

The submission and approval process has also been significantly streamlined since its

introduction in 2011. The new eRCP portal includes semi-automated mechanisms for

seeking endorsement from research unit directors, along with automated directory name

lookup and links to related research projects. A post-project survey is used to elicit feedback

from individual researchers at the end of each cycle.

Specialists from the Scientific Computing program are then assigned to work on one or more

approved eRCPs. Over the six-month cycle, the resource allocation is around 0.2 FTE, with

each staff member allocated 3 eRCP projects per cycle. Importantly, eRCPs are provided to

CSIRO researchers and scientists at no additional charge.

16

Page 17: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

The eRCP has been enormously successful over the years, with demand outstripping

capability to allocate staff to the projects. The program has demonstrated a range of useful

outcomes including – including for example - an augmented reality tool for analysing

bushfire plumes over Tasmania; a dashboard to interrogate cotton crop physiological

measurements and an online platform to monitor algal blooms for multiple water bodies.

Scientific Computing specialists also provide dedicated support to CSIRO researchers, based

around the same set of core capabilities, via an entirely separate funding models known as

“pan deployments” as well as secondments. In both cases, CSIRO projects fund the specialists’

time at larger allocations, often extending over 12 months or more. In a sense, this acts like a

contractor service for Business Units, providing them with highly specialised skills but without

the need to recruit new staff of their own.

Future Plans

CSIRO Scientific Computing will respond to the major initiatives – MDE, Digital Academy and

Missions as follows:

• MDE

o Redirect Scientific Computing expertise currently working on eRCPs and pan

deployments to MDE related activities. In the first instance, these specialists

will apply their skills and domain knowledge to one of several nominated

pilots, helping design and build foundational components of the MDE.

o Over time, it is anticipated that those same specialists will contribute to the

ongoing development and enhancement of additional MDE components in

line with its progressive organisational rollout.

• Digital Academy

o Develop/adapt training content as appropriate for the Digital Academy. For

example, making use of existing Software Carpentry material for HPC usage,

but customising appropriate aspects for our own computing environment.

o Delivering training content to CSIRO staff. This has already proven very

successful in the machine learning area – with hundreds of staff attending

sessions - and will no doubt continue to grow over time.

• Missions

o Scientific Computing will continue to provide CSIRO researchers with the

eResearch support they need in response to the significant scientific

challenges tackling Missions.

REFERENCES

1. J. A. Taylor, J. Zic, and J. Morrissey, “Building CSIRO e-Research Capabilities,” in

17

Page 18: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

eResearch Australasia 2008.

2. T. Hey, S. Tansley, and K. Tolle, “The Fourth Paradigm: Data-Intensive Scientific

Discovery,” Data-Intensive Sci. Discov. Microsoft Res., 2009.

3. S. Moskwa, “The Accelerated Computing Initiative,” in eResearch Australasia, 2012.

4. CSIRO Chief Executive's Report 2018-19: https://www.csiro.au/en/About/Our-

impact/Reporting-our-impact/Annual-reports/18-19-annual-report/part-1/chief-

executive-report

ABOUT THE AUTHOR(S)

- Dr John Zic is the Executive Manager of CSIRO’s Science Computing Services

- Mr Justin Baker is Leader of the Scientific Computing Data Analytics and Visualisation

Team.

18

Page 19: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

²Ƙȅ ƻǾŜNJŦƛǘǘƛƴƎ ƛǎ ōŀŘ ŦƻNJ ǎŎƛŜƴŎŜΥ [Ŝǎǎƻƴǎ ŦNJƻƳ LJǎȅŎƘƻƭƻƎȅ

Adam Bartonicek, Dr. Narun Pornpattananangkul, Associate Professor Tamlin Conner University of Otago, Department of Psychology

Correspondence to: [email protected]

Many published research findings in psychology cannot be replicated. Even formerly “well-established” effects such as power-posing and implicit priming have failed to replicate. The crisis is not limited to psychology – replication issues abound across numerous fields, including neuroscience and biomedical sciences (e.g. Button et al., 2013; Ioannidis, 2005). The main causes of the replication crisis are thought to be inadequate statistical literacy and questionable research practices, such as p-hacking and “HARKing” (Hypothesizing After Results are Known). However, there may also be a less well-appreciated contributor to replication crisis – overfitting (Yarkoni & Westfall, 2017). Overfitting occurs when an overly complex model provides a good fit to the data it was trained on, but fails to accurately predict new samples. The goal of the classical statistical frameworks used in psychology, such as OLS and maximum likelihood methods, is to provide inference by finding the best fit to the data at hand. As such, these methods are liable to overfitting, especially when used alongside automatic variable selection methods such as forward, backward, and stepwise regression. Conversely, the goal of more recent statistical and machine learning methods is to maximize prediction accuracy in new samples and guard against overfitting directly. As such, psychologists and other scientific researchers may benefit from incorporating newer statistical and machine learning methods into their research in order to improve its replicability. To this end, more user-friendly open-source machine learning software packages are now being developed, such as the recent R package PredPsych and machine learning module for JASP. The proliferation of convenient digital tools for machine learning may lead to more replicable and reliable research, in psychology and in experimental science in general.

ABOUT THE AUTHOR(S)

Adam Bartonicek is a PhD student at the Department of Psychology, University of Otago. His main interests are well-being and using new statistical learning methods for high-dimensional inference.Dr. Narun Pornpattananangkul is a lecturer at the Department of Psychology, University of Otago. His main research interests include using big data in fMRI to study changes in reward-processing in mood disorders.Associate Professor Tamlin Conner is a lecturer at the Department of Psychology, University of Otago. Her main research interests include the impact of health behaviours on well-being and using mobile technology for daily experience sampling.

19

Page 20: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Associate Professor Tamlin Conner is a lecturer at the Department of Psychology, University

of Otago. Her main research interests include the impact of health behaviours on well-being

and using mobile technology for daily experience sampling.

REFERENCES

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 0696–0701. https://doi.org/10.1371/journal.pmed.0020124Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

20

Page 21: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!ǎǎŜǎǎƛƴƎ ǘƘŜ LJƻǘŜƴǘƛŀƭ ƻŦ ŀdzǘƻƴƻƳƻdzǎ !L ŘŜǾƛŎŜǎ ŦƻNJ LJƻNJǘŀōƭŜ NJŜŀƭπǘƛƳŜ 5b! ǎŜljdzŜƴŎŜNJǎ ŀƴŘ ŘŜLJƭƻȅŀōƭŜ ǎŜƴǎƻNJǎ

Authors name(s): Miles Benton, Joep de Ligt, Donia Macartney-Coxson, Richard Dean

Organisation: Institute of Environmental Science and Research (ESR), Wellington, NZ

Authors Email(s): [email protected], [email protected], donia.macartney-

[email protected], [email protected]

The current ‘climate’ is full of buzz words, ranging from AI (artificial intelligence) and deep

learning, through to cloud computing and the ‘Internet of Things’. As consumers, and even

research specialists, this can all be a little overwhelming. At ESR we are endeavouring to

provide our staff, clients, and hopefully the wider community, with some insight into the

technologies behind this jargon. One such project involves evaluating the deployment of

low-cost portable devices into the field to collect real-time data.

This talk will highlight our experiences with the Nvidia Jetson family of small embedded

computing platforms. The Jetson ecosystem includes small form-factor modules with GPU-

accelerated parallel processing, making them ideal low-power, high-performance portable

devices which have the capability to perform advanced operations in remote locations.

Our aim is to create a cost effective and truly portable real-time DNA sequencing device

which can be easily taken into the ‘field’ with results reported in real-time as the sequencer

runs. This will incorporate the Nanopore minION DNA sequencer alongside a cheap single

board computer (Nvidia Jetson based) powered by off-the-shelf rechargeable batteries. The

Nvidia powered technology will allow real-time base calling of DNA, thus making direct

detection/identification in the field a real possibility.

Additionally, we envisage a totally modular device not just limited to DNA sequencing.

Backwards compatibility with such ecosystems as Raspberry Pi and Arduino means that a

wide range of sensors can be attached (i.e. temperature, humidity, water flow, camera’s)

which can report back in real-time. This, alongside the ability to run off portable (even solar

powered) batteries, makes for an extremely versatile base unit.

Ultimately the whole package is extremely cost effective, with potential use cases across a

multitude of research, primary sector and industry fields. Additionally, the affordable, easy

to source components provide exciting opportunities for such endeavours as community

outreach and education.

21

Page 22: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR

- Name: Dr Miles Benton

- Bio: Dr Benton is a Senior Bioinformatics Scientist within the Human Genomics group

at ESR with extensive experience in computational genomics and bioinformatics. He

recently completed a post doc at Queensland University of Technology (Brisbane,

Australia) working on the development of methods to deal with ever expanding

genomic data sets and their access and interpretation back to the people that matter

(i.e. clients, clinicians, researchers, public, etc). Part of his role at ESR has been

implementing bioinformatics workflows in both research and clinical settings. He is

also developing machine learning/AI technology on portable Nvidia modules for field

deployment in various areas. Dr Benton was recently appointed to the Genomics

Aoteoroa Bioinformatics Leadership Team, where he is responsible for overseeing

bioinformatics support for human health projects. He is also heavily involved in the

Data Carpentries as an instructor and facilitator, as well as a mentor on ESR's data

science accelerator programme. He is deeply committed to making data science and

it's tools accessible, with the belief that everyone should be able to 'play' with and

interpret their data.

22

Page 23: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

wŜǎŜŀNJŎƘ /ƭƻdzŘ b½

Blair Bethwaite

NeSI, University of Auckland

[email protected]

Research Clouds are community or private Infrastructure-as-a-Service computing

capabilities tailored for research users, services, and workloads. As IaaS’s these capabilities

can cater to a massive range of use-cases providing researcher-defined infrastructure with

close integration to other institutional IT services. As the international Open Infrastructure

(nee OpenStack) community has matured and stabilised in recent years we are seeing more

and more scientific cloud deployments popping up – over 50 research organisations were

represented at the last Scientific SIG BoF during the Berlin OpenStack Summit and close to

100 people attended the Cloud Infrastructure in HPC BoF at SC19.

In Australia the Nectar cloud programme, which built one of the first national Research

Clouds over 7 years ago, is continuing to be supported through the merged and rebranded

ARDC (Australian Research Data Commons). New Zealand already operates one private

Node of the Nectar cloud thanks to University of Auckland. Could there be more, and is

there a case for testing broader sector access?

This BoF aims to bring together research infrastructure specialists from across the country

to gather interest and workshop models and technical architectures for a national research

cloud capability.

ABOUT THE AUTHOR(S)

Blair Bethwaite

Blair has worked in distributed computing for over a decade; both in research and for

research; for institutional and national projects; from applications, through grid & cloud

middleware, to full HPC & cloud systems design, implementation, and operations. Previously

over the ditch at Monash University, Blair most recently led Monash’s use of OpenStack to

underpin research computing. Originally from Christchurch, in mid-2018 Blair returned to

take up the opportunity of becoming NeSI's Solutions Manager, focusing back up the

technology stack closer to the user. Blair's role within NeSI covers Application Support and

Collaboration & Integration.

23

Page 24: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

/ƻƴǘŀƛƴŜNJǎ ƛƴ It/ ¢dzǘƻNJƛŀƭ

Blair Bethwaite, Mark Gray NeSI, Pawsey [email protected], [email protected]

NB: please refer to https://nesi.github.io/ernz20-containers/ for setup instructions/options prior to attending this tutorial.

½ Day Hands On Event Hosted on NeSI and run in collaboration with Pawsey Supercomputing Centre. This event will build on material and experience from the same tutorial recently run at the Supercomputing’19.

No longer an experimental topic, containers are here to stay in HPC. They offer software portability, improved collaboration, and data reproducibility. A variety of tools (e.g. Docker, Shifter, Singularity, Podman) exist for users who want to incorporate containers into their workflows, but oftentimes they may not know where to start.

This tutorial will cover the basics of creating and using containers in an HPC environment. We will make use of hands-on demonstrations from a range of disciplines to highlight how containers can be used in scientific workflows. These examples will draw from Bioinformatics, Machine Learning, Computational Fluid Dynamics and other areas.

Through this discussion, attendees will learn how to run GPU- and MPI-enabled applications with containers. We will also show how containers can be used to improve performance in Python workflows and I/O-intensive jobs.

Lastly, we will discuss best practices for container management and administration. These practices include how to incorporate good software engineering principles, such as the use of revision control and continuous integration tools.

24

Page 25: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

bŀƛǾŜ tNJŜŘƛŎǘƛƻƴ ƻŦ /ŀƴŎŜNJ hdzǘŎƻƳŜǎ dzǎƛƴƎ aŀŎƘƛƴŜ [ŜŀNJƴƛƴƎ

Matt Bixley1, Mik Black1 1Department of Biochemistry, University of Otago, New Zealand

[email protected]

[email protected]

Prediction of 5 year cancer outcomes from histology images has been undertaken using

machine learning (ML), artificial intelligence (AI) and deep learning techniques, by multiple

international research groups, with success for a number of different cancers (e.g., breast

and colorectal). A key outcome in this approach is the easy translation of technology to

allow pathologists to access the applications in their workflow. An extension to the idea of

outcome prediction is to use histology image data to estimate genomic characteristics of a

tumour, such as those often derived from gene expression data – examples include

molecular subtype, proliferation rate, oncogenic pathway activation, and genomic

instability.

Typically the training process involves the hand delineation of 100s if not 1000s of slides to

identify regions of interest and remove aberrations to improve accuracy. While some

automation has been attempted, here we present a naive approach to estimate the

accuracy with minimal human intervention. Currently the work has been applied to stomach

cancer slides from The Cancer Genome Atlas (TCGA), using both patient outcome data, and

genomic data on the molecular characteristics of the tumour.

ABOUT THE AUTHOR(S)

Matt is a Carpentries Instructor and Teaching/Research Fellow at the University of Otago.

His research background extends from laboratory and field work through quantitative

genetics and bioinformatics. Matt’s current research is on the use of Machine Learning tools

to predict cancer outcomes.

Mik received a BSc(Hons) in statistics from the University of Canterbury, and a MSc

(mathematical statistics) and PhD (statistics) from Purdue University. After completing his

PhD in 2002, Mik returned to New Zealand to work as a lecturer in the Department of

Statistics at the University of Auckland. An ongoing involvement in a number of Dunedin-

25

Page 26: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

based collaborative genomics projects resulted in a move to the University of Otago in 2006,

where he now leads a research group focused on the development and application of

statistical methods for the analysis of data from genomics experiments, with a particular

emphasis on human disease. Mik has also been heavily involved in major initiatives

designed to put in place sustainable national research infrastructure for NZ: Genomics

Aotearoa and NZ Genomics Limited for genomics, digital literacy training via The

Carpentries, and NeSI (New Zealand eScience Infrastructure) for high performance

computing and eResearch.

26

Page 27: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

wŜLJNJƻŘdzŎƛōƭŜ tƻǎǘŜNJǎ ŀƴ hǘŀƎƻ ¢ƘŜƳŜ

Matt Bixley

Department of Biochemistry, University of Otago, New Zealand

[email protected]

The endpoint of most research is a publication, be that a journal article (hopefully in

Nature), a conference presentation or a lab meeting. The premise of Reproducible Research

is that not only do we now present a summary of our findings, but we also make available

the details, code and (where possible) the data that lead to those findings. Various tools

exist to assist us in sharing our work and documenting our workflows. One extremely

popular tool for this is R Markdown, which provides the ability to write, document and

publish in a single workflow.

Here we present, postOTAGO an R package for creating posters with an Otago theme, that is

readily transferable to other organisations.

ABOUT THE AUTHOR(S)

Matt is a Carpentries Instructor and Teaching/Research Fellow at the University of Otago.

His research background extends from laboratory and field work through quantitative

genetics and bioinformatics. Matt’s current research focus is on the use of machine learning

tools to predict cancer outcomes.

27

Page 28: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

DataONE: Supporting Data Discovery and Access through Social and

Technical Infrastructure

Amber E Budden National Center for Ecological Analysis and Synthesis, University of California Santa

Barbara DataONE [email protected]

Addressing grand challenge questions requires exploration at broad spatial, geographic and temporal scales, facilitated through easy access to distributed, heterogeneous data. DataONE is an interoperable, federated network of data repositories providing open, persistent, robust, and secure access to well-described and easily discovered data about life and the environment. Over the last ten years of development, both technical and social capacity building has been critical in creating an infrastructure that meets the current and future needs of the community. Informed by working group research, community engagement, and usability evaluation, DataONE has developed a comprehensive search and discovery platform exposing over 1.2M data files; tools and services that support research reproducibility, transparency and credit; and data management training and resources to enhance data literacy. Through these and aligned activities, DataONE has improved interoperability across a broad coalition of data repositories and enhanced data practices across a diverse community of researchers, data managers, and data librarians.

DataONE is a community-governed network built in partnership with existing data repositories supporting distinct and diverse communities. As DataONE continues to grow from a funded project into a sustained program, this networked, user-driven approach continues to inform infrastructure development, feature design and prioritization, maximizing the value and impact of research data in an increasingly complex, diversified data discovery and use landscape.

ABOUT THE AUTHOR(S)

- Amber E Budden, PhD BScAmber Budden is the Director of Learning and Outreach at the National Center forEcological Analysis and Synthesis where she leads the NCEAS Learning Hub and short

28

Page 29: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

course activities. She is an open science facilitator, community manager and data literacy trainer and serves as a co-lead on several projects, including DataONE, a community-networked infrastructure supporting Earth and environmental scientists in their data management, preservation, search and discovery needs. An advocate for open and transparent science, Amber previously conducted research on article publication practices before working in the open data landscape. In her current roles, Amber supports the community in using open science infrastructure and leads training and outreach activities focused on best practices for data management.

Amber has a PhD in behavioral ecology and has conducted postdoctoral research on avian sexual selection and life-histories at the University of California Berkeley in addition to bibliometrics research at NCEAS. Amber has held teaching positions at the University of Toronto and York University in Canada and she has worked in outreach and publications within the non-profit sector. She is currently a principal investigator on several cyberinfrastructure awards including DataONE the Arctic Data Center and the Permafrost Discovery Gateway; is Chair of the ESIP Data Stewardship Committee;Member of the Make Data Count team; Advisory Board member for the Center for Scientific Collaboration and Community Engagement; and was a board member of the National Postdoctoral Association.

Amber holds a PhD in Behavioral Ecology from the University of Wales, a Joint Honors BSc in Psychology and Zoology from the University of Bristol and qualification in youth and community work.

29

Page 30: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.dzƛƭŘƛƴƎ ŀƴŘ {dzLJLJƻNJǘƛƴƎ ŀ bŜǿ ½ŜŀƭŀƴŘ 5ƛƎƛǘŀƭ [ƛǘŜNJŀŎȅ ¢NJŀƛƴƛƴƎ /ƻƳƳdzƴƛǘȅ

Megan Guidry, Ngoni Faya, Murray Cadzow, and Fabiana Kubke

NeSI, Genomics Aotearoa, University of Otago, and University of Auckland

[email protected], [email protected], [email protected], and

[email protected]

The delivery of digital literacy training has been a focus nationally for several years. An

increase in data availability across fields has led to a capability gap that can only be filled by

providing researchers and support staff with relevant training opportunities that encourage

and incentivise continual learning.

One-off events (such as Carpentries workshops, ResBaz events, etc...) are a great start, yet

attendees can often feel stuck or discouraged once they leave a supportive workshop

environment-- a topic touched on the 2019 eResearch NZ BoF ‘What Happens on Monday’.

It is difficult for researchers and support professionals to discern what skills are needed to

increase their research capability.

Attendees of this BoF will work together to co-create the first iteration of a skills roadmap

for eResearchers in NZ. This will involve an interactive discussion of efforts of support and

development of local communities, identification of knowledge and skills delivered in

training, and exploration into how credentialing could be applied. Understanding, as best as

possible, the skill building ‘stepping stones’ that lead eResearchers to increased capability

will provide a foundation for local communities in NZ to provide digital literacy training

opportunities that are more coordinated, cooperative, and nationally effective at raising

eResearch capability.

ABOUT THE AUTHOR(S)

Megan Guidry is the Regional Coordinator for the Carpentries in New Zealand and also

works as the training coordinator for the New Zealand eScience Infrastructure (NeSI). Her

main priority is raising the eResearch capability in New Zealand through training delivery

and community building.

30

Page 31: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Ngoni Faya is Genomics Aotearoa’s Training Coordinator, tasked with supporting and

building capacity and capability in bioinformatics for New Zealand. Based at the University

of Otago, he will be working with Genomics Aotearoa partners across New Zealand to

develop resources and technologies that provide international level training for genomics

and bioinformatics.

Murray Cadzow is a Teaching Fellow and Postdoctoral Fellow at the University of Otago. He

is both a Carpentries instructor and instructor trainer. His teaching focus is on delivering

digital literacy training to researchers, and the development and support of the local

Carpentries community at Otago. His research involves the use of large datasets to

investigate the genetic basis of Gout in Māori and Polynesian populations.

Fabiana Kubke is a Neuroscientist at the University of Auckland and is a strong advocate for

developing and supporting the training and development of digital literacy for students and

researchers.

31

Page 32: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

/ŀNJLJŜƴǘNJƛŜǎ ŀǘ hǘŀƎƻ

Murray Cadzow, Matt Bixley and Mik Black

University of Otago

{murray.cadzow, matt.bixley, mik.black}@otago.ac.nz

As part of a strategic initiative from the Division of Health Sciences at the University of

Otago, a project was established to increase researchers’ use of “big data” in research

projects. The first steps taken in beginning to build this capability were to ramp up both the

delivery of Software and Data Carpentry workshops, and the training of local instructors in

The Carpentries pedagogy. As part of this initiative, Murray and Matt have been delivering

and facilitating Carpentries workshops across the multiple University of Otago campuses

(Dunedin, Christchurch, Wellington), developing additional training materials and lessons,

and supporting other groups in the use of Carpentries pedagogy for non-Carpentries

workshops. In this talk we will discuss some of the impacts this initiative has had on

delivering Carpentries workshops, and on the Carpentries community at Otago.

ABOUT THE AUTHOR(S)

Murray Cadzow is a Teaching Fellow and Postdoctoral Fellow at the University of Otago. He

is both a Carpentries instructor and instructor trainer. His teaching focus is on delivering

digital literacy training to researchers, and the development and support of the local

Carpentries community at Otago. His research involves the use of large datasets to

investigate the genetic basis of Gout in Māori and Polynesian populations.

Matt Bixley a Carpentries Instructor and Teaching/Research Fellow at the University of

Otago. His research background extends from Lab and Field work through Quantitative

Genetics and Bioinformatics. Current research is in the use of Machine Learning tools to

predict cancer outcomes.

Mik received a BSc(Hons) in statistics from the University of Canterbury, and a MSc

(mathematical statistics) and PhD (statistics) from Purdue University. After completing his

PhD in 2002, Mik returned to New Zealand to work as a lecturer in the Department of

32

Page 33: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Statistics at the University of Auckland. An ongoing involvement in a number of Dunedin-

based collaborative genomics projects resulted in a move to the University of Otago in 2006,

where he now leads a research group focused on the development and application of

statistical methods for the analysis of data from genomics experiments, with a particular

emphasis on human disease. Mik has also been heavily involved in major initiatives

designed to put in place sustainable national research infrastructure for NZ: Genomics

Aotearoa and NZ Genomics Limited for genomics, digital literacy training via The

Carpentries, and NeSI (New Zealand eScience Infrastructure) for high performance

computing and eResearch.

33

Page 34: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Building Machine Learning Systems on Microsoft Azure Cloud Virtual

Machines –

A Joint Report of Two Projects

Cheng-Hao Cai

School of Computer Science

University of Auckland

E-mail: [email protected]

We complete two machine learning projects using the Microsoft Azure Virtual Machines. The first project is automatic voice recognition, which is the use of machine learning techniques to convert human speech to text. We build Gaussian mixture models, hidden Markov models and deep neural networks on the Azure VM, then use 100 hours of voice data to train the models. We find that better machine learning models and more training data can lead to increased accuracy of voice recognition, while background noise can reduce the recognition accuracy. The second project is automated program repair. In this project, machine learning models such as support vector machines and random forests are used to learn the semantics programs. The training of such machine learning models, model checking processes and constraint solving processes are completed using the Azure VM. As both the model checking and machine learning techniques require considerable computational resources, we suggest using these techniques with Azure cloud computing services.

ABOUT THE AUTHOR

Chenghao Cai is a PhD student at the School of Computer Science, University of Auckland, whose study has been financially supported by the China Scholarship Council (CSC). His PhD work provided substantial contributions to the field of automated software engineering, especially in the area of machine learning approach to formal design model repair. Chenghao is in stage of finishing the PhD study. His thesis consists of eleven chapters, where the content of the chapters is supported by internationally peer reviewed publications. Chenghao has published ten research papers to date, among which was the 52-pages manuscript published in the Automated Software Engineering (ASE) journal. ASE is a top quality and prestigious international journal in the field of Software Engineering, which has an A-tier ranking by the Computing Research and Education Association of Australasia (CORE). Furthermore, Chenghao received the Microsoft Asia Cloud Research Software Fellowship (CRSF) Award in June 2019.

Additional Authors: Jing Sun, Gill Dobbie

School of Computer Science, University of Auckland

34

Page 35: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!dzǘƻƳŀǘƛƴƎ ǘƘŜ wŜǎŜŀNJŎƘ 5ŀǘŀ [ƛŦŜŎȅŎƭŜ ǿƛǘƘ Dƭƻōdzǎ !dzǘƻƳŀǘŜ

Ryan Chard, Kyle Chard, Ian Foster

Argonne National Laboratory and University of Chicago

[email protected], [email protected], [email protected]

Research data can traverse a multitude of compute and storage devices from their

collection, through analysis, dissemination, and archival storage. The scientific data lifecycle

often requires acting on data spanning geographical locations and timescales, from near-

real time quality control, to human-oriented curation, through to long-term cataloguing and

archival. Further, almost any step of this lifecycle can require the use of specialized

hardware or computing resources resident in one or more administrative domains.

Combined with ever-growing data rates and volumes, these challenges necessitate new

technologies to aid researchers in reliably, and simply, offloading distributed data

management and analysis tasks.

To address these needs we have developed Globus Automate--a distributed research

automation platform designed to empower scientists to create, deploy, and apply data-

oriented pipelines. Globus Automate can reliably automate the entire research data

lifecycle, governing data from its generation at various instruments, through analysis, to

dissemination and archival, while weaving fine-grained access control throughout the

pipeline to securely interoperate with services across administrative domains. Globus

Automate enables users to offload the management of data and abstract the challenges

associated with distributed analysis and storage pipelines.

Globus Automate fills an important, yet previously unmet need in science by enabling the

composition of data management services into distributed data management pipelines.

Using any of the provided Globus services, such as Transfer, Search, and Auth, as well as any

custom service that exposes an Automate API, users can construct rich data pipelines to

perform various tasks. Further, users can leverage funcX--a distributed function as a service

platform-- in Automate flows to perform remote computation on almost any resource to

which the user has access.

35

Page 36: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Fig 1. An overview of the Globus Automate pipeline used to analyse and publish data from

the Advance Photon Source’s (APS) 8ID beamline. This flow captures data at the APS (1-2),

transmits it to the leadership computing facility for analysis (3-4), and publishes the data (6)

into a data portal for visualization and consumption.

In this talk we will present Globus Automate and describe uses cases from initial pilot

deployments. We will describe how funcX and Globus Automate make it possible to easily

and seamlessly exploit a wide range of computational resources to automate the research

data lifecycle, such as is depicted in Fig 1., from performing preprocessing and quality

control tasks locally through to outsourcing large-scale analyses to leadership computing

facilities.

ABOUT THE AUTHORS

Ryan Chard is an Assistant Computer Scientist at Argonne National Laboratory having joined

2016 where he was awarded a Maria Goeppert Mayer Fellowship. His research focuses on

the development of cyberinfrastructure to enable scientific research. He is particularly

interested in automation platforms and performing on-demand scientific analysis at scale.

He has a Ph.D. in Computer Science from Victoria University of Wellington, New Zealand and

a Masters of Science from the same university. His research interests include high

performance computing, scientific computing, cloud computing, cloud economics, and

network inference.

Kyle Chard is a Research Assistant Professor at the University of Chicago and a researcher at

Argonne National Laboratory. He received his Ph.D. in Computer Science from Victoria

University of Wellington, New Zealand. His research interests include data-intensive

computing, cloud computing, and economic resource allocation.

36

Page 37: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Ian Foster is an Argonne Senior Scientist and Distinguished Fellow and the Arthur Holly

Compton Distinguished Service Professor of Computer Science. Ian received a BSc (Hons I)

degree from the University of Canterbury, New Zealand, and a PhD from Imperial College,

United Kingdom, both in computer science. His research deals with distributed, parallel, and

data-intensive computing technologies, and innovative applications of those technologies to

scientific problems in such domains as climate change and biomedicine. Methods and

software developed under his leadership underpin many large national and international

cyberinfrastructures. Ian is a fellow of the American Association for the Advancement of

Science, the Association for Computing Machinery, and the British Computer Society. His

awards include the Global Information Infrastructure (GII) Next Generation award, the

British Computer Society's Lovelace Medal, R&D Magazine's Innovator of the Year, and an

honorary doctorate from the University of Canterbury, New Zealand. He was a co-founder of

Univa UD, Inc., a company established to deliver grid and cloud computing solutions.

37

Page 38: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

²Ƙŀǘ ƭŀƴƎdzŀƎŜ ŀNJŜ ȅƻdz ǎLJŜŀƪƛƴƎΚΗ Wallace Chase

REANNZ [email protected]

IT people. Students. Administrators. Scientists. Librarians. Corporate suits. Cultures. Nations. Collaborative research involves folks from across the whole spectrum. How do we communicate with each other? How do we speak the same language when often our lived experiences are so different? Join us for a discussion about how we can more effectively communicate with each other to solve complex multi-disciplinary problems. ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.

38

Page 39: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Iƻǿ ŘƻŜǎ w9!bb½ ǿƻNJƪΚ Wallace Chase

REANNZ [email protected]

Moving data around is key to successful collaborations and REANNZ is here to help! Come for a discussion around how REANNZ moves your data around. Come learn about the services and tools available to the REANNZ community! ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.

39

Page 40: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

CNJƻȊŜƴ Řŀǘŀ Wallace Chase

REANNZ [email protected]

How does data from polar experiments get to the warm labs of researcher around the world? Come join Wallace Chase of REANNZ to lean how the data gets thawed out and moved around the world. Hear about how REANNZ and the international NREN community are currently working to increase the ability to move every increasing big data from the ice to the world. ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.

40

Page 41: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

²Ƙȅ ǎƻ ǎƭƻǿΚ aƻƭŀǎǎŜǎ ōƛŀǎŜŘ Řŀǘŀ transfers…

Wallace Chase REANNZ

[email protected]

Why does is my data moving so slowly? Come hear the top reasons why your data transfer is so very slow and how you can speed it up… ABOUT THE AUTHOR(S) As Technical Engagement Manager at REANNZ, Wallace assists the community to utilize the full potential of the global Research and Education network ecosystem. To this end Wallace leverages his 15 years of experience in higher education IT operations and supporting research infrastructure.

41

Page 42: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{ŎŀƭŀōƭŜ ²ƻNJƪŦƭƻǿǎ ŀƴŘ wŜLJNJƻŘdzŎƛōƭŜ 5ŀǘŀ !ƴŀƭȅǎƛǎ όŦƻNJ DŜƴƻƳƛŎǎύ

Francesco Strozzi, Roel Janssen, Ricardo Wurmus, Michael R. Crusoe, George Githinji, Paolo

Di Tommaso, Dominique Belhachemi, Steffen Möller, Geert Smant, Joep de Ligt & Pjotr Prins

ESR Institute of Environmental Science and Research

[email protected]

Biological, clinical, and pharmacological research now often involves analyses of genomes,

transcriptomes, proteomes, and interactomes, within and between individuals and across

species. Due to large volumes, the analysis and integration of data generated by such high-

throughput technologies have become computationally intensive, and analysis can no longer

happen on a typical desktop computer.

This group of authors came together to describe and execute the same analysis using a

number of workflow systems and how these follow different approaches to tackle execution

and reproducibility issues. In a book chapter [1] about these topics showcases how any

researcher can create a reusable and reproducible bioinformatics pipeline that can be

deployed and run anywhere. This includes how to create a scalable, reusable, and shareable

workflow using four different workflow engines: the Common Workflow Language (CWL),

Guix Workflow Language (GWL), Snakemake, and Nextflow.

We would like to present the different components discussed in this chapter and dicuss how

these can foster stonger and more efficient collaboration across Aotearoa. It should be

noted that while the examples are from a genomics background these principles apply to all

data based research projects that require reproducible and scalable workflows.

1. Strozzi, F. et al. Scalable Workflows and Reproducible Data Analysis for Genomics. in Evolutionary Genomics: Statistical and Computational Methods (ed. Anisimova, M.) 723–745 (Springer New York, 2019). doi:10.1007/978-1-4939-9074-0_24 https://link.springer.com/protocol/10.1007/978-1-4939-9074-0_24

ABOUT THE AUTHOR(S)

- Joep de Ligt, PhD

Dr. Joep de Ligt is the lead Bioinformatics at ESR. Prior to this role in New Zealand he was

involved in genomics research and bioinformatics education and community building in the

42

Page 43: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Netherlands. For a full overview of his scientific publications please see this scholar page;

https://scholar.google.com/citations?user=z2edTLkAAAAJ&hl=en

43

Page 44: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!ŎŎŜƭŜNJŀǘƛƴƎ 5ŀǘŀ {ŎƛŜƴŎŜ

Richard Dean

Institute of Environmental Science and Research

[email protected]

The ability to influence decision making by extracting knowledge from data is key to success

in organisations across New Zealand. However, high demand for data scientists means that

many organisations who want to expand their data analytics capability experience

difficulties in recruiting suitably skilled candidates. Richard will present an alternative

approach focussed on the upskilling, retraining and empowering of existing employees

through what is termed a ‘data science accelerator’. He will discuss his experience as Public

Health England’s first graduate from the UK government’s data science accelerator

programme, how that led to working on some cool projects in the UK and why he’s now just

as fired up to bring the concept over to New Zealand. He will provide some insights from

how the data science initiative is settling in at ESR and how New Zealand could become

more ‘united in data’.

ABOUT THE AUTHOR(S)

Richard is a Data Scientist at ESR, a crown research institute that deals with nitty gritty real

world problems affecting human communities covering everything from forensic science to

human health, biowaste, microplastics and the environment. Before joining ESR, he worked

as a Senior Data Scientist for Public Health England, an executive agency of the UK’s

Department of Health.

In his current role, he works across the whole organisation on projects that gain insight from

big data sets. He is also responsible for driving forward ESR’s data science initiative which

involves training staff through data carpentries and pushing the boundaries through an

engineering, robotics, innovation, coding and automation club – Erica for short.

Richard was the first member of staff from PHE to graduate from the UK government digital

service ‘data science accelerator’ programme. In 2019, he brought the scheme to New

Zealand through an internal accelerator programme within ESR. A second cohort is currently

being planned and will run from February – May 2020.

44

Page 45: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

He has a BSc in Information Systems Management from Durham University and wrote an

MSc thesis on public health data interoperability standards while working in Durham.

He moved to New Zealand in November 2017 with his Kiwi wife and is trying his best to raise

two crazy kids – one born in the UK and one born in NZ.

Richard’s claim to fame is that he is one of New Zealand’s most successful mini golf coaches,

having convinced his wife to travel to Kosovo for the 2016 World Adventure Golf Masters,

where she won a bronze medal - New Zealand’s first ever medal in international match play.

45

Page 46: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

/ƘŀƭƭŜƴƎŜǎ ŀƴŘ ƻLJLJƻNJǘdzƴƛǘƛŜǎ ƛƴ ǘƛƳŜƭȅ ŀƴŘ ŜŦŦƛŎƛŜƴǘ ŘŜƭƛǾŜNJȅ ƻŦ L¢ ŦƻNJ ŜwŜǎŜŀNJŎƘ LJNJƻƧŜŎǘǎ

David Eyers

University of Otago

[email protected]

Lahiru Ariyasinghe

University of Otago

[email protected]

There are many instances of “big” eResearch projects that have been very well supported by

initiatives both within the University of Otago, and across New Zealand overall. Often at the

other end of the spectrum of project scale, initiatives such as The Carpentries have

supported widespread capability lift in terms of researchers adopting computing

technology. However there are many eResearch projects that face tricky computational and

data processing problems, while likely leading to great opportunities, but in which it is

difficult to assess and prioritise the potential impact that that project might have, relative to

others.

From the perspective of the eResearch Advisory Group at the University of Otago, and from

collaborations across the University, we have seen many eResearch projects face types of

barriers to their completion that would have been difficult to predict ahead of time. Some

of the types of challenges encountered have included:

• access to funding—e.g., where a potential cost emerges within a project, that did not

fit into the scope of research grants;

• types of funding—capital expenditure versus operating expenditure in terms of

research computing, e.g., DIY clusters versus use of the cloud;

• sustainability—e.g., considering how to support projects after their headline grant

funding has finished;

• tracking issues that need resolution across multiple different teams—e.g., across

departmental and central IT, researchers, NRENs, etc.;

• prioritisation and opportunity costs—e.g. the mechanisms that can support

escalation of issues in an efficient manner;

• management of the expectations of researchers and professional staff involved in

research projects;

46

Page 47: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

DŜƴƻƳƛŎǎ !ƻǘŜŀNJƻŀ ¢NJŀƛƴƛƴƎ

мΦbƎƻƴƛ Cŀȅŀ ϧ нΦ5ƛƴƛƴŘdz {ŜƴŀƴŀȅŀƪŜΦ 1.Genomics Aotearoa, New Zealand. 2.NeSI, New Zealand

[email protected] & [email protected]

A decrease in sequencing cost has seen a large amount of sequence data being generated in the last few years, leading to a paradigm shift from sequencing data generation to data analysis. Despite the ease of data generation, the same cannot be said for data analysis mainly due to fewer researchers with the bioinformatics skills necessary to analyze these datasets. Moreover, most data analysis tools are developed for use with the Linux command line and require use of high-performance computers, therefore there is need for hands-on data analysis training. Empowering researchers through hands-on training courses is the key to improve knowledge and understanding of bioinformatics approaches thereby easing the skills shortage. Genomics Aotearoa (GA) is a collaborative platform established to ensure that New Zealand is internationally participating and leading in the fields of genomics and bioinformatics. One of GA’s projects which is critical to genomics research is bioinformatics capability where bioinformatics tools and strategies needed to analyze information are provided. The bioinformatics capability project aims to address the increasing local demand for data analysis methods as well as training. The concept is: develop material/pipelines that can be accessed by everyone and travel to offer hand-on bioinformatics workshop. Post-doctoral researchers with strong bioinformatics background have been brought on board to develop open-source and reusable data analysis material and pipelines to benefit the genomics research community. At this stage, development of introductory, intermediate and advanced bioinformatics training material for genomics researchers is underway. Together with our partner, NeSI, coordination and delivery of data science and bioinformatics training workshops has already begun around the country seeing just above 250 researchers trained. Basically, based on the expressions of interest, we managed a supply/demand of ~66% with factors such as instructors and room availability contributing significantly to lowering this figure. In 2020, we are coming up with strategies to improve our demand/supply. NeSI platforms and it’s virtual machines were instrumental in hosting the training workshops which allows trainees to expand their skill set from introductory to advanced levels with a focus on how to use HPCs for their research. !dzǘƘƻNJǎ bƎƻƴƛ Cŀȅŀ Ngoni is the Genomics Aotearoa’s Training Coordinator, tasked with supporting and building capacity and capability in bioinformatics for New Zealand. Based at the University of Otago, he is working with Genomics Aotearoa partners across New Zealand to develop resources and technologies that provide international level training for genomics and bioinformatics. The aim is to give genetics researchers the training opportunities they need to analyse their own data sets, as well as facilitating the NeSI computing platforms and infrastructure required in their projects.

47

Page 48: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ƛƴƛƴŘdz {ŜƴŀƴŀȅŀƪŜ Dini is an Applications Support Specialist at NeSI with a particular interest use of High performance computing for Computational Biology and Bioinformatics. He joined NeSI following a decade of research experience gained in the field of Cancer Genetics, Chemical Genetics , Immunolgy and Bioinformatics.

48

Page 49: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

• estimation of projects’ timelines, and likely response times for resolving IT issues

that are encountered.

There are many changes are on the eResearch horizon that could have a significant impact

on the above challenges, including:

• use of cloud computing as a mechanism for performing eResearch off campus,

centred on the researcher;

• DevOps tooling for reproducible research within projects, e.g., to avoid bit-rot, and

to help researchers take a central role in the software the develop and/or use;

• including Research Software Engineers (RSEs) within the University staff, and thus

being able to decouple ongoing support from project funding;

• increased capability for delegated authorisation and security management;

• shifting from batch to interactive or streaming data processing;

• machine learning computation gravitating to the devices with cutting-edge

performance (often able to be sourced for free) being installed on researchers’

computers, rather than being centrally sited and/or managed;

• providing delegated authorisation to support self-service facilities for researchers;

• changing roles of supporting organisational units such as university libraries.

The University of Otago is currently restructuring its offerings in terms of eResearch support

within the University’s Information Technology Services. Beyond reviewing various problem

cases, this talk will view the above topics through the lens of what is likely to be practical in

the near future at the University of Otago.

ABOUT THE AUTHOR(S)

- David Eyers is an academic in the Department of Computer Science at the University

of Otago in Dunedin, New Zealand. He previously worked as a senior research

associate at the University of Cambridge, from where he was awarded his PhD. One

of his primary research focus areas is in distributed systems, particularly regarding

communication efficiency, energy monitoring, storage and security management

technologies such as decentralised information flow control. He aims to help

develop power-efficient, reliable, highly scalable and secure architectures for cloud

computing. Beyond research relevant to HPC and cloud computing, he has broad

interests in eResearch topics that include reproducibility, and the distributed

management of research data and metadata.

- Lahiru Ariyasinghe is a postdoctoral researcher in the Department of Computer

Science at the University of Otago in Dunedin, New Zealand, from where he was

awarded his PhD. Lahiru is an expert in building computational pipelines, and using

49

Page 50: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

emerging DevOps approaches to develop reproducible research platforms. Lahiru

has first-hand experience of innovative eResearch projects, including some situations

in which unexpected barriers have been encountered, and creative solutions have

overcome those barriers.

50

Page 51: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.dzƛƭŘƛƴƎ ŀ CŜŘŜNJŀǘŜŘ wŜǎŜŀNJŎƘ /ƻƭƭŀōƻNJŀǘƛǾŜ

David Fellinger

iRODS Consortium

[email protected]

The concept of countrywide and worldwide research collaboratives is relatively new. Several

decades ago it was common for a department head to have multiple vertical file cabinets

with paper folders housing the work of researchers and students in his or her department.

Access and subsequent citations of this work was generally based on the department heads’

knowledge of the works. As digital storage technologies became less expensive and

relatively ubiquitous the vertical files turned into disk storage systems reflecting the work of

each university department. The works were still filed and maintained by standard file

system references such as creation date, name, and access controls. In the many cases, card

catalogs or spreadsheets were used to further describe the titles. The introduction of

ethernet in the late 1970’s largely changed the manner in which research works were

conserved. The deployment of campus-wide data networks enabled universities to establish

and maintain central data repositories. Storage could become a service of the university

where individual colleges or departments no longer had to maintain their own archival

systems. The era of the digital research collaborative was born. In many cases, this

transition took years, and even today, some university departments retain internal storage.

Locating a specific work based upon anything other than title was a challenge and that

problem grew with the number of works that were archived.

The Advent of Storage Management Technology

A digital file system is really just a means for storing and maintaining data like a set of

shelves is a means to hold books. What is actually required is a way to relate descriptive

data to files indicating the contents of a file. This was largely understood for libraries

containing shelves of books starting thousands of years ago dating back to 2000 BC [1]. In

the United States, the Defense Advanced Research Projects Agency (DARPA) funded a

program called the Storage Resource Broker (SRB) in 1995 and 1996 and the first

middleware to identify works based on content and user defined metadata was written. In

2006 the DICE group, a group of research institutions in the US created the Integrated Rule-

Oriented Data System (iRODS) expanding on the concepts of SRB and in 2013 the iRODS

Consortium was formed as a user supported community devoted to the long term

continuation of this open source middleware. This project, that was first launched 25 years

51

Page 52: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ago, has spawned software that is being used to manage data archives worldwide. The

iRODS software can completely virtualize entire file system infrastructures so that storage

purchased from any vendor at any time can be made to appear as one effective file system.

Researchers no longer have to be concerned with the location of data but just the contents.

Data discovery is one of the primary features of iRODS. A researcher can specify search

terms retained in an index that allows other researchers to discover that research work. The

process of building an index does not necessarily require human intervention. Metadata can

be automatically extracted from files at rest or while being ingested to enable

discoverability. In fact, complete workflow automation can be realized with iRODS. Data can

be automatically ingested from numerous sensors and routed, based on content and

policies, to specific compute platforms for analysis. The subsequent data products can then

be distributed based on policy. Data products can be published according to policies

associated with the collections under management. All of this functionality can be audited in

real time to precisely track the operation of a data center.

Bandwidth Availability Enables Global Collaboration

The deployment of 100Gbps ethernet wide area networks across many universities

launched a new era of research data communication. Initially all data operations were

relegated to one campus or entity simply due to the limitations of communication

technology. While it was possible to transfer files by way of File Transfer Protocol (FTP)

technologies it was not easily possible to create indices that spanned federated collections

allowing data to be discovered or easily accessed. The secure federation capabilities of

iRODS has changed the way that we think of data locality. One of the key focuses of iRODS

development has been to enable federated collaboration. When the administrators of two

iRODS sites share a set of keys, the two sites, with permissions, can appear as one. The

researcher or administrator can assign access controls for local and WAN access. A user in a

remote zone can easily discover data through access to user defined metadata. A file

transfer can then be enabled with the iRODS servers brokering a direct transfer to the

requesting client. A researcher can even share data with a non-iRODS users issuing a secure

ticket for a specific file or files.

A New Era of Data Sharing is Underway

Large scale iRODS deployments span the world and have enabled collaborations of multi-

national scientists and researchers. In the US the iPlant Collaborative was formed in 2008

with funding from the National Science Foundation. Data management was based on iRODS

from the start of the project and it initially served the plant science communities primarily in

the US. From its inception, iPlant quickly grew into a mature organization providing

powerful resources and offering scientific and technical support services to researchers

52

Page 53: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

nationally and internationally. In 2015, iPlant was rebranded to CyVerse to emphasize an

expanded mission to serve all life sciences [2]. Today CyVerse serves over 47,000 users with

5,690 participating academic institutions and 2,438 non-academic institutions. A major

feature of the collaborative is the Discovery Environment (DE) which allows researchers to

quickly find files of interest relating to their life science discipline. The primary site is in

Tucson Arizona with a mirror at Texas Advanced Computing in Austin Texas. Both data

management and workflow control is enabled by the use of iRODS.

In Europe the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of

over 50 universities and research institutions in the European Union. The infrastructure is

managed under iRODS and the data covers over 30 scientific disciplines from atmospheric

research to physics, hydro-meteorology, genomics, and ecology. As with CyVerse, a major

feature of EUDAT is data discovery across the entire geography of the EU. The goal is to

provide both data access and re-use for near term needs as well as data preservation to

build a long term archive [3].

In the Netherlands, SURF has built a data management framework based on iRODS.

Countrywide data from several universities is stored at their data site. Besides the service of

offering data storage and management, they also offer data processing and analysis as well

as compute services. All of the data at the site is moved to various platforms and tiers using

iRODS [4]. SURF is a member of the iRODS community as well as several universities in the

Netherlands.

In Sweden, The Swedish National Infrastructure for Computing (SNIC) is a national research

infrastructure that makes available large scale high performance computing resources,

storage capacity, and advanced user support, for Swedish researchers. This service is

managed under iRODS control [5]. This service uses the Swedish University Network

(SUNET) which links the infrastructure at the KTH Royal Institute of Technology to other

universities in Sweden with a 100Gbps link to facilitate data movement [6].

These are just a few of the iRODS deployments in both the academic and research sectors.

The use of iRODS and its discovery capabilities accelerates scientific research allowing

researchers to quickly find relevant materials while building on them. The power of iRODS to

manage data based on collection policies cannot be overstated as data sets grow and

automation becomes a requirement. Many worldwide universities, libraries, museums, and

companies have chosen iRODS as a technology that allows the “future proofing” of data

collections independent of the evolution of storage. These institutions have realized that

their data policy decisions can be maintained by iRODS at any scale regardless of the change

of data storage or networking technologies over time.

53

Page 54: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

About the author

Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS

Consortium. He has over three decades of engineering experience including film systems,

video processing devices, ASIC design and development, GaAs semiconductor manufacture,

RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc.

he focused on building an intellectual property portfolio and presenting the technology of

the company at conferences with a storage focus worldwide.

In his role at the iRODS Consortium, Dave is working with users in research sites and high

performance computer centers to confirm that a broad range of use cases can be fully

addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a

member of the founding board.

He attended Carnegie-Mellon University and holds patents in diverse areas of technology.

References

1. The history of the card catalog is available from;

https://www.vox.com/culture/2017/4/21/15357984/card-catalog-library-of-

congress-history ,accessed 2 November 2019

2. The history of CyVerse is available from;

https://www.cyverse.org/about ,accessed 9 October 2019

3. Information regarding EUDAT is available from;

https://www.eudat.eu/eudat-cdi ,accessed 8 October 2019

4. Information regarding SURF is available from;

https://www.surf.nl/en/research-ict ,accessed 9 October 2019

5. Information regarding SNIC is available from;

https://www.snic.se/ ,accessed 9 October 2019

6. Information regarding SUNET is available from;

https://www.sunet.se/about-sunet/ ,accessed 9 October 2019

54

Page 55: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

²ƘŜNJŜ 5ŀǘŀ [ƛǾŜǎΥ bŜ{LΣ ǘŀƻƴƎŀ ŀƴŘ ƎNJƻǿƛƴƎ NJŜLJƻǎƛǘƻNJȅ ǎŜNJǾƛŎŜǎΦ

Brian Flaherty

NeSI

[email protected]

Data transfer and data sharing have been a part of NeSI's service catalogue for several years, but the service priority has always been in support of compute-intensive research (and related training & consultation). The launching of a national data transfer platform and a new relationship with Genomics Aotearoa have provided NeSI the opportunity to re-evaluate its data service offering. A key project in the Genomics Aotearoa Workplan is bioinformatics capability (Project 1811), which encompasses the development of a national genomics data repository including bespoke processes for Māori management of indigenous data, which is actively populated across all New Zealand genomics research activities. GA's functional requirements for this repository include securely storing, preserving and providing mediated access to genomics data for the longer term. It is also intended that the repository be interactive and usable by a large number of NZ researchers. NeSI's early response has been to focus on implementing the base -level infrastructure requirements for the repository while beginning to investigate platform options and prototype permissions workflows. This presentation will provide an update on progress to date, including storage and access management through Globus, and introduce topics from data publishing and discovery services (from simple metadata to interrogation of genome summaries), to indigenous data governance requirements, and longevity/persistence.

ABOUT THE AUTHOR(S)

- Brian Flaherty

- Brian is Product Manager, Data at NeSI. He has a background in digital libraries

digital scholarship, research infrastructure & support and discovery services.

-

55

Page 56: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.dzƛƭŘƛƴƎ ŀ ƴŀǘƛƻƴŀƭκNJŜƎƛƻƴŀƭ Řŀǘŀ ǘNJŀƴǎŦŜNJ LJƭŀǘŦƻNJƳΥ Dƭƻōdzǎ .ƻC

Kyle Chard

University of Chicago, Chicago, Illinois, USA

Argonne National Laboratory, Chicago, Illinois, USA

[email protected]

Brian Flaherty

NeSI, Auckland, New Zealand

[email protected]

The Globus research data management service (https://globus.org), operated by the University of Chicago for the global research community, supports the rapid, reliable, and secure transfer and sharing of research data within and among institutions. With more than 140,000 registered users and 10,000 active endpoints worldwide, it is essential science infrastructure for many institutions and in many fields. This BOF will provide an opportunity for attendees to learn about the latest Globus features, hear from operators of research infrastructure and research networks about how Globus is used in research, and explore the options for high speed Trans-Tasman data transfer. The BoF will be organized as brief presentations followed by a facilitated conversation. Kyle Chard will introduce Globus and highlight new features, including access to file sharing, cloud storage, protected data management, and data automation. Brian Flaherty will describe how NeSI in collaboration with REANNZ and Globus is building a national data transfer platform. Bring your questions and ideas on how the national platform can be improved/enhanced.

ABOUT THE AUTHOR(S)

- Kyle Chard is a Research Assistant Professor in the Department of Computer Science at the University of Chicago and a researcher at Argonne National Laboratory. He received his Ph.D. in Computer Science from Victoria University of Wellington, New Zealand in 2011. He co-leads the Globus Labs research group which focuses on a broad range of research problems in data-intensive computing and research data management. He currently leads projects related to parallel programming in Python, scientific reproducibility, and elastic and cost-aware use of cloud infrastructure.

- Brian Flaherty is Product Manager, Data at New Zealand eScience Infrastructure. He has a background in digital libraries, scholarly communication, open repositories, data management, research information management and discovery services.

56

Page 57: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

aƛŎNJƻπŎNJŜŘŜƴǘƛŀƭǎ ŀƴŘ wŜǎŜŀNJŎƘ {ƪƛƭƭǎ 5ŜǾŜƭƻLJƳŜƴǘ

Jonathan Flutey

Victoria University of WELLINGTON

[email protected]

The New Zealand Qualifications Authority (NZQA) has recently formalised a new micro-

credential policy and piloted a series of small skills based courses that align to the NZQA

credit framework.

While this is not yet gaining traction in Universities, Tertiary Education Organisations (TEO’s)

with a strong skills based focus are finding new ways of rewarding, and recognising, learning

through this new policy and framework.

The policies focus is not only on TEO’s. NZQA have formalised a process for non-TEO’s

(professional groups, accreditation boards, communities) to benchmark their skills based

learning programmes for micro-credential equivalency -

https://www.nzqa.govt.nz/providers-partners/approval-accreditation-and-

registration/micro-credentials/equivalency/

This birds of a feather session opens up the discussion, and possibilities, of nationally

recognised credentials and development pathways for RSE’s and research support staff with

a particular focus on skills based and professional practice assessment. Is this something our

communities want, lets discuss!

ABOUT THE AUTHOR(S)

- Jonny Flutey

- https://orcid.org/0000-0002-2210-755X

57

Page 58: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ƛƎƛǘŀƭ tNJŜǎŜNJǾŀǘƛƻƴ bŜǿ ½ŜŀƭŀƴŘ

Andrea Goethals

National Library of New Zealand

[email protected]

Since 2004, the New Zealand government has invested approximately $50 million in the

digital preservation programmes of the National Library of New Zealand (NLNZ) and

Archives New Zealand. These funds have been used to develop the repository infrastructure

and staff expertise needed to operate, manage and support sustainable long-term digital

programmes to care for the nation’s cultural heritage and government records in digital

form.

Recognising that data with long-term value, and therefore in need of digital preservation, is

being produced by many individuals and organisations across NZ, the NLNZ began a project

five years ago to explore a national approach to digital preservation. The idea is that NLNZ’s

digital preservation programme could be expanded to provide a digital preservation service

for other NZ organisations creating digital content with ‘high value’, i.e. that will contribute

to economic, social, cultural or economic outcomes, now or in the future.

The NLNZ’s research has included surveys of targeted populations to understand for NZ the

value of data being created; the policies, strategies, practices and systems in place to

manage and maintain access to it; and the appetite to use a NZ digital preservation service.

CIO/CTOs of NZ state sector organisations were surveyed because of their responsibility to

maintain access to their organisation’s digital material, and eResearchers were surveyed

because they generate digital material of value. The surveys were first conducted in 2015,

and then repeated in 2019 to understand the extent to which changes in digital preservation

practice and needs in NZ had changed or remained the same.

This presentation will share what has been learned through this research, and the

eResearch conference attendees will be invited to provide feedback on a potential national-

level digital preservation service.

58

Page 59: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR(S)

- Andrea Goethals

- Andrea manages the digital preservation team at the National Library of New

Zealand. She has primary responsibility for the overall day-to-day operations of the

National Digital Heritage Archive and contributes to the strategic direction of the

Library’s digital preservation programme. She champions digital preservation issues

and collaborates closely with others at the Library and around the world to advance

digital preservation standards and practices.

59

Page 60: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

LΩƳ ŀ .ƛƎ aŜǘŀƭ CŀƴΥ

.ƛƎ 5ŀǘŀ ŀǘ ǘƘŜ [ƻǿŜǎǘ [ŜǾŜƭ

Joseph Guhlin

Genomics Aotearoa @ University of Otago

[email protected]

How can 1Gbp pair genome and fewer than 200 samples produce 10Tb of data? How do we

work with such massive datasets?

Genomics is benefitting from an accelerated increase in data. As we work with more

samples and larger genomes, data increases linearly. Working with machine learning

algorithms data can increase exponentially. We need to change how we think about

processing data and performing analyses. At the analytical level, researchers should

understand how to reduce problems into the smallest solvable problem set. By attacking

small solvable problems, a large dataset becomes a series of computations which is easily

parallelizable. Map/Reduce is a technique from data science used to address this specific

problem. This technique benefits workflows, high-performance computing, and

programming.

Other problems can arise from large datasets. Common bioinformatics software does not

scale to large genomes. Throwing hardware at the problem is the most common solution,

but there are alternatives such as memory mapping files.

Finally, there are processor intrinsics called Single instruction, multiple data (SIMD). These

allow running a single computation over multiple data points simultaneously. Experience in

a systems programming language is not a pre-requisite for this. Both Python and R have

tools to work with SIMD and GPU instruction sets.

In this lightning talk, I plan to share my story of working with large datasets, how I try to

address problems, and my failings and successes. Working with datasets with a size

unthinkable a decade ago requires a shift in thinking, both from the analysis level as well as

the level of those writing the tools and libraries.

60

Page 61: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR(S)

Joseph Guhlin, PhD in Plant and Microbial Sciences. Has been working with Unix (FreeBSD,

originally) and Perl since age 12. Has expanded programming skills in Clojure, a lisp-dialect

that runs on the JVM, and Rust, a systems-level programming language gaining traction as

an alternative to C++. Interests include programming, genomics, big data sets, and machine

learning applications.

61

Page 62: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¢NJŀƛƴƛƴƎΥ LǘΩǎ ōŜǘǘŜNJ ǘƻƎŜǘƘŜNJ

Megan Guidry

NeSI

[email protected]

New Zealand eScience Infrastructure (NeSI) provides expertise and capability to researchers

conducting computation and data intensive research in New Zealand. Within the training

sector, our core purpose is to raise the computational capability of New Zealand research

and, in turn, shrink the existing eResearch skills gap. To do this, however, we rely heavily on

healthy partnerships with various organizations throughout the country.

In this presentation, we will discuss our training efforts so far (both in terms of delivering

training, but also cultivating the New Zealand training community) and reflect on the scale

of the opportunity/challenge that we face. Ideally, this talk will be a conversation starter on

how we, as a community of busy and passionate eResearch enthusiasts, can continue to

improve processes and share knowledge more freely.

Ultimately, training needs to be useful and relevant to those who need it. NeSI strives to be

agile in it’s approach to training delivery and this presentation will conclude by noting what

we are doing today to ensure our efforts are increasingly measurable, scalable, and

community-focused.

ABOUT THE AUTHOR(S)

Megan Guidry is the training coordinator for New Zealand eScience Infrastructure (NeSI) and

is also the Regional Coordinator for the Carpentries in New Zealand. Her main priority is

raising the eResearch capability in New Zealand through training delivery and community

building.

62

Page 63: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

IȅōNJƛŘ ¢NJŀƛƴƛƴƎΥ ŀ ǎŎŀƭŀōƭŜ ƳƻŘŜƭ ŦƻNJ ŘŜƭƛǾŜNJƛƴƎ

ƘŀƴŘǎπƻƴ ǘNJŀƛƴƛƴƎ ǘƻ ŘƛǎLJŜNJǎŜŘ ƭŜŀNJƴŜNJǎ

Christina Hall1, Andrew Lonie1, Jeff Christiansen1

(1) Australian BioCommons, Australia

[email protected]

Australia has a diverse array of government agencies, universities and research institutes

undertaking bioscience research. Biologists and bioinformaticians are distributed widely

throughout the country, and are sometimes isolated in small research groups or remote

locations. A novel training delivery methodology was developed to service the urgent needs

for bioinformatics skills in a cost and time efficient way. The method, which combines an

expert Lead Trainer delivering a presentation online in conjunction with a hands-on

interactive practical session at multiple venues supported by trained local Facilitators, is

ideally suited to the delivery of simultaneous training workshops around Australia. Referred

to as the ‘hybrid training model’ the scalable method combines the advantages of webinar

presentations with some valuable components of in-person group training.

Australian BioCommons’ hybrid training events regularly cater for more than 100

participants at up to 9 venues. Each participant brings their own laptop to a venues hosted

by one or more local Facilitators who are responsible for the local logistics including room

bookings and WIFI connections. Critically, Facilitators are themselves trained in the

workshop materials ahead of time. Presentations from the Lead Trainer are viewed on a

communal screen, with each participant simultaneously completing guided hands-on

activities. Live camera feeds from each venue help participants to feel they are part of a

larger community, and allow the Lead Trainer to observe room dynamics in real time. An

online shared ‘Discussion Board’ is active during the session, available for participants to

interact across venues, asking questions about their own specific challenges or interests.

Peers and experts alike join the discussion and answer technical questions. The 3-4 hour

events are structured to enable successful completion of exercises and to ensure nobody is

left behind or rushed through tasks. The recording of the training events, presentations,

tutorials and Discussion Boards are made available for perpetual reference after the event

has concluded.

Engagement with skilled Facilitators is key to the success of each training event, and the

availability of training at particular locations is dependent on identifying a willing volunteer.

63

Page 64: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Group size is also determined by Facilitator availability, with an approximate ratio of 1:10

Facilitators to participants strongly encouraged. Experienced and active trainers and

researchers themselves, the Facilitators have been a valuable source of feedback on the

development of the model, as well as being integral local event organisers and workshop

helpers. They have been enthusiastic in their support of the hybrid training model and its

ability to supplement their own local training programs. The delivery method allows

regional universities with only a few participants to have direct access to training expertise

on the same footing as larger universities.

Online evaluation surveys show that close to 100% of all participants think ‘this was a useful

workshop that enhanced my knowledge and skills’, and that ‘the format of the exercises and

activities enhanced participants’ learning and increased their level of skills’.

The hybrid method of training delivery provides an efficient way to reach many venues

simultaneously, and is easily extensible to new sites. The events are particularly valued by

regional locations that may not otherwise have access to the depth and breadth of expertise

offered by national events. This methodology fosters the development of a community of

people interested in bioinformatics training and can help to elevate the profile of local

Facilitators and domain experts who participate. The recording of each event’s

presentations, cameras and links to materials allows for continued use by participants who

trained on the familiar environment of their own laptop. By posting these resources online,

the content is also suitable for self-guided use by the public.

The hybrid training methodology is an important feature of the Australian BioCommons

training program. Its ability to efficiently enable training of dispersed learners is compelling.

The potential to extend this format to incorporate a larger multi-national audience with

shared geographic challenges is currently being investigated.

ABOUT THE AUTHOR(S)

- Name: Dr Christina Hall

- Bio: Christina is the Training and Communications Manager of the Australian

BioCommons. In developing and implementing a national program of bioinformatics

training events and resources, Christina builds on similar previous roles for

Melbourne Bioinformatics and the EMBL Australia Bioinformatics Resource. Her

research career in plant pathology was interspersed with several science

communications roles, including museum public program management. Christina’s

professional motivation is to enhance scientific progress through supporting

Australian biologists to do their best science.

64

Page 65: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{ƛƴƎdzƭŀNJƛǘȅ ŎƻƴǘŀƛƴŜNJǎ ƻƴ It/

W. Hayek1, B. Bethwaite2, B. Roberts3

NeSI/NIWA1, NeSI/University of Auckland2, NeSI/Manaaki Whenua - Landcare Research3

[email protected]

Containerisation is a form of virtualisation that has become very popular in the world of IT services and cloud computing, offering straightforward portability and deployment of software and services without having to install a complex set of dependencies. It has recently become available on High Performance Computing (HPC) systems through the popular Singularity software package. Singularity is a containerisation tool that is particularly suitable for HPC and scientific applications, featuring immutable software to support reproducibility of scientific results, as well as integration with HPC file systems, MPI, and more. This talk will outline the basic concepts of containerisation and discuss a recent NeSI consultancy project where a web server and database were containerised to process data on the Mahuika HPC. The project is now easily portable and can scale out to many cores, enabling very significant speed-ups.

ABOUT THE AUTHOR(S)

- Wolfgang Hayek is a research software engineer at NeSI and NIWA, and group

manager of NIWA’s scientific programming group, with many years of experience in

scientific computing and HPC.

- Blair Bethwaite is solutions manager at NeSI; he has strong expertise in HPC, cloud

computing, cloud architectures, and scientific computing

- Ben Roberts is an application support specialist at NeSI and has many years of

experience in scientific computing and HPC

65

Page 66: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ŀǘŀΧ ǎƻ ǿƘŀǘΩǎ ǘƘŜ LJNJƻōƭŜƳΚ Rosie Hicks, CEO Australian Research Data Commons

eResearch Australasia 2019

Digital data, tools and methods are changing everything, including the way research occurs and the societal challenges that we can address. All areas of research are becoming ever more dependent on data and eResearch. eResearch is now simply research. We care about data to enable discovery, to speed up research, to generate new knowledge that will make a difference. Ultimately, we undertake research to have an impact on people’s lives. A discussion of the challenges and opportunities now includes terms such as sensitive data, access to government data, open data, FAIR data, and trusted data. Along with buzz words such as Artificial Intelligence, Machine Learning and Cloud. Our first problem is a lack of shared understanding of these concepts. What does ‘sensitive data’ mean for health research or for defence technologies? This talk will examine a number of these terms to identify specific challenges that are

preventing us from achieving the full impact of our research

66

Page 67: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{ǘNJdzŘŜƭнΥ LƴŎNJŜŀǎƛƴƎ ŀŎŎŜǎǎƛōƛƭƛǘȅ ƻŦ It/ LƴŦNJŀǎǘNJdzŎǘdzNJŜ Chris Hines

The Monash eResearch Centre supports a large number of communities with highly varied

computing skills. As such one of our flagship offerings has been for a number of years simple

easy access to Desktops (i.e. vncservers) running on the same HPC hardwarethat supports

our large users. This facilities growing our users computing skills from simple visualisation

tasks to large batch processing tasks over the course of their research career.

Strudel2 is the next generation of tool supporting access. By carefully melding various

standard technologies we've produced a framework capable of supporting not just our

traditional desktop embedded within the HPC environment but also many of the other tools

for example Jupyter Notebooks and RStudio Server. We've also blended in Federated Single-

Sign on Authentication, and, for Federation members that support it, Multifactor

authentication.

!.h¦¢ ¢I9 !¦¢Ihwό{ύ

Chris has been kicking around the Australian eResearch sector for more years than he cares

to admit to anyone, let alone himself. Chris exhibits the typical arrogance of most people

with a background in physics assuming all your problems can be solved easily if you simply

approximated your cows as spheres to simplify the maths. An itinerant sys-admin,

programer and HPC consultant, he uses his skills wherever the needs of Monash university

research require.

67

Page 68: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

The Undies-mate Un-debate

Chris Hines and Ai-Lin Soo

Monash eResearch Centre

[email protected] [email protected]

Machine learning is far from error free, but if its good enough to improve

outcomes, perhaps we should deploy anyway? The ethical implications of how and

when we use techology, and specifically approaches such as machine learning and

data mining where the inherient biases of the system are not always obvious, is

something we all should engage in. We'll take topics from the audience, vote on

the most interesting and kick of the verbal equilvent of a WWE wrestling match.

Attendies should be prepared to "step into the ring" and voice their own

opinions. This is supposed to be a fun look at a serious topic.

ABOUT THE AUTHOR(S)

Chris Hines: Chris has been kicking around the Australian eResearch sector for more years than he cares to admit to anyone, let alone himself. Chris exhibits the typical arogance of most people with a background in physics assuming all your problems can be solved easily if you simply approimated your cows as spheres to simplify the maths. An itinerant sys-admin, programer and HPC consultant, he uses his skills wherever the needs of Monash university research require.

Ai-Lin Soo: A background in Commerce and Biomedical Science, eResearch was not on the radar until I stumbled across the Monash eResearch Centre. With a year under the eResearch belt (and hopefully many more to come), I'm enjoying my time as Project Officer and still

68

Page 69: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

contemplating what the 'e' in eResearch stands for. I have an interest in improving health outcomes for society and when not contemplating all of the above, I spend my time watching TV crime dramas.

69

Page 70: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{dzLJLJƻNJǘ ŦƻNJ wŜǎŜŀNJŎƘ 5ŀǘŀ aŀƴŀƎŜƳŜƴǘ ƛƴ dzƴƛǾŜNJǎƛǘȅ ƭƛōNJŀNJƛŜǎ ς Iƻǿ ŦŀNJ ƘŀǾŜ ǿŜ ŎƻƳŜΚ

Jess Howie – Research Support Librarian

University of Waikato

[email protected]

Advances in computing and technology have triggered a tidal wave of data on a global scale.

In this rapidly changing environment, data has become an output in its own right and steps

need to be taken in order to ensure that data is appropriately managed, stored and

preserved. Ideal Research Data Management practices help to realise the potential of data

to enrich knowledge through re-use and re-analysis, as well as provide mechanisms for

validation and enhance reproducibility. Librarians are well-placed to support researchers to

manage their data optimally. Not only are they well-versed in metadata and findability, they

also have an important role to play as advocates and balancing voices in a discussion which

is politically, ethically and culturally charged.

This lightning talk will summarise research which explored the development of research

support services in New Zealand University Libraries via survey responses from all eight

Universities. The survey questions for the Research Data Management section were

repeated from a multi-country carried out in 2012. The sharing of data from the original

survey enabled some longitudinal analysis over this time period. Respondents were asked to

provide details on the level of services offered, partnerships with other units, job titles, staff

time, barriers to service development and skills gaps.

Among the findings were a strong indication of growth in services, alongside a reduction in

perceived barriers and an increase in staff capacity. The results point to a strong future for

Research Data Management support in libraries but also provide some warning as to areas

that require more development and greater level of collaboration.

ABOUT THE AUTHOR(S)

- Name - Jess Howie

- Bio - I am Researcher Support Librarian at University of Waikato. My areas of interest

include Research Data Management, scholarly communication, Open Access and

research impact.

70

Page 71: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¦ǎŜNJ ƧƻdzNJƴŜȅπŘNJƛǾŜƴ LJNJƻŘdzŎǘ ƳŀƴŀƎŜƳŜƴǘ

Jun Huh

NeSI

[email protected]

NeSI was facing challenges around user onboarding. We built a journey map for NeSI

researchers to gain better understanding of the extent of the problem, and focus on where

the biggest issue was. As an organisation, we are striving to be more metric driven, and

using this user journey as a reference for the team members to see things from researchers’

perspective.

Jun will share the process NeSI went through, along with the user journey and service

blueprint that maps the journey to internal processes, how looking at the numbers in the

context of the user journey helped us identify problem areas.

The process led us to achieve improvements in the account setup process, and have given

us a useful reference point to understand what to focus on next.

ABOUT THE AUTHOR(S)

- Jun Huh, Innovation and Growth at NeSI

- Jun comes from a start-up background with focus around providing genuine value to

the users and steering organisations to be more user driven.

71

Page 72: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

[ŜŀNJƴƛƴƎ Iƻǿ ¢ƻ [ŜŀNJƴ

Jun Huh

NeSI

[email protected]

This lightning talk presents some learning concepts that could be useful for researchers

wanting to learn a new skill or a new tool, and trainers who wants to create effective

training programmes.

Jun will explain some learning related concepts including but not limited to:

• The mastery curve

• Chunking

• Categorising what to understand vs memorise vs practice

ABOUT THE AUTHOR(S)

- Jun Huh, Innovation and Growth at NeSI

- Jun comes from a start-up background with focus around providing genuine value to

the users and steering organisations to be more user driven.

72

Page 73: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

CNJŀƳŜǿƻNJƪǎ ŦƻNJ ƎNJƻǿǘƘ ŀƴŘ ƛƴƴƻǾŀǘƛƻƴ

Jun Huh

NeSI

[email protected]

What does it mean to grow as an organisation or a community? How do we innovate?

As an organisation we have to be adaptive to changes and new requirements. We have

touched upon many different frameworks to frame our thoughts and processes. We wish to

share some of these frameworks and talk about how they have helped the decision-making

processes in NeSI.

• Cynefin: How different levels of complexity in a problem pushes you toward

different approaches.

• Wardley map: Mapping different maturity levels of products and their directions.

Roles of pioneers vs town planners.

• User journey / service blueprint: Helping us see things from the users prespectives.

• Market segmentation / user personas: Building archetypes to help us communicate

using the right voice.

• Growth strategies: Managed growth vs growth hacking.

• Understanding user needs: What users say, what users do, and what users should

do.

ABOUT THE AUTHOR(S)

- Jun Huh, Innovation and Growth at NeSI

- Jun comes from a start-up background with focus around providing genuine value to

the users and steering organisations to be more user driven.

73

Page 74: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.dzƛƭŘƛƴƎ ŀƴ LƴǘŜNJƴŀǘƛƻƴŀƭ C!Lw ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ ŦƻNJ wŜǎŜŀNJŎƘ 5ŀǘŀ

Guido Aben, AARNet, Australia [email protected]

Kuba Moscicki, CERN, Switzerland [email protected]

Carina Kemp, AARNet, Australia [email protected]

Over the past ~6 years, a budding community of NREN and discipline operators of

synch&share stores has popped up. These operators typically run one of [ownCloud, seafile,

NextCloud, PowerFolder]. Judging by site surveys presented at consecutive synch&share-

focused CS31 conferences, their services have all become runaway successes – it’s not

unusual for these stores to be in the PB range and to serve tens of thousands of real

researchers and their real research data. The next wave of open science policy, however,

tells us that data shouldn’t be locked inside a single vault – instead it needs to be

interlinked, citable, free to move; in short, FAIR. The CS3 community have always been

working towards enabling interlinking of the data between stores at the identity and

metadata levels. An open protocol was developed to announce, accept and propagate

shared volumes from one installation to another. This protocol is called OpenCloudMesh2

and is by now supported by most synch&share software vendors. So, we have the installed

base, the incentive to interlink, and the technology to interlink. We just haven’t taken actual

linking beyond proof of concept yet; not in an operational, sustainable way in any case.

A proposal: interlink synch&share stores into a new pan-european data eInfrastructure

We were informed in late 2018 that the EC had put out a call for the development of

innovative science cloud eInfrastructures, called InfraEOSC-023.

This call matched surprisingly well with our intents. A few guidelines from the call may

illustrate this. Imagine you have interlinked sets of synch&share nodes, and that data can be

1 http://cs3.infn.it/info.html# 2 https://www.geant.org/Services/Storage_and_clouds/Pages/OpenCloudMesh.aspx 3 http://ec.europa.eu/research/participants/data/ref/h2020/wp/2018-2020/main/h2020-wp1820-infrastructures_en.pdf

74

Page 75: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

freely requested and mounted between them. Now think how well you’re placed to answer

these challenges from the call:

==========================

Highlights

• innovative models of collaboration that genuinely include incentive mechanisms for a user oriented open science approach

• develop innovative services that address relevant aspects of the research data cycle (from inception to publication, curation, preservation and reuse),

• allowing implementation of new scientific data-related developments and intelligent linking and discovering of all research artefacts

• foster interdisciplinary research, serving a wider remit of research needs, as well as new users like industry and the public sector.

==========================

A consortium has now been formed to deliver this project and is made up of ~10

eInfrastructure providers (NRENs, landmark instruments etc.), most in Europe, but AARNet is

also a contributor through their Cloudstor Services. The project shall be delivered not from a

blank slate, but rather building on an existing set of services already operated and in

widespread use among end users at the participant sites. This proposal does not focus on

development of software for a new infrastructure; rather, it is about systems integrating

existing components to deliver added value to the existing and active participants of the CS3

and GEANT communities.

The resultant eInfrastructure will be established by interfederating exiting stores into a fabric

of “federated sites” based on federation mechanisms, operational routines and trust.

Federative best practices learned from EduGAIN and EduRoam will be adopted and applied.

This presentation will present the building blocks for the project, the conditions to consider

to make this a success and the proposed milestones and invite additional international

collaborators.

75

Page 76: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

! /ƻƳƳƻƴ ¢ƘNJŜŀŘΥ /NJŜŀǘƛƴƎ ŎƻƳƳdzƴƛǘȅΣ ǿƻNJƪƛƴƎ ǘƻƎŜǘƘŜNJ ŀƴŘ ŜƴNJƛŎƘƛƴƎ NJŜǎŜŀNJŎƘ

Sara King, AARNet, [email protected]

Natasha Simons, ARDC, [email protected]

Paula Andrea Martinez, National Imaging Facility (NIF), The University of Queensland (UQ)

[email protected]

Carina Kemp, AARNet, [email protected]

This Birds of a Feather session is for colleagues from the eResearch training sector to share experiences and knowledge in community building practices.

It will be an interactive session, starting with a panel (20 min) of eResearch professionals speaking about their roles in creating communities – describing the barriers as well as the successes – and future plans, opportunities, and maybe even a few ‘dream big’ moments!

Using the ‘Open Space’ rule allowing participants to move between groups, in the second part (30 min) of the session participants will select from proposed topics for smaller group discussions to take a direct approach to building community.

The goal of this BoF is to discuss and share ideas on how to nourish and grow a Community of Practice and to (not-so-sneakily) actively create or continue to evolve collegial relationships both within New Zealand and internationally.

This will be an occasion for participants from a broad range of areas, such librarians, ITS staff, data stewards and others from the eResearch community, to connect and contribute, foster new collaborations and create opportunities to develop personally and professionally.

The session will have a short (10 min) debrief, and all discussions will be documented for future action and sharing. Anyone interested to stay in contact is welcome to join the online ENRICH Community of Practice, currently hosted on Slack.

Organisations on the potential panellist list:

AARNet

ARDC

76

Page 77: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

New Zealand eScience Infrastructure (NeSI)

University of Queensland

References

Community of Practice, available from https://en.wikipedia.org/wiki/Community_of_practice, accessed 21 November 2019.

Open Space Technology, available from https://openspaceworld.org/wp2/what-is/, accessed 21 November 2019.

ENRICH Community of Practice: https://enrichcop.slack.com

ABOUT THE AUTHOR(S)

- Dr Sara King is an eResearch Analyst with Australia’s academic and researchnetwork provider, AARNet. She has extensive experience in researcher engagementand training, with expertise in research data and technologies in the Humanities andSocial Science (HASS) research areas.

- Natasha Simons is Associate Director, Skilled Workforce, for the Australian ResearchData Commons (formerly ANDS, RDS and Nectar). With a background in libraries, ITand eResearch, Natasha has a history of developing policy, technical infrastructure(with a focus on persistent identifiers) and skills to support research.

- Dr Carina Kemp is the Director of eResearch for AARNet responsible for making thenetwork work the best it can for research in Australia. She works with the Australianand international research community to find and implement tools that sit above thenetwork to make technology and data research ready. Previous to joining AARNet,Dr Kemp was the Chief Information Officer at Geoscience Australia.

- Dr Paula Andrea Martinez is the National Training Coordinator for theCharacterisation Community in Australia. Her interests are on research methodsdevelopment and now outreach and advocacy in data and software best practices.

77

Page 78: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ŀǘŀπƛƴǘŜƴǎƛǾŜ ŀLJLJNJƻŀŎƘŜǎ ǘƻ ŦƛƴŘƛƴƎ ŀƴŘ LJNJŜŘƛŎǘƛƴƎ NJŜǎŜŀNJŎƘ ƻdzǘŎƻƳŜǎ ŦƻNJ bŜǿ ½ŜŀƭŀƴŘ ƘŜŀƭǘƘ NJŜǎŜŀNJŎƘ

Stephanie Guichard1, Stacy Konkiel2

Digital Science [email protected], [email protected]

How can we use data science to measure research outcomes at scale? Can quantitative data be used at all to understand research’s “impact” in its truest sense? In this presentation, we will share how—by asking the right questions, using the right data, and understanding data science’s strengths and limitations—New Zealanders can measure their success towards achieving health research outcomes, and even forecast future success. Using the New Zealand Health Ministry’s “New Zealand Health Research Strategy 2017-2027” report as a case study, we will first show how thoughtful strategic planning makes it possible for data scientists to answer pressing questions like, “How can we track the implementation of research into health policy?” and “How can we produce the best research that supports the well-being of all New Zealanders?” Next, we will discuss how linked bibliometric and altmetric data sources can help analysts better understand if and how New Zealand health research has achieved strategic priorities. Using unique data from Dimensions Analytics, a linked research intelligence database, and Altmetric Explorer, which provides data for understanding the broader impacts of research, we will use large scale visualization and statistical approaches to understand the current state of New Zealand health research with regard to desired outcomes; predict future trends based on past funding and publishing activity; and offer suggestions for ways to improve the likelihood of achieving desired research outcomes in the future. Among the outcomes studied will be international and industry collaboration trends, the translation of research into innovation and public policy, and public engagement with health research. Finally, we will offer a frank discussion on the benefits and limitations of quantitative data in measuring desired outcomes like community collaborations and whether research is improving health outcomes for Maōri and disabled peoples. In some cases, leading engagement indicators like altmetrics can be used as a rough proxy for success, and are complementary to traditional program evaluation approaches. We will explore several instances where altmetric and bibliometric data succeed and where they fail. References

78

Page 79: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Ministry of Business, Innovation and Employment and Ministry of Health. 2017. New Zealand Health Research Strategy 2017-2027. Wellington: Ministry of Business, Innovation and Employment and Ministry of Health. Retrieved from: https://www.health.govt.nz/publication/new-zealand-health-research-strategy-2017-2027

ABOUT THE AUTHOR(S)

Stephanie Guichard is a Regional Sales Manager for Digital Science in the Asia-Pacific region.

Previously, Stephanie worked with teams at Nature Publishing Group and Palgrave

Macmillan (Macmillan Science & Education). Stephanie graduated with a triple major in

Literature, Philosophy and History, specializing in medieval French literature and history. A

native of New York and Singapore, Stephanie now lives in Melbourne.

Stacy Konkiel is the Director of Research Relations at Digital Science. Stacy’s research

interests include incentives systems in academia and informetrics, and she has written and

presented widely about altmetrics, Open Science, and library services. Stacy was a co-

founder of the HuMetricsHSS initiative and is a Metrics Toolkit co-founder and Editorial

Board member. Previously, Stacy worked with teams at Impactstory, Indiana University &

PLOS. You can learn more about Stacy at stacykonkiel.org.

79

Page 80: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

/ƭƻdzŘπƴŀǘƛǾŜ ǘŜŎƘƴƻƭƻƎƛŜǎ ƛƴ ŜwŜǎŜŀNJŎƘ π ōŜƴŜŦƛǘǎ ŀƴŘ ŎƘŀƭƭŜƴƎŜǎ

Marko Laban NeSI

[email protected] The commercial world of public cloud has been rapidly evolving in the last 10 years, democratizing access to software engineering tools and technologies that provide an easy way for small teams to design, build and operate large distributed fault-tolerant applications at Google/Amazon scale. “Cloud native” is a multi-disciplinary approach/methodology that applies selected architectural patterns, software development processes and freely available open source libraries/frameworks to build distributed software applications designated to fully utilize the advantages of the modern cloud-computing model. In this paper, we aim to make a case for wider adoption of cloud native technologies in eResearch and discuss challenges on that path - Method: Comparative analysis, applying a mature methodology from commercial space in a new space (eResearch) - Conclusion: “Researchers and the eResearch sector as a whole could reap significant benefits from adopting Cloud-native more widely” - References: o Cloud-native applications https://ieeexplore.ieee.org/document/8125550 o Understanding Cloud-native applications after 10 years of cloud computing https://www.researchgate.net/publication/312045183_Understanding_Cloud-n ative_Applications_after_10_Years_of_Cloud_Computing_-_A_Systematic_M apping_Study o Understanding the e-Research ecosystem in New Zealand: https://www.nesi.org.nz/sites/default/files/media/eScienceFuturesWorkshop-R eflectionReport.pdf o Cloud-native infrastructure https://www.oreilly.com/library/view/cloud-native-infrastructure/978149198429 1/ch01.html ABOUT THE AUTHOR(S) - Marko Laban - Bio: Marko has more than 20 years of technical and product management experience in various software industry areas including Cloud, Enterprise Software, 3D/Manufacturing, Bioinformatics and start-ups. In the past he was involved with companies of different sizes from large ones like SAP and Cisco to early stage

80

Page 81: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

start-ups. Today, Marko is working as a Software Product Engineering Lead at NeSI.

81

Page 82: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{ŎƛŜƴǘƛŦƛŎ ǎdzLJŜNJŎƻƳLJdzǘƛƴƎΥ ¢ŜŀŎƘƛƴƎ LJNJŀŎǘƛŎŀƭ ǎƪƛƭƭǎ ŦƻNJ ŎNJŜŘƛǘ

Joseph Lane

University of Waikato

[email protected]

While theoretical modelling and simulation are increasingly used in research, postgraduate

students are typically expected to learn these skills outside of the formal credit-bearing

papers that make up their degrees.

The University of Waikato recently undertook a complete redesign of its postgraduate Science

qualifications, including the redevelopment of all of the underlying papers. As part of the

review process, focus groups were held with both current and former postgraduate students.

One of the key themes that emerged through these focus groups was a desire to “learn

through doing”, with more focus on skills development rather than fact recollection. In

response to that feedback, a new postgraduate paper, SCIEN511 – Scientific Supercomputing,

was established, which provides a practical introduction to undertaking scientific research on

a supercomputer. The paper assumes no prior computational experience and is intended for

science students from a broad range of disciplines.

In this presentation, I will outline my experience in developing and teaching SCIEN511 –

Scientific Supercomputing, reflecting on the successes and challenges of running the paper

for the first time. A close collaboration with the NeSI team ensured a great outcome for the

enrolled students.

ABOUT THE AUTHOR(S)

Associate Professor Jo Lane is a computational chemist at the University of Waikato and is

currently Deputy Dean for the School of Science. Jo obtained his BSc(Hons) and PhD from the

University of Otago.

82

Page 83: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ŀǘŀ ŀƴŀƭȅǘƛŎ ǘNJŀƴǎŦƻNJƳŀǘƛƻƴ ƧƻdzNJƴŜȅ ǿƛǘƘ WdzLJȅǘŜNJ

Nancy Lin

NeSI

[email protected]

This is a short presentation of key takeaways for how NeSI internal data analytics shift from

off shelf software to open source. Demonstrate the journey of building a business reporting

system by python and Jupyter Lab and what python enable us to do in the future.

ABOUT THE AUTHOR(S)

- Nancy Lin

- NeSI Data Analyst

83

Page 84: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.ƻCΥ bŜȄǘ {ǘŜLJǎ ŦƻNJ ǘƘŜ wŜǎŜŀNJŎƘ {ƻŦǘǿŀNJŜ 9ƴƎƛƴŜŜNJƛƴƎ /ƻƳƳdzƴƛǘȅ ƛƴ bŜǿ ½ŜŀƭŀƴŘ

Nooriyah P. Lohani

New Zealand eScience Infrastructure

[email protected]

The term Research Software Engineer (RSE), originally coined by the UK RSE association (https://rse.ac.uk), describes a growing number of people in academia who combine expertise in programming with an intricate understanding of research. Although this combination of skills is extremely valuable, these people lack a formal place in the traditional academic system. Inspired by the success of the RSE Association in the UK, we continue to work towards establishing an Australasian Chapter of the RSE Association (https://rse-aunz.github.io/). Together with international bodies and support from national organisations such as AeRO, NeSI, CAUDIT, the Australian Research Data Commons (ARDC) and other research institutions, we aim to campaign for the recognition and adoption of the RSE role within the research ecosystem. Appropriate recognition, reward and career opportunities for RSEs are also needed. We would like to discuss the shortcomings and what worked in the year of events to allow RSEs to meet, exchange knowledge and collaborate. This BoF will cover the community building activities that have occurred, identify future plans and activities for the coming year in New Zealand and Australia, and discuss the next steps that were identified at the pre-conference workshop. If you are an academic or researcher who codes; a professional software engineer working in the research space; a systems administrator who maintains research systems; or someone who is passionate about quality research software, please join us at this event to actively work on how we can grow this community and advocate for others. Together, we can build a thriving community that benefits research software engineers, and ultimately contributes to a more efficient and sustainable research ecosystem.

ABOUT THE AUTHOR(S)

- Nooriyah Lohani

- I am a Bioinformatician by training and after working for a few years in a commercial

and academic realm, I am now a research communities advisor at NeSI passionate

about understanding research needs in the eScience sector. I am also Co-chair of the

RSE Australia New Zealand steering committee.

84

Page 85: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

tNJƻƎNJŜǎǎ ƛƴ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎ

Andrew Lonie, University of Melbourne, Melbourne, Australia, [email protected]

As for many other research disciplines, rapid advances in digital technologies and methods

are proving transformational in life sciences. Internationally, major life sciences

infrastructure initiatives are increasingly defining global scale data infrastructures; in

particular, the US-based National Institutes of Health through their Data Commons program,

and the EU-based ELIXIR program and EBI, are building data infrastructures that are, in

many ways, equivalents of the global data-focussed infrastructures driving advances in

astronomy and physics - infrastructures like Hubble, LIGO, and the LHC. And, like astronomy

and physics, it is clear that world-class life sciences research in Australia and New Zealand

will increasingly depend on digital methods and data resources, and communities, that are

globally sourced and supported. Therefore, sponsored by Bioplatforms Australia, we have

developed a research infrastructure program called the Australian BioCommons that

strongly engages the research community, international infrastructure initiatives, and

national digital resource providers, recognising that Australia must understand, participate

in and contribute to global life science-enabling endeavours as a first class partner, and

presenting this as a clear vision of implementable requirements to national providers.

Method As part of the challenge of building towards the BioCommons, a pathfinder project was established late in 2018. BioPlatforms Australia, the Australian Research Data Commons and AARNet committed $2.5M to the project, which was then extended by Pawsey and NCI donating time and facility resources. The pathfinder project demonstrates a concerted national infrastructure effort to better characterise future life science solutions. Results

The pathfinder project required the identification of unknowns confronting the planning and development of the BioCommons proper. The five that we identified relate to key long term challenges:

• Human Genomes: the fundamental requirement for sensitive human data access andsharing

• Interoperability with global data: using paediatric cancer genomics as an exemplar• Non-model Genome Assembly & Annotation: using genomics programs aimed at

characterising Australian flora and fauna) as exemplars

85

Page 86: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

• Improvements to researchers’ ability to analyse their data: using phylogenetics as anexemplar, and addressing the technical challenges of integrating instruments andAARNet’s CloudStor as a data mover with the Galaxy research workflow system

• A Pathfinder Cloud: investigating approaches to compute and storage access thatalign with life science research practice and to understand the scale required ofvarious classes of compute and storage infrastructure

Conclusion

Experience with the Implementation Studies confirms several design challenges that life science infrastructures confront, including:

• Data growth rates exceed constant cost technology performance growth rates• The compute demand is different to established HPC practice and culture• Australian data will continue to exist, but must be interpreted in a global context

where data is increasingly too big to move/copy.• Infrastructure compatibility will be vital as more of the analysis and methodology

software will be imported or globally developed• A ‘cloud native’ paradigm is dominating peer life science infrastructure investment

!ŎƪƴƻǿƭŜŘƎƳŜƴǘǎ ²Ŝ ŀŎƪƴƻǿƭŜŘƎŜ .ƛƻtƭŀǘŦƻNJƳǎ !dzǎǘNJŀƭƛŀ ŀǎ ǘƘŜ LJNJƛƳŀNJȅ ǎLJƻƴǎƻNJ ƻŦ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎΣ ŀƴŘ ƴdzƳŜNJƻdzǎ LJŀNJǘƴŜNJǎ ƛƴŎƭdzŘƛƴƎ ǘƘŜ !dzǎǘNJŀƭƛŀƴ wŜǎŜŀNJŎƘ 5ŀǘŀ /ƻƳƳƻƴǎΣ !!wbŜǘΣ b/L ŀƴŘ tŀǿǎŜȅ ǎdzLJŜNJŎƻƳLJdzǘƛƴƎ ŎŜƴǘNJŜǎΣ ŀƴŘ ǘƘŜ !dzǎǘNJŀƭƛŀƴ !ŎŎŜǎǎ CŜŘŜNJŀǘƛƻƴΦ .ƛƻLJƭŀǘŦƻNJƳǎ !dzǎǘNJŀƭƛŀ ƛǎ ŀƴ !dzǎǘNJŀƭƛŀƴ bŀǘƛƻƴŀƭ /ƻƭƭŀōƻNJŀǘƛǾŜ wŜǎŜŀNJŎƘ LƴŦNJŀǎǘNJdzŎǘdzNJŜ {ŎƘŜƳŜ LJNJƻƎNJŀƳΦ

!.h¦¢ ¢I9 !¦¢Ihwό{ύ

!κtNJƻŦ !ƴŘNJŜǿ [ƻƴƛŜ 5ƛNJŜŎǘƻNJΣ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎΣ ¦ƴƛǾŜNJǎƛǘȅ ƻŦ aŜƭōƻdzNJƴŜ !ƴŘNJŜǿ [ƻƴƛŜ ƛǎ 5ƛNJŜŎǘƻNJ ƻŦ ǘƘŜ !dzǎǘNJŀƭƛŀƴ .ƛƻ/ƻƳƳƻƴǎ όƘǘǘLJΥκκōƛƻŎƻƳƳƻƴǎΦƻNJƎΦŀdzύΣ 5ƛNJŜŎǘƻNJ ƻŦ ǘƘŜ 9a.[ !dzǎǘNJŀƭƛŀ .ƛƻƛƴŦƻNJƳŀǘƛŎǎ wŜǎƻdzNJŎŜ ό9a.[π!.wΥ ƘǘǘLJΥκκŜƳōƭπŀōNJΦƻNJƎΦŀdzύΣ ŀƴŘ ŀƴ ŀǎǎƻŎƛŀǘŜ LJNJƻŦŜǎǎƻNJ ŀǘ ǘƘŜ CŀŎdzƭǘȅ ƻŦ aŜŘƛŎƛƴŜΣ 5ŜƴǘƛǎǘNJȅ ŀƴŘ IŜŀƭǘƘ {ŎƛŜƴŎŜǎ ŀǘ ǘƘŜ ¦ƴƛǾŜNJǎƛǘȅ ƻŦ aŜƭōƻdzNJƴŜΣ ǿƘŜNJŜ ƘŜ ŜǎǘŀōƭƛǎƘŜŘ ŀƴŘ ǘƘŜƴ ŎƻƻNJŘƛƴŀǘŜŘ ǘƘŜ a{Ŏ ό.ƛƻƛƴŦƻNJƳŀǘƛŎǎύ ŦƻNJ Ƴŀƴȅ ȅŜŀNJǎΦ !ƴŘNJŜǿ ŘƛNJŜŎǘǎ ŀ ƎNJƻdzLJ ƻŦ ōƛƻƛƴŦƻNJƳŀǘƛŎƛŀƴǎ ŀƴŘ ŜπNJŜǎŜŀNJŎƘ ǎLJŜŎƛŀƭƛǎǘǎ ǿƛǘƘƛƴ ǘƘŜ .ƛƻ/ƻƳƳƻƴǎ ǘƻ ŎƻƭƭŀōƻNJŀǘŜ ǿƛǘƘ ŀƴŘ ǎdzLJLJƻNJǘ ƭƛŦŜ ǎŎƛŜƴŎŜǎ NJŜǎŜŀNJŎƘŜNJǎ ƛƴ ŀ ǾŀNJƛŜǘȅ ƻŦ NJŜǎŜŀNJŎƘ ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ LJNJƻƧŜŎǘǎ ŀŎNJƻǎǎ !dzǎǘNJŀƭƛŀ

86

Page 87: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

/ƭƛƳŀǘŜ 5ŀǘŀ ŀƴŘ /ƻƳLJdzǘƛƴƎ ŀNJŜ IƻǘǘƛƴƎ ¦LJΗ

Shona Mackie, Annika Seppälä, Inga J Smith

University of Otago

[email protected], [email protected], [email protected]

Understanding our climate and how it changes in future is a topic of increasingly urgent

research, with heightened levels of public and political support worldwide. Climate models,

however, are necessarily big. In theory, they represent all physical processes from the top of

the atmosphere to the bottom of the ocean, over land, water and sea ice, in 3-dimensional

grid cells with a resolution of 1 degree or finer. The interaction and evolution of these

processes is modelled with a temporal resolution of more than a single timestep per hour,

and typically we need to run for at least 100 years. Furthermore, uncertainties and internal

variability in the climate system mean that we run an ensemble rather than a single model

run. The structure of climate models means they can usually be parallelized to a point, but

they are not generally suitable for the distributed computing solutions that can be

implemented in other fields. As well as being expensive to run, climate models produce a lot

of data (PB scale). The idea is to capture the state of the whole world in a 3-dimensional

mesh with a temporal resolution fine enough to see how it changes, and a spatial resolution

fine enough to examine any physical process anywhere on Earth that might impact on

climate. For example, one model component (atmosphere, ocean etc) can be made of 1.2

million grid points. Saving just one parameter daily for 100 years = 44 billion data points. 30

parameters from 6 ensemble simulations amounts to 8 trillion data points, just from one

model component. These data have to be accessible so that we can do processing and

monitoring of climate model runs while they are underway, and need to be securely

archived in a way that makes them accessible for long term use, and shareable with

collaborators both present and future, here in Aotearoa and abroad. Running a climate

model is just the beginning of climate research, analysis of the data requires tools capable of

accessing and handling large data volumes that are generally stored on remote servers,

sometimes overseas, at a speed that makes interrogation and analysis practical.

Climate modelling is one of the most computationally hungry fields of research and New

Zealand has recently joined the list of the relatively few countries with the resources and

infrastructure to do it. Growing this field of research in New Zealand will need development

of resources and expertise to manage those resources. Events like eResearch 2020 are an

87

Page 88: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

important way for information to be shared with network architects and data managers to

ensure that the systems and infrastructure are in place to support the next generation of

climate researchers.

ABOUT THE AUTHOR(S)

- Shona Mackie

- Shona Mackie is a climate modeller at University of Otago, developing the New

Zealand Earth System Model to include new physics processes, and carrying out

senstivity studies using the current version of the model to better understand

uncertainties inherent in our climate projections.

- Annika Seppälä

- Annika Seppälä is a senior lecturer at Otago University Physics department. Her

research uses computational simulations together with large space based Earth

observation datasets to investigate solar influence on the atmosphere and climate

from global to regional scales.

- Inga

- Inga Smith is a senior lecturer in the Department of Physics, University of Otago. Her

research interests are in sea ice physics and climate change, particularly the

influence of fresh water on sea ice formation.

88

Page 89: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!ŘǾŀƴŎƛƴƎ bŜǿ ½ŜŀƭŀƴŘΩǎ ŎƻƳLJdzǘŀǘƛƻƴŀƭ NJŜǎŜŀNJŎƘ ŎŀLJŀōƛƭƛǘƛŜǎ ŀƴŘ ǎƪƛƭƭǎ

Nick Jones New Zealand eScience Infrastructure

[email protected]

Supporting research today and tomorrow requires an inclusive partnership with New Zealand researchers, communities, and Te Ao Māori, underpinned by a specialised and powerful technology ecosystem. Over the last year, New Zealand eScience Infrastructure (NeSI) has advanced New Zealand’s computational research capabilities and skills through training programmes, consultancy in research software engineering, strategic partnerships, and recruitment of NeSI team members with stronger knowledge of and connection into research communities. This session will look at how those activities, combined with recent enhancements to NeSI’s existing platforms and services, are enabling New Zealand’s science sector to compete and excel globally.

ABOUT THE AUTHOR Nick Jones is NeSI’s founding Director, having established and led NeSI alongside a team of colleagues and peers since inception in mid-2011. Nick is responsible for NeSI’s strategic directions and performance overall, bringing together a talented and diverse array of people, and their institutions and interests. Nick has over 20 years’ experience in innovating in advanced information/computing technology in sectors including education, science and research. Nick established the eResearch NZ conference series in 2010 to support the sector coming together in the spirit of community to share experiences and explore directions in an area so critical to our future prosperity as a nation.

89

Page 90: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Growing the eResearch workforce in an inclusive way

Jana Makar 1 , Loretta Davis 2

New Zealand eScience Infrastructure (NeSI) 1 , Australasian eResearch Organisations (AeRO) 2

[email protected] 1 , [email protected] 2

Diverse teams have been shown to increase a company’s ROI, make decisions 2X faster with half the meetings,and corporations that hire women into the C-suite often see a 15% increase in profitability.In this eResearch NZ Birds-of-a-Feather (BoF) session, we would like to build on the momentum andconversations started at a previous eResearch Australasian BoF and discuss the opportunities around bettercoordinating, supporting, and expanding diversity efforts within Australasia’s eResearch and HPC sectors.This BoF’s ideal outcomes would include:● a greater understanding of existing efforts and support structures in place for encouraging diversityand gender equity in Australasia’s HPC and eResearch sectors● building new and/or stronger relationships built amongst Australasia diversity and gender equityadvocates● a short-term action plan for 2020-21 to explore ways to better connect, coordinate, and leverageexisting diversity and inclusion effortsDiscussion notes and feedback gathered in this session will be collated and shared with the broader eResearchcommunity, as part of a newly formed working group’s efforts to support the development of a Women in HPC(WHPC) community in Australasia. For more information on the Australasian WHPC Working Group, visit theAeRO eResearch Chat’s Diversity & Inclusion space .

ABOUT THE AUTHOR(S)

Jana MakarBased at the University of Auckland, Jana coordinates a variety of engagement initiatives and externalcommunications to raise the profile of NeSI’s activities, impacts, and collaborations. Prior to joining NeSI, Jana worked as a communications consultant for multiple organisations in Canada’s technology, academic, and startup sectors.

Loretta DavisLoretta is a seasoned IT professional with 25+ years experience in the eResearch, commercial and government sectors in Australia, Africa and the USA. When not working part time for AeRO, Loretta consults as a Solutions Specialist to a number of private clients.

90

Page 91: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¢ƻǿŀNJŘǎ C!Lw LJNJƛƴŎƛLJƭŜǎ ŦƻNJ NJŜǎŜŀNJŎƘ ǎƻŦǘǿŀNJŜ Paula Andrea Martinez, National Imaging Facility (NIF), The University of Queensland (UQ)

[email protected] The FAIR Guiding Principles, published in 2016, aim to improve the findability, accessibility, interoperability and reusability of digital research objects for both humans and machines. The FAIR principles are also directly relevant to research software. In this position paper “Towards FAIR principles for research software”, we summarised and developed a basis for community discussion. At the start, we discussed what makes software different from data concerning the application of the FAIR principles, and which desired characteristics of research software go beyond FAIR. Then, we presented an analysis of where the existing principles can directly apply to software, and where they need to be adapted or reinterpreted. Our next step after the position paper is to prompt for community-agreed identifiers for FAIR research software. - Acknowledgments To all the authors of Towards FAIR principles for research software https://doi.org/10.3233/DS-190026, and the numerous people who contributed to the discussions around FAIR research software at different occasions preceding the work on this paper. - References Lamprecht, Anna-Lena, et al. (2019) Towards FAIR principles for research software. Data Science. https://doi.org/10.3233/DS-190026 ABOUT THE AUTHOR(S) - Dr Paula Andrea Martinez is leading the National Training Program for the Characterisation Community in Australia since 2019. She works for the National Image Facility (NIF). Last year she worked at ELIXIR Europe coordinating the Bioinformatics and Data Science training program in Belgium and collaborated with multiple ELIXIR nodes in the development of Software best practices. Her career, spanning Sweden, Australia and Belgium nurtured her experience in Bioinformatics and Research Software development for complex and data-intensive science. She started a career in Computer Science, later on, interested in research methods development and now outreach and advocacy in data and software best practices

91

Page 92: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¦ǘƛƭƛǎƛƴƎ hȄŦƻNJŘ bŀƴƻLJƻNJŜ 5ŀǘŀ ŦƻNJ ǘƘŜ DŜƴƻƳŜ !ǎǎŜƳōƭȅ ƻŦ 9ƴŘŜƳƛŎ bŜǿ ½ŜŀƭŀƴŘ {LJŜŎƛŜǎ

Ann Mc Cartney1, Chen Wu3, Ross Crowhurst3, Joseph Guhlin2, Chris Smith1, David Chagne5,

Thomas Buckley1.

1 Systematics Team, Manaaki Whenua, 231, Morrin Road, Saint Johns Auckland.

2 Department of Biochemistry, School of Biomedical Sciences, University of Otago,

Cumberland Street, Dunedin.

3 Plant and Food Research,120 Mt Albert Road, Sandringham, Auckland.

4 Commerce A, Room 113, University of Auckland, Symonds Street, Auckland.

5 Plant and Food Research, Batchelar Road, Fitzherbert, Palmerston North.

As part of Genomics Aotearoa, a high-quality genomes project has been established to

generate pipelines for the assembly of species across New Zealand. These pipelines are

specifically targeted at key species that are on the verge of extinction, treasured by Māori,

key players in the primary production industry, a significant threat to biosecurity within New

Zealand or have complex genomes,i.e. are abnormally large, have higher ploidy levels, are

highly repetitive or heterozygous. These species have been sequenced using a variety of

NGS platforms, namely Illumina, Oxford Nanopore (ONT), PacBio, Chromium 10X and Hi-C.

Here, genome assembly strategies under development on NeSI will be outlined specifically

using ONT and Illumina datatypes in order to highlight the impact of read depth and

coverage on genome assembly quality. This study deals with the optimisation of genome

assembly construction for projects with a limited budget or those that are confined to

certain locations/sequencing platforms. It also addresses optimal assembly strategies for

species with unique genome architectures. In order to investigate this five species with

unique genome characteristics were selected, namely; Hericium novae-zealandiae a small

diploid fungi, Clitarchus hookeri a species containing a large, repetitive and highly

heterozygous genome, Knightia excelsa a plant species with a medium sized, non-repetitive

genome with low heterozygosity, and kiwifruit another plant species with a smaller and

more repetitive genome structure. A focus will all be placed on the importance of

appropriate data management, transfer and sharing when working with toanga species.

92

Page 93: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR(S)

Name: Ann Mc Cartney

Bio: Ann Mc Cartney is a Genomics Aotearoa postdoctoral fellow at Manaaki Whenua -

Landcare Research. After completing her Genetics and Cell Biology degree at DCU in 2012,

where she graduated top of her class, she won IRCSET funding to complete her PhD in the

Bioinformatics and Molecular Evolution under the supervision of Dr. Mary O’Connell (now at

Nottingham University) in 2018. During her PhD, Ann identified and characterised fusion

genes with a specific focus on primate genomes, producing a thesis entitled "Novel gene

genesis by gene fusion: a network based approach". Since moving to Auckland, New Zealand

in 2018, she has worked as a Postdoctoral Researcher for the Genomics Aotearoa’s High

Quality Genomes Project under Associate Professor Thomas Buckley creating protocols for

the sequencing and assembly of endemic New Zealand species including stick insects such

as Clitarchus hookeri, fungi from the Herecium, venturia and Pithomyces clades, birds such

as the Hihi and Kakapo and honey suckle trees such as the rewarewa.

93

Page 94: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ŀǘŀ tƛLJŜƭƛƴŜǎ ŀƴŘ tNJƛǎƳǎ

Alan McCulloch

AgResearch (NZ)

[email protected]

We describe a data-prism data processing and analysis metaphor, contrast this with the

data-pipeline metaphor and topology and describe several use cases.

The pipeline processing metaphor is popular for two main reasons: firstly, end-to-end

(longitudinal) processing integrity and performance is usually uppermost in the minds of

analysts and software designers; secondly, mature implementations of well-understood

formal pipeline-topology abstractions such as directed acyclic graphs are readily available,

as are well-understood end-to-end-oriented quality-control processes and metrics.

However, collections of input files associated with some large-scale datasets have important

side-to-side (latitudinal) structural features, processing and quality control metrics that are

not so well represented by the longitudinal pipeline metaphor and topology. For example,

while processing of a set of samples from a sequencing machine may conclude with perfect

end-to-end integrity per data-file, unsupervised machine learning (for example clustering)

applied latitudinally to a low-entropy precis of all of the input, intermediate or final data-

files may identify data features of interest such as outliers, relevant to quality control.

Another example of latitudinal processing and structure is in the use of multiple reference

frames for sample annotation, rather than a single reference, so that a single stream of

processing is refracted into multiple streams, with each stream searching a different

reference database, and/or using alternate search parameters. Technical steps such as job-

scheduling and intermediate and output file-disposition for such “short, wide” (as opposed

to “long, narrow”) processing streams can be awkward when using a pipeline metaphor. For

example pipeline-oriented scripting usually stores and indexes input, intermediate and

output files “non-semantically”, via hard-mapping the output from each pipeline-stage to a

different file-system folder for that stage, which does not work well if each folder receives

input from multiple threads of processing of the same data (for example, file-name

collisions will result).

We describe some data-prism use cases, and a number of simple techniques we have found

useful in implementing a data-prism metaphor, such as semantic file storage and indexing, a

high level API for tasks such as random sampling and processing large numbers of input files

94

Page 95: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

and parameter sets, low-entropy data representation approaches to support a high level

latitudinal view of the data, and the use of a meta-scheduler and command-line mark-up for

easier refraction of single into multiple streams of processing (and to try to reduce the

impedance mismatch between the shell command-line that many users know and love, and

the cluster job-submission systems known and loved by systems administrators).

!.h¦¢ ¢I9 !¦¢Ihw

Alan McCulloch is a Bioinformatics Software Engineer working at AgResearch’s Invermay

campus, mainly supporting genetic and genomic databases and high-throughput

computational pipelines.

95

Page 96: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

{ƘŀNJƛƴƎ ŀŎNJƻǎǎ ǘƘŜ ŘƛǘŎƘΥ ƛƴŦNJŀǎǘNJdzŎǘdzNJŜ ŦƻNJ ǎƻŎƛŀƭ ǎŎƛŜƴŎŜ NJŜǎŜŀNJŎƘ Řŀǘŀ ƛƴ !dzǎǘNJŀƭƛŀ ŀƴŘ bŜǿ ½ŜŀƭŀƴŘ

Steven McEachern

Australian Data Archive

Australian National University

[email protected]

Marina McGale

Australian Data Archive

Australian National University

[email protected]

Martin von Randow

COMPASS Research Centre

University of Auckland

[email protected]

Janet McDougall

Australian Data Archive

Australian National University

[email protected]

The social science research community has a long tradition of working collaboratively to

study comparative political, social and economic problems in an international context.

Australian and New Zealand researchers have made regular, long term contributions to

international research programs such as the International Social Survey Program [1], World

Values Survey [2] and the Comparative Study of Electoral Systems [3], and the results of this

work is disseminated internationally.

These international collaborations highlight significant opportunities for the establishment

of shared resources and infrastructure to support such programs. Social science data

archives have been established in many countries to support the efforts associated with

programs. Particularly in the European Union, these now represent the major EU-funded

social science infrastructures under the ESFRI program, such as the Consortium of European

Social Science Data Archives (CESSDA) [4], and associated survey research programs of the

96

Page 97: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

European Social Survey (ESS) [5] and the Survey of Health and Retirement in Europe (SHARE)

[6].

Our project

Cross-national collaborations such as CESSDA and the World Values Survey represent an

opportunity for Australian and New Zealand researchers and research infrastructures in the

social sciences, but one that has not yet been realised.

This paper therefore provides an overview of the collaborations over the past 10 years

between the Australian Data Archive and the COMPASS Research Centre. This collaboration

began with a joint effort between ADA and COMPASS to establish the New Zealand Social

Science Data Service (NZSSDS), using the NESSTAR access system in 2007-8. This collection

has now been maintained for 10 years, including migration to a Figshare repository around 5

years ago by the Centre for eResearch at the University of Auckland.

Over the 10 years since, the ANU Centre for Social Research and Methods (home of ADA)

and COMPASS have contributed to the ISSP program and provide support for the national

Election Study in each country. This paper therefore presents an overview of the

development of social science infrastructure over this period, and the collaboration

between Australia and New Zealand over that time and into the future.

Methods

To further this collaboration, over the past 12 months, ADA and COMPASS have been

working to preserve and update the NZSDSS collection of datasets as a hosted collection

within the Australian Data Archive. The collection – re-establishing the New Zealand Social

Science Data Service – is now managed and maintained through the COMPASS team in New

Zealand, but housed at ADA on the National Computational Infrastructure.

The re-establishment of NZSDSS has involved three separate streams of activity – each of

which is critical to the preservation and dissemination of research data. These streams

included:

1. Technical update: Migration of the existing data from the previous service

(established at FigShare) to the ADA Dataverse environment

2. Policy: updating preservation and access policies for the COMPASS data to meet New

Zealand, Australian and international regulatory and ethical requirements

97

Page 98: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

3. Curation: Enabling web-based curation processes for data and metadata using

common procedures and international social science metadata standards (the Data

Documentation Initiative)

Results

The presentation will provide details of the progress of this new collection, a discussion of

the benefits and drawbacks to cross-national data management through the project, and

lessons for managing cross-national collaborations and shared infrastructure in the social

sciences more generally.

References

[1] International Social Survey Program (ISSP): http://issp.org/

[2] World Values Survey (WVS): http://www.worldvaluessurvey.org/

[3] Comparative Study of Electoral Systems (CSES): https://cses.org/

[4] Consortium of European Social Science Data Archives (CESSDA): https://www.cessda.eu/

[5] European Social Survey (ESS): http://www.europeansocialsurvey.org/

[6] Survey of Health and Retirement in Europe (SHARE): http://www.share-

project.org/home0.html

ABOUT THE AUTHOR(S)

Steven McEachern is Director, Marina McGale is Web Services Coordinator, and Janet

McDougall is senior archivist at the Australian Data Archive. Martin von Randow is Data

Manager and Analyst at the COMPASS Research Centre.

98

Page 99: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9ƴŀōƭƛƴƎ ŀdzǘƘŜƴǘƛŎŀǘƛƻƴ ŀǘ Ǝƭƻōŀƭ ǎŎŀƭŜΥ ŀƴ dzLJŘŀǘŜ ƻƴ w9!bb½ ǎŜNJǾƛŎŜǎ

Vladimir Mencl

REANNZ

[email protected]

REANNZ operates for the New Zealand R&E community two well known services: Tuakiri,

the Identity and Access Management Federation, and eduroam, roaming infrastructure for

seamless network access. This presentation will give an update on new developments on

these services.

Tuakiri has recently completed the connection to eduGAIN, the global "federation of

federations" - and eduGAIN connectivity is now available to the NZ R&E community. This

allows users from NZ institutions to log into services from other federations via eduGAIN -

and in a similar vein, overseas users can log into NZ-based services connected to eduGAIN.

Tuakiri is rolling out eduGAIN with an opt-in process, and this talk will cover the steps an

Identity Provider or a Service Provider must take to join eduGAIN.

For eduroam, the global community has made several new very useful services available.

With eduroam CAT, the Configuration Assistant Tool, it is now easy to onboard users for

eduroam - either pointing them directly to the CAT website to install the connection profile,

or rolling out the connection profile across a fleet of devices through centralised

management infrastructure. In both cases, it results into consistent (and more secure)

connection profile deployment. And CAT 2.0, released this year, makes it even easier for

institutions to register with CAT and create the end-user connection profiles.

While larger institutions are well capable of running the infrastructure required for eduroam

themselves, smaller institutions often find this task challenging. For these, as an alternative

to running the IdP infrastructure, the Managed IdP offering might be the right fit: a fully

hosted and managed eduroam IdP, with an interface for managing user accounts and for

deploying these accounts on user devices.

ABOUT THE AUTHOR(S)

Dr. Vladimir Mencl has been part of the New Zealand R&E community since 2006 and has

been involved in identity and access management projects since the early days of the

BeSTGRID project. When the Tuakiri project moved to REANNZ, Vlad joined REANNZ where

he is part of the Systems team as a Senior Software Engineer.

99

Page 100: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Beyond Super

April Neoh

Account Executive HPC/AI & Big Data Storage

Hewlett Packard Enterprise

E-mail: [email protected]

Discover “Beyond Super” supercomputing as the HPE and Cray now team together as one company. Listen to how HPE and Cray are combining unique IP to deliver the capabilities you will need in the exascale era with systems that perform like a supercomputer and run like a cloud.

Together, we are redefining the supercomputing industry with the solutions and services for a new era of converged modeling, simulation, analytics, and AI.

ABOUT THE AUTHOR

April has been in the IT industry for over 25 years in IBM, Cray and now HPE, with her time spent mainly on working closely with Technical and Research customers to build high performing computing solutions using bleeding edge HPC technologies. She has worked with Government Agencies and Universities in building collaborative research partnerships and implement TOP500 sized HPC systems in Australia and New Zealand, with the goal of exploiting technologies to deliver a scientific outcome.

100

Page 101: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Using comparative RNASeq to identify small non-coding RNAs in bacterial clades

Thomas Nicholson1,2 and Paul Gardner1

1Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand, 2Genetics Otago, University of Otago, New Zealand.

[email protected]

Small non-coding RNAs are involved in regulation of a wide range of cell processes. There are a number of tools that exist that try to identify these RNAs using a range of methods, however challenges with predicting non-coding RNAs from the sequence alone and transcriptional noise making the use of RNASeq data unreliable has hindered annotation of functional elements (Jose et al. 2019). While these methods manage to predict RNAs it can be hard to determine whether results from RNASeq data are the result of a real RNA or noise and to deal with this problem we are using a comparative approach by taking RNASeq data from multiple genomes within a clade (Lindgreen et al. 2014). We have designed a pipeline that identifies peaks in intergenic regions of RNASeq data that may by functional RNAs and uses genome alignments to check if there are conserved regions of expression that would indicate the transcription that is observed is for functional RNAs. By using a comparative approach we aim to improve the signal to noise ratio in our results and better list of candidate small non-coding RNAs.

References

1. Jose, B. R., Gardner, P. P., & Barquist, L. (2019). Transcriptional noise and exaptation as sources for bacterial sRNAs. Biochemical Society Transactions, 47(2), 527-539.

2. Lindgreen, S., Umu, S. U., Lai, A. S. W., Eldai, H., Liu, W., McGimpsey, S., ... & Poole, A. M. (2014). Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling. PLoS computational biology, 10(10), e1003907.

ABOUT THE AUTHOR(S)

Tom is a PhD student in the Department of Biochemistry at the Univrsity of Otago. His research focusses on the bioinformatic analysis of small non-coding RNAs in prokaryotes.

101

Page 102: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Network-based Nonparametric Tests to Identify Genetic Modifiers of Rare Diseases

Eliatan Niktab ([email protected])1, Stephen Sturley ([email protected])2,

Ingrid Winship ([email protected]) 3-4, Mark Walterfang

([email protected])3-4, Paul Atkinson ([email protected])1, Andrew

Munkacsi ([email protected])i1

1School of Biological Sciences, Victoria University of Wellington, Wellington, New Zealand

2Department of Biology, Barnard College at Columbia University, New York, New York, USA

3Melbourne Neuropsychiatry Centre, Royal Melbourne Hospital, Melbourne, Australia

3Medicine, Dentistry And Health Sciences, Melbourne Universityl, Melbourne, Australia

Genome and exome sequencing has been extensively successful in identifying disease causing gene mutations and variants in GWAS. However, there has been little success in deducing the pertinent genomic variants that significantly modify disease progression and fully account for phenotype. One reason is that current use of genome-wide association study (GWAS) utilize narrow sense heritability estimates and do not include assessment of epistasis and other components of broad-sense heritability1-2. Here we report investigation of genetic variants that modify the causal gene of a monogenic disease and ultimately regulate its onset and progression in individuals. Niemann-Pick type C (NP-C) disease, a rare monogenic Mendelian disease, is one of more than 6,000 Mendelian diseases for which there is no cure. Most NP-C patients with the NPC1 gene mutation are diagnosed as late infants and die before or during adolescence, yet survival of some to adulthood provides a testbed for elucidating genes that alleviate the primary mutation. Therefore, we collected whole-genome sequences of pediatric and adult-onset patients. We then developed a pipeline that integrates mathematical models of genetic polymorphisms, augmented Bayesian biological networks, clinical records, and semantic ontologies of GWAS data. The tool that we developed analyzes sequencing data, identifies genome-wide interactions, and has scripts that control for confounding factors using heterogeneous data harmonization and modularity-based clustering. Our approach mitigates the statistical challenge of sample sizes inherent to current GWAS methodology. There are a large number of modifying genes that appear to function in epistatic networks of disease-modifying variants whose genetic effects together explain the heritability NP-C in its various manifestations.

102

Page 103: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

1- Kim, S. et al. Genes with high network connectivity are enriched for disease heritability.Am. J. Hum. Genet. (2019).

2- Escala-Garcia, M. et al. A network analysis to identify mediators of germline-drivendifferences in breast cancer prognosis. Nat. Commun. (2020).

Keywords: system biology, network, genome, GWAS, Bayesian

Acknowledgments: Ara Parseghian Medical Research Fund

ABOUT THE AUTHOR(S)

Eliatan Niktab

I’m a Ph.D. candidate at Victoria University and a member of Quantitative Methods,

Machine Learning, and Functional Genomics group at Genomics England Clinical

Interpretation Partnership (GeCIP). I’m trained to investigate diverse, complex and multi-

feature data including human genomics, proteomics, and metabolomics by developing

mathematical models and statistical analyses. My Ph.D. dissertation utilizes such models for

investigating the rare neurodegenerative Niemann-Pick type C (NPC) disease. I’m mostly

practiced in genome-scale algorithm design, using deep neural networks for genetic variant

discovery, Baysian modeling, and GPU-accelerated software development.

Stephen Sturley, Ph.D.

Professor Sturley’s group uses a multidisciplinary approach that integrates genetics,

biochemistry and cell biology. He is specialized in applying yeast as a model organism to

understand human lipid metabolism. Particular emphasis and success of his lab has been

attained with regard to cholesterol, sphingolipid and fatty acid homeostasis underlying

lipotoxicity with particular reference to neurodegeneration, obesity, and diabetes. Their use

of yeast to identify genetic modifiers of recessive disorders such as Niemann- Pick Type C

(NPC) resulted in the identification of histone deacetylase inhibitors as a candidate therapy,

for which this drug was further tested in murine models of NPC disease and now in clinical

trial in human patients.

Mark Walterfang, M.D., Ph.D. FRANZCP

Professor Walterfang has significant experience in clinical neuropsychiatry and general adult

psychiatry with expertise in managing comorbid psychiatric and neurological disorders,

neurodegenerative disorders, neurometabolic disorders, and atypical dementias. Clinical

experience is the basis of his success in research, starting with 13 published papers from his

103

Page 104: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Ph.D. that have been cited more than 500 times. He has since published over 130 papers in

major psychiatric, metabolic, neurological and neuroimaging journals.

Ingrid Winship, MBChB, MD, Ph.D., FRACP, FACD

The focus of her research is the relationship between genotype and phenotype with

particular emphasis on human diseases. In the last two decades, her research has helped to

frame the questions, define the phenotypes, and via statistical analyses associated

genotype and phenotype. At the other end of the translational pipeline, her research has

translated the discoveries and knowledge into clinical protocols and policies, which has

changed the way patients are treated in medical practice via new drug targets and

biomarkers to monitor the onset and progression of the disease.

Paul Atkinson, Ph.D.

Professor Atkinson is a cell biologist who has long investigated ER-related events in

specification of membrane protein synthesis and transport. His studies have included ER,

Golgi and plasma membrane specific glycosylation structure determined by multi-

dimensional NMR. Specific membrane glycoproteins studies utilised model systems

including membrane maturing viruses. More recently he has utilized yeast gene knockout

libraries to investigate epistatic network contribution to phenotype in ER related events.

Andrew Munkacsi, Ph.D.

Munkacsi lab studies the genetics, cell biology, and biochemistry of intracellular lipid

transport to identify novel targets to treat the defective transport of cholesterol and

sphingolipids underlying human disease. His group uses a unique combination of unbiased,

high- throughput systems biology approaches in yeast genomic screens based on synthetic

lethality, gene expression, protein localization, and protein-protein interactions, as well as

exome and genome sequencing of human patients. Dr. Munkacsi successfully used these

genome-wide yeast screens to identify unsuspected and precise targets to treat

neurodegenerative diseases such as Alzheimer’s disease and Niemann- Pick Type C (NPC)

disease, of which a subset have progressed to clinical trials.

104

Page 105: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

NVIDIA Accelerated Computing Workshop Dr. Gabriel Noaje

Senior Solutions Architect E-mail: [email protected]

This workshop will provide an overview of the accelerated computing solutions that NVIDIA offers for HPC, DL and ML. From the GPU architecture to the CUDA-X libraries collection and developer tools attendees will be introduced to the NVIDIA Platform for developers. In addition, dedicated frameworks for Deep Learning like DeepStream or Clara and for Machine Learning like RAPIDS will also be introduced.

13:30-13:45 Opening by Dennis Ang, NVIDIA APAC South Director

13:45-14:15 Convergence of HPC + AI

14:15-15:10 Accelerated platform overview and updates (system

architecture, CUDA, CUDA-X, development tools)

15:10-15:30 Afternoon tea

15:30-16:00 DeepLearning Tools and SDKs – DeepStream, Clara, etc.

16:00-16:20 NVIDIA GPU Cloud (Containerization, Transfer Learning Toolkit,

TensorRT, TensorRT Inferencing Server)

16:20-16:40 Convergence of HPC + Data Science (RAPIDS framework)

16:40-17:00 Q&A Session

ABOUT THE INSTRUCTOR Gabriel Noaje has more than 10 years of experience in accelerator technologies and parallel computing. Gabriel has a deep understanding of users’ requirements in terms of manycore architecture after he worked in both enterprise and public sector roles. Prior to joining NVIDIA, he was a Senior Solutions Architect with SGI and HPE where he was developing solutions for HPC and Deep Learning customers in APAC. Previously, he was a Senior Computational Scientist at A*STAR Computational Resource Centre in Singapore (A*CRC) supporting users with deploying their applications on GPUs and large HPC systems.

105

Page 106: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Basics of Cloud Computing Workshop Daniel OôByrne Catalyst Cloud

E-mail: [email protected]

An introduction into cloud computing and how you can take advantage of Cloud technologies for your research. The workshop aims to show you the advantages of using cloud computing over legacy platforms and will guide you on how to set up your first instance on the Catalyst Cloud.

Day: Thursday 13 February Time: 11.00am – 12.30pm

106

Page 107: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!ŎŀŘŜƳƛŎ 5ŀǘŀ {ŎƛŜƴŎŜΥ ŦNJƻƳ ƛƴŘƛǾƛŘdzŀƭǎ ǘƻ ƛƴǎǘƛǘdzǘƛƻƴǎ

Micaela Parker

Academic Data Science Alliance

[email protected]

Although data-driven research is already accelerating scientific discovery, substantial systemic challenges still exist in academia that impact both individual researchers and institutional decision-making. These challenges need to be overcome for academia to fully realize the promise of the new data era. Toward that end, working in partnership with one another and with the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation, three universities (the University of California Berkeley, New York University, and the University of Washington) have been attempting to create supportive environments for researchers using and developing data-intensive practices. Established in 2013, and known as the Moore-Sloan Data Science Environments (MSDSE), this collaboration was structured through a set of working groups on cross-cutting topics viewed as critical to advancing data science in academia: career paths and incentives, software development, education, reproducibility and open science, reflexive and reflective ethnography, and the role of physical space in collaboration. As the MSDSE grants approach sunset in 2020, the Academic Data Science Alliance (ADSA) has formed to expand the community and build a network across the U.S. and beyond to share lessons from the MSDSEs and from other higher education institutes experimenting with the integration of data science in academia. This talk will cover some of the efforts and activities of the MSDSEs and ADSA, emphasizing best practices and lessons learned that have emerged from six years of collaborative institutional experimentation, from cross-domain workshops and project incubators to the challenges of creating (and filling) new staff data scientist positions outside of any one particular lab or discipline.

ABOUT THE AUTHOR(S)

Micaela Parker is Executive Director of the Academic Data Science Alliance (ADSA). ADSA is a

domain-agnostic organization that supports university researchers in their efforts to

collaborate around data-intensive tools, methods, and responsible applications. By building

networks of data science practitioners and thought leaders, ADSA enables better sharing of

knowledge, ideas, and lessons learned.

Before launching ADSA, Dr. Parker served as Program Coordinator for the Moore-Sloan Data

Science Environments and Executive Director for the University of Washington’s eScience

Institute. In this role, she handled operations, developed research and training programs,

107

Page 108: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

participated in strategic planning and fiscal oversight, and worked directly with university

and industry partners and funders.

Prior to 2014, Dr. Parker was a senior research scientist in UW’s School of Oceanography,

where she also earned her PhD. She has been involved in many large, interdisciplinary

projects bridging oceanography and genomics. Coming from a data-rich domain, she

appreciates the new data-driven world for all its benefits and challenges. She now enjoys

facilitating collaborations to help researchers navigate this fourth scientific paradigm.

108

Page 109: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Mice, organoids and s ingle cel ls : computat ional methods for cancer t reatment

Elizabeth Permina1*, Tom Brew1, Mik Black1, 2, Parry Guilford1,2, 3

1. Centre for Translational Cancer Research, University of Otago

2. Department of Biochemistry, University of Otago, Dunedin Otago, New Zealand

3. Pacific Edge Ltd, New Zelaland

* [email protected]

Hereditary Diffuse Gastric Cancer (HDGC) affects hundreds of people in New Zealand, many from

Māori families. An inherited mutation in the E-cadherin gene (CDH1) is a strong driver of HDGC, affecting

individuals as young as 15 years old. A promising way of combatting HDGC involves finding a synthetic

lethal (SL) partner to the HDGC-defining gene, CDH1. Synthetic lethality is defined as a specific

relationship between two genes where a loss of one is tolerated by the cell but the loss of both is lethal.

An innovative way of mixing computational approaches with experimental data offers a method of

identifying a range of prospective drug targets and treatments. Generation of mouse gastric organoids

(simplified versions of a mouse stomach produced from mouse stem cells with a micro-anatomy that is

close to that of a real stomach) with and without CDH1 loss, provide a realistic model for HDGC, and

single-cell RNA sequencing (scRNA-seq) then provides whole-transcriptome data for these organoid

samples. Here I will present an analysis of the organoid scRNA-seq data, utilizing linkage to existing SL

gene and pathway data (including siRNA studies done previously in our lab) as well as integration of

publicly accessible data sets derived from patient tumours.

Elizabeth Permina is a Postdoctoral Research Fellow working on the HRC funded research programme

“Reducing the burden of gastric cancer in New Zealand”, based in the Centre for Translational Cancer

Research at the University of Otago, Dunedin.

Tom Brew is a PhD student in Biochemistry, with research focused on developing novel approaches to

treating hereditary diffuse gastric cancer.

Associate Professor Mik Black is a Principal Investigator in the Centre for Translational Cancer Research at the University of Otago, and the bioinformatics lead for Genomics Aotearoa, a national initiative for developing genomics and bioinformatics capability in New Zealand. His research focuses on the development and application of statistical and bioinformatics methodology to problems in human health, with a particular focus on cancer.

Professor Parry Guilford is a Principal Investigator in the Centre for Translational Cancer Research at the University of Otago, whose research focuses on the role played by the gene E-cadherin in the development and progression of hereditary diffuse gastric cancer. He is also a co-founder and Chief

Scientific Officer of Pacific Edge Ltd.

109

Page 110: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9ƴƘŀƴŎƛƴƎ ŜwŜǎŜŀNJŎƘ LJNJƻŘdzŎǘƛǾƛǘȅ ǿƛǘƘ bŜ{Lϥǎ ŎƻƴǎdzƭǘŀƴŎȅ ǎŜNJǾƛŎŜ

Alexander Pletzer1, Chris Scott2, Wolfgang Hayek1 and Georgina Rae2

NeSI (NIWA1, University of Auckland2)

[email protected]

Many research areas from bioinformatics and genomics, to materials science, fundamental physics, earthquake simulation and weather/climate prediction are increasingly dependent on the availability of powerful computing platforms and deep software stacks. Unfortunately, scientific software too often runs at sub-optimal performance, sometimes reaching only a few percent of the maximum peak performance of the supercomputer. Small changes in code implementation details, the choice of compilers and libraries and adjustments in runtime environment have been shown to sometimes have a significant impact on code performance. By walking through some examples, we show how researchers were able to leverage NeSI’s free consultancy service to squeeze more performance out of their application, sometimes reducing the execution time by several factors, a win-win solution which benefits science and saves core hours.

ABOUT THE AUTHOR(S)

- Alex Pletzer is research software engineer for NeSI at NIWA. Originally a physicist,

Alex drifted towards high performance during a career that spans research in plasma

physics, working for a private company in Colorado and supporting users at

university in Pennsylvania.

- Chris Scott is research software engineer for NeSI at University of Auckland.

Currently lead of the computational science team, Chris has a background in

molecular dynamics, Monte Carlo methods, finite element analysis, visualisation and

parallel computing

- Wolfgang Hayek is research software engineer for NeSI at NIWA and scientific

programming group lead at NIWA. Wolfgang has expertise in radiative transfer

modelling, fluid dynamics, data analysis and high performance computing

- Georgina Rae is engagement manager and, until recently, was team lead for the

computational science team. Georgina has experience in food and plant research

and has worked in the world of intellectual property and commercialisation

110

Page 111: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

wdzƴƴƛƴƎ wņLJƻƛΥ wŜōƻƻǘƛƴƎ wŜǎŜŀNJŎƘ /ƻƳLJdzǘƛƴƎ ϧ {dzLJLJƻNJǘ ŀǘ ±¦²

Jonny Flutey, Andre Geldenhuis, Wes Harrell, Matt Plummer

Victoria University of Wellington

[email protected]; [email protected]; [email protected];

[email protected]

Over the past year, a team newly based in the Centre for Academic Development has taken

responsibility for consolidating and reconfiguring high performance computing resources at

Victoria University of Wellington. Accompanying this technical reboot have been refreshed

approaches to support, training, community building, rebranding and data gathering.

This paper will outline the proceses and practices adopted in this holistic approach to

research support and researcher development, ranging from wrangling computing nodes

into a shipping container, to onboarding non-typical users to an HPC environment. It will

overview tools utilised (Slurm, Slack, GitHub, MKDocs, Ganglia, AWS Cost Calculator),

training and events offered (Carpentries workshops, ResBaz, community catch-ups), and

approaches undertaken (including developing a capability tied to available resources,

reporting metrics, alignment with NeSI environments, and engagement with internal and

external stakeholders).

https://vuw-research-computing.github.io/raapoi-docs/

https://resbaz.github.io/resbaz2019/wellington/

https://medium.com/the-data-nudge/key-takeaways-from-research-bazaar-wellington-

2019-937c57c8699

ABOUT THE AUTHOR(S)

Jonny Flutey is Digital Learning and Research Manager, Andre Geldenhuis and Wes Harrell

are Research Software Engineers, and Matt Plummer is a Digital Research Consultant, all

based in the Centre for Academic Development at Victoria University of Wellington.

111

Page 112: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

tŜNJπǎŀƳLJƭŜ LJŀǘƘǿŀȅ ŀƴŀƭȅǎƛǎ ǘƻƻƭ ŦƻNJ 5b! ƳŜǘƘȅƭŀǘƛƻƴ Řŀǘŀ

Santana, A.F. 1,2, Benton, M.C.1, Macartney-Coxson, D.1, Black, M.A.2 1Human Genomics, Institute of Environmental Science and Research (ESR), Porirua, New

Zealand, 2 Department of Biochemistry, University of Otago, Dunedin, New Zealand. [email protected], [email protected], donia.macartney-

[email protected], [email protected]

Pathway enrichment analysis plays an important role in the understanding of biological processes and diseases. Such analyses interrogate defined sets of differentially methylated CpG sites and/or differentially expressed genes, evaluating whether changes to members of biological pathways are occurring by chance or not, indicating the potential biological relevance of these functional groupings. In the DNA methylation context, traditional paired or case:control designs are statistically tested to investigate which methylation sites are significantly altered across samples. However, due to the implementation and design of such tests, per sample analyses are infeasible. Moreover, unlike gene expression, methylation data can skew pathway analysis results if not properly processed, as CpG probes are generally unevenly distributed across genes. For instance, there are genes which contain as few as one probe, while others have hundreds of probes mapping to them. Consequently, genes with a large number of probes have a higher probability to exhibit differential methylation, and hence reported perturbed pathways are likely to include false positive results. We propose a novel pathway analysis tool for DNA methylation data, which enriches gene sets and analyses pathways disruption on a per-sample basis, and reduces the bias from CpG-to-gene mapping. This is a flexible approach that can apply different categorisation techniques to methylation signals (beta values) creating CpG sets. These are then converted into gene sets after adjustment using one of the multi-mapping bias methods developed. The final output is then enriched for pathway membership. To assess statistical significance, a resampling step is performed to evaluate whether enriched pathways are robust or if they could have been randomly obtained. This tool is under development and will be made available as an R package. ABOUT THE AUTHOR(S) Alessandra Santana is a PhD student in Bioinformatics at the Institute of Environmental Science and Research (ESR) and University of Otago. In her collaborative project, she develops computational and visualisation tools to investigate obesity and its related metabolic disorders and the tractability of epigenetic information as disease markers, particularly focusing on DNA methylation. Her interests include machine learning, clustering techniques, DNA methylation, microRNAs and genetics.

112

Page 113: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Humanities Data Untied - An Untapped Resource or just an Untidy Office?

Alexander Ritchie University of Otago Library

[email protected]

This presentation will reflect on aspects of the impact made by data science and digital tools within the Library and the Humanities at Otago University. It will seek to untie some of the tightly bound threads of 'data' metaphors in a humanities context, and touch on how librarians and libraries are helping to order the 'untidy office' of data scholarship nationally and internationally. It will do this through three provocations:

• do the Humanities actually have 'data',• what should and do Humanists do with the data if we do indeed have it, and

finally• what does it mean to be 'united' in data in the context of a Humanities-in-

crisis, indigenous data sovereignty, and continual under-funding of the GLAMand cultural sector.

It will conclude in musing about the relationship between data, capta, information, knowledge, and wisdom, and what narrative and metaphor might offer eResearch in Aotearoa.

ABOUT THE AUTHOR(S) alexander ritchie currently works as a librarian in the humanities at the University of Otago Library, having previously worked as a librarian in sciences, with Otago Polytech, and at Te Uare Taoka o HǕkena | Hocken Collections. Recently, he has collaborated with colleagues in the UO Library and School of Arts to support the University's Divisional Digital Humanities Hub | Te PǾkapu Matihiko. This work has involved staffing open hours, running seminars, developing resources, hosting workshops, and advocating for the value of the digital in the humanities. He is yet to master the art of writing about himself in the third person.

113

Page 114: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

CƛNJǎǘ ǎǘŜLJǎ ƛƴ ƳŀŎƘƛƴŜ ƭŜŀNJƴƛƴƎ ǿƛǘƘ bŜ{L

Chris Scott1, Kameron Christopher2, Alexander Pletzer1, Wolfgang Hayek1, Nooriyah Lohani1

NeSI1, NIWA2

[email protected]

This is a hands on, beginner level workshop on machine learning with NeSI. We will focus on image recognition as an example but this workshop should also be useful to those who wish to build their confidence with machine learning tools such as Keras and TensorFlow. We will begin with a broad introduction to some of the machine learning techniques that will be applied during the workshop, such as convolutional neural networks. Then we will create and train a machine learning model to count objects in images using Keras and TensorFlow within a Python notebook environment. This workshop will then focus on tweaking the model to improve performance. The attendees will have the opportunity to provide feedback to the group to learn from each other’s experiences and discuss any pitfalls that were encountered, such as overfitting. This workshop is aimed at building your confidence in applying machine learning tools and techniques and to prepare you for taking the next steps in using machine learning for your research.

ABOUT THE AUTHOR(S)

- Chris Scott is a Research Software Engineer for NeSI at The University of Auckland

- Kameron Christopher is Chief Scientist – HPC and Data Science at NIWA

- Alex Pletzer is a Research Software Engineer for NeSI at NIWA

- Wolfgang Hayek is a Research Software Engineer for NeSI at NIWA

- Nooriyah Lohani is Research Communities Advisor for NeSI at The University of

Auckland

114

Page 115: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

It/ ŦƻNJ ƭƛŦŜ ǎŎƛŜƴŎŜǎΥ ƘŀƴŘƭƛƴƎ ǘƘŜ ŎƘŀƭƭŜƴƎŜǎ LJƻǎŜŘ ōȅ ŀ ŘƻƳŀƛƴ ǘƘŀǘ NJŜƭƛŜǎ ƻƴ ōƛƎ Řŀǘŀ

Dinindu Senanayake

New Zealand eScience Infrastructure (NeSI)

[email protected]

The advancement of sequencing technologies, proteomics, microscopy (High throughput

high content), etc. and decreasing cost is responsible in creating an avalanche of data across

multiple sub-domains that fall under life sciences. This data deluge demands an

interdisciplinary approach to face the associated challenges such as data storage, parallel

and high-performance computing solutions for data analysis, scalability, security and data

integration. Ability to deliver solutions to these needs will result in converting highly

granular, unstructured data into real scientific insights which will accelerate the advances

being made assisted precision medical treatment based on an individual’s genetic makeup,

developing drugs with minimum side effects, species conservation programmes, etc.

New Zealand eScience Infrastructure (NeSI) is focused on delivering these tools that are

required by our researchers who might need a “huge” amount of memory to assemble a

large genome, simulate the Newtonian equations of motion in biochemical molecules like

proteins, nucleic acids in parallel, facilitate the ever increasing requirement of data storage

(from day to day to “Sensitive”) and deploying efficient methods for end-to-end data

transfers. Also, NeSI’s partnership with Genomics Aotearoa had been instrumental in

introducing training tools such as virtual machines and an extensive number of workshops

hosted on these machine which are proving to assist beginners’ level

bioinformaticians/computational biologists to acquire advance skills within a short period to

be used in their search to understand the rules of life

ABOUT THE AUTHOR(S)

Dinindu Senanayake

- An Applications Support Specialist at NeSI with a particular interest in Bioinformatics

and Computational Biology. Joined NeSI following half a decade of research

experience gained in the field of Cancer Genetics, Chemical Genetics and

Bioinformatics

115

Page 116: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

!LJLJƭƛŜŘ 5ŜŜLJ [ŜŀNJƴƛƴƎ ŦƻNJ 5ƛǾŜNJǎŜ Research Communities Prof. Richard O. Sinnott University of Melbourne

The Melbourne eResearch Group (www.eresearch.unimelb.edu.au) are involved in a multitude of projects, many of which are focused on big data and data analytics. Many researcher challenges have much to benefit from artificial intelligence and especially from the application of deep learning and convolutional neural networks (CNNs). This talk will provide an overview of a portfolio of projects that have benefited from recent advances in the deep learning domain. These include case studies related to: • pedestrian/crowd counting for the City of Melbourne; • (early) fruit counting on trees (for fruit growers to estimate yield); • tree volume canopy estimation (for fruit growers to estimate the amount of spraying needed); • truck and trailer classification for VicRoads; • feral cat classification for ecology researchers working in rural Victoria; • plant and flower classification for commercial agricultural companies, and • encroachment of vegetation on powerlines for a range of utility companies The talk will cover a brief background to deep learning and CNNs and focus on the results that are now possible, with specific focus on projects requiring image detection and classification. Demonstrations of the result of the case studies will be provided. ABOUT THE AUTHOR(S) Professor Richard O. Sinnott is the Director of eResearch at the University of Melbourne and Chair of Applied Computing Systems. In these roles he is responsible for all aspects of eResearch (research-oriented IT development) at the University. He has been lead software engineer/architect on an extensive portfolio of national and international projects, with specific focus on those research domains requiring finer-grained access control (security). He has over 400 peer reviewed publications across a range of applied computing research areas

116

Page 117: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¦ƴƛǘŜŘ ƛƴ Řŀǘŀ ƳŀƴŀƎŜƳŜƴǘΥ Lǎ ƛǘ ǘƛƳŜ ŦƻNJ ŀ ƴŀǘƛƻƴŀƭ NJŜǎŜŀNJŎƘ Řŀǘŀ ƳŀƴŀƎŜƳŜƴǘ ŦNJŀƳŜǿƻNJƪΚ

Shiobhan Smith and Laura Armstrong

University of Otago and University of Auckland

[email protected] and [email protected]

In 2015 the CONZUL Research Data Management Framework Report for Universities New

Zealand made the following observation:

“While the concept of research data management has not changed, the environment in

which research is conducted has. Researchers are now able to generate extremely large

volumes of data over very short periods of time, and analyse complex systems where

previously a reductive approach was required. The impact of technology on modern

research has led to a situation where our ability to manage research data has been

overtaken by our ability to generate it, a situation which has created a separation in the

scholarly record. Where once research data were available for peer reviewed

communication, whether in formal publication, collaborative agreements or between

individuals, data are now stored on volatile media in inaccessible locations and without any

contextual semantics or clear lines of ownership, provenance or purpose. Researchers are

unable to, or see little value in structuring their data more effectively and institutions are

unsure how to encourage this.

There is a significant risk that these data, this evidence of the scholarly record, will be lost;

rendering the publications, communications and discourse they generate un-defensible and,

in an academic context, useless. This risk of loss is borne of two circumstances. First,

technology’s inability to store and preserve digital objects for long periods; disks degrade or

fail and data bit-streams corrupt. Second, an absence in the research process of essential

data structure activities so that data may be found, understood, shared and attributed in

line with community conventions in data sharing and validation. Together these two

circumstances encapsulate the need for RDM.

The current state in New Zealand is a fragmented approach to provision that trails other

parts of the world; most notably the UK and EU, the US and Australia. ” 1 (p7)

Fast forward 5 years and we believe this fragmented approach persists. One possible reason

for this is that there is no nationally recognised framework or guideline that sets clear

expectations on how research data needs to be managed by New Zealand research

117

Page 118: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

institutions. Universities are including aspects of research data management in policies such

as researcher codes of conduct and open access guidelines but is this enough to ensure that

New Zealand’s research data is treated as a valued asset and is FAIR2?

This workshop is a follow-on from initial conversations at URONZ2019 and a session at

Figshare Fest NZ 2019 and is aimed at those with an interest in or support research data

management. Working in small groups, participants will brainstorm the concept of a

national research data framework; who should be responsible, what should be include, how

it should be monitored, and the risks versus benefits of adopting this collaborative

approach.

References:

1 Wilkinson, J Max, Flaherty, Brian, Hearne, Shari, Lynch, Helen, Lamond, Heather, Dewson,

Natalie, … Amos, Howard. (2016, February 2). Research Data Management Framework

Report (Version Public). Zenodo. http://doi.org/10.5281/zenodo.1193195

2 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for

scientific data management and stewardship. Sci Data3, 160018 (2016)

doi:10.1038/sdata.2016.18

ABOUT THE AUTHOR(S)

Shiobhan Smith

Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being

appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan

was Subject Librarian to a number of Humanities departments including Sociology,

Anthropology, Geography, and Theology. As Subject Librarian to the Centre for

Sustainability, Shiobhan was involved in the development of the Otago Data Management

Planning tool and has an interest in Research Data Management. Shiobhan also has

knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.

Laura Armstrong

Laura Armstrong is a Senior eResearch Engagement Specialist at the Centre for eResearch,

University of Auckland working to engage researchers in eresearch, and deliver research

data management services and researcher enablement projects.

118

Page 119: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

¦ƴƛǘƛƴƎ ŜljdzƛLJƳŜƴǘ ŀƴŘ NJŜǎŜŀNJŎƘ LJdzōƭƛŎŀǘƛƻƴǎΥ ōƛƎƎŜNJ ǘƘŀƴ .Ŝƴ IdzNJΚ

Shiobhan Smith and Fiona Glasgow

University of Otago

[email protected] and [email protected]

The issue is simple enough; create a list of publications that are the result of using a specific

piece of equipment. But how do you do this? Does your institution have mechanisms that

track data outputs from their inception in equipment, like flow cytometric analysers, to a

completed and published article? How many researchers may work on that data in-

between? Would they cite that equipment in their outputs?

OMNI (Otago Micro and Nanoscale Imaging unit) administrators must report on the

scholarly outputs produced as a result of using their equipment. To date this process has

been difficult and manual, requiring hours of work gathering lists of publications from the

research outputs database, contacting researchers, and relying on self-reporting. As

publishers do not systematically ask for equipment metadata as part of the publishing

process there is no easily way to query publication databases. In 2018, OMNI manager

Charlene Gell met with the library Research Support Unit (RSU) to investigate a more

streamlined and sustainable solution. What the RSU discovered is that the issue is simple

but the solution complex.

In this presentation RSU members Shiobhan Smith and Fiona Glasgow will use the OMNI

project as a case study to discuss; persistent identifiers, the data lifecycle, data

management, research management systems, and citation culture. They will breakdown

the problem, present possible solutions, and seek feedback. After all equipment and

publications are united by data. But is maintaining that connection, as the data moves

through the research lifecycle, simply bigger than Ben Hur?

ABOUT THE AUTHOR(S)

Shiobhan Smith

Shiobhan has over 10 years’ experience working in Libraries and Museums. Prior to being

appointed as the University of Otago Library’s Research Support Unit Manager, Shiobhan

was Subject Librarian to a number of Humanities departments including Sociology,

Anthropology, Geography, and Theology. As Subject Librarian to the Centre for

Sustainability, Shiobhan was involved in the development of the Otago Data Management

119

Page 120: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Planning tool and has an interest in Research Data Management. Shiobhan also has

knowledge and skills in Digital Humanities, Bibliometrics, and Information Literacy.

Fiona Glasgow

Fiona is an information management enthusiast, working in both libraries and museum for

the past five years. After finishing an Honours degree in English, Fiona began the Masters of

Information Studies, which she completed in July 2017. Her research topic was focused on

digitising Māori collections in museums.

120

Page 121: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

5ŀǘŀ ƳƻǾŜƳŜƴǘ ŎƘŀƭƭŜƴƎŜǎ ǘƻ NJŜǎŜŀNJŎƘ LJNJƻŘdzŎǘƛǾƛǘȅ π ŜȄŀƳLJƭŜǎ ŀƴŘ NJŜǎLJƻƴǎŜǎ

Dr Frankie Stevens1, Dr Carina Kemp1, Dr Andrew Lonie2, Dr Steve Manos2

1 AARNet, 2 Australian Biocommons

[email protected], [email protected],

[email protected], [email protected]

The ability to reliably and repeatedly move data from A to B is becoming the key

underpinning capability of modern research. This has been driven by changes in how

research happens: national and international-scale collaborations around data are growing

and more numerous; researchers are using more diverse computational infrastructure that

is more geographically distributed; there’s an increasing use of international reference data

collections; and there is an upscale of instruments and the volumes of data they produce.

This BoF aims to explore this challenge, and seeks to delve into some fundamental

questions, such as:

● How much research impact does data movement have?

● What examples of approaches, tools and collaborations exist where things have

worked well?

● What is the taxonomy to describe the challenges and our responses within data

movement? (A common language that describes the multi-faceted nature of this

challenge is needed. Data movement software, data movement scheduling, data

placement and data security, are all related areas that begin to describe responses to

the challenge.)

● What could an optimal ‘data movement ecosystem’ look like? If we treated data

movement as a first class citizen of the research computing world, akin to HPC, data

storage, or cloud, what sort of training, expertise, resources, help and approaches

could we envisage?

This BoF will probe these questions across various disciplines, to determine the levels of

support and tooling required to do so, and to build partnerships between research

disciplines and eInfrastructure providers. The BoF will feature real world examples of data

movement challenges being experienced by life science researchers in Australia, but aims to

capture the broad array of issues from other disciplines, and provide information on the

solutions available today. The BoF will make use of online polling software to enable all

delegates to participate in real time, and permit broad engagement on the topic.

121

Page 122: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

ABOUT THE AUTHOR(S)

- Frankie Stevens is AARNet’s Research Engagement Strategist. Previously, Frankie has

held roles with the Australian Research Data Commons, the NSW state body for

eResearch, the Research Data Storage Infrastructure (RDSI) Project and was

eResearch Programme Manager at the University of Sydney. Frankie has 20 years'

experience in the Higher Education Sector, with a background in Molecular Biology,

having worked in both the Australian and overseas university sectors.

- Carina Kemp is AARNet’s eResearch Director. Carina has worked across government,

industry and research. She is passionate about enabling and connecting innovative

people, innovation, team empowerment, bridging the gap between IT and everyone

else and engagement with stakeholders at all levels. Carina has a background in

Geosciences.

- Andrew Lonie is the DIrector of the Australian Biocommons. Andrew has a

background in molecular biology and computer science, and was appointed Head of

the Victorian Life Sciences Computation Initiative’s Life Sciences Computation Centre

in 2010 to create a multi-disciplinary centre of expertise in life sciences offering best

practice analyses, training and education. He was subsequently appointed Director

of the VLSCI in 2015, which underpinned the formation of Melbourne Bioinformatics

in 2017. Andrew is also the Director of Melbourne Bioinformatics and EMBL-ABR.

- Steven Manos is Associate Director of Cyberinfrastructure in the Australian

Biocommons. Previously, Steven was the University of Melbourne as the Director of

Research Platform services. Steven has a background in Physics.

122

Page 123: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

.ƛƎ LƴǘŜNJƴŜǘ tƛLJŜ ŀƴŘ /ƭƻdzŘ {ŀǾŜŘ aȅ {ǘƻNJŀƎŜ ƛƴ /NJƛǎƛǎ

Authors name: Dan Sun

Organisation: AgResearch

Authors Email: [email protected]

The current storage solutions in AgResearch are all based on Network Attached Storage

(NAS) technologies. It was simple, quick and cost effective to deploy. In some instances, it

was even easy to scale up their capacities. However, individual fileservers have become

data silos and we suffered from their limitations regularly. This talk is based on an incident

caused one of those struggles. It also covers how we recovered from it quickly by utilising

the Cloud, and our thoughts on our future storage platform.

Over one weekend in early October 2019, unexpected amount of data was placed on one of

user accessible fileservers and pushed its utilisation over 85%. Consequently, its

performance started to degrade. Unfortunately, there was no other storage which had

enough spare capacity to offload this additional load in the same physical location.

We decided to remove some large datasets which had not been accessed by users for over 2

years to reclaim capacity quickly. At the same time, we had to maintain the same data

protection level (two separated copies of the same data stored in two different locations).

To achieve this objective, we uploaded a copy of such datasets’ offsite replicas to Microsoft

Azure Blob storage before removing the original copy from the server. Additionally, we also

configured the Cloud storage to automatically migrate data from the Cool tier to the Archive

tier after data being in the cloud for 7 days. This significantly reduces the cost of storing

data in the Cloud for the long term, although we acknowledge the additional cost and time

for retrieving such data if that’s required. We deem the probability of such operation low

and would only be necessary in a disaster recovery scenario.

We were extremely pleased by the performance of REANZ’s network when we were

uploading data to Microsoft Azure’s instance in Australia. We were able to upload 2TB of

data in just over 37 minutes, which translates to 7 Gbps per second in average. The speed

of our WAN is 10 Gbps. It took us another 2 hours to remove the dataset on the fileserver

where we were running out of capacity. Overall, it took us just less than 3 hours to stabilise

this fileserver and we think it was a fairly good outcome. After the initial crisis was over, we

uploaded further 6TB of data to the Cloud to reclaim capacity from the same fileserver. We

plan to use the same approach whenever we encounter similar issues in the short term until

we are able to replace our current generation storage solutions.

123

Page 124: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Almost all of our storage solutions will reach their end of life in the next 12 to 24 months,

and we are currently planning a new generation storage platform to replace them. From all

lessons we have learned to date, we think a scale out storage solution is much more fit for

purpose than NASs or fileservers. Based on our uses of the Cloud, we start to see the value

of Object stores, although we won’t be getting rid of unstructured data store, filesystems,

any time soon. It is our ambition to integrate both by some smart software. We also think

data replication is more practical and appropriate than the traditional backup/restore model

for the amount of data volume we have to keep. Lastly, the possibility to replicate data to

the Cloud is attractive, particularly the low-cost archival storage, but its high retrieval

overhead (both time and cost) is a risk that needs to be further investigated and mitigated.

ABOUT THE AUTHOR(S)

Dan is currently working for AgResearh as a HPC consultant and maintains a smallish Linux

cluster and storage. He is passionate about helping researchers to do science by using

advanced technologies. When he is not firefighting at work, he enjoys having barista made

coffee, fancy burgers and donuts with his collaborators and friends.

124

Page 125: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

LƴǘŜNJƴŀǘƛƻƴŀƭƛǎŀǘƛƻƴ ƻŦ ¢ƘŜ /ŀNJLJŜƴǘNJƛŜǎ ς [Ŝǎǎƻƴǎ ƭŜŀNJƴǘ ƻƴ ǘƘŜ ǿŀȅ

Riku Takei

University of Otago

[email protected]

At the end of 2018, there was growing interest in the internationalisation of the Carpentries

teaching material for non-English language speaking countries. As part of this community-

driven initiative, I became involved in the translation of the Software Carpentry materials

into Japanese. In this lightning talk, I would like to share my experience on organising,

managing, and collaborating with people living in Japan, using Git and GitHub.

ABOUT THE AUTHOR(S)

Riku Takei

I have been involved with The Carpentries since I began my MSc at the University of Otago;

first as a learner, then a helper, and finally as an instructor. I became an official Carpentries

instructor in 2018, and since then, I have been involved in the translation of The Carpentries

material into Japanese.

125

Page 126: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

LƴŦƭdzŜƴŎƛƴƎ 5ŀǘŀ /dzƭǘdzNJŜ ǘƻ hLJǘƛƳƛǎŜ 5ŀǘŀ ¦ǘƛƭƛǎŀǘƛƻƴ

Lisa Thomasen

Fonterra Co-Operative Group Ltd

[email protected]

At Fonterra’s Research & Development Centre we have a range of data which describes both our dairy products and manufacturing processes. We want to preserve this data for the future and increase our opportunities for applications of analytics. To be successful this requires a significant shift in the data culture. This talk will outline the approaches we are taking to change the data culture in our research teams. We have started this process by conducting a data usage survey which allowed us to define our biggest data challenges and their position in the data life cycle. We have now built a team of data stewards to help us map out and execute solutions to the data challenges we’re facing to allow us to get optimal value out of our research data now and many years into the future. This talk will cover the work we have done with our data stewards over the past year and the next steps we have planned to achieve our data management vision. This includes our work to implement unique identifiers for all samples and our proposed metadata database.

ABOUT THE AUTHOR(S)

Lisa Thomasen has been working as a research statistician for the Fonterra Research &

Development Centre in Palmerston North for four years. Throughout this time, she has

dedicated a lot of resource towards data management and statistics training.

126

Page 127: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

IdzƳŀƴƛǘƛŜǎΣ !NJǘǎ ŀƴŘ {ƻŎƛŀƭ {ŎƛŜƴŎŜǎΥ ²Ƙŀǘ ƘŀǾŜ ǿŜ ƭŜŀNJƴŜŘΣ ǿƘŜNJŜ ŀNJŜ ǿŜ ƎƻƛƴƎΚ

Alexis Tindall Australian Research Data Commons

[email protected]

Ian Duncan Australian Research Data Commons

[email protected]

The Humanities, Arts and Social Sciences (HASS) community covers an extraordinary breadth of research activities. In this Birds of a Feather (BoF) session, we explore Australian developments to support the communities under this wide umbrella. These communities can be divergent in approaches and objectives, but remain united in data and united in approaches to research support.

The Australian Government is poised to make a long-awaited investment to support the humanities, arts and social sciences community under the National Collaborative Research Infrastructure project. In preparation for this investment, they have commissioned the Australian Research Data Commons (ARDC) to map the data landscape relevant to the HASS community, relevant concurrent initiatives, and the role of existing research infrastructure in supporting those communities. While the data and research support landscape for HASS is rich, fragmented and diverse, the real challenge is in capturing the landscape of research communities and responding to their needs.

But what about the research communities? Under the umbrella of ‘HASS’ we cluster research as diverse as Urban Environments and Design, with Law, Classics and the Philosophy of Religion. How do we plan for a research infrastructure investment that can support such breadth of activity, diversity of approaches to data and sources, differing research ambitions? A HASS Research Data Commons has been identified as a model that could aid data-enabled HASS research. A Commons can be planned with flexibility and interoperability, to allow focussed support to benefit specific research communities in initial implementation, that can be extended to support related activities across new and emerging communities.

This 60 minute BoF seeks to share approaches to supporting HASS research communities, and learn from related regional initiatives. The BoF will open with a presentation on the Australian HASS data and data-enabled research landscape as captured in this project, and a sketch of our proposed responses. Discussion after the presentation will focus on challenges and opportunities in this area.

127

Page 128: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Questions for discussion include:

• How do we usefully characterise HASS-relevant data collections? How do we determine significance and support access in a responsible and sustainable way? How do we balance openness, access and the rights of communities contributing to those datasets?

• How do we accommodate commonalities and differences across HASS research communities? How do we strengthen opportunities for those conducting data-driven HASS research when they might be atypical of their field?

• How ready are HASS communities? How can we ensure they’re ready to make the most of new digital opportunities to amplify, enhance and supercharge their research?

ABOUT THE AUTHOR(S)

Alexis Tindall is a Senior Research Data Specialist at the Australian Research Data Commons, with a particular interest in supporting and enabling humanities, arts and social sciences research. She has extensive project management experience in diverse environments. Before joining the eResearch community, she worked in natural history and social history museums, and is passionate about digitisation and improving digital access to the nation’s treasured collections.

Ian Duncan is Director, eResearch Infrastructure & Services, at the Australian Research Data Commons and has many years experience in defining, developing, and running infrastructure to support high-impact research. Ian was previously Director of the NCRIS Research Data Storage project and has a keen interest in working with evolving communities in making more data more available to more people and looking at how to effectively incorporate citizen science, industry and government partners into our sector.

128

Page 129: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9ƴƎƛƴŜŜNJƛƴƎ It/Υ ²ƘŀǘΩǎ ƎƻƛƴƎ ƻƴΚ

Callum Walley

NeSI

[email protected]

Engineering Researchers are met with many unique challenges when scaling their research

on high performance computers.

In this presentation the current state of NeSI’s support for Engineers will be discussed, the

developments, challenges and what can be expected in the future.

ABOUT THE AUTHOR

Callum Walley is part of New Zealand eScience Infrastructure (NeSI) applications support

team, their main goal being to develop HPC capability and engagement within the

Engineering community.

129

Page 130: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

9ŀNJǘƘ ǎȅǎǘŜƳ ƳƻŘŜƭƭƛƴƎ ƛƴ bŜǿ ½ŜŀƭŀƴŘ ς ǘdzNJƴƛƴƎ ōƛƎ Řŀǘŀ ƛƴ ōƛƎ ǎŎƛŜƴŎŜ

Jonny Williams, Erik Behrens, Olaf Morgenstern, Mike Williams NIWA, Wellington, NZ

[email protected] After several years of development, the first results and papers showcasing the output from New Zealand’s earth system modelling community are now available. This represents a large body of behind-the-scenes work from multiple NIWA and NeSI staff, not to mention our international collaborators in the Unified Model partnership. This is all very well, but how are we going about turning approximately 0.5PB of raw model output into science which enables New Zealanders to ‘anticipate, adapt, manage risk, and thrive in a changing climate.’ This is the mission statement of the Deep South National Science Challenge, through which this work is funded. We are simulating three greenhouse gas emissions scenarios representative of an unknown future. From the model output, we can estimate how the world will warm. However, earth system models enable us to do a lot more than this. We can also examine changes to chemical processes in the atmosphere, biogeochemical processes in the ocean, as well as changes to the terrestrial biosphere. I will discuss the theory and practice of turning this raw data into useful science in an HPC context. I will also present some early findings from our model, which differs from its parent model – the UKESM – in its ability to simulate the ocean circulation around Aotearoa New Zealand at high, ‘eddy permitting’ resolution. ABOUT THE AUTHOR(S) Jonny Williams moved to New Zealand in 2015 after a postdoc in physical geography at Bristol University studying extreme warm paleoclimates of the Cretaceous and Jurassic periods. Before this he worked in private practice as a junior consultant at Eunomia Research and Consulting and as a climate scientist at the UK Met Office. Jonny has a PhD in molecular electronics from the University of Bath and a degree in physics from Imperial College London.

130

Page 131: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Erik Behrens leads the ocean modelling project, to further improve NZESM, in the second phase of the Deep South. He has a PhD and degree in physical oceanography from the Christian-Albrechts University of Kiel, Germany. His main interest is to understand how oceans around New Zealand and around Antarctica change due to climate change. Olaf Morgenstern is leading climate modelling at NIWA and for the Deep South National Science Challenge. Prior to joining NIWA worked for Cambridge University in the UK and the Max-Planck-Institute for Meteorology in Hamburg, Germany. His main research interest is in the linkages between physical climate change and atmospheric composition. He holds a PhD in meteorology from ETH Zurich, Switzerland, and a physics degree from Freiburg University, Germany. Mike Williams has been the director of the Deep South National Science Challenge since

2016. He obtained his PhD in polar oceanography from the University of Tasmania in 1999

and was an assistant professor at the Niels Bohr Institute for Physics in Copenhagen,

Denmark for three. He joined in NIWA in 2001 and has had various roles, including leading

the climate observations programme and Antarctic research programmes

131

Page 132: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

²ƻNJƭŘǿƛŘŜ ¢NJŜƴŘǎ ƛƴ /ƻƳLJdzǘŜNJ !NJŎƘƛǘŜŎǘdzNJŜǎ ŦƻNJ 5ŀǘŀ {ŎƛŜƴŎŜ

Jeff Zais

NeSI

[email protected]

High performance computing architectures continue to evolve along several dimensions.

These changes are driven by the demand for more complex simulations and the ability to

create, handle, and analyse ever growing volumes of data.

This paper will focus on the state of the art in computer architectures designed to server

large academic research communities in countries around the world. Prominent examples

will include NCI (Australia), LRZ (Germany), and SciNet (Canada).

Besides these examples that are in place, trends in technology will be summarized, to show

what can reasonably be expected in the next five years. This will include expected advances

in many of the key areas of computer architecture, including processors, memory,

networking, and storage. Particular emphasis will be placed on the rapidly evolving area of

storage technology.

ABOUT THE AUTHOR(S)

Jeff Zais recently joined NeSI and NIWA as the Senior High Performance Computing

Architect and Science Advisor. His academic background includes a B.S. degree from the

University of Wisconsin, and M.S. and Ph.D. degrees from Stanford University in Aerospace

Engineering. Professional experience includes technical and management roles at Ford

Aerospace, Cray Research, IBM, and Lenovo, focused on application performance and

system architecture.

132

Page 133: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Otago’s Network for Engagement And Research: Mapping Academic

Expertise and Connections

Sander Zwanenburg

University of Otago

[email protected]

Academic expertise is complex, dynamic, and often encoded in jargon (Auriol et al. 2013).

Academics move across institutions and increasingly change the topics of their research

(Zeng et al. 2019) They apply their expertise in writing that is often specific to a specialised

community.

This makes academic expertise hard to find and to understand. For example, it is difficult for

prospective postgraduate students of the University of Otago to find the right supervisor.

Likewise, organisations that require expertise for their R&D may not be able to identify

available experts. Even within universities, their schools and departments, an understanding

of expertise and its applications in collaboration is very limited, complicating the

management of expertise and the facilitation of its application.

Currently, to find or understand expertise, one might rely on social networks or digital

facilities such as search engines, websites, and academic databases. All of these carry

important shortcomings in the search for experts. For example, asking people in one’s social

network can be time-consuming and ineffective since individuals’ awareness of expertise in

their network quickly fades or becomes outdated when going beyond an intimate inner ring

of contacts (Hill and Dunbar 2003). Search engines are optimized for finding relevant pages

and documents, not experts (Dudek et al. 2007).

How can we map academic expertise, to make it easier to find and understand?

Our answer is a local and practical one. NEAR, the Network for Engagement And Research is

an information system under development, that aims to help its users find and understand

academic expertise in the University of Otago. Its proof of concept has been developed in

the Otago Business School, with data from and about academic staff. The vision is to build a

data warehouse around academics’ expertise and its social context, and to communicate

that data visually and interactively through a web application. This can be rolled out to all

other schools and divisions in the university.

133

Page 134: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Essentially, NEAR collects data, integrates and interprets it, and communicates this

information to its users. The data collected is a combination of user-inputted data and

existing data from other systems. Initially, NEAR only collected basic phonebook-type data

from Active Directory, another institutional system, and asked academics to put in detail

around their Fields of Research, research methods, Sustainable Development Goals,

collaborations, and the Fields of Research of their collaborations. The data input from

academic colleagues came with challenges. Initially this relied on an online survey that

quickly became so complicated that Qualtrics, the survey provider, had to change its

technical specifications. We later developed a custom-made profile system, based on a

LAMP stack, where people could log in, and fill out their details. One difficulty that remained

was that this required a push, not just one-off but continued over time, to get this data

actually collected. The data collection emphasis shifted to other systems that contained

data on people’s expertise, and that was maintained and updated elsewhere: the Research

Output Database, a Research Management Information System, the Media Expertise

Database, but also external databases like Elsevier’s Scopus and Clarivate’s Web of Science.

This shift meant that we started the development of data harvesting and integration

protocols. These were all developed in-house in R. They consisted of working with APIs and

the resulting output. A current challenge is to infer expertise based on the available

evidence. This evidence is based on data on different types of publications, grants, and self-

reports and are linked to different classifications of research fields. These will overlap to

different extents and it is possible that not all fields of expertise are homogenously reflected

in such evidence. Possibly, semantic fingerprints can be applied to enhance accuracy and

reduce reliance on particular classifications.

We have communicated our information about expertise and their social context through an

interactive web visualisation, as shown in the figure below. The visualisation is written with

an R package called visNetwork, which is then embedded in a Shiny app. It consists of a

network graph, where each node represents a staff member (colour coded for department)

or an external party (red), and each link represents an active collaboration. One can zoom

in, and hover over these elements to view their details. They are also searchable by name,

department, field of research (and their corresponding disciplines and sub-disciplines),

research method, and sustainable development goal through drop-down lists. This

highlights those applicable nodes and edges. Highlighted staff members can be emailed with

the click of a button, allowing to easily bring together people with like-minded research

expertise or interests.

134

Page 135: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

Screenshot of NEAR’s network visualisation

In the next stage of the project, we will develop further the data integration schemes,

enhance our algorithm to infer expertise based on this data, and update the interactive

visualisation to reflect these inferences. This visualisation should not only help users find

fitting experts, but gain an understanding of how these experts sit in a dynamic, social

context. For example, given a field of expertise, do the experts form a close-knit group or

are they scattered around the university? Deeper insights like these can allow for potent

outcomes, such as an email to a strategically positioned expert.

We believe that our approach has the potential to augment popular search engines in an

important yet local way. Current search engines are optimized for web pages and online

documents (e.g. Google), scholarly output (Google Scholar, Web of Science, Scopus),

geographic information (e.g. Google Maps, Yelp). NEAR can offer deeper insights about

expertise of individuals by combining institutional and public data. It has the potential to

allow its users not only to find the most fitting expert, but also to understand the structure

and dynamics of particular areas of expertise. Hopefully, in the future, this will help bridge

the demand and supply of expertise, and identify opportunities to leverage more fully what

people have developed over many years.

Acknowledgements

I thank Brian Spisak for his fellow leadership in this project and Caitlin Owen and Lahiru

Ariyasinghe for their development support. Further, there are many internal organisations

that have contributed to the initiative, including the Otago Business School, the Research

Support Unit of the Library, Information Technology Services, and Research & Enterprise.

Thank you.

135

Page 136: eResearch NZ 2020 · Journey with Jupyter Cheng-Hao Cai - Building Machine Learning Systems on microsoft Azure Cloud Machines. 12:00. Matt Plummer - Running Rāpoi: Rebooting Research

References

Auriol, L., Misu, M., and Freeman, R. A. 2013. "Careers of Doctorate Holders,").

Dudek, D., Mastora, A., and Landoni, M. 2007. "Is Google the Answer? A Study into Usability

of Search Engines," Library Review (56:3), pp. 224-233.

Hill, R. A., and Dunbar, R. I. 2003. "Social Network Size in Humans," Human nature (14:1),

pp. 53-72.

Zeng, A., Shen, Z., Zhou, J., Fan, Y., Di, Z., Wang, Y., Stanley, H. E., and Havlin, S. 2019.

"Increasing Trend of Scientists to Switch between Topics," Nature communications

(10:1), pp. 1-11.

ABOUT THE AUTHOR

Sander Zwanenburg is a Lecturer within the Department of Information Science. He

obtained Bachelor and Master of Science degrees from the University of Groningen, The

Netherlands, and a PhD degree in Management Information Systems from The University of

Hong Kong. Sander’s research interests lies in the fields of the psychology of IT use, the

development of metrics, and networks of knowledge. He has published in various

Information Systems venues such as the Australasian Journal of Information Systems,

Communications of the Association for Information Systems, and the proceedings of the

International Conference on Information Systems.

136