HKU Data Curation MLIM7350 Class 7

107
Class 7…giant balancing 'if I have seen further it is by standing on the shoulders of giants'. Scott Edmunds, HKU Data Curation MLIM73

Transcript of HKU Data Curation MLIM7350 Class 7

Page 1: HKU Data Curation MLIM7350 Class 7

Class 7…giant balancing'if I have seen further it is by standing on the shoulders of

giants'.

Scott Edmunds, HKU Data Curation MLIM7350

Page 2: HKU Data Curation MLIM7350 Class 7

Communicating in-class

• Chat channel: • http://backchannelchat.com/chat/dw131• Feel free to ask questions, requests to speed

up/slow down

Also feel free to email: [email protected]

Page 3: HKU Data Curation MLIM7350 Class 7

About me:

• Scott Edmunds• Molecular biology, sci editing & comms• Scientific journal & (big) data publishing• Reproducibility & open science• Open Data Hong Kong & Citizen Science

Journal, data-platform and database for large-scale biological data

www.gigasciencejournal.com

Page 4: HKU Data Curation MLIM7350 Class 7

About me:

Page 5: HKU Data Curation MLIM7350 Class 7

• Formerly Beijing Genomics Institute• Founded in 1999 (1% of HGP)• China’s 1st citizen managed not-for-profit research institute

funded by commercial sequencing-as-a-service (BGI Tech)• Now largest genomic organization in the world• HQ in Shenzhen, international data production in BGI HK (Tai

Po)

About my employer:

Page 6: HKU Data Curation MLIM7350 Class 7

Open Data Hong KongExCom member for Open Science

Open Science Working Group

Page 7: HKU Data Curation MLIM7350 Class 7
Page 8: HKU Data Curation MLIM7350 Class 7

WHY CURATE DATA?

RECAP

Page 9: HKU Data Curation MLIM7350 Class 7

WHY SHARE DATA?

Page 10: HKU Data Curation MLIM7350 Class 7

WHY SHARE DATA?

https://okfn.org/

Page 11: HKU Data Curation MLIM7350 Class 7

WHAT EXACTLY IS “OPEN DATA"?

Page 12: HKU Data Curation MLIM7350 Class 7

What is open data (公开数据 )?

http://opendefinition.org/od/2.0/en/

Page 13: HKU Data Curation MLIM7350 Class 7

OKFN: 8 types of open data

http://science.okfn.org/

Page 14: HKU Data Curation MLIM7350 Class 7
Page 15: HKU Data Curation MLIM7350 Class 7

Research Data ≈ Government DataCanada's Action Plan on Open Government 2014-16

http://open.canada.ca/en/content/canadas-action-plan-open-government-2014-16

Page 16: HKU Data Curation MLIM7350 Class 7

Research Data policies growing globally

http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=researchdata#1

Page 17: HKU Data Curation MLIM7350 Class 7

https://data.gov.hk

HK has “Public Sector Information"

Page 18: HKU Data Curation MLIM7350 Class 7

Why Licensing is Important for:

http://dx.doi.org/10.1186/1756-0500-5-494

Placing restrictions on the reuse of scientific information, particularly data, slows down the pace of research. Furthermore, legal requirements for attribution ingrained in licenses such as CC-BY can prohibit future research across large collections of content – as commonly happens in data mining.

Therefore, to eliminate legal impediments to integration and re-use of data, such as this stacking of attribution requirements in large collections of data, and to help enable long-term interoperability an appropriate license or waiver specific to data should be applied.

Page 19: HKU Data Curation MLIM7350 Class 7

Panton Principles

http://pantonprinciples.org/

=CC0 better than CC-BY for datasets to prevent “attribution stacking”

Page 20: HKU Data Curation MLIM7350 Class 7

Levels of openness: 5★’s of open data

http://5stardata.info

Page 21: HKU Data Curation MLIM7350 Class 7

Levels of openness: 5★’s of open data

http://5stardata.info

★ - make your stuff available on the Web (whatever format) under an open license

★★ - make it available as structured data (e.g., Excel instead of image scan of a table)

★★★ - make it available in a non-proprietary open format (e.g., CSV as well as of Excel)

★★★★ - use URIs to denote things, so that people can point at your stuff

★★★★★ - link your data to other data to provide context

Page 22: HKU Data Curation MLIM7350 Class 7

Levels of openness: 5★’s of open data

Exercise: What star rating is this data?Example: Hong Kong: Dengue Mosquito Breeding Habitatshttp://www.fehd.gov.hk/english/safefood/dengue_fever/images/montlyOvitrap_2003-2016.pdf http://www.fehd.gov.hk/english/safefood/dengue_fever/

Static PDFs, images, not on data.gov.hk, no licensing information = ?

Page 23: HKU Data Curation MLIM7350 Class 7

Levels of openness: 5★’s of open data

http://5stardata.info

Exercise: What star rating is this data?

1. HK FEHD: Distribution of the number of live pigs sold at different auction prices on the day https://data.gov.hk/en-data/dataset/hk-fehd-fehdsh-daily-auction

2. Singapore: Dengue Mosquito Breeding Habitats https://data.gov.sg/dataset/dengue-mosquito-breeding-habitats

3. Linked Drug-Drug Interactions (LIDDI) https://datahub.io/dataset/linked-drug-drug-interactions-liddi

Page 24: HKU Data Curation MLIM7350 Class 7

Why closed data sucks?

https://commons.wikimedia.org/wiki/File:Inner_door_in_forbidden_city.jpg

Page 25: HKU Data Curation MLIM7350 Class 7

Hong Kong Edition

https://data.gov.hk

Gov't spend on open data platform = $1.2M

Gov't spend on 20 rubbish apps = $20M

https://www.hongkongfp.com/2015/09/14/public-finance-concern-group-raps-10-rubbish-govt-apps-one-has-only-10-downloads/

Why closed data sucks?

Page 26: HKU Data Curation MLIM7350 Class 7

What the Gov't builds for $20M What open data can build for free

http://gazetteer.hk/

Hong Kong Edition Why closed data sucks?

Page 27: HKU Data Curation MLIM7350 Class 7

Open Data as a revenue stream...

Hong Kong Edition Why closed data sucks?

Page 28: HKU Data Curation MLIM7350 Class 7

Open Data as a revenue stream means can't share conservation data...

Why closed data kills spoonbills?

Page 29: HKU Data Curation MLIM7350 Class 7

Climate change, global hunger, pollution, cancer, disease outbreaks…

http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966

Why closed data kills people?

Page 30: HKU Data Curation MLIM7350 Class 7

Open Data as a revenue stream means can't share cancer data...

https://www.change.org/p/mark-c-capone-ceo-of-myriad-genetics-myriad-genetics-give-us-our-damn-brca-data

Why closed data kills women?

Page 31: HKU Data Curation MLIM7350 Class 7

Open Data as a revenue (publishing) stream means nobody is sharing ethnic Chinese control data to enable pharmacogenomics to work on Chinese populations...

Why closed data kills Chinese populations?

Page 32: HKU Data Curation MLIM7350 Class 7

THE REPRODUCIBILITY CRISIS

Page 33: HKU Data Curation MLIM7350 Class 7

How research is disseminated

18121665 1869

Page 34: HKU Data Curation MLIM7350 Class 7

Consequences of 351 year old incentive systems…

Buckheit & Donoho: Scholarly articles are merely advertisement of scholarship. The actual scholarly artifacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible.

Page 35: HKU Data Curation MLIM7350 Class 7

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 36: HKU Data Curation MLIM7350 Class 7

1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.1001747

The challenge: reproducibility

Page 38: HKU Data Curation MLIM7350 Class 7

Growing Issue: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 39: HKU Data Curation MLIM7350 Class 7

Growing Issue: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

At current % increase by 2045 as many papers published as retracted!

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Page 40: HKU Data Curation MLIM7350 Class 7

Problem: growing replication gap

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

More retractions: >15X increase in last decadeAt current % > by 2045 as many papers published as retracted

Insufficient methods

Page 41: HKU Data Curation MLIM7350 Class 7

The Cost of Scientific Retractions?

A: $400,000 per paper

https://elifesciences.org/content/3/e02956

Page 42: HKU Data Curation MLIM7350 Class 7

Only policy that counts…IMPACT FACTOR

Impact Factor

Page 43: HKU Data Curation MLIM7350 Class 7

What is the journal Impact Factor (jIF)?• Citation Index concept first developed

by Eugene Garfield in 1955 (Science)

• Formed Institute of Scientific Information (ISI) in 1960

• Science Citation Index (SCI) launched in 1963.

• Web version (Web of Science) launched in 1997.

• ISI purchased by Thomson-Reuters in 1992.

• Sold as part of their Intellectual Property & Science portfolio in July 2016 for $3.55B USD to private equity funds.

https://commons.wikimedia.org/wiki/File:Eugene_Garfield_HD2007_Richard_J._Bolte_Sr._Award.TIF

Page 44: HKU Data Curation MLIM7350 Class 7

How do you calculate the jIF?

1. Count the total number of citations from the two years before the IF release year.

2. Count total number of papers published in the two years before IF release year

3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014

# of Papers in 2013-2014

2015 20132014

Page 45: HKU Data Curation MLIM7350 Class 7

1. Count the total number of citations from the two years before the IF release year.

2. Count total number of papers published in the two years before IF release year

3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014

# of Papers in 2013-2014

2015 20132014

TWO PROBLEMS

Page 46: HKU Data Curation MLIM7350 Class 7

1. Count the total number of citations from the two years before the IF release year.

2. Count total number of papers published in the two years before IF release year

3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014

# of Papers in 2013-2014

2015 20132014

TWO PROBLEMS1. Rewards/incentivizes short term citations only

Page 47: HKU Data Curation MLIM7350 Class 7

2015 20132014

Two PROBLEMS1. Rewards/incentivizes short term citations only

Impact factor driven science =

Page 48: HKU Data Curation MLIM7350 Class 7

JIFBAIT Networkmore

GWASGWAS

JIFBAIT NEWS

Arsenic Life forms, will they take over the planet?

By Melba Ketchum, PhD

Which Overhyped, Unreproducible Experiment Are You?

Want rapid citations for 2 years only? Carry out this quiz.

You got: STAP CellsOf course dipping cells in coffee will make them pluripotent. Even if the research gets discredited, it’ll still get 100’s of citations in two years.

Page 49: HKU Data Curation MLIM7350 Class 7

1. Count the total number of citations from the two years before the IF release year.

2. Count total number of papers published in the two years before IF release year

3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014

# of Papers in 2013-2014

2015 20132014

TWO PROBLEMS

2. How do you count denominator? Negotiated.

Page 50: HKU Data Curation MLIM7350 Class 7

Impact Factor correlates with:

https://quantixed.wordpress.com/2016/01/05/the-great-curve-ii-citation-distributions-and-reverse-engineering-the-jif/

Unreproducible Data. Missing papers.

Page 51: HKU Data Curation MLIM7350 Class 7

Impact Factor correlates with:

http://bjoern.brembs.net/2016/01/even-without-retractions-top-journals-publish-the-least-reliable-science/

Less reliable science.

Page 52: HKU Data Curation MLIM7350 Class 7

Impact Factor correlates with:

http://iai.asm.org/content/79/10/3855.full

Retraction Index

http://iai.asm.org/content/79/10/3855.full

Page 53: HKU Data Curation MLIM7350 Class 7

Growing # of journals addressing this

http://dx.doi.org/10.1371/journal.pmed.1001607

Page 54: HKU Data Curation MLIM7350 Class 7

QUANTIFYING REPRODUCIBILITY

Page 55: HKU Data Curation MLIM7350 Class 7

DataSame Different

Code

Same

Reproducible Replicable

Different

Robust Generalisable

https://figshare.com/articles/Publishing_a_reproducible_paper/4720996

Page 56: HKU Data Curation MLIM7350 Class 7

http://reproducibility.cs.arizona.edu/

Arizona Repeatability in Computer Science Experiment

• 2015 study examining extent Computer Systems researchers share their research artifacts (code)

• NSF policies on sharing code since 2005• Examined 613 papers from ACM conferences & journals•

• Attempted to locate source code that backed up results• If found, tried to build the code.

Page 57: HKU Data Curation MLIM7350 Class 7

http://reproducibility.cs.arizona.edu/

Arizona Repeatability in Computer Science Experiment

• Manual curation/look for code that backed up results

• If missing, emailed authors• Chased if no reply• If found, tried to build the

code• Resolve issues• Survey results

Page 58: HKU Data Curation MLIM7350 Class 7

http://reproducibility.cs.arizona.edu/

613 paperstested

123 successfulReproductions (20%)

Arizona Repeatability in Computer Science Experiment

Page 59: HKU Data Curation MLIM7350 Class 7

Questions? | 15 minute break

Page 60: HKU Data Curation MLIM7350 Class 7

The Hong Kong context

http://web.archive.org/web/20131127073400/http://openaccess.hk/about.html

Page 61: HKU Data Curation MLIM7350 Class 7

Asia’s Academic City?

8 Universities, many ranked top 50 worldwide

100K students (UG/PG/FT/PT)

1 major research funder (UGC/RGC)

Grant budget = $17.5 BN HKD/yr ($2.3BN USD)

UGC Policy: “Realization of making Hong Kong Asia's world city is only possible if it is based upon the platform of a very strong education and higher education sector. “

http://www.ugc.edu.hk/eng/ugc/policy/policy.htm

Page 62: HKU Data Curation MLIM7350 Class 7

Asia’s Academic City?

8 Universities, many ranked top 50 worldwide

100K students (UG/PG/FT/PT)

1 major research funder (UGC/RGC)

Grant budget = $17.5 BN HKD/yr ($2.3BN USD)

UGC Policy: “Realization of making Hong Kong Asia's world city is only possible if it is based upon the platform of a very strong education and higher education sector. “

http://www.ugc.edu.hk/eng/ugc/policy/policy.htm

Page 63: HKU Data Curation MLIM7350 Class 7

Data: WorldBank

R&D spending in HK amongst lowest in Developed World

Page 64: HKU Data Curation MLIM7350 Class 7

Hong Kong’s focus…“The plot earmarked for expansion of Hong Kong Science Park might now be used to build apartment blocks instead. Is the government backing down on its commitment to project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3

Page 65: HKU Data Curation MLIM7350 Class 7

“The plot earmarked for expansion of Hong Kong Science Park might now be used to build apartment blocks instead. Is the government backing down on its commitment to project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3

Hong Kong’s focus…

Page 66: HKU Data Curation MLIM7350 Class 7

https://osf.io/cgpzb/

Open Science (Open Access & Open Data) survey of Hong Kong

Any comments?

Page 67: HKU Data Curation MLIM7350 Class 7

Science & Technology players in HK

Political forum Legislative Council (LegCo)Policy makers

Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)

Financing Government EB Private Sector

ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector

R&D Centres ASTRI Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations

Researched policy, collected case studies, FOI, interviewed many key players (funders, libraries, administrators…)

Page 68: HKU Data Curation MLIM7350 Class 7

HK: good with some parts of open…

http://hub.hku.hk/

Page 69: HKU Data Curation MLIM7350 Class 7

http://index.okfn.org/

HK: bad with the rest…

Page 70: HKU Data Curation MLIM7350 Class 7

https://data.gov.hk

HK: bad with the rest…

Page 71: HKU Data Curation MLIM7350 Class 7

Signatories to Berlin OA Declaration

Page 72: HKU Data Curation MLIM7350 Class 7

OA Policies in Hong Kong

Page 73: HKU Data Curation MLIM7350 Class 7

Hidden at the back of RGC guidelines

http://www.ugc.edu.hk/eng/doc/rgc/form/srfdp_sr2.pdf

Page 74: HKU Data Curation MLIM7350 Class 7

IR: infrastructure is (mostly) there

http://www.julac.org/?page_id=79

Page 75: HKU Data Curation MLIM7350 Class 7

IR: infrastructure is (mostly) there

http://repositories.webometrics.info/en/Asia/Hong%20Kong

Page 76: HKU Data Curation MLIM7350 Class 7

IR: infrastructure is (mostly) there

Page 77: HKU Data Curation MLIM7350 Class 7

No policies, Mo’ problems

Page 78: HKU Data Curation MLIM7350 Class 7

Q: How much is spent on Open/Closed Access in HK?

A: Nobody has any idea!

https://lists.okfn.org/pipermail/open-access/2014-May/001888.html

Page 79: HKU Data Curation MLIM7350 Class 7

In China publication + JIF = money = fraudAttempts to “game the peer-review system on an industrial scale”

1. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/2. http://www.grassley.senate.gov/sites/default/files/about/upload/Senator-Grassley-Report.pdf

Companies offering authorship of papers made to order by “paper mills”1. Common ghostwriting medical papers by pharma2

Guaranteed publication in JIF journal, often using fake referees, ID theft, etc.

Page 80: HKU Data Curation MLIM7350 Class 7

1. http://dx.doi.org/10.1087/201102032. http://blog.thegrandlocus.com/2014/10/a-flurry-of-copycats-on-pubmed 3. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/

What is the cost of the jIF?

JIF 2 = $10,000 USDJIF 5 = $20,000 USD

Buy SellC/N/S = $30,000 USDJIF 10 = $1,500 USD

Page 81: HKU Data Curation MLIM7350 Class 7

1. http://www.scmp.com/comment/insight-opinion/article/1758662/china-must-restructure-its-academic-incentives-curb-research

Created by skewed incentive systems in China…

“While we are rightly proud of Hong Kong’s highly regarded and ranked universities system, we are not immune to the same pressures. While funders in Europe have moved away from using citation based metrics such as JIF in their research assessments, the Hong Kong University Grants Committee states in their Research Assessment Exercise guidelines that they may informally use it.”

Page 82: HKU Data Curation MLIM7350 Class 7

1. http://www.scmp.com/comment/insight-opinion/article/1758662/china-must-restructure-its-academic-incentives-curb-research

And this is now happening in Hong Kong too!

JIF 2 = $8,000 USDJIF 5 = $15,000 USD

Buy

Page 83: HKU Data Curation MLIM7350 Class 7

How to fight back: Sign DORA.

http://www.ascb.org/dora/

Page 84: HKU Data Curation MLIM7350 Class 7

Political forum Legislative Council (LegCo)Policy makers

Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)

Financing Government EB Private Sector

ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector

R&D Centres ASTRI Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations

Who needs to provide leadership?

What new infrastructure do we need?

Science & Technology players in HK

Page 85: HKU Data Curation MLIM7350 Class 7

Who needs to provide leadership?RGC/UGC & new ITBWhat new infrastructure do we need?New “HK Data Service”, stewardship & platforms

Science & Technology players in HK

Political forum Legislative Council (LegCo)Policy makers

Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)

Financing Government EB Private Sector

ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector

R&D Centres ASTRI Data Curators & Stewards (Libraries, OGCIO, Data Studio@SP)

Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Data Disseminators (HARNET, data.gov.hk, "HK Data Service")

Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations

Downstream Users (Researchers, Innovators, Citizens)

Academic/commercial

cloud

Page 86: HKU Data Curation MLIM7350 Class 7

If Government doesn’t act, Universities need to lead way

http://hub.hku.hk/advanced-search?location=crisdataset

Page 87: HKU Data Curation MLIM7350 Class 7

If Government doesn’t act, Universities need to lead way

http://www.rss.hku.hk/integrity/research-data-records-management

Page 88: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://hub.hku.hk/advanced-search?location=crisdataset

Page 89: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://lib.hku.hk/researchdata/rpg.htm

“Beginning with the September 2017 intake, all HKU research postgraduate (rpg) students have responsibility for 1) using a data management plan (DMP), where applicable, to describe the use of data in preparation for, or in the generation of their theses, and 2) depositing, where applicable, a dataset in the HKU Scholars Hub.”

Page 90: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://hub.hku.hk/advanced-search?location=crisdataset

Page 91: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://hub.hku.hk/advanced-search?location=crisdataset

Page 92: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://hub.hku.hk/advanced-search?location=crisdataset

CC-BY NC by default

Page 93: HKU Data Curation MLIM7350 Class 7

First CRIS in HK, built upon ScholarsHub

http://hub.hku.hk/advanced-search?location=crisdataset

Licensing T&Cs

Page 94: HKU Data Curation MLIM7350 Class 7

HK CRIS: Further reading/resources

https://youtu.be/focv1z3lpPI

RPg Students -- Instructions for Data:http://lib.hku.hk/researchdata/rpg.htm

Depositor's User Guide: http://lib.hku.hk/researchdata/deposit_page.htm

Seminar slides from HKU Libraryhttp://www.rss.hku.hk/integrity/rcr/rcr-info/seminars

See also ReShare video guide:

Page 95: HKU Data Curation MLIM7350 Class 7

The cost to Hong Kong of not doing this?

• Estimates lack of citation impact not being OA = 50% ($8.75B?)2

• How much is the HK taxpayer losing through missing out on potential

collaborations, wider engagement & unrepeatable work?

HK UCG grant budget = $17.5 Billion HKD/yr (4% of Gov spending)

Taking lowest reported reproducibility rates (11%) = >$15 billion wasted1

$$

$

1. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html2. http://www.ecs.soton.ac.uk/~harnad/Temp/research-australia.doc

Page 96: HKU Data Curation MLIM7350 Class 7

https://osf.io/cgpzb/

Open Science (Open Access & Open Data) survey of Hong Kong

Reading/Reflection for next class

Thoughts and ideas why Hong Kong is lagging behind US/EU?

Any ideas what we need to do to move forward?

Any feedback on the survey?

Page 97: HKU Data Curation MLIM7350 Class 7

QUANTIFYING REPRODUCIBILITY IN HK

Page 98: HKU Data Curation MLIM7350 Class 7

HKU Repeatability in HK Research Experiment

• HKU policy on data sharing from 2015• PLOS policy mandating sharing of supporting March 1,

2014• HKU has published 267 PLOS ONE papers 2014-date• Can we quantify reproducibility in a sample of these?• Easy exercise in literature curation• 2016 HKU PLOS publications = 49 papers

http://hub.hku.hk/simple-search?query=&location=publication&sort_by=bi_sort_2_sort&order=asc&rpp=25&filter_field_1=journal&filter_type_1=equals&filter_value_1=plos+one&filter_field_2=dateIssued&filter_type_2=equals&filter_value_2=[2014+TO+2017]&filter_field_3=dctype&filter_type_3=equals&filter_value_3=article&etal=0&filtername=dateIssued&filterquery=2016&filtertype=equals

Page 99: HKU Data Curation MLIM7350 Class 7

HKU Repeatability in HK Research Experiment

• Everyone assigned 5 2016 HKU PLOS papers• Quickly scan paper looking for supporting data• If no data, ignore• If uses data, is it all associated with the paper?• If external data, is it available from URL or accession?• If “data available on request”, are they contactable?• Don’t spend more than 5mins per article• Add data into googledoc, and we’ll go through results &

feedback next class

Homework/Case study: literature curation exercise

Page 101: HKU Data Curation MLIM7350 Class 7

HKU Repeatability in HK Research Experiment

Example 1.Is there data presented in the paper? – Yes

Is there external data, and if so what is the link/accession? – No

Is all the data in the paper available? – No

Comments - Has questionnaire, but not data as says "minimal anonymized dataset will be made available upon request”

Enter data here: https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeYmB3Uh4U23HX-o/edit?usp=sharing

Page 103: HKU Data Curation MLIM7350 Class 7

Final Project• For the final project for this course, you can

choose from 3 assignment options. • The assignment is due on the 15th May and it

is worth 40% of your grade.• Time will be set aside for presenting a

provisional draft of this during the final class on the 24th April.

Page 104: HKU Data Curation MLIM7350 Class 7

Final Project: Option 1Write an Annotated Bibliography about data curation practices in an academic discipline of your choosing.

• Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of “open data.”• Summarize data practices in your chosen discipline or topic. (5-7 sentences)• Find 7-10 sources that relate that discipline or topic to data creation, management, and/or curation.• Provide a citation for the source in APA style.• Write a short annotation that summarizes the content of the source. You may include quotes from

the source sparingly, but the annotations should be mostly, if not entirely, in your own words. (3-5 sentences)

• Explain the relevance of the source with relation to the data practices of your chosen discipline or topic. (1-2 sentences)

• Find a few example public datasets to demonstrate the above points. Cite the data in the relevant places in the Bibliography according to the Data Citation Principles.

• Refer to this guide for more information about annotated bibliographies: http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation should be in the “Descriptive” style.

Page 105: HKU Data Curation MLIM7350 Class 7

Final Project: Option 2Using a relevant dataset (this can either be from the literature curation exercise, a BYO dataset, or one given to you), write a report that includes a description of the dataset, a Data Management Plan, and a guidelines document for the researcher(s).

• Describe the dataset that explains the form of the data and the academic discipline in which it was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data Management Plan following the guidelines from HKU or a granting body such as NSF.

• 1 page guidelines document that could be presented to the researcher(s) that provides guidelines for their data (extant and forthcoming):

– Preservation– Appraisal– Documentation

• For the DMP and the guidelines document, you can extrapolate from the your dataset to imagine additional details about the research practices that created the dataset and will create more data in the future.

• Look for suitable data repositories that can host this data (institutional, general purpose, or subject specific), and if there is one relevant then publish the data if you have permission, and correctly cite the data in the relevant places in your report.

Page 106: HKU Data Curation MLIM7350 Class 7

Final Project: Option 3Prepare a 30 minute data curation workshop that you could teach to researchers that would provide them the necessary details to understand why data curation is relevant to them and best practices they should follow.

• Slide deck that introduces data curation for a researcher audience. (No more than 40 slides.)

• Presenter outline that describes the important points for each slide.• Topics that might be addressed in your workshop: the value of data

management, writing a data management plan, data repository options. You can assume your audience is researchers are at HKU.

• Make sure all of the content is copyright free, and share the final material openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient metadata to make it discoverable.

Page 107: HKU Data Curation MLIM7350 Class 7

Looking ahead…

• Next class on Monday 27th March we’ll go from open to FAIR data

• We’ll also go through the reflection & curation case studies– Bring ideas & feedback, and we’ll look at the data

• Final project due 10th May– Need to present preliminary version on 26th April

to get feedback before completion