HKU Data Curation MLIM7350 Class 7
-
Upload
scott-edmunds -
Category
Technology
-
view
9 -
download
0
Transcript of HKU Data Curation MLIM7350 Class 7
Class 7…giant balancing'if I have seen further it is by standing on the shoulders of
giants'.
Scott Edmunds, HKU Data Curation MLIM7350
Communicating in-class
• Chat channel: • http://backchannelchat.com/chat/dw131• Feel free to ask questions, requests to speed
up/slow down
Also feel free to email: [email protected]
About me:
• Scott Edmunds• Molecular biology, sci editing & comms• Scientific journal & (big) data publishing• Reproducibility & open science• Open Data Hong Kong & Citizen Science
Journal, data-platform and database for large-scale biological data
www.gigasciencejournal.com
About me:
• Formerly Beijing Genomics Institute• Founded in 1999 (1% of HGP)• China’s 1st citizen managed not-for-profit research institute
funded by commercial sequencing-as-a-service (BGI Tech)• Now largest genomic organization in the world• HQ in Shenzhen, international data production in BGI HK (Tai
Po)
About my employer:
Open Data Hong KongExCom member for Open Science
Open Science Working Group
WHY CURATE DATA?
RECAP
WHY SHARE DATA?
WHAT EXACTLY IS “OPEN DATA"?
What is open data (公开数据 )?
http://opendefinition.org/od/2.0/en/
Research Data ≈ Government DataCanada's Action Plan on Open Government 2014-16
http://open.canada.ca/en/content/canadas-action-plan-open-government-2014-16
Research Data policies growing globally
http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=researchdata#1
Why Licensing is Important for:
http://dx.doi.org/10.1186/1756-0500-5-494
Placing restrictions on the reuse of scientific information, particularly data, slows down the pace of research. Furthermore, legal requirements for attribution ingrained in licenses such as CC-BY can prohibit future research across large collections of content – as commonly happens in data mining.
Therefore, to eliminate legal impediments to integration and re-use of data, such as this stacking of attribution requirements in large collections of data, and to help enable long-term interoperability an appropriate license or waiver specific to data should be applied.
Panton Principles
http://pantonprinciples.org/
=CC0 better than CC-BY for datasets to prevent “attribution stacking”
Levels of openness: 5★’s of open data
http://5stardata.info
★ - make your stuff available on the Web (whatever format) under an open license
★★ - make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ - make it available in a non-proprietary open format (e.g., CSV as well as of Excel)
★★★★ - use URIs to denote things, so that people can point at your stuff
★★★★★ - link your data to other data to provide context
Levels of openness: 5★’s of open data
Exercise: What star rating is this data?Example: Hong Kong: Dengue Mosquito Breeding Habitatshttp://www.fehd.gov.hk/english/safefood/dengue_fever/images/montlyOvitrap_2003-2016.pdf http://www.fehd.gov.hk/english/safefood/dengue_fever/
Static PDFs, images, not on data.gov.hk, no licensing information = ?
Levels of openness: 5★’s of open data
http://5stardata.info
Exercise: What star rating is this data?
1. HK FEHD: Distribution of the number of live pigs sold at different auction prices on the day https://data.gov.hk/en-data/dataset/hk-fehd-fehdsh-daily-auction
2. Singapore: Dengue Mosquito Breeding Habitats https://data.gov.sg/dataset/dengue-mosquito-breeding-habitats
3. Linked Drug-Drug Interactions (LIDDI) https://datahub.io/dataset/linked-drug-drug-interactions-liddi
Why closed data sucks?
https://commons.wikimedia.org/wiki/File:Inner_door_in_forbidden_city.jpg
Hong Kong Edition
https://data.gov.hk
Gov't spend on open data platform = $1.2M
Gov't spend on 20 rubbish apps = $20M
https://www.hongkongfp.com/2015/09/14/public-finance-concern-group-raps-10-rubbish-govt-apps-one-has-only-10-downloads/
Why closed data sucks?
What the Gov't builds for $20M What open data can build for free
http://gazetteer.hk/
Hong Kong Edition Why closed data sucks?
Open Data as a revenue stream...
Hong Kong Edition Why closed data sucks?
Open Data as a revenue stream means can't share conservation data...
Why closed data kills spoonbills?
Climate change, global hunger, pollution, cancer, disease outbreaks…
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966
Why closed data kills people?
Open Data as a revenue stream means can't share cancer data...
https://www.change.org/p/mark-c-capone-ceo-of-myriad-genetics-myriad-genetics-give-us-our-damn-brca-data
Why closed data kills women?
Open Data as a revenue (publishing) stream means nobody is sharing ethnic Chinese control data to enable pharmacogenomics to work on Chinese populations...
Why closed data kills Chinese populations?
THE REPRODUCIBILITY CRISIS
How research is disseminated
18121665 1869
Consequences of 351 year old incentive systems…
Buckheit & Donoho: Scholarly articles are merely advertisement of scholarship. The actual scholarly artifacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible.
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
1. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.1001747
The challenge: reproducibility
Replication rates as low as 11%
http://www.nature.com/nature/journal/v483/n7391/full/483531a.htmlhttps://osf.io/e81xl/wiki/home/
Growing Issue: increasing number of retractions>15X increase in last decade
Strong correlation of “retraction index” with higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
Growing Issue: increasing number of retractions>15X increase in last decade
Strong correlation of “retraction index” with higher impact factor
At current % increase by 2045 as many papers published as retracted!
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
Problem: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
More retractions: >15X increase in last decadeAt current % > by 2045 as many papers published as retracted
Insufficient methods
The Cost of Scientific Retractions?
A: $400,000 per paper
https://elifesciences.org/content/3/e02956
Only policy that counts…IMPACT FACTOR
Impact Factor
What is the journal Impact Factor (jIF)?• Citation Index concept first developed
by Eugene Garfield in 1955 (Science)
• Formed Institute of Scientific Information (ISI) in 1960
• Science Citation Index (SCI) launched in 1963.
• Web version (Web of Science) launched in 1997.
• ISI purchased by Thomson-Reuters in 1992.
• Sold as part of their Intellectual Property & Science portfolio in July 2016 for $3.55B USD to private equity funds.
https://commons.wikimedia.org/wiki/File:Eugene_Garfield_HD2007_Richard_J._Bolte_Sr._Award.TIF
How do you calculate the jIF?
1. Count the total number of citations from the two years before the IF release year.
2. Count total number of papers published in the two years before IF release year
3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
1. Count the total number of citations from the two years before the IF release year.
2. Count total number of papers published in the two years before IF release year
3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS
1. Count the total number of citations from the two years before the IF release year.
2. Count total number of papers published in the two years before IF release year
3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS1. Rewards/incentivizes short term citations only
2015 20132014
Two PROBLEMS1. Rewards/incentivizes short term citations only
Impact factor driven science =
JIFBAIT Networkmore
GWASGWAS
JIFBAIT NEWS
Arsenic Life forms, will they take over the planet?
By Melba Ketchum, PhD
Which Overhyped, Unreproducible Experiment Are You?
Want rapid citations for 2 years only? Carry out this quiz.
You got: STAP CellsOf course dipping cells in coffee will make them pluripotent. Even if the research gets discredited, it’ll still get 100’s of citations in two years.
1. Count the total number of citations from the two years before the IF release year.
2. Count total number of papers published in the two years before IF release year
3. Divide number of citations by number of papers 2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS
2. How do you count denominator? Negotiated.
Impact Factor correlates with:
https://quantixed.wordpress.com/2016/01/05/the-great-curve-ii-citation-distributions-and-reverse-engineering-the-jif/
Unreproducible Data. Missing papers.
Impact Factor correlates with:
http://bjoern.brembs.net/2016/01/even-without-retractions-top-journals-publish-the-least-reliable-science/
Less reliable science.
Impact Factor correlates with:
http://iai.asm.org/content/79/10/3855.full
Retraction Index
http://iai.asm.org/content/79/10/3855.full
Growing # of journals addressing this
http://dx.doi.org/10.1371/journal.pmed.1001607
QUANTIFYING REPRODUCIBILITY
DataSame Different
Code
Same
Reproducible Replicable
Different
Robust Generalisable
https://figshare.com/articles/Publishing_a_reproducible_paper/4720996
http://reproducibility.cs.arizona.edu/
Arizona Repeatability in Computer Science Experiment
• 2015 study examining extent Computer Systems researchers share their research artifacts (code)
• NSF policies on sharing code since 2005• Examined 613 papers from ACM conferences & journals•
• Attempted to locate source code that backed up results• If found, tried to build the code.
http://reproducibility.cs.arizona.edu/
Arizona Repeatability in Computer Science Experiment
• Manual curation/look for code that backed up results
• If missing, emailed authors• Chased if no reply• If found, tried to build the
code• Resolve issues• Survey results
http://reproducibility.cs.arizona.edu/
613 paperstested
123 successfulReproductions (20%)
Arizona Repeatability in Computer Science Experiment
Questions? | 15 minute break
The Hong Kong context
http://web.archive.org/web/20131127073400/http://openaccess.hk/about.html
Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
Grant budget = $17.5 BN HKD/yr ($2.3BN USD)
UGC Policy: “Realization of making Hong Kong Asia's world city is only possible if it is based upon the platform of a very strong education and higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm
Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
Grant budget = $17.5 BN HKD/yr ($2.3BN USD)
UGC Policy: “Realization of making Hong Kong Asia's world city is only possible if it is based upon the platform of a very strong education and higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm
Data: WorldBank
R&D spending in HK amongst lowest in Developed World
Hong Kong’s focus…“The plot earmarked for expansion of Hong Kong Science Park might now be used to build apartment blocks instead. Is the government backing down on its commitment to project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3
“The plot earmarked for expansion of Hong Kong Science Park might now be used to build apartment blocks instead. Is the government backing down on its commitment to project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3
Hong Kong’s focus…
https://osf.io/cgpzb/
Open Science (Open Access & Open Data) survey of Hong Kong
Any comments?
Science & Technology players in HK
Political forum Legislative Council (LegCo)Policy makers
Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Researched policy, collected case studies, FOI, interviewed many key players (funders, libraries, administrators…)
Signatories to Berlin OA Declaration
OA Policies in Hong Kong
Hidden at the back of RGC guidelines
http://www.ugc.edu.hk/eng/doc/rgc/form/srfdp_sr2.pdf
IR: infrastructure is (mostly) there
http://www.julac.org/?page_id=79
IR: infrastructure is (mostly) there
http://repositories.webometrics.info/en/Asia/Hong%20Kong
IR: infrastructure is (mostly) there
No policies, Mo’ problems
Q: How much is spent on Open/Closed Access in HK?
A: Nobody has any idea!
https://lists.okfn.org/pipermail/open-access/2014-May/001888.html
In China publication + JIF = money = fraudAttempts to “game the peer-review system on an industrial scale”
1. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/2. http://www.grassley.senate.gov/sites/default/files/about/upload/Senator-Grassley-Report.pdf
Companies offering authorship of papers made to order by “paper mills”1. Common ghostwriting medical papers by pharma2
Guaranteed publication in JIF journal, often using fake referees, ID theft, etc.
1. http://dx.doi.org/10.1087/201102032. http://blog.thegrandlocus.com/2014/10/a-flurry-of-copycats-on-pubmed 3. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/
What is the cost of the jIF?
JIF 2 = $10,000 USDJIF 5 = $20,000 USD
Buy SellC/N/S = $30,000 USDJIF 10 = $1,500 USD
1. http://www.scmp.com/comment/insight-opinion/article/1758662/china-must-restructure-its-academic-incentives-curb-research
Created by skewed incentive systems in China…
“While we are rightly proud of Hong Kong’s highly regarded and ranked universities system, we are not immune to the same pressures. While funders in Europe have moved away from using citation based metrics such as JIF in their research assessments, the Hong Kong University Grants Committee states in their Research Assessment Exercise guidelines that they may informally use it.”
1. http://www.scmp.com/comment/insight-opinion/article/1758662/china-must-restructure-its-academic-incentives-curb-research
And this is now happening in Hong Kong too!
JIF 2 = $8,000 USDJIF 5 = $15,000 USD
Buy
Political forum Legislative Council (LegCo)Policy makers
Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Who needs to provide leadership?
What new infrastructure do we need?
Science & Technology players in HK
Who needs to provide leadership?RGC/UGC & new ITBWhat new infrastructure do we need?New “HK Data Service”, stewardship & platforms
Science & Technology players in HK
Political forum Legislative Council (LegCo)Policy makers
Government Advisory Committee on Innovation and TechnologyInnovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI Data Curators & Stewards (Libraries, OGCIO, Data Studio@SP)
Facilitators HKPC HKTDC HKSTPC Cyberport HKIB Data Disseminators (HARNET, data.gov.hk, "HK Data Service")
Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Downstream Users (Researchers, Innovators, Citizens)
Academic/commercial
cloud
If Government doesn’t act, Universities need to lead way
http://hub.hku.hk/advanced-search?location=crisdataset
If Government doesn’t act, Universities need to lead way
http://www.rss.hku.hk/integrity/research-data-records-management
First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
First CRIS in HK, built upon ScholarsHub
http://lib.hku.hk/researchdata/rpg.htm
“Beginning with the September 2017 intake, all HKU research postgraduate (rpg) students have responsibility for 1) using a data management plan (DMP), where applicable, to describe the use of data in preparation for, or in the generation of their theses, and 2) depositing, where applicable, a dataset in the HKU Scholars Hub.”
First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
CC-BY NC by default
First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
Licensing T&Cs
HK CRIS: Further reading/resources
https://youtu.be/focv1z3lpPI
RPg Students -- Instructions for Data:http://lib.hku.hk/researchdata/rpg.htm
Depositor's User Guide: http://lib.hku.hk/researchdata/deposit_page.htm
Seminar slides from HKU Libraryhttp://www.rss.hku.hk/integrity/rcr/rcr-info/seminars
See also ReShare video guide:
The cost to Hong Kong of not doing this?
• Estimates lack of citation impact not being OA = 50% ($8.75B?)2
• How much is the HK taxpayer losing through missing out on potential
collaborations, wider engagement & unrepeatable work?
HK UCG grant budget = $17.5 Billion HKD/yr (4% of Gov spending)
Taking lowest reported reproducibility rates (11%) = >$15 billion wasted1
$$
$
1. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html2. http://www.ecs.soton.ac.uk/~harnad/Temp/research-australia.doc
https://osf.io/cgpzb/
Open Science (Open Access & Open Data) survey of Hong Kong
Reading/Reflection for next class
Thoughts and ideas why Hong Kong is lagging behind US/EU?
Any ideas what we need to do to move forward?
Any feedback on the survey?
QUANTIFYING REPRODUCIBILITY IN HK
HKU Repeatability in HK Research Experiment
• HKU policy on data sharing from 2015• PLOS policy mandating sharing of supporting March 1,
2014• HKU has published 267 PLOS ONE papers 2014-date• Can we quantify reproducibility in a sample of these?• Easy exercise in literature curation• 2016 HKU PLOS publications = 49 papers
http://hub.hku.hk/simple-search?query=&location=publication&sort_by=bi_sort_2_sort&order=asc&rpp=25&filter_field_1=journal&filter_type_1=equals&filter_value_1=plos+one&filter_field_2=dateIssued&filter_type_2=equals&filter_value_2=[2014+TO+2017]&filter_field_3=dctype&filter_type_3=equals&filter_value_3=article&etal=0&filtername=dateIssued&filterquery=2016&filtertype=equals
HKU Repeatability in HK Research Experiment
• Everyone assigned 5 2016 HKU PLOS papers• Quickly scan paper looking for supporting data• If no data, ignore• If uses data, is it all associated with the paper?• If external data, is it available from URL or accession?• If “data available on request”, are they contactable?• Don’t spend more than 5mins per article• Add data into googledoc, and we’ll go through results &
feedback next class
Homework/Case study: literature curation exercise
HKU Repeatability in HK Research Experiment
Example 1.
https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeYmB3Uh4U23HX-o/edit?usp=sharing
HKU Repeatability in HK Research Experiment
Example 1.Is there data presented in the paper? – Yes
Is there external data, and if so what is the link/accession? – No
Is all the data in the paper available? – No
Comments - Has questionnaire, but not data as says "minimal anonymized dataset will be made available upon request”
Enter data here: https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeYmB3Uh4U23HX-o/edit?usp=sharing
HKU Repeatability in HK Research Experiment
Example 1.OPTIONAL: Optional: If data missing, do the authors respond if contacted?
Enter data here: https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeYmB3Uh4U23HX-o/edit?usp=sharing
Final Project• For the final project for this course, you can
choose from 3 assignment options. • The assignment is due on the 15th May and it
is worth 40% of your grade.• Time will be set aside for presenting a
provisional draft of this during the final class on the 24th April.
Final Project: Option 1Write an Annotated Bibliography about data curation practices in an academic discipline of your choosing.
• Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of “open data.”• Summarize data practices in your chosen discipline or topic. (5-7 sentences)• Find 7-10 sources that relate that discipline or topic to data creation, management, and/or curation.• Provide a citation for the source in APA style.• Write a short annotation that summarizes the content of the source. You may include quotes from
the source sparingly, but the annotations should be mostly, if not entirely, in your own words. (3-5 sentences)
• Explain the relevance of the source with relation to the data practices of your chosen discipline or topic. (1-2 sentences)
• Find a few example public datasets to demonstrate the above points. Cite the data in the relevant places in the Bibliography according to the Data Citation Principles.
• Refer to this guide for more information about annotated bibliographies: http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation should be in the “Descriptive” style.
Final Project: Option 2Using a relevant dataset (this can either be from the literature curation exercise, a BYO dataset, or one given to you), write a report that includes a description of the dataset, a Data Management Plan, and a guidelines document for the researcher(s).
• Describe the dataset that explains the form of the data and the academic discipline in which it was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data Management Plan following the guidelines from HKU or a granting body such as NSF.
• 1 page guidelines document that could be presented to the researcher(s) that provides guidelines for their data (extant and forthcoming):
– Preservation– Appraisal– Documentation
• For the DMP and the guidelines document, you can extrapolate from the your dataset to imagine additional details about the research practices that created the dataset and will create more data in the future.
• Look for suitable data repositories that can host this data (institutional, general purpose, or subject specific), and if there is one relevant then publish the data if you have permission, and correctly cite the data in the relevant places in your report.
Final Project: Option 3Prepare a 30 minute data curation workshop that you could teach to researchers that would provide them the necessary details to understand why data curation is relevant to them and best practices they should follow.
• Slide deck that introduces data curation for a researcher audience. (No more than 40 slides.)
• Presenter outline that describes the important points for each slide.• Topics that might be addressed in your workshop: the value of data
management, writing a data management plan, data repository options. You can assume your audience is researchers are at HKU.
• Make sure all of the content is copyright free, and share the final material openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient metadata to make it discoverable.
Looking ahead…
• Next class on Monday 27th March we’ll go from open to FAIR data
• We’ll also go through the reflection & curation case studies– Bring ideas & feedback, and we’ll look at the data
• Final project due 10th May– Need to present preliminary version on 26th April
to get feedback before completion