Charleston Conference 2016

18
| From Maslow’s Hierarchy to Knowledgegraphs: Experiments in Big and Small Data at Elsevier Anita de Waard, [email protected] VP Research Data Management, Elsevier

Transcript of Charleston Conference 2016

Page 1: Charleston Conference 2016

|

From Maslow’s Hierarchy to Knowledgegraphs: Experiments in Big and Small Data at Elsevier

Anita de Waard, [email protected] Research Data Management, ElsevierCharleston Conference, November 4, 2016

Page 2: Charleston Conference 2016

| 2

Big Data vs. Small Data: What Will I Be Talking About?

Data Type Small Big

User UX User analytics

Performance Pure Scival

Research Research Data Management (RDM)

HPC systems (HEP, astronomy, etc)

Text Text mining KnowledgeGraphs

Health Medical systems Precision Medicine

Elsevier does I will talk about

Page 3: Charleston Conference 2016

|

Bauer, B. (Bruno) et al,(2015) ‘Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung (eBook)‘ (in German)E-infrastructures Austria, https://phaidra.univie.ac.at/detail_object/o:407736

Stays at institution

Take it with me

Don’t know

Data is lost

Other

When You Leave Your Institution, What Happens To Your Data?

Page 4: Charleston Conference 2016

|

When we talk about data, we really talk about the following:

Machine & environment settings

Raw data Processed data

Scripts & analyses

Protocols, methods, algorithms

Accessibility

Reproducibility

Reusability

Discoverability

Note: images for illustrative purpose only4

Page 5: Charleston Conference 2016

|

https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data

A Maslow Hierarchy for Research Data:

Page 6: Charleston Conference 2016

|

Preserve Process: Hivebench (http://www.hivebench.com)

Page 7: Charleston Conference 2016

|

Linked to published papers – or not

Linked to Github – or not

Versioning and provenance

Preserve Data: Mendeley Data (https://data.mendeley.com/)

Page 8: Charleston Conference 2016

|

http://www.journals.elsevier.com/softwarex/

Share and Comprehend: SoftwareX (http://www.journals.elsevier.com/softwarex/)

Page 9: Charleston Conference 2016

|

Access: Linking papers to data: www.Scholix.org

• ICSU/WDS/RDA Publishing Data Service Working group

• Creating linked-data model for exposing DOI to DOI links outside publisher’s firewall

• Merged with National Data Service pilot with the same goal

• Collaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, Elsevier, OpenAire

Objective: move from

a plethora of (mostly) bilateral arrangements between the different players…

.. a one-for-all cross-referencing service for articles and data

.. to ..

Page 10: Charleston Conference 2016

|

Discover: Data Search (http://datasearch.elsevier.com)

DataSearch.Elsevier.com

1.Across repositories

2.(Deep) indexing of data, so not just metadata

3.Data preview

1

3

2

Page 11: Charleston Conference 2016

|

https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data

A Maslow Hierarchy for Research Data:

Data at Risk

Reproducibility Papers

Page 12: Charleston Conference 2016

|

GOAL: IDENTIFY ENTITIES AND RELATIONSHIP ACROSS THE ENTIRE ELSEVIER CORPUS IN SCIENCE DIRECT

TEXT MINING + ENTITY IDENTIFICATION, USING OUR TAXONOMIES (EMMET, COMPENDEX, AND OTHER)

UNSUPERVISED, SCALABLE AND BUILT WITH OFF-THE-SHELF TECHNOLOGIES

COLLABORATION WITH UNIVERSITY COLLEGE LONDON AND UM AMHERST [1]

TOWARDS AN ELSEVIER KNOWLEDGE GRAPH

14M articles from Science Direct

3.3M triples

475M triples

49M triples p x r matrix p x k, k x r latent factor matrices

~102 triples

920K concepts from EMMeT

[1] Riedel, S., L. Yao, A. McCallum, and B. M. Marlin. (2013). "Relation extraction with matrix factorization and universal schemas”, http://www.aclweb.org/anthology/N13-1008

Page 13: Charleston Conference 2016

|

SAMPLE OUTPUT:

glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss

Deduplication/normalization: downsampled from 49M entity-resolved triples:

Page 14: Charleston Conference 2016

|

Knowledge Graphs for the Life Sciences:

Bradley Allen, DC Conference, Oct 2016, http://www.slideshare.net/bpa777/dc2016-keynote-20161013-67164305/15

Page 15: Charleston Conference 2016

| 15

Trends driving Digital Health & Precision Medicine:need for health data with consent

4500 tests for gene disorders available(2013: 3200 +20% CAGR)

$1245cost to sequence full genome(10/2014: $5730)

$199cost of 23andME test

25 million biomed articles referenced on PubMed

30 days → 1 hourmanual to machine learningtime needed to develop one prediction model at Elsevier

1.2 millionnew biomed articles p.a.

76%of US hospitals use at least a basic EMR

130 million patientdata sets at large insurer21 m complete for last 2 years7 m with clinical and lab dataNB: 6 m (no clin, lab) in Germany6.5 million in Catalonia

105 mm ECG high ecg quality, heart rate, respiratory, body temp, activity, body position,water tight, induction charged, bluetooth, continuous data feed

patientslikeme has

400,000+ members 31 million data points covering 2,500+ conditions, donating data

1. genetic testing

2. information explosion

3. patient data

4. biosensors - IoT in health

5. machine learning

6. patient empowerment

Page 16: Charleston Conference 2016

| 16

The Elsevier Medical Graph is a deep predictive model that relates attributes of over 2000 medical conditions to phenotypes of patients at potential risk of re-admission.

Probability of occurrance within next five years. 2,083 ICD10 conditions. Based on 6 year longitudinal history of 6 million German patients.

Page 17: Charleston Conference 2016

| 17

Big Data vs. Small Data: What Did I Talk About?

Data Type Small Big

User UX User analytics

Performance Pure Scival

Research Research Data Management (RDM)

HPC systems (HEP, astronomy, etc)

Text Text mining KnowledgeGraphs

Health Medical systems Precision Medicine

Elsevier does I discussed!

Page 18: Charleston Conference 2016

|

Thank you!

18

Anita de Waard, VP Research Data Collaborations,Elsevier RDM ServicesJericho, VT [email protected]