Is one enough? Data warehousing for biomedical research
-
Upload
greg-landrum -
Category
Science
-
view
147 -
download
0
Transcript of Is one enough? Data warehousing for biomedical research
![Page 1: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/1.jpg)
Is one enough? Data warehousing for biomedical research
Gregory Landrum1, Matthias Wrobel2, Nicholas Clare2
1 KNIME.com AG2 Novartis Institutes for BioMedical Research, Basel
2016 Basel Life Sciences Week21 September 2016
![Page 2: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/2.jpg)
Overview
§ Motivation: why is this both important and hard?
§ Three data warehouse case studies
§ Is one enough?
2
![Page 3: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/3.jpg)
Challenges for real world data management and analysis
§ Lots of heterogeneous data from multiple sources, both internal and external
§ Source data are frequently messy and unstructured
§ Constant flow of new data into the system
§ Diverse stakeholders and users
§ Highly diverse and complex questions to ask of the data
§ Serious performance requirements
![Page 4: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/4.jpg)
Storing and managing real-world data
§ Warehouse vs mart vs federation vs “data lake” vs linked data/triple store vs …
§ Many, many different approaches, technologies, and architectures.
§ Most are applicable in some scenarios but there is no silver bullet.
Stonebraker, Michael, and Uğur Çetintemel. ”One size fits all": an idea whose time has come and gone. Proceedings, 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, 2005.https://cs.brown.edu/~ugur/fits_all.pdf
![Page 5: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/5.jpg)
Storing the data isn’t the end of the story
You probably want to be able to get the data back out.
![Page 6: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/6.jpg)
Storing the data isn’t the end of the story
http://flickr.com/photos/35703177@N00/1063555182
You probably want to be able to get the data back out.
Extracting insights from a data lake
Jokes aside, allowing the data to be queried and retrieved efficiently is essential
![Page 7: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/7.jpg)
Nature of the dataShape of the data generated for a project
Hit finding 106 rows, 1-2 columns
Hit-to-lead103 rows, 5-10 columns
Lead optimization102 rows, 102 columns Clinic
1 rows, 104 columns
“omics” data (can appear at multiple stages) is different still
![Page 8: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/8.jpg)
Query/report type 1:
Specialized search
Many rows
Few columns
![Page 9: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/9.jpg)
Query/report type 2:
Basic search
Few rows
Many columns
![Page 10: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/10.jpg)
It’s more than just standard queries and reports
§ We also want to enable data scientists (informaticians)
§ They are going to generally want to ask more complex and varied questions
§ Will likely want to retrieve larger data quantities
§ Would be great to help them with their 80% problem
![Page 11: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/11.jpg)
The 80% problem
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
![Page 12: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/12.jpg)
Real-world case studies
§ Avalon:• Productive, maintained, and in active use for 15 years.
§ MAGMA:• Productive, maintained, and in active use for >5 years.
§ Entity Warehouse (EW):• In active development
![Page 13: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/13.jpg)
Avalon
§ One table/view per “fact_type” (maps roughly to assay)• Typical table has about 10 columns• Big table has about 100 columns
§ One row per measurement• 10s of rows for short-lived assays• Typically hundreds to thousands of rows• More than a million rows for HTS
§ ~30K tables/views
§ Additional tables defining structure of the fact tables
§ Little metadata
§ Tightly coupled to a UI
![Page 14: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/14.jpg)
MAGMA
§ Intended to be “the” warehouse
§ Similar type of schema as ChEMBL
§ Results stored in a tall and skinny table
§ Columns for all primitive data types (string, float, int, etc)
§ ~2 billion rows
§ Tables with metadata
![Page 15: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/15.jpg)
Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)• About a dozen result types
§ One row per measurementCurrent size:• 10s of millions of rows for Activity-Concentration• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
![Page 16: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/16.jpg)
Entity Warehouse: the entity § Used to represent the business objects (scientific or otherwise) of
interest• Compounds• Samples• “Assays”• Proteins• Projects• People• Assay results• Documents• etc…
§ Model for entity type stored in a central location
§ Entities can be linked and grouped
![Page 17: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/17.jpg)
Example entity: the small molecule concept
17
![Page 18: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/18.jpg)
Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)• About a dozen result types
§ One row per measurementCurrent size:• 10s of millions of rows for Activity-Concentration• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
![Page 19: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/19.jpg)
Example Composite Field Type: Activity Concentration
19
![Page 20: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/20.jpg)
Loading the data1
§ The Entity Warehouse is only one part of a large, multi-year data integration project.
§ The majority of the thought and effort has gone into how to properly integrate heterogeneous internal and external data sources
§ Conversion to entities, link resolution, some normalization
§ Preservation of links to original data systems
§ Strong focus on performance/timeliness of the load
§ Once the data are loaded: make it broadly accessible (helping with that 80% rule for data scientists)
1The 80% rule affects us too.
![Page 21: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/21.jpg)
CDF architecture
Update Services
Consolidation Layer
Source Layer
Integration Layer
Access LayerVisualization/reporting tools and user interfaces
Entity Services
Entity Warehouse
Search Indexes
Custom Datamart …
Entities Assays Facts Workflow
Registration systems
Assay metadata systems
Assay data systems
Logistics systems
Curation Framework
Entity Instance
ReferenceEntity & Property
Definitions
Fact Instance
Reference
![Page 22: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/22.jpg)
One size really doesn’t fit all
§ Just as there is no perfect database technology for all situations, we don't think that there's a perfect research data warehouse for all use cases.
§ The Entity Warehouse will contain most of the data and meet 90% of the needs,1 but there are still going to end up being multiple “warehouses”
§ We will encourage and support the building and use of data marts by data scientists and will make it easy to keep them up to date
§ The warehouse(s) is/are just one piece of the full data ecosystem
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
1At least we hope so. When it comes to enabling broad usage of the various types of 'omics data we'll need to see
![Page 23: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/23.jpg)
One size really doesn’t fit all
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
§ Maybe this is actually a hopeful message from the point of view of a possible standardized warehouse
§ If there’s only one warehouse, it’s probably going to be *mine*
§ If I’m using more than one “warehouse”, then I’m much more willing to talk about using something standardized for one of them
![Page 24: Is one enough? Data warehousing for biomedical research](https://reader031.fdocuments.in/reader031/viewer/2022030312/58ee3fe51a28ab3f618b45e9/html5/thumbnails/24.jpg)
Acknowledgements
Past and present members of the Avalon, MAGMA, and CDF teams:
Bernd RohdeJoe Ringgenberg
Mathias AspAndre Zelenkovas
Ryan MullerSandra MuellerArtem Mitrokhin
Recca ChatterjeeNabil Hachem
Andreas KoellerMark SchreiberBarry Frishberg
Thomas MuellerAlberto Gobbi
Peter Ertl
Paul SelzerWerner Braun
and many more
…
Past and present members of NIBR NX leadership:
Remy EvardSteve CleaverKen Robbins
Patrick WarrenAndy Palmer