Design Research: Innovation Warehouse Startup Accelerators London
Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research...
Transcript of Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research...
Building a Data Warehouse in Biologics Research
Dr. Alex Kohn (Roche Diagnostics GmbH)
Dr. Bernhard Schirm (Quattro Research GmbH)
The Roche Group
Key Facts at a Glance
• Founded 1896 in Basel, Switzerland
• Founding families still hold majority stake
• Employing 80,000 people
• Leadership in pharmaceuticalsLeading supplier of medicines for cancer and a market leader in virology
• Leadership in in vitro diagnostics
• Focus on Personalised Healthcare
Current Treatment is the Same for Most Patients
Outcomes can vary widely
= Different
outcomes
3
=
Patients Treatment
+Disease
+
Different treatment outcomes affect patients’ safety, survival and quality of life
Personalized Healthcare (PHC)
Tailors treatment to the patient
4
• Molecular diagnostic testing can stratify patients according to their specific genetic makeup and/or the nature of their disease or condition
• This approach improves drug safety, may increase patient survival, and may improve quality of life
The Roche Penzberg Site
One of the largest biotech centers in Europe
Divisions: Pharma & Diagnostics
Established: 1972 (Boehringer Mannheim GmbH)
1998 acquired by Roche
Employees: 4825* / FTE (Diagnostics 62 %; Pharma 38%)
Area: ~350,000 m2 = ~86 acres
Investments 2003 to 2011: ~€ 1.84 bn * Headcount of December 2011
The Roche Penzberg Site
One of the largest biotech centers in Europe
Divisions: Pharma & Diagnostics
Pharma Research & Early Development (pRED)
Centre of excellence for therapeutic proteins
Oncology, Inflammatory & autoimmune diseases,Metabolic diseases, Central nervous system diseasesVirology
* Headcount of December 2011
Biologics
Size and Complexity Makes Them Different
• Biopharmaceuticals are at least x100 fold larger than traditional chemical products.
• Produced by living cells
• Modified during expression, incubation in bioreactor, purification and storage
• Presence of impurities (host cell proteins, DNS, endotoxins, degradation products and aggregates)
• The process is the product
Aspirin
(< 200 daltons)
Chemical
pharmaceutical
Erythropoietin (EPO)
(~30 000 daltons)
Biopharmaceutical
Building a Data Warehouse in Biologics Research
Project Objectives
Establish a Data Warehouse as a Data Consolidation and Integration platform for all data within Biologics Research stored in relational repositories
Provide an ad-hoc query, reporting, and analysis toolset that gives users immediate access to information in the Data Warehouse.
IT-Systems in the Biologics Research Process Chain
9
Lab 1
Lab 2
Lab n
Lab Data Acquisition Registration & Tracking Data Analysis
0
200
400
600
800
1000
1200
DN
A c
lear
ance
Act
ual
12
3
4
56
78
9 10
111213
14
15
16
17
181920
0 200 400 600 800 1000 1200
DNA clearance Predicted
P=0,2616 RSq=0,84 RMSE=241,81
Actual by Predicted Plot
Continuous factors centered by mean, scaled by range/2
InterceptLoad pHLoad MassW1 pH
W1 CondW2 pHEL Cond
Flow rate(Load pH-7)*(Load Mass-15,02)(Load pH-7)*(W1 pH-7,5)(Load pH-7)*(W1 Cond-8,5)
(Load pH-7)*(W2 pH-7,72)(Load pH-7)*(EL Cond-104)(Load pH-7)*(Flow rate-140)
(Load Mass-15,02)*(W1 pH-7,5)
Term
98,965-0,5121071,3143033-0,334162
0,21875-0,076016
0,10625
0,106250,143750,343750,10625
0,331250,193750,31875
-0,46875
Scaled Estimate
0,2607790,2922680,2912620,291621
0,291560,288380,29156
0,291560,291560,291560,29156
0,291560,291560,29156
0,29156
Std Error
379,50-1,754,51
-1,15
0,75-0,260,36
0,360,491,180,36
1,140,661,09
-1,61
t Ratio
<,0001 *0,14010,0063 *0,3037
0,48690,80260,7304
0,73040,64290,29140,7304
0,30740,53580,3241
0,1688
Prob>|t|
Scaled Estimates
Horiz Vert
Load pH
Load Mass
W1 pH
W1 CondW2 pH
EL Cond
Flow rate
Factor
7
15,02
7,5
8,57,72
104
140
Current X
Yield
Purity
ProteinA
DNA clearance
HCP
Response
97,650943
98,809434
18,231132
758,49057
2441,0377
Contour
98,965
98,91
15,57
527
2350
Current Y
97,650943
98,816038
.
.
.
Lo Limit
.
.
18,231132
758,49057
2441,0377
Hi Limit
10
15
20
Load
Mas
s
Yield
ProteinA
DNA clearance
HCP
6,5 6,6 6,7 6,8 6,9 7 7,1 7,2 7,3 7,4 7,5
Load pH
Contour Profiler
Data Warehouse Components
ELN
LIMS
Projects
ScreeningScreening
Inventory
SourceSystems
Data Capturing Source Systems
Fermentation
Cell Line
Vector Variant
Protein Variant
B-Cell (Hybridoma)
Specimen
Animal
Fusion Sort
Purified & Characterized
Protein
System Description
Key Manager Object and relationship management(proprietary)
Labware Immunization management (Animals, Specimen)
TheraPS Workflow and request tracking (proprietary)
E-Workbook / BioBook
IDBS Electronic Lab Notebook
Sample Management
Sample management including analyticaldata (proprietary)
MaterialManagement
Material management (proprietary)
PI Osisoft: Online monitoring of fermentationprocesses
Data Warehouse Components
ELN
LIMS
Projects
ScreeningScreening
Inventory
SourceSystems
Data Warehouse
Data Marts
ETL
Extract, Transform, Load
ETL
• Environment: Mostly Oracle Systems
• Used tools
o Oracle Warehouse Builder
o Oracle Workflow
o SQL, PL/SQL
o For PI integration
− PI SDK, C++, Java
− Integrated into Oracle Warehouse Builder
o APEX
− Master Data Management
− Monitoring
Oracle Warehouse Builder
OWB
• Graphical design of ETL routines
• Modeling of process flows
• Provides documentation of ETL
• Used Versiono OWB 11gR2
o Oracle Workflow 2.6.4 to automate process flows
• Part of every 11gR2 database installation
• Only Standard Edition features are used
• OMBPLUSo Deployments of mappings and process flows
Change Data Capturing
CDC
• Load only changed data
o No updates, only inserts
• Non-invasive
o E.g. materialized views, views with timestamps
o May need to load too much data
• Invasive
o Triggers, redo log transport
o Smaller delta, faster
o Problem: could not change some systems due to license and support issues
�We use only the non-invasive approach
Data Warehouse Components
ELN
LIMS
Projects
ScreeningScreening
Inventory
SourceSystems
Reporting
Ad-hoc Queries
Data Mining, OLAP
Data Warehouse
Data Marts
Master Data Management
ETL
Master Data Management
MDM
• Definitiono Reference terms consolidated throughout an organization
o Non-transactional data
o E.g.: Physical units, projects, parameters
• In realityo Use of controlled vocabulary not enforced everywhere
• Role in data warehouseo Essential for data linking and comparison
• How to solve?o Get rid of Excel MDM
o Build MDM curation tools
o User buy-in
Challenges
1. Dimensionality
2. Genealogy
3. Heterogeneous user requirements
Dimensional Modeling
• Star model for business warehouses
• Simple fact tables
• Large amount of data
o E.g. Walmart ~1000TB
Transaction
Product
Region
TimeCustomer
ProductGroup
OLAP Cubes
Aggregate facts along dimensions
Transaction
Dimensions in Biologics Research?
• High dimensionality
• Multiple facts having hierarchies
o E.g. IC50 has Hill-Slope & R²
• Warehouse size relatively small (~1TB) compared to finance warehouse
Study
B-CellBinding
Device
ProcessTime
Fermentation Phase
Experiment
Absorbance
Technique
Target
Species
AssayActivity
IC50
Batch
Project
Aggregation of Data
• Summary data is a key part of data warehouses
o Pre-aggregation to optimize queries
• Straightforward for business data
o Aggregate sales number (mean, sum)
• Difficult for Biologics Research data
o Aggregation is only valid under a specific set of dimensions
− E.g.: Average IC50 only for the dimensions protocol, target, concentration
o No aggregation possible / desired
− Time-dependent data like cell growth
− Qualifying data like sequences
� Aggregation of Biologics Research data is the exception
The BI Approach
• OBIEE
o OLAP Cubes
o Admin Tool: Physical / Business / Presentation Layer
o Answers: Ad-hoc query and analysis
• OBIEE in the domain of Biologics Research
o Many dimensions / conditions � ok
o Majority of the data can’t be aggregated because it has no scientific meaning
�OLAP approach not applicable
Challenges
1. Dimensionality
2. Genealogy
3. Heterogeneous user requirements
Genealogy
• In Finance we usually have a clear reference point
o Product
o Customer
• In Biologics Research reference points are contained in a complex genealogy
o Data consolidation depends on scientific need
o Different levels on which data should be viewed
− Antibody
− Hybridoma
� OLAP / Hierarchy approach not applicable due to curse of aggregation
Challenges
1. Dimensionality
2. Genealogy
3. Heterogeneous user requirements
Heterogeneous User Requirements
• Initial approach: Use 1 system for all use cases
o e.g. OBIEE
• Not applicable due to heterogeneous user requirements
o Consolidation needs vary depending on the level on which the data should be viewed
• We ended up using multiple tools for accessing the data marts
o Tibco Spotfire
o Oracle APEX
o MS Excel
o InSilico
Online & Offline Fermentation Analysis in Spotfire
Genealogy Browser in APEX
Conclusion
• Master data management is a key success factor
• Biologics Research is different from business warehouses
o Dimensionality, aggregation and genealogy
• Most tools on the market are for business warehouses
• Query & Reporting
o No one-size fits all solution
o Tailored solutions for each user domain
o Agile development approach � Enabled quick user buy-in
We Innovate Healthcare