Post on 03-Mar-2018
Structuring for Data Integrity Success: Concepts and Cases
Kristen Anton, MS
Bioinformatics
Geisel School of Medicine at Dartmouth
and UNC at Chapel Hill
January 13, 2017
From Information Design, Nathan Shedroff
• Don’t know what data formats will be • Data stored on a variety of platforms • Data needs to be available when we want it • Data are distributed – not in one place • Data cannot be centralized • Data are protected
Do know we have to be able to use some or all of the data at any time, from one single access point.
The problem with research data
Science data challenges
Big Data
Emphasis on capture
Standards Access
Meaningful use Archive
Grant Dollars
Patient empowerment and interactivity
Secrets to success
1. Do not design a study without contribution of computational and analytic experts (figure)
2. Plan adequate resources
3. There is no magic bullet technical solution
4. Don’t skip the engineering
5. Develop realistic timelines collaboratively
6. Remember: the goal is a high quality data set
7. Test your systems
8. Iterate
Secrets to success
1. Do not design a study without contribution of computational and analytic experts
2. Plan adequate resources
3. There is no magic bullet technical solution
4. Don’t skip the engineering
5. Develop realistic timelines collaboratively
6. Remember: the goal is a high quality data set
7. Test your systems
8. Iterate
Informatics & Data Management
• Needs analysis
• Process engineering
• Functional specification
• Technical specification
• System development & implementation
• Training, documentation
• Infrastructure (hardware, network)
• Data validation
• Process refinement (feedback to developers)
• Data cleaning
• Data manipulation
• Data reporting
• Data sharing, transformation
• Data integration
• Data analysis/interface
Informatics for Complex Biomedical Research
The goal: To develop high-quality computerized system designed to
give you a high-quality data set
• Efficient, accurate, secure, validated data capture
• Safe data storage and archive (make sure you can restore from backup: practice!)
• Change control and audit (data elements, process and data)
• Code repository (especially important for longitudinal studies)
• Ability to retrieve data in such a way that all study data for an individual is attributable to that subject
• Low-cost maintainability of the system
• Speedy startup, minimal cost
First (critical) step: Process Engineering
• Read and understand the protocol
• Describe the protocol diagrammatically (flow charts, use-case diagrams, etc.)
• Review the documentation and adjust (scientific, operations and tech staff) – iterate!
• Include documentation within study manuals
Data Entry: Lots of shapes and sizes
• Paper – centralized double entry, then post to database
• Paper – scan & verify, then post to database
• Paper with entry at source of data collection
• Electronic data collection, stand-alone systems deployed to data collection site; data transfer
• Web and mobile device-based electronic data collection (automatic post to database)
• Automated data extraction and capture (e.g. eHR, PHR)
Data Entry: General Principles
• Authentication and authorization
• Design with measures to make identification clear
• Good security practice (no sharing passwords, regular changing of passwords, physical & logical security)
• Audit trail
• Date/time stamp (make sure systems’ date/time is correct; account for different time zones)
Web and app-based Data Entry: Dynamic interfaced enforce protocol
• Use bar codes, pick-lists, radio buttons, check boxes and
skip-patterns
• Minimal use of default values and free text
• Information can be pre-loaded to facilitate data linking (ie: specimens to subject)
• Consistent use of standardized semantics
• Validation at entry, submit, and posting to database
• Reduce data entry burden/ capture at source
• Use standardized header to facilitate subject identification
Security: Physical and Logical
• Physical – Servers locked, locked and
locked
– “Good Practice”
• Topology – Firewall
– Separate servers
– Systematic Virus Protection
• Authorized Access to Data – Regulation of access by login
(individual signature)
– Multiple levels of access privilege
– Time-out
• Disaster Recovery
– Audit function within database stores all changes to data with metadata
– Nightly back-up
– Monthly archive with off-site storage
• Design
– Data: identified, de-identified, anonymous
– Encrypted transactions (data moves between interfaces and database)
Data Security: Challenging High Priority
Create a dataset that includes the least
possible identification
Standards & ‘Gold’ Standards
• FDA guidelines for computer systems in clinical trials http://www.fda.gov/RegulatoryInformation/Guidances/ucm126
402.htm
• HIPAA guidelines regarding security of information
• IRB requirements with regard to ability to identify individuals in the data set
• Industry semantic standards and ontologies
• Metadata standards
• Transmission standards FHIR, HL7 (support automated extraction)
Data Management for Complex Biomedical Research
The goal: To safely, efficiently and accurately
collect, store, validate and retrieve study data
• Process validation
• Participate in system validation
• Data validation measures within central database
• Quality control checks at source of capture
• Ensuring ready access to data for analytical staff
• Data reporting to support operation of study as well as interim analysis
• Data sharing (with potential data transformation or mapping) - becoming increasingly important
Challenge: BIG DATA
1. Volume • Multiple generators: humans,
machines, networks, social media
2. Variety • Structured data e.g. databases,
spreadsheets
• Unstructured data e.g. emails, photos, videos, pdf, path reports
3. Velocity • Fast, sometimes continuous capture
• Real-time access, use desired
4. Veracity • Biases, noise, outliers,
abnormalities • “Dirty” data • How long is data valid,
how long should it be stored?
• Is the data being analyzed meaningful to the problem?
5. Value
Preparing to handle big data
• Invest in capturing and maintaining data in well-annotated, accessible, structured data repositories
– Based on rigorous data/information architectures
• Computer Scientists, Statisticians/Data Scientists, Domain Experts (Scientists) must systematize the analysis of massive data
– Significant efficiencies may be achieved by thinking of data analysis and data access together rather than thinking of them as serial operations.
– We need new statistical methods and algorithms optimized for this type of environment
• Develop computing infrastructures for sharing and analyzing highly distributed, heterogeneous data
– Requires coordination (international, cross-agency)
– Requires a software architecture
• Sustainability in both the data and the software infrastructures are critical
The fun part: Bioinformatics in action
Internet cohort of 14,000 IBD patients • Baseline and 6-month surveys on disease activity and
treatment • Modules collect information on a variety of patient reported
outcomes, diet, sleep, etc. – very flexible to incorporate new questions
• Pilot tested biospecimen collection • 25 abstracts and more than a dozen papers • Two PCORI grants based on Partners
What has this research shown?
… also Kids & Teens
• Coordinator supported IBD registry • Information and specimens • Specimen analysis bringing in “big data” e.g. genotyping • Sub-population linked to CCFA Partners
7 IBD centers
of excellence
5000
IBD patients
Developed Baseline &
Follow-up surveys
(in person and online)
Web site home page serves as portal to Registry, biospecimen network, forms, documents, SOP’s, project descriptions
Data capture:
• Updated web-based data collection tools to enroll subjects and enter clinical information from patients and charts
• Baseline and follow-up survey
Online follow-up to reduce coordinator burden:
1266 on-line follow up questionnaires completed
Sharing and analyzing data:
Annotated case report form helps users understand data set contents and formats
Mars, Palm trees & IBD?
SHARE biospecimen registry
Dynamic biospecimen registry developed using Apache Software Foundation open source software, collaboration with NASA Jet Propulsion Laboratory
• Basic metadata describing SHARE biospecimens defined and implemented within this system
• Ongoing effort to connect all sites electronically, for dynamic data retrieval
• Data extract brings information to network until electronic connection is established
• Blood, tissue and stool data now available
• Status page continuously update to reflect progress on development
SHARE biospecimen registry: scalable
Opportunity: Building a full data science “knowledge environment” for SHARE IBD research
• Document and search protocols, data and metadata standards, cohort descriptions, outcomes
• Reproducible analyses supported by “pipelines” e.g. RNASeq, Secretome, Mass spec
• Persistent archive of data • Searchable environment
Data Science Knowledge Environment:
Early Detection Research Network (EDRN) is an initiative of the National Cancer Institute created to bring together dozens of research institutions to help accelerate the translation of biomarker information into clinical applications for diagnosing cancer in its earliest stages …
… supported by a virtual, national, integrated Knowledge System.
SHARE biospecimen registry
September 2016, our systems were named one of the ten technical advances improving cancer research by Tech Republic.
Others on the list include IBM Watson, Google DeepMind and CRISPR.
Our work is the only technology that addresses enterprise architecture for managing and sharing big data for analytics and visualization.
http://www.techrepublic.com/article/10-ways-tech-is-improving-cancer-research/
Chemoprevention study: Prevention of skin cancer by antioxidants
• Subjects (9000) with clinical evidence of chronic arsenic exposure will be recruited from two ongoing cohort studies
• Two study sites in Araihazar and Matlab; Coordinating center at ICDDR, B in Dhaka and University of Chicago; Mailman School of Public Health Columbia University, data center at Dartmouth
• Goal: to test effectiveness of vitamin E and selenium in preventing development of skin cancer
Data issues
• Design processes, forms and systems
• Collect: Family history, Risk factors, Health & outcomes, bio-specimens/related data
• Screening, recruitment, pill distribution, data collection, bio-specimen tracking, data management, data integration and analysis
• Challenges: intermittent power, low bandwidth internet access, low-tech facilities (no land-line phone, no fax, no cooling, no computers at sites), translation of information into and back from Bengali – etc.
Colon Cancer Family Registries
Chemoprevention of Arsenic Induced Skin Cancer
Visual Media Influences on Adolescence Smoking
Vit D/Calcium Polyp Prevention Study
PCORI: IBD Patient Powered Research Network
Methotrexate Response in Treatment of UC
IBD research
We Care
Improving the quality of life of patients is not just a job for us – we are
committed to facilitating better disease screening, treatment, cure and the basic
science that leads to knowledge.
Great Team
Department of Biomedical Data Science
Dave Aman Judy Harjes Rob Rheaume Scott Gerlach Suzie Rovell-Rixx John Gilman Susan Gallagher
Jane Hebb Ted Bush Steve Pyle Judi Forman Laurie Johnson Maureen Colbert
Center for Gastrointestinal Biology and Disease
David Seligson Ginny Sharpless Wenli Chen Van Nguyen