Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44...
-
Upload
bartholomew-byrd -
Category
Documents
-
view
228 -
download
2
Transcript of Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44...
Relational Databases and Statistical
Processing
Andrew WestlakeSurvey & Statistical Computing
+44 20 8374 4723 [email protected]
15-Mar-03 Statistical Databases and RDBMS © S&SC 2
Introduction
• Good Statistical Analysis needs Good Data– Many things contribute to data quality
• Quality of structures and processes for data collection related to quality of data
• Good Ideas from Computer Science– Process design methodologies
– Database methodologies
– Data and process modelling systems
• Statisticians can benefit from CS concepts– Particularly for design of complex structures and
continuous processes
15-Mar-03 Statistical Databases and RDBMS © S&SC 3
Data Structure
• Statistics generally based on a single rectangular dataset– Much statistical data is conceived and stored in this form
• Database professionals often use Relational Database systems– Formal model, based on a set of related rectangular datasets– Allows the representation of more complex structures, and
easier maintenance and manipulation of the data.
• RDB approach valuable for some statistical situations– Review RDB concepts and functionality, plus SQL– Statistical Examples
• Pakistan Fertility Survey• Processing and analysis system for HIV and AIDS reporting for
England and Wales
15-Mar-03 Statistical Databases and RDBMS © S&SC 4
The Relational Model
• A logical specification of the content and behaviour of a database management system, including
• The types of structure that can be present in a database
• The properties of elements that can be stored in these structures
• The operations that can be performed on these structures and their behaviour
• Facilities that must be present in the database management system
• The general nature of the interactions between the database and its users and administrators
– Codd’s Rules specify properties that a Relational DataBase Management System (RDBMS) must possess
• SQL– A standard language with which to interact with a RDBMS.
15-Mar-03 Statistical Databases and RDBMS © S&SC 5
Relational Database System Structure
User Processes
Data Modelling
SQL Relational Model
Storage & Access Methods
Applications & Tools, include functionality, semantics appropriate to application
Expressing the real task in terms of the conceptual model
Standard interface to an RDBMS, syntax and embedding
Specification of functionality, behaviour and scope
Implementation issue, affects performance
D
D Data
P
D Rules
Database
P
P
P
P
(SQL)
User Processes
SchemaDesign
DataEntry
TransactionProcessing
MISReports
DBMS
InsertionDeletionUpdates
Consistency
SortingGrouping
Aggregation
RetrievalSelection
Transformation
DataDictionary
DataBaseManagement
System
15-Mar-03 Statistical Databases and RDBMS © S&SC 6
RDBMS Strengths
• Relational Model– Precise, formal mathematical specification of structure and
behaviour– Appropriate for information about Entities, and
Relationships between them
• Data Modelling– Useful tools for understanding data structures and flows
• SQL– International Standard (SQL2), widely implemented
• Current Implementations– Widely available, well supported, good implementations,
integration with other products, add-on market for tools
15-Mar-03 Statistical Databases and RDBMS © S&SC 7
Components of the Relational Model
• Domain• a set of possible (atomic) values
• Attribute• defined over a domain, has a name, cf. variable
• Tuple• a set of values, one associated with each attribute, cf. cases
or records. NULL values supported
• Relation• defined over a set of attributes, has a name, consists of a set
of tuples
• Relational DataBase• a set of relations
15-Mar-03 Statistical Databases and RDBMS © S&SC 8
Relations
• Rectangular Data tables, in which columns are variables, rows are cases.
• Tables can be linked by the values of common variables.
Relation Cancer_ Registration
Domain Registration
ID Gender Age in years Weight in
Kg, 1dp SEG Age in
years ICD
Attribute ID Sex Age_at_ Registration
Weight SEG Age_at_ Diagnosis
Diagnosis
4951 2 58 54.5 3 53 194 4952 1 68 75.8 2 60 285 4953 2 73 62.3 5 70 162 4954 2 52 48.7 2 45 501 4955 1 49 95.2 4 47 162 4956 1 68 112.7 3 61 196
Tuples 4957 1 87 84.2 4 85 203 4958 2 74 69.4 4 70 162 4959 1 69 53.0 1 69 186 4960 2 92 83.0 5 92 162 4961 1 45 67.4 5 41 503
Relation Lung_ Diseases
Domain ICD String Attribute Diagnosis Disease_Name
162 Cancer of the Lung
Tuples 501 Asbestosis 503 Mesothelioma
15-Mar-03 Statistical Databases and RDBMS © S&SC 9
SQL
– International Standard, actively revised
– Widely available in good RDBMS software
– Text language, used by programs and people• Stored or constructed at run-time
– Designed to support tools which are independent of the DBMS• cf. Client-Server architecture
– User and Programmer skills portable across products and sites
– Has sections for defining database structure (DDL), Manipulating database content (DML)Ensuring database integrity, andManaging database security
15-Mar-03 Statistical Databases and RDBMS © S&SC 10
SQL Data Manipulation
SELECT {DISTINCT}<expression list>FROM <table list>WHERE <condition>GROUP BY <column list>HAVING <group condition>ORDER BY <column list>
• All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations.
• The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement.
• Retrieval is usually done through a Query interface, which generates the SQL.
15-Mar-03 Statistical Databases and RDBMS © S&SC 11
Data Manipulation - Join
• The Join operation selects the combination of rows from two or more tables that meet a matching conditionSELECT R1.*, R2.* FROM R1, R2 WHERE A = X
• All matching combinations are retained– Eg can relate household information to all members via
household number
Join A=X
A B C R1.D X R2.D E
1 b1 c1 d1 1 d3 e1
2 b2 c2 d2 2 d5 e2
3 b3 c3 d3 3 d3 e3
15-Mar-03 Statistical Databases and RDBMS © S&SC 12
Data Manipulation - Aggregation
• Aggregation functions can be used in the variable listSELECT COUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS),
AVE(WEIGHT) FROM CANCER_REGISTRATION
– This produces one record for the whole table containing three values, the number of records with non-null values for Diagnosis, the number of distinct Diagnosis values, and the average Weight.
– The available functions are COUNT, SUM, AVE, MIN, MAX.
– Aggregation functions can be applied to expressions.
15-Mar-03 Statistical Databases and RDBMS © S&SC 13
Data Manipulation - Grouping• The GROUP BY clause is used to perform aggregation
within subgroups, producing a record for each group. SELECT DIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT)
FROM CANCER_REGISTRATION GROUP BY DIAGNOSIS ORDER BY DIAGNOSIS
– This produces a record for each diagnosis containing the number of registrations and their average weight as well as the diagnosis code.
– It is sorted into diagnosis code order, using the ORDER BY clause.
– The Group By clause can contain multiple fields (for cross-tabulation).
– The expression list can contain variables used in the Group By clause, and aggregate functions for any variable.
– A WHERE clause selects the records to be aggregated.– A HAVING clause selects the groups to be returned.
15-Mar-03 Statistical Databases and RDBMS © S&SC 14
Views
• Stored query definition– Can be evaluated on demand
– Used to present data in the form needed by the user (a person or a process)
• Dynamic evaluation– Ensures that the viewed information is up to date
– May be inefficient if the information does not change
15-Mar-03 Statistical Databases and RDBMS © S&SC 15
Database Design
• Choose physical tables and fields– Capture essential characteristics of real-world system
– Support intended uses of the information (directly or through views)
– Operate efficiently
• Identify important links between tables– Provide appropriate Keys
– Choose Indexes and other implementation details
• Retain flexibility for future needs• Various methodologies for design process
15-Mar-03 Statistical Databases and RDBMS © S&SC 16
RDB Summary
• Relational databases are ubiquitous, and are useful for large-scale data collections
• Some manipulation and aggregation operations can be done more easily than in statistical packages
• Relational model is a useful way of thinking about data structures– May suggest different data structure
• No replacement for statistical packages for statistical analysis• Flexible manipulation of more complex data structures• Useful development tools for processes• Interface to Views for Statistical Analysis
– ODBC or similar– Extract micro- or macro-data for statistical analysis
15-Mar-03 Statistical Databases and RDBMS © S&SC 17
Databases for Statistics• Simple situations
– Use statistical or survey packages
• Studies with Complex Structure– Use RDB for flexible manipulation
– May store static result tables
• Ongoing Data Capture Processes– Use RDB for dynamic data entry, updating, matching and
checking
– Choose data structure to optimise data processes and quality, not ease of statistical processing
– May take static snapshots for regular statistical analysis
• Statistical Dissemination – another presentation
• Statistical Metadata – yet another presentation
15-Mar-03 Statistical Databases and RDBMS © S&SC 18
PFFPS – Main
structure
PFFPSHID
PK,FK1,I1 CLUSTERPK H_NO
RECORD_TYPPROVINCEAREADISTRICTVISITF_DAYF_MONF_YEAINTERVIEWSUPERVISORDEORESULTQ16TQ16MQ16FQ17Q18Q19Q20Q21Q22Q23_1Q23_2Q23_3Q23_4Q23_5Q23_6Q23_7Q23_8Q24Q25_1Q25_2Q25_3Q25_4Q26Q27
PFFPSHHM
PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01
RECORD_TYPPROV_HHMAREA_HHMQ03Q04Q05Q06Q07Q08Q09Q10Q11Q12Q13Q14Q15
PFFPSWID
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
RECORD_TYPI10 PROV_WIDI1 AREA_WIDI9 LINE_WIDI3 DIST_WIDI4 F_DAY_WIDI5 F_MON_WIDI6 F_YEA_WIDI11 VISIT_WID
WOM_RESQ101HQ101MQ102Q103Q104Q105MQ105YQ106Q107Q108LQ108CQ109Q110Q111Q112Q113Q114Q115Q201Q202Q203SQ203DQ204Q205SQ205DQ206Q207SQ207DQ208Q210Q218Q219Q220Q221Q222Q223Q224Q225_1Q225_2Q225_3Q225_4Q225_5Q225_6Q225_7Q226Q227Interview CMCBirth CMC
Weights
PK Cluster
Weight
15-Mar-03 Statistical Databases and RDBMS © S&SC 19
Data Tables for PFFPS
Weights
PK Cluster
PFFPSHID
PK,FK1,I1 CLUSTERPK H_NO
PFFPSHHM
PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01
PFFPSWID
PK,FK1 CLUST_WIDPK,FK1 H_NO_WIDPK,FK1 L_NO_WID
PFFPSWBH
PK,FK1 CLUST_WBHPK,FK1 H_NO_WBHPK,FK1 L_NO_WBHPK LINE_WBH
PFFPSWPB
PK,FK1 CLUST_WPBPK,FK1 H_NO_WPBPK,FK1 L_NO_WPBPK,FK1 LINE_WPB
PFFPSWC1
PK,FK1 CLUST_WC1PK,FK1 H_NO_WC1PK,FK1 L_NO_WC1PK Q301
PFFPSWC2
PK,FK1 CLUST_WC2PK,FK1 H_NO_WC2PK,FK1 L_NO_WC2
PFFPSWMF
PK,FK1 CLUST_WMFPK,FK1 H_NO_WMFPK,FK1 L_NO_WMF
PFFPSWSE
PK,FK1 CLUST_WSEPK,FK1 H_NO_WSEPK,FK1 L_NO_WSE
15-Mar-03 Statistical Databases and RDBMS © S&SC 20
PFFPS Meta Data Structure
Record
PK Record
Dictionary
PK Field
FK1,I1 Record
Category
PK,FK1 FieldPK,I1 Code
Range
PK,FK1 FieldPK Lower
Skip
PK,FK1 FieldPK,I1 Code
15-Mar-03 Statistical Databases and RDBMS © S&SC 21
HIV and AIDS reporting section, PHLS
• Redesign of processing and reporting system• Old system (1994) based on files processed with
Epi-Info and Aspect database• Requirements
– Simplify management of data
– Improve consistency
– Simplify regular reporting
– Enable ad hoc reporting
15-Mar-03 Statistical Databases and RDBMS © S&SC 22
Old System
• Input Notifications– HIV Tests (standard form)
– AIDS Diagnosis (standard)
– ONS Death Reports
• Some notifications on disc, most on paper
• Data Storage– Separate HIV and AIDS files
– Deaths Added
• Reporting– Quarterly standard reports
– Ad Hoc inquiries (e.g. PQs)
– Occasional in-depth analysis
• Processes– Continuous Data Entry from
paper
– Regular Linking for reporting extractions
– Occasional Matching within and between files
• Problems– Duplicates and omissions
– Name matching
– Repeated linking
– Inflexible for reporting (eg progression from HIV to AIDS)
15-Mar-03 Statistical Databases and RDBMS © S&SC 23
Old Structure
HIVNotification
File
AIDSNotification
File
HIV Forms
AIDS Forms
Death Reportson disc
QuarterlyExtracts
15-Mar-03 Statistical Databases and RDBMS © S&SC 24
Initial Thoughts
• Notification forms appropriate for reporting, but not for primary information store
• Information is about Patients– Notifications are sources of information– Record information about events and states
• Do matching once, at initial entry• Unified database structure• Additional components for metadata, auditing, contact
information• Extracts to separate statistical reporting system
15-Mar-03 Statistical Databases and RDBMS © S&SC 25
Full system
Patient Data
Notification Processing
External Organisation
Metadata System
Auditing SystemReporting Facilities
Overall Structure
15-Mar-03 Statistical Databases and RDBMS © S&SC 26
Main Patient Tables
D Patient
PK Patient ID
I11 SoundexI5 InitI2 DOBI10 SexI7 OccI9 PC OutI8 PC InI1 DHAI6 LAI3 Ethnic GpI4 ExpFK1,U2,U1 Audit ID
D HIV Infection
PK,FK2 Patient ID
U3 HivNoFK3 Hospital IDFK1,U2,U1 Audit ID
D AIDS Diagnosis
PK,FK2 Patient ID
U1 AidsNoFK3 Hospital IDFK1,U3,U2 Audit ID D Death Record
PK,FK2 Patient ID
FK3,U1 AidsNoFK4,U5 HivNoU6 ONS IDU4 Death IDI1 Date DeathFK1,U3,U2 Audit ID
D AIDS Indicator
PK,FK1,I1 Patient IDPK Indicator Type CodePK Indicator Code
FK2,U2,U1 Audit ID
15-Mar-03 Statistical Databases and RDBMS © S&SC 27
Patients and Notifications
SourceNotifications
OptionalComponents
RepeatingComponents
Patient
HIV Infection
AIDSDiagnosis
DeathRecord
HIVNotification(green form)
AlternativeIDs
(Soundex, DoB)
PreviousStatus
(Date)
AIDSNotification(blue form) Blood
Donation
BloodRecipient
Follow UpStudy(Vicky)
Action(Type, Date,Outcome)
External ID(Context,Institution)
ONS DeathNotification
Residence(PostCode, Date)
AIDSIndicator
ExternalFiles
15-Mar-03 Statistical Databases and RDBMS © S&SC 28
Full HAP Data
Structure
D Death Record
PK,FK2 Patient ID
FK3,U1 AidsNoFK4,U5 HivNoU6 ONS IDU4 Death IDFK1,U3,U2 Audit ID
D Patient
PK Patient ID
FK1,U2,U1 Audit ID
D AIDS Diagnosis
PK,FK2 Patient ID
U1 AidsNoFK3 Hospital IDFK1,U3,U2 Audit ID
D HIV Infection
PK,FK2 Patient ID
U3 HivNoFK3 Hospital IDFK1,U2,U1 Audit ID
D External Patient ID
PK,FK3,I3 Patient IDPK,FK2,I1 Organisation IDPK,I2 ID in Institution
FK1,U1 Audit ID
D Alternative ID
PK,FK2,I1 Patient IDPK,I3 Identifier Type CodePK,I2 Identifier Value
FK1,U2,U1 Audit ID
D Action
PK Action ID
FK2,I4,I2 Patient IDFK1,U1,U2 Audit ID
D Contact
PK,FK3,I2 Organisation IDPK,FK2,I1 Person ID
FK1,U2,U1 Audit ID
D Person
PK Person ID
FK1,U2,U1 Audit ID
/ Person
D Organisation
PK Organisation ID
FK2,I2 DirectorFK3,I3 Primary ContactFK1,U2,U1 Audit ID
/ works at
D AIDS Indicator
PK,FK1,I1 Patient IDPK Indicator Type CodePK Indicator Code
FK2,U2,U1 Audit ID
D Notification
PK Notification ID
FK3 Patient IDFK2,I2,I1 Batch IDU3 Notification NoFK4,I6 Organisation IDFK1,U1,U2 Audit ID
D Audit Block
PK Audit ID
FK1,I1 Session IDFK2,I2 Creation Session ID
D Session
PK Session ID
FK1,I1 User Code
D User
PK User Code
FK1,I1 Session ID
D Note
PK Note ID
FK2,I2,I1 Patient IDFK1,U2,U1 Audit ID
15-Mar-03 Statistical Databases and RDBMS © S&SC 29
Data EntryNotification
New NotificationEnter common Patient identification
from a New Notification
View Matches with existing Patients
Initiate entry of a new Patient orupdating an existing one
External Data Sources
Create anew
Patientrecord
New Patient
Transferinformationfrom New
Notification
Transfer
HIV Notification
Information about the Patient andHIV Infection taken from a green
HIV form
AIDS Notification
Information about the Patient andAIDS Diagnosis taken from a
blue AIDS form
Death Record
Add or Update DeathInformation for a Patient
Patient
HIV Infection AIDS DiagnosisDeath Record
• Get Patient details from source
• Look for matching existing patients– May need expert to
decide– Create new Patient if
needed• Open appropriate form
for source to complete transfer of Patient and details
• Forms know how to handle duplicates (generally take earliest)
15-Mar-03 Statistical Databases and RDBMS © S&SC 31
Extraction
• Separate Extraction Database template– Linked to main database
– Set of associated Excel spreadsheet templates
• New copy each quarter– Populate by Extraction of information through Views
– Gives a Snapshot of Patient status
• Set of aggregation queries– Import results into spreadsheets as source
• Reorganise with Pivot tables– Create publication versions
15-Mar-03 Statistical Databases and RDBMS © S&SC 32
Conclusions
• Good analysis needs good data– Need right structure for efficient processing
• RDB provides flexible structure and manipulation– Useful for complex situations– Can produce the views needed for statistical analysis– Quick to implement at a basic structural level
• Good framework for implementing production processes– Important to get the process right for good quality data– Not cheap to implement production processes
• Can easily include associated information– Metadata, other supporting information
15-Mar-03 Statistical Databases and RDBMS © S&SC 33
The End
• Extended versions of presentations available from– www.sasc.co.uk– Follow links to MSc in Statistical Computing