Post on 29-Mar-2018
1
RTI International is a trade name of Research Triangle Institute
3040 Cornwallis Road ■ P.O. Box 12194 ■ Research Triangle Park, North Carolina, USA 27709 Phone 919-990-8397 e-mail ggrubbs@rti.orgFax 919-541-6178
Database Architecture and Design WorkshopGeorge Grubbs
May 17, 2005
2
Field Director’s Guide to “Database Design Appreciation”
3
Why should you care?Expectation setting
Become familiar with data modeling and the database design process, terminology and concepts.
To understand what goes on when a survey is being developed; or changed.
To better communicate with database designers and programmers when development or modifying a survey instrument.
To appreciate the overall database design process and its value.
To get better study outcomes and smoother system development efforts.
4
Workshop Objectives
Answer the questions:
What is a (relational) database?
How does it relate to a survey?
What is database design?
How do you design a database?
What about data warehousing?
5
Workshop Schedule
ScheduleSome basics: database concepts & terminology
Database design process
Data modeling – part 1
BreakData modeling – part 2
Data warehouse fundamentals
Introduction to the Case Study
Performing the Case Study
6
First things first: “What is a database?”
It is a structured collection of related data. Office terminology: file, record, field. Database terminology: logical: entity, entity instance, attribute.Database terminology: physical: table, row, column.
Example: Survey database using “logical terminology”: Questionnaire, Respondent, and Response entities. Response entity instance = “John Smith’s response to question 10. Response entity’s attribute might be called “answer”. And the data value for “John Smith’s response to question 10” might be “true”.
What is the process for getting data into an (electronic) database from a source document? That’s next
7
Source document to electronic form and into the database
DBMSSurvey_ID: “123”.Q1: “Y” Q2: “N”
Some kind of programming language: Visual Basic, “C”, EntryPoint, etc.
DBMS = MS Access, SQL Server 2000, Oracle, DB2. OS = Operating System =
Windows, Linux, Unix.
Hard drive
Uses SQL, (Structured Query Language) to interface with the DBMS.
8
Where is the data?
Here it is!
9
Disk array with 18 TB capacity
1 TB = 1,000,000,000,000 bytes1 TB = 1,099,511,627,776 bytes
1 KB = 1024 bytes
1 MB = 1 KB x 1 KB = 1,048,576
1 GB = 1 KB x 1 MB = 1,073,741,824
1 TB = 1 KB x 1 GB = 1,099,511,627,776
1 PB = 1 petabyte = 1 KB x 1 TB
1 EB = 1 exabyte = 1 KB x 1 PB
1 ZB = 1 zettabyte = 1 KB x 1 EB
1 YB = 1 yottabyte = 1 KB x 1 ZB
Example: Survey with 400 questions and each response averages 100 characters.
1GB = 26,844 surveys.
1 TB = 27,487,791
1 PB = 28,147,497,671
10
Database design First things first: Requirements
Output and
End Users
Questionnaire, Interview instrument
Interviewers
process
11
You might get some requirements from considering the interviewees
The Population
12
Data modeling is the heart of db design
1. Construct a logical data modelEntity-Relationship Diagram (ERD)
Key-Based Data Model (KBDM)
Fully-Attributed Data Model (FADM)
2. Construct a physical data modelPhysical Data Model (PDM)
Make data model improvements
13
Main goals in database design
Minimize redundant data (ideally, each data value should be in only one place in the database).
Reflect the business rules of the application domain (data quality).
Construct a clear and understandable data model that is well-documented (used to “communicate”).
Benefits: data quality, structural integrity, data consistency, performance, understand requirements.
14
Logical Data Model - ERD
The ERD is very simple: it only considers entities and their relationships.
An entity models something in the “real world” – that is, something in our “application domain” which is a “survey domain” in our case – e.g., an entity would be a “person”.
Let’s look at some entity examples, then deal with relationships
15
Example entities with a few attributes
QuestionnaireType A, Eff Date: 2/13/04, …
Type B, Eff Date: 2/13/03, …
Type B, Eff Date: 4/1/05, …
RespondentJoyce E. Smith, Female,
Live in North Carolina,
Age 42, …
Question1, What state you from?
2. What is your age?
1. Plan to re-visit?
InterviewerJohn W. Romano, Male, 5 years of experience, …
Carrie Jones, Female, 1 year of experience, …
ResponseTrue, False, True, True, …
So how are these entities related?
Let’s see
16
Entity relationships
Questionnaire
Respondent
InterviewerResponse
Question
is co
mpleted
by
has
has
makes
interviews
completesis made by
is interviewed by
is part of is answer to
17
Cardinality
Questionnaire
Respondent
Interviewer
Response
Question
Cardinality is the occurrence relationship between two entities.
N
1
1
N
1
N
1
N
N
M
The number of times one entity instance can occur for each instance of a related entity.
18
KBDM: Key-Based Data ModelPrimary Keys (PK)
Respondentrespondent_id (PK)
Interviewerinterviewer_id (PK)
Questionnairequestionnaire_id (PK)
Questionquestion_id (PK)
Responseresponse_id (PK)
A primary key value uniquely identifies a row in a table.The lines are used to indicate types of relationships and cardinality.
19
KBDM: Key-Based Data ModelPrimary Keys and Foreign Keys (FK)
Interviewerinterviewer_id (PK)
Questionnairequestionnaire_id (PK)
Questionquestionaire_id (PK) (FK)
question_id (PK)
Responserespondent_id (PK)(FK)
question_id (PK)(FK)
questionnaire_id (PK)(FK)
Resp_Intrvr_Assocrespondent_id (PK)(FK)
interviewer_id (PK)(FK)
Respondentrespondent_id (PK)
questionnaire_id (FK)
Foreign keys are used to establish relationships between tables.
20
FADM: Fully-Attributed Data ModelAdd Attributes (and Normalize)
Questionnairequestionnaire_id (PK)
type_code
effective_date
Questionquestion_id (PK)
questionnaire_id (FK)
question_text
Responserespondent_id (FK)
question_id (FK)
questionnaire_id (FK)
response
Only a few attributes shown.
Respondentrespondent_id (PK)
last_name
questionnaire_id (FK)
Interviewerinterviewer_id (PK)
last_name
Resp_Intrvr_Assocrespondent_id (FK)
interviewer_id (FK)
notes
21
A word about “normalization”
To normalize a database design is to put it in third normal form or 3NF.
There are quite a few normal forms: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF and even others.
The goal of normalization is primarily to minimize data redundancy, but a fully normalized database can be very inefficient due to query complexity; therefore, once performance is known, a database design is de-normalized to improve performance.
22
Normalization examples
Respondentrespondent_id (PK)
interviewer_1_name
interviewer_2_name
Respondentrespondent_id (PK)
Respondent_Intvwr
respondent_id (PK) (FK)
interviewer_name (PK)
Respondentrespondent_id (PK)
Respondent_Intvwrrespondent_id (PK) (FK)
interviewer_id (PK) (FK)
Interviewerinterviewer_id (PK)
interviewer_name
What’s wrong with having repeating fields?
What if you need to have more than 2 interviewers?
What if an interviewer’s name changes?
23
A word about “referential integrity”
Respondentrespondent_id (PK)
last_name
questionnaire_id (FK)
Questionnairequestionnaire_id (PK)
type_code
effective_date
Would it make sense to have someone in the “Respondent” table with a “questionnaire_id” that did not point to” a questionnaire in the “Questionnaire” table?
NULL = binary ‘0’
Parent
Child
Questionnaire1 Type A 2/13/04
3 Type B 4/10/05
RespondentSP12 Smith 3
WI13 Jones 2
WI65 Phang
24
The Logical FADM in ERwin
Questionnairequestionnaire_id
type_codeeffective_datedescription
Respondentrespondent_id
last_name (IE1.1)first_name (IE1.2)middle_initial (IE1.3)home_state_codeagesex_codemarital_status_codetravel_group_sizequestionnaire_id (FK)
Questionquestionnaire_id (FK)question_id
question_labelquestion_text
Interviewerinterviewer_id
last_namefirst_namemiddle_initialsex_codeyears_experience
Responserespondent_id (FK)questionnaire_id (FK)question_id (FK)
response
Resp_Intrvr_Assocrespondent_id (FK)interviewer_id (FK)
interview_notes
25
The Physical FADM in ERwinTarget DBMS: MS Access 2000
Questionnairequestionnaire_id: Long Integer
type_code: Text(1)effective_date: Date/Timedescription: Memo
Respondentrespondent_id: Text(8)
last_name: Text(20) (IE1.1)first_name: Text(15) (IE1.2)middle_initial: Text(1) (IE1.3)home_state_code: Text(2)age: Integersex_code: Text(1)marital_status_code: Text(1)travel_group_size: Integerquestionnaire_id: Text(1) (FK)
Questionquestionnaire_id: Text(1) (FK)question_id: Long Integer
question_label: Text(6)question_text: Text(40)
Interviewerinterviewer_id: Text(5)
last_name: Text(20)first_name: Text(15)middle_initial: Text(1)sex_code: Text(1)years_experience: Integer
Responserespondent_id: Text(8) (FK)questionnaire_id: Text(1) (FK)question_id: Long Integer (FK)
response: Text(1)
Resp_Intrvr_Assocrespondent_id: Text(8) (FK)interviewer_id: Text(5) (FK)
interview_notes: Memo
26
Example databaseRefer to PDM handout for column names – not all columns shown here.
Questionnaire1 A 2/13/04
2 B 2/13/03
3 B 4/10/05
Question1 1 1 Are you a native of this state?
3 1 1 Are you from this state?
3 2 2 First time visitor?
2 6 6 Did you spend over $500?
RespondentSP12 Smith Joyce E F NC 42 … 3
WI13 Jones Ed M GA 23 … 1
WI65 Phang Li A F CA 31 … 1
InterviewerI1 Romano John W M 5
I2 Jones Kim J F 1
I3 White Jim L M 8
Resp_Intrvr_AssocSP12 I1 Suspicious of motives.
WI65 I2 Eager for the interview.
SP12 I3 Receptive after review.
ResponseSP12 3 1 T
WI13 1 6 F
SP12 3 2 F
WI65 1 2 T
27
Intro to Structured Query Language - SQL
Create a table
CREATE TABLE Questionnaire (questionnaire_id long PRIMARY KEY NOT NULL,type_code text(1) NOT NULL,effective_date datetime NOT NULL,description memo
)
Insert data into a table
INSERT INTO Questionnaire (questionnaire_id, type_code, effective_date, description) VALUES (1, ‘A’, #5/17/05#, ‘Miami visitor survey’)
28
SQL continued
Update a column value
UPDATE Questionnaire SET type_code = ‘B’ WHERE questionnaire_id = 1
Retrieve data from a database
29
Retrieving information from the database
List the respondents from North Carolina along with their age.
SELECT first_name, middle_initial, last_name, age FROM Respondent WHERE home_state_code = ‘NC’;
What are the questions for the Type B (4/10/05) questionnaire?
SELECT question_text FROM Questionnaire, Question WHEREQuestionnaire.questionnaire_id = Question.questionnaire_id AND type_code= ‘B’ AND effective_date = ‘4/10/05’ ORDER BY question_label
Equating table keys, e.g. “questionnaire_id” is called a “join”.
30
More SQL
What are the questions and responses for Joyce Smith and what is her home state?
(Notice the use of “t1”, “t2” and “t3” – that is just a shorthand way of referring to table names.)
SELECT t1.question_label, t1.question_text, t3.response, t2.home_state_code
FROM Question t1, Respondent t2, Response t3
WHERE t1.questionnaire_id = t3.questionnaire_id
AND t2.respondent_id = t3.respondent_id
AND t2.last_name = ‘Smith’
AND t2.first_name = ‘Joyce’
ORDER BY t1.question_label;
What’s wrong with this query?
31
I left out …
Indexing: alternate keys, inversion entries.
Column value constraints; NULL, NOT NULL.
Views.
Triggers and Stored Procedures.
Document the design.
Change management – version control.
Keep the design and the database in sync.
32
Data Warehouse
“What is a data warehouse?”A database.
Contains detailed and summary data.
Normally, is not an online, transactional database.
Usually contains data integrated from several sources.
Supports business intelligence (BI) applications, online analytical processing (OLAP), and data mining.
33
• FACT – measured value. exs: “interview time” and “practice time”
• DIMENSION – descriptive attribute. exs: “age range” and “gender”
Dimensional data models look like this.
Dimensional data model:
Star design.
Snowflake design.
34
The Data Warehousing, Data Mining, and BI – OLAP Process.
Clean-Extract-Transform-Load
Data Warehouse Guy Data Mining GuyBI - OLAP Gal
35
Data Warehousing Technologies
DSS: Decision Support
OLAP: On-line analytical processing
Data Mining : Knowledge discovery “Maybe I’ll discover
a real nugget and win the Nobel Prize!”
Slice and dicedata “cubes”
36
Example of a data warehouse
InquiryInquiry
AnswerAnswerData DictionaryData Dictionary
DataDataWarehouseWarehouse
SearchSearchEngineEngine
SecuritySecurityProvisionsProvisions
andandAccessAccess
AuthorityAuthority
Florida State Education Data WarehouseFlorida State Education Data Warehouse
Clean,Extract, Transform, Load
Data Data WarehouseWarehouse
(Oracle)(Oracle)StatewideStatewideCourse Course NumberNumber
State StuState StuFinancialFinancial
AidAid
Fed FamFed FamEd LoanEd Loan
ProgProg
Fl StuFl StuAsst GrtsAsst Grts
WDEFWDEF
WDISWDIS
FETPIPFETPIP
WDISWDISSupportSupport
PostPost--SecondarySecondaryEd CoordEd Coord
FL BrightFL BrightFutureFuture
Pre KPre K--1212Course Course
Code DirCode Dir
DisabilitiesDisabilitiesOpportOpportSchp;arSchp;ar
OpportunityOpportunityScholarshipScholarship
Eval & Eval & ReportingReporting
Assess &Assess &EvaluationEvaluation
FacilitiesFacilities
SupportSupportPre KPre K--1212
Sch TransSch TransMgmtMgmt
GEDGED
StaffStaffPre KPre K--1212
StudentStudentPre KPre K--1212
AnnualAnnualFinancialFinancial
ReportReport TeacherTeacherCertCert
BudgetBudget
CostCost(SCARS)(SCARS)AggregateAggregate
FTEFTEFundingFunding TalentedTalented
2020
DCCDCCStudentStudentAnnualAnnual
PersonnelPersonnelReportReport
DCCDCCStaffStaff
SUSSUSStudentStudent
DCCDCCFinanceFinance
SUSSUSStaffStaff
SUSSUSFinanceFinance
Data sources: Other databases
37
Case study: “Miami Area Visitor Survey”
38
Tasks
1. Design a database to support the web-based online survey.
2. Extend the database design to contain the data from the online web survey.
3. Refine the database design to respond to typical queries.
39
Task 1: Database to support online survey
Tables containing data to populate drop-down lists.What are these?
State codes, cities, zip codes, countries.
List of reasons to visit Miami.
List of leisure activities.
40
Lookup Tables
State_LUstate_code: Text(2)
state_name: Text(60)
City_LUstate_code: Text(2)city_name_id: Integer
city_name: Text(40)
Zip_LUstate_code: Text(2)city_name_id: Integerzip_code: Text(5)
Country_LUcountry_code: Text(3)
country_name: Text(50)
Visit_Reason_LUvisit_reason_id: Integer
visit_reason_desc: Text(40)
Leisure_Activity_LUleisure_activity_id: Integer
leisure_activity_desc: Text(40)
41
Task 2: Database to contain the data
Tables with columns for keys and for fields to contain what the respondents enter.
Include a place for “Other” inputs. Make the design flexible to accommodate changes. Normalize the design.Q1: State, City, Zip, Country. Ex: “NC”, “Charlotte”, “28212”, “USA”.
Q2: Reasons for visiting Miami: First, Second, Third reasons. Pick from lists, plus “other” text.
Q3: Leisure activities: Multiple – pick from list, plus “other” text.
Q4: Time spent on trip. Ex: “2”, “days”; “5”, “hours”.
Q5: Number of nights away from home on trip: Ex: “0”, “1”, 6”.
Q6: Number of total visits to Miami in 2 years: Ex: “1”, “4”.
Q7: Plan to return to Miami? Ex: “Yes”, “No”. Reason.
Q8: Respondent Gender: Ex: “Male”, “Female”.
Survey date.
42
Survey Data Tables
Survey_Responsesurvey_response_id: AutoNumber
survey_date: Date/Timestate_code: Text(2) (FK)city_name: Text(40) (FK)zip_code: Text(5) (FK)country_code: Text(3) (FK)gender_code: Text(1)time_spent_miami_value: Integertime_spent_miami_units_code: Text(1)nbr_nights_miami: Integerttl_night_away_home: Integerttl_visits_miami_2_yrs: Integerplan_return_miami_ind: Text(1)
Visit_Reason_Othersurvey_response_id: Long Integer (FK)
visit_reason_other_desc: Text(100)
Visit_Reason_LUvisit_reason_id: Integer
visit_reason_desc: Text(40)
Visit_Reasonsurvey_response_id: Long Integer (FK)visit_reason_rank: Byte
visit_reason_id: Integer (FK)
Leisure_Activitysurvey_response_id: Long Integer (FK)leisure_activity_id: Integer (FK)
Leisure_Activity_Othersurvey_response_id: Long Integer (FK)
leisure_activity_other_desc: Text(100)
Leisure_Activity_LUleisure_activity_id: Integer
leisure_activity_desc: Text(40)
Country_LUcountry_code: Text(3)
country_name: Text(50)
Zip_LUstate_code: Text(2) (FK)city_name: Text(40) (FK)zip_code: Text(5)
What’s missing?A place to store “reason for returning or not returning to Miami.
43
Task 3: Refine the database design
Understand the types of queries needed.
Think about a data warehousing approach.
Create some tables to hold aggregate and summary data.
44
Will database technology come to this?
Credit: http://www.cartoonstock.com/directory/d/databases.asp
45
Thanks
I hope you enjoyed the workshop.
For any follow-up questions, e-mail me at ggrubbs@rti.org.
Presentation available at: www.rti.org/ifdtc.
I forgot a minor detail, the “Final Exam”!