Introduction to Data Handling. A Fast Hour Review of data types Scalar, ordinal, nominal Decisions...

Post on 15-Jan-2016

218 views 0 download

Transcript of Introduction to Data Handling. A Fast Hour Review of data types Scalar, ordinal, nominal Decisions...

Introduction to Data Handling

A Fast Hour

Review of data types Scalar, ordinal, nominal

Decisions regarding encoding data Turning information into analyzable data Dealing with missing data

The structure of experimental data Getting things into 2 dimensional (or a few dimensional) tables

Deciding on which software to use Excel Spreadsheet-style analysis packages Scripted analysis

Review of Data Types

Review of Data Types

Scalar Continuous Discrete

Ordinal

Nominal

Scalar Data

Continuous Data Real numbers used to measure magnitude Unbounded at least in one direction Ex: Average Dilantin level

(3.1+4.4)/2 = 3.75

Discrete Data Data that can take on a finite number of values Unbounded at least in one direction Ex: Average number of fingers

(5+4)/2 ≠ 4.5, but is ‘in between’ 4 and 5

Scalar Data

Truly continuous data are theoretical – you don’t run into them in the real world

Because of limitations of measurement (e.g., significant figures), scalar data are actually discrete

In most real life applications, discrete data can be handled as if continuous Just beware of the ‘2.3 kids’ problem

Ordinal Data

Data whose attributes are ordered but for which the numerical differences between adjacent attributes are not necessarily interpreted as equal

Bounded Scale has some upper and lower limit

Classic Example: Glasgow Coma Scale GCS of 4 intuitively ranks lower than GCS of 5 Difference of GCS of 14 and 15 is not the same as difference

between GCS of 3 and GCS of 4 GCS of 4 + GCS of 5 ≠ GCS of 9

Nominal Data

May have an assigned numerical value for analytical reasons, but there is no numerical underpinning for the variables

Example: Race African american = 1 Hispanic = 2 Asian = 3 1 + 2 ≠ 3

Turning information into analyzable data

Turning information into analyzable data Discrete data are usually easy

Age Vital signs One dimensional measures (e.g., Hgb, time-to-relapse)

Ordinal and nominal data get tricky If you’re only going to do descriptive statistics, it doesn’t

matter much If you’re going to model (e.g., do regression) it gets

involved

Real Life Example from the Camp Survey Question 3. On a usual camp day, the

person on site with the highest level of health care training is a:

Physician Registered nurse Licensed practical nurse Licensed paramedic Licensed EMT Licensed first responder First aid provider

Real Life Example from the Camp Survey

What type of variable would you use?

Real Life Example from the Camp Survey One choice:

A continuous variable

On a usual camp day, how many years of training has the senior-most caregiver completed?

Var_Years

Real Life Example from the Camp Survey Another more likely choice:

An ordinal variable

Physician = 1

RN = 2

LPN = 3

Paramedic = 4

EMT = 5

First responder = 6

First aid = 7

Var_Caregiver

Real Life Example from the Camp Survey A Third Choice

Seven nominal ‘dummy variables’

Var_MD = 1 or 0 (yes or no)Var_RN = 1 or 0 Var_LPN = 1 or 0 Var_Para = 1 or 0 Var_EMT = 1 or 0 Var_Respond = 1 or 0 Var_FirstAid = 1 or 0

Real Life Example from the Camp Survey Who cares?

Var_Caregiver

1234567

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0 0 0 0 0 1

7 DummyVariables

Var_Years + Real Numbers

A Basic Modeling Problem

Is there a relationship between the level of on-site caregiver training and the number of deaths per year at camp?

Deaths = f (Caregiver Level)

Num

ber

of D

eath

s

Var_Caregiver1 7

Deaths = b1x1 + b0

where x1 = Var_Caregiver (1-7)b1 = a coefficientb0 = the y-intercept

Num

ber

of D

eath

s

Var

_MD

Var_

Firs

t_Ai

d

Deaths = b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 + b7x7+ b0

where x1 = Var_MD, x2 = Var_RN, etc.b1-7 = are coefficients for each xb0 = the y-intercept

Var

_RN

Var

_Par

a

Var

_LP

N

Var

_EM

TV

ar_R

espo

nd

Num

ber

of

Dea

ths

Var_Caregiver1 7

Nu

mb

er

of

De

ath

s

Var_MD Var_First_Aid

Pros: Easy to compute Easy to understand

Cons: Forces a ‘continuous’ structure onto Var_Caregiver that may not really exist

Pros: Agrees more closely with experimental results Doesn’t impose any relationship between different provider levels Cons: Less easy to understand ‘Discards’ the knowledge that some caregivers have more training than others

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Stick close to original measurement When you call an ambulance in an

emergency, how long does it take for the ambulance to get to your camp?

< 5 minutes (Time = 1) 5-10 minutes (Time = 2) 10-15 minutes (Time = 3) 15-20 minutes (Time = 4) > 20 minutes (Time = 5) Don’t know (Time = 6)

Good, bad,Indifferent?

Stick close to original measurement Do you know how long it would take an

ambulance to respond to a call from your camp? (y/n)

If so, how many minutes? (some discrete #)

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can Abstraction seems useful, but distances you from

what you were originally looking at Keep continuous data continuous if at all possible Likewise preserve ordinal and nominal data Later on, you can ‘digest’ the raw data into

categories, etc., as necessary.

Decisions regarding how to encode data Remember:

Data can always be made more general during analysis. They cannot be made more specific.

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Avoid bundling more than one idea into a single variable Ex. <5, 5-10, 10-15, 15-20, > 20, ‘Don’t Know’

Decisions regarding how to encode data Stick as close to the ‘raw measurement’ as

you can

Avoid bundling more than one idea into a single variable

Use a specific plan for missing data!

Missing Data

Blank data cells are ambiguous Data not provided/collected? Data erroneously omitted? Data provided but nonsensical?

Note: Many statistical packages will ignore an entire ‘observation’ if a data point is missing!!!

Missing Data

Pick something (other than nothing) to denote a missing data point ‘.’ or ‘Null’ are commonly used

The Structure of Data

Statistical analysis is based on the idea of ‘observations’ An observation often is a patient (and all of the

data you collect about that patient) Really is just an experimental ‘unit’ or ‘trial,’ such

as one summer camp or one hospital day

Any analysis of many observations requires you establish a ‘structure’ for your observations

The Structure of Data

You’ll need to think about the ‘shape’ of your experimental data early in your study Preferably during planning

Fortunately, very many data sets can be structured into a tabular form For better or worse, Excel is used really often

The Structure of Data

Obs # Last Name Systolic BP Diastolic BP

1 Fawcett 114 54

2 Smith 93 42

3 Jackson 78 49

4 Ladd 58 38

Fields

Observations

Don’t confuse a 2-dimensional data table with 2-dimensional data!

Ultimately, every observation is a mathematical ‘vector’ that completely describes that event in an n-dimensional space

Fawcett

Jackson

Smith

Ladd

SB

P

DBP

Your data have as many dimensions as they havedata fields!

(Unavoidable) Shortcomings of Tabular Data Large Number of Fields or Observations

Difficult to ‘look’ at all of the data

Troubles with Repeated Measures

Handling Repeated Measure Data in a Tabular Data Structure

Pat

ient

ID

Wei

ght

Day

1

BU

N D

ay 1

Urin

e D

ay 1

Wei

ght

Day

2

BU

N D

ay 2

Urin

e D

ay 2

Wei

ght

Day

3

BU

N D

ay 3

Urin

e D

ay 3

Wei

ght

Day

4

BU

N D

ay 4

Urin

e D

ay 4

Wei

ght

Day

5

BU

N D

ay 5

Urin

e D

ay 5

Obs # Last Name Hospital Day Systolic BP

1 Fawcett 1 84

2 Fawcett 2 72

3 Fawcett 3 84

4 Smith 1 94

Handling Repeated Measures in Tabular Data Structures

• The ‘Day in the Life’ strategy• A Patient Day becomes the observation• Can be a more compact way of saving data

DemographicData

Daily Data(For each of 7

Study days

BacterialIsolateDataOutcome

Data

Using Relational Databases for More Complex Data

Deciding Which Software to Use Some useful groundrules

1. Use software with all of the tools you need

2. Don’t make things unnecessarily complicated

3. Know in advance what your statistical collaborators are going to use, and how they like the data to appear

Deciding Which Software to Use Data-entry Level Tools

Input method other than just entering fields in Excel spreadsheet ‘Forms’ type page Interface with other data types Interface with Scantron Interface with analytical instruments

Deciding Which Software to Use Data-entry Level Tools

Entry error control Double entry Restricted data fields that must fit a particular format or

be rejected Merging data sets

Doing this by hand is fine for 15 patients, but not for 1,500

Deciding Which Software to Use Data Manipulation Needs

Do your data need some post-collection modification prior to analysis? Transformation (e.g., log-transforming to achieve normal

statistical distributions) Relabeling missing data fields Text or numerical string modification

E.g. changing all dates to MM/DD/YYYY Internal data consistency checks

E.g. is the number of ICU days < the number of hospital days?

Deciding Which Software to Use What Analyses are You Going to Perform?

Summary Statistics Frequencies, means, etc.

Simple x by y regressions Contingency tables (and 2) ANOVA Multivariate modeling Logistical modeling Nonlinear modeling

Easy in Excel

Not Easy in Excel

Best Handled inDedicated StatsPackages or Elsewhere

Deciding Which Software to Use Output Needs

Tabular data that can be dumped into a word processor Text files Cut-and-paste

Graph preparations and dumping Cut-and-paste Specialized output formats

.tif, .jpg, .svg, MS metafiles Colors (RGB v. CMYK)

Deciding Which Software to Use Other needs you might not have thought

about but that are really important Interim “noodling” type analysis Needing to repeat the analysis on multiple data

sets, or to ‘update’ the analysis if new data become available

Deciding Which Software to Use

Spreadsheets Excel

Spreadsheet and ‘Pull-Down’ Stats Packages SPSS, Prism (Graphpad), JMP

Database Managers Access, Foxpro

Scripted Statistical Languages SAS, R, MatLab

Incr

easi

ng L

evel

of

Org

aniz

atio

nIn

crea

sing

Fro

nt-E

nd T

ime

Handling Your Data in Excel

Few up-front requirements Load your data and you’re ready to go Many simple stats can be done as ‘one off’

analyses VERY Inflexible

You pay for your choice later on in debugging, rerunning analysis, editing the data set, etc.

Using Spreadsheet-’Pulldown’ Stats Packages Is the most power most users will ever need

Slightly more up-front time Forced data structures are like eating oatmeal Most have integrated graphics utilities

Some unusual applications are tough to manage Nonlinear analysis

Using Scripted Statistical Packages When you anticipate running relatively complicated

analyses on a series of data sets

When you can design the analysis plan without having all of the data available

When you must document exactly how you did your analysis and be able to exactly duplicate it at will Which is arguably every time (!)

g<-read.csv("expdata2.csv",header=TRUE)gmat<-as.matrix(g)gmati<-gmat*-1heatplot(gmati,Colv=NA)

Back-End Utilities

Graphical Output Excel has horrible graphics that can be spotted a

mile away in journals Most stats packages will do better

Consider ‘Post-Processing’ in Dedicated Graphics Software E.g., Adobe Illustrator

Research is a Data Business, Use the Tools at Your Disposal

Data Input System

DedicatedDatabaseManager

StatisticalPackage

StatisticalPackage

StatisticalPackage

GraphingSystem

GraphicsPolishing for Publication

Other Very Important Resources Google

Almost everything you need to know Most of it’s pretty accurate

Java Applets Many stats applications can be found on line that will run on any

machine

Open source code is on its way R Linux

CSCAR Sometimes more helpful than others.

Who Will Not Be Helpful

MCIT

Questions? People and their Software

Sue Stern JMP (The SAS ‘PullDown’ Package) Repeated measures analysis of clinical data

Bonnie Singal SAS Pretty much any clinical statistical research question

Matt Trowbridge Stats and GIS packages Merging complex data sets

John Younger SAS, Prism, R Kinetics, Logistic and nonlinear models of complex behaviors