P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from...

27
P20 Seminar November 12, 2009 1 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management

Transcript of P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from...

Page 1: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 1

Statistical Collaboration

Part 1: Working with Statisticians from Start to Finish

Part 2: Essentials of Data Management

Page 2: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 2

Objectives

Participants will learn about: process of consulting and collaborating

with statistician general principles of database setup,

data entry, verification, cleaning and storage

Page 3: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 3

Part 1: Working with Statistician from Start to Finish

Kay Savik, MS

Page 4: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 4

Collaboration

“Collaboration implies that statistician

and researcher want to learn and

exchange information. This exchange

should be mutually beneficial.”

Gerald van Belle

Page 5: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 5

Types of Consulting

Cross sectional - statistical advice for data already collected or analyzed

Longitudinal – a long term relationship between statistician and researcher

Page 6: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 6

First Meeting

Intent of study Source of data Sampling unit Randomization Model of effects Type of study Type of data

Page 7: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 7

First Meeting What is the research question?

What level of statistical knowledge does researcher have?

What are the data and what form are they in?

What are the conventions in this specific area of study?

Page 8: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 8

The Conversation

To prevent type III error – the right answer to the wrong question!

Clarify research aims Appropriate design Measurement Data management Analysis

Page 9: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 9

Analysis Choice

Sir David Cox –

“Begin with very simple methods and, if possible, end with simple methods”

Rinndskopf’s Rules of Statistical Consulting –

“Sometimes the “best” or “right” statistical procedure is not the best for a particular situation.”

Page 10: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 10

Which Statistical Package?

There is not one “perfect” software for any procedure

All standard packages have been tested and are reliable

“Specialized” procedures are found in several packages

Page 11: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 11

Collaborate Rather than Consult

Collaboration is a communal activity Decide who is responsible for what at

first meeting Politely and quickly leave a

collaboration where any party seems misguided or unethical

Decide on questions of authorship at first meeting

Page 12: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 12

Part 2: Essentials of Data Management (DM)

Olga Gurvich, MA

Page 13: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 13

Data Management

Essential part of any research Interactive and collaborative venture of

both investigator and statistician Requires a well-defined in advance

system and consistency in its implementation

Page 14: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 14

Data Management Stages

Database setup Raw data collection [who, what, when, how] Raw data entry, verification and cleaning Data storage [Data re-structuring for statistical analyses] [Data analysis] Data archiving

Page 15: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 15

Database Setup - Software

Choice mainly depends on

Amount of data to be collected Complexity of data structure Type of data Export/import capabilities to/from Planned statistical analyses and software

Software: try avoiding Excel SPSS, ACCESS, EpiInfo, output of survey

software, plain text (ASCII)

Page 16: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 16

Database Setup – Structure

Participants => rows; variables => columns

Logical Record: one row contains all data for a single study participant

Multiple Record: multiple rows per single participant

Relational: multiple data files that can be merged

Page 17: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 17

Database Setup - General

Give short, meaningful and “dated” name DB given to a statistician for cleaning and

analyses should include

- ONLY collected raw data;

- NO graphs, comments, titles, summaries,

hidden rows, split-spreadsheets, multiple

spreadsheets, imposed “special” formats

or highlighting

Page 18: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 18

Database Setup - Variables

Set unique numeric ID(-s) in 1st column (-s) Identify types of variables, measurement

units and type of recording [auto/manual] Carefully choose variables’ format and length Dates format MM/DD/YYYY; if parts are

missing, create three separate variables Time format dd hh:mm:ss or similar

Page 19: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 19

Database Setup - Variables

Create separate variable for every separate piece of information

Give unique, short [6-8 char], meaningful names

No special characters [!, %, $,spaces] Do not start with a number Consider other restrictions of specific

software [e.g., lower/upper case letters]

Page 20: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 20

Database Setup - Coding

Assign short and meaningful codes; consistent for same-response variables

Use numeric (if possible) coding;

do not combine num and char codes within a numeric variable

Address missing values Avoid using “N/A”, “?”, etc. entirely

Page 21: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 21

Database Setup – Codebook/Data Dictionary

A written handbook with information on study data:

Study title, PI name, date of last update, DB name and location

# of observations, # of variables Study variables and their attributes [name,

label, location (ASCII), coding (values), format, measurement units]

Other [formulae, weights, scoring documentation, etc.]

Page 22: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 22

Data Entry, Verification and Cleaning

Ultimate aim is

a fully-documented backed-up archive of

verified, validated and ready-for-use data

Page 23: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 23

Data Entry

“Do it promptly, completely and consistently”

Preferably one trained data entry person [unless double entry]

Unique ID (-s) All the data must be entered in its “raw” form

directly from the original records - NO hand calculations

Frequent back-up

Page 24: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 24

Data Verification and Cleaning

Optimally done by a statistician or DM professional in close collaboration with investigator

Includes (but not limited to) general and logic checks to detect errors and outliers, verification of data completeness (subjects and variables)

Audit trail/log book for a complete record of changes made

Following all necessary corrections, ONE FINAL CLEAN DB is created

Page 25: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 25

Data Storage

Stored on a password-protected server are

1. ONE INITIAL RAW DB

2. ONE FINAL CLEAN DB

3. CODEBOOK

4. Audit trail or log book [if used] Frequent BACK-UPs are performed All previous DB versions EXCEPT the initial

raw one are destroyed

Page 26: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 26

Data Re-Structuring

If not foreseen in advance, may be needed for certain analyses

Usually can be done in statistical packages Keep a record of any re-structuring Use “version-” or “date-numbering” system

Page 27: P20 Seminar November 12, 20091 Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.

P20 Seminar November 12, 2009 27

Data Archiving

At the end of a project, the data, codebook, log-book and programs [syntax] must be archived

The archive serves as a permanent storage and gives access to all project-related information

Keep a copy of the archive and detailed report of the archive’s structure