Applications e

8/10/2019 Applications e

1/25

Choice of an appropr iate statistical technique

a complex issue

somewhat arbi traryReal-l i fe data often contain mixtures of di fferent types of data

two statisticians may select different methods

depending upon what assumptions they are wil l ing to take

into account

extraneous factors

availabil i ty of software and its l imitations

availabil i ty of time and financial resources

General Principles of Data

Analysis


2/25

Warnings

F igures allow us to calculate them

Applying different techniques and obtaining di ff erent resul tsdoes not mean that something is wrong

Looking for an answer to the same question by using several

methods may lead to a better understanding

Obtaining negative resul ts may be as informative as getti ng a

positi ve one

Obtaining no answer by using one technique, does not mean

that there is no answer at all

Etc.


Analysis


3/25

The choice of a statistical technique depends essentially upon

Characteristics of the analysis question;

Characteristics of the data;

Characteristics of the sampling design.

Character istics of the Analysis Question

Whether there is a distinction between independent and dependent

variables or not?

Whether the nature of the research problem requires:

Description, exploration, estimation, or

Testing of a hypothesis or model

Whether the focus of research is on ' var iables' or 'objects.


Analysis


4/25

Character istics of the Data

Types of data sets

I ndividuals - var iables data sets

Proximi ties data sets

Variable - Variable Proximities

I ndividual - I ndividual Proximities

Types of Variables

Continuous or Quantitative Variables

Discrete or Quali tative Variables

Variable types by measurement level


Analysis

Nominal-scale variables

Ordinal -scale variables

I nterval-scale vari ables

Ratio-scale variables


5/25


6/25

Techniques for problems with distinction between independent anddependent variables


Analysis

Analysis Method

Dependent Independent Dependent Independent

One One Nominal Nominal Non-parametric tests, Chi-squareOne One Nominal

(dichotomous)

Nominal Multiple Classification Analysis

One One Nominal Nominal

(Dichotomous)

Wilcoxon's two sample test, Chi-square,

Kolmogorov-Smirnov Test

One One Interval-scale Nominal

(Dichotomous)

t-test, Analysis of Variance

One One Interval-scale Interval-scale Regression AnalysisOne One Interval-scale Nominal Analysis of Variance

One More Nominal Interval-scale Discriminant Analysis

One More Interval-scale Nominal Analysis of Variance, Multiple Regression

Analysis, Multiple Classification Analysis

One More Interval scale Dummy Analysis of Variance, Multiple Regression

Analysis, Multiple Classification Analysis

One More Interval-scale Interval-scale Multiple Regression Analysis

No. of Variables Measurement Level


7/25

Usual way of statistical problem solving

Formulate the question using terms and logics of the specif ic

field of the problem (science management, pedagogy,

economics, etc.)

Reformulate the question using statistical terms and logics

F ind appropriate statistical model(s) and technique(s)

Use the selected model(s) and technique(s)

Give statistical interpretation to the resul ts obtained

Reformulate the interpretation wi th terms of the original f ield

of application


Analysis


8/25

Question in research management

Research groups have multiple outputs comprising publications,

patents, experimental mater ials etc. What are the differences if any

in the performance of the Research Groups of selected countr ies?

Statistical question

Can we construct a reasonable productivity index, using the

following measures of the scienti f ic output

Articles in country PatentsArticles abroad Algor ithms and designs

Original research reports Exper imental mater ial

Can we find a signif icant dif ference by countr ies in the productivity

index?

Scientific products by

country


9/25

Statistical model and technique

Partial order scor ing for constructing the index of research output

Analysis of variance for testing the hypothesis concerning the

signif icance of the difference

Use of the selected model and technique


country

RUN POSCOR

FILES

PRINT = POSCOR.LST

DICTIN = R2R3RU.DIC

DATAIN = R2RU.DAT

DICTOUT =POSCOR.DIC

DATAOUT =POSCOR.DAT

SETUP

POSCOR SCORES OF RU OUTPUTS

BADDATA=MD1 -

IDVAR=V2 -

TRANSVARS=(V1)

POSCOR ORDER=DESR -

ANAME= OUTPUT

VARS=(V116,V118,V122,V126,V128,V13

0)

RUN ONEWAY

FILES PRINT = ONEWAY1.LST

DICTIN = POSCOR.DIC

DATAIN = POSCOR.DAT

SETUP

ANALYSIS OF VARIANCE OF RU OUTPUT

BADDATA=MD1 -

PRINT=CDICT DEPVARS=(V8) CONVARS=(R1)

RECODE

R1=RECODE V15 (40)=1, (360)=2, (410)=3, (638)=4, (844)=5, (868)=6


10/25


country

Use of the selected model and technique (results)

Weight-

sum

1 334 334 22.9 37.731 35.794 1.26E+04 16.8 9.02E+052 239 239 16.4 45.213 35.778 1.08E+04 14.4 7.93E+05

3 200 200 13.7 77.585 27.336 1.55E+04 20.7 1.35E+06

4 225 225 15.4 52.547 35.43 1.18E+04 15.7 9.02E+05

5 233 233 16 36.7 33.266 8.55E+03 11.4 5.71E+05

6 229 229 15.7 69.074 36.255 1.58E+04 21.1 1.39E+06

Code

Label N % Mean

S.D.(esti

m.) Sum of X %

Sum of X-

square

Total sum of squares 2048467For 6 groups , Eta 0.4018943

For 6 groups , Etasq 0.161519

For 6 groups , Eta(adj) 0.3982909

For 6 groups , Etasq(adj) 0.1586357

Between means sum of squares 330866.5

Within groups sum of squares 1717601

F( 5,1454) 56.018


11/25


countryStatistical interpretation

The F( 5,1454)=56.018 value shows that there is a highly

signi f icant dif ference by country in the constracted performance

index. We see also a medium strength differentiation between the

countr ies: Eta(adj)=0.398.

The Mean values show the level of each country.

I nterpretation for research management

There are two countr ies with low, two ones with medium and two

other ones with high productivity index.

Source

P.S. Nagpaul : Guide to Advanced Data Analysis using I DAMS Software


12/25

Question in psychology - pedagogy

I ntellectual performance, motivation and creativity of school children can

be measured by using several indicators. Some of them are produced by

the chi ldren themselves (e.g. IQ tests) others are based on the evaluationgiven by their teachers (e.g. average grade). What are the perceivable

dimensions if any behind these indicators?


I n the set of the listed indicators, are there any groups within which

statistical inter-correlation and between which statistical independencecan be detected?

TAverage grade TCreative behaviourC IQ C Achievement motivationC Creativity test TMotivated behaviourC Creative atti tude TM otivation index

Performance, motivation

and creativity of schoolchildren


13/25


Pearsonian correlation between the measured indicators

Mul tidimensional scaling, cluster analysis


Executing PEARSON, MDSCAL, CLUSFINDin IDAMS

MDSCALresul t



Teachers

Children


14/25


CLUSFINDresul t



C IQ

C Creativity test

C Creative atti tude

C Achievem. motivation

TAverage grade

TCreative behaviour

TMotivated behaviour

TM otivation index

0,75

0,71

0,40

0,45

0,27

0,13

0,02


15/25


and creativity of schoolchildrenStatistical interpretation

Mul tidimensional scaling shows clear separation of indicators produced

by children and teachers

Cluster analysis supports the finding of the separation of var iablescoming from teachers and children

Pedagogical/psychological interpretation

Just one aspect: ratings given by teachers to chi ldren are near ly the

same, independently of the evaluated abil i ty, atti tude or behaviour

dimensionSource

M. Hunya: Mul tidimensional statistical techniques in pedagogical studies

Data

A.Deak, B. Kozeki : Study in to the eff ect of motivation and creativity factors on the

performance of school children


16/25

Question in hydrology

We have water level data on four r ivers in North-Afr ica (mor

than 40 years). Can the water f low level be predicted on the basis of

data from the past? I f so, with what precision?

What if the average f low level is considered instead of the individual

ones?


Can the r iver f low values be predicted by using a set of valuesfrom the preceding per iod?

How does the prediction change if 6 month average flow is

used?

Prediction of river flow

values


17/25


Autoregression model (wi th a lag of 12 to 36) applied to the river f low

time ser ies

Transformation of the original data into a time series of movingaverages (interval length = 6)


Time Ser ies Analysis option from the IDAMS interactive facil i ties

Original series Moving average series

12 months R* * 2=0,32 12 months R* * 2=0,92

24 months R* * 2=0,35 24 months R* * 2=0,93

36 months R** 2=0,36


values


18/25


Original ser ies


values

Moving average ser ies


19/25


valuesStatistical interpretation

Autoregression shows that individual values can be predicted (Unbiased

R* * 2 = 0,32 - 0,36; for 12 to 36 months) with moderate or avarage

precision, high peak values are very poor ly reproduced.

I n the case of a 6 month moving average, the prediction is near ly perfect

(Unbiased R** 2 = 0,92; for 12 months).

Hydrological interpretation

Although the pattern of changes can fair ly be reproduced, even thr ee

years data from the past are not enough at al l to predict the height ofpeak flows.

But if we consider 6 month averages, they can be predicted almost wi th

ful l precision.

Data

UNESCO, Water Science Di vision


20/25

Question concerning company management

What are the factors that inf luence the economic performance

of a company? Economic performance is measured by the

return on capital employed.Statistical question

Can the return on capital be predicted by using a set of

economic and production indicators from those character izing

the company?

How does the prediction change if we are loking for a subset of

best predictors?


Mul tiple linear regression

Stepwise regression

Business


21/25


Running REGRESSN

Results

The fu ll regression model explains 70% of the adjusted variance

of the dependant variable. I ts standard error is about one hal f of

the mean, value of the determinant of the correlation matr ix is

.79478E-05. There are 8 variables (out of 12) with high

covar iance ratio

values. The stepwise regression model selects 3 variables for explaining

80 % variance. No multicol l ineari ty (0.77647 ). Standard error of

the estimate of the dependent var iable = 0.06135 which is qui te

low: high rel iabil i ty of estimation.

Business


22/25

Business

Statistical interpretation

Ful l r egression model:the reliabil i ty of prediction is poor. Strong

mul ticol l ineari ty is shown. Variables, which contr ibute to

mul ticol l ineari ty can be identi f ied

The stepwise regression model: 3 variables for explaining 80%

variance. No mul ticoll ineari ty. H igh reliabil i ty of estimation.

I nterpretation for management

Al though the ful l indicator set can give nice prediction, it can not

be suggested for real use because of the poor predictionreliability.

But i f we consider 3 careful ly selected indicators, we can get a

fair prediction.Source

P.S. Nagpaul, I ndia


23/25

Question concerning measurement of knowledge level

Tests are used very often in education for checking the level of

knowledge in one or in another subject. Long tests with many

questions can meet relatively easily the reliability requirement.

The question i s if we can make a shor t interactive, adaptive test

from a long test, preserving at least nearly the original rel iabi l i ty.


Can we give a good estimate of the original test value by using atree structure based prediction?


Regression tree

Education


24/25


Running SEARCH

Results

Starting f rom a standardized test (f or checking a specif ic verbal

aptitude) containing 20 questions, a regression tree with 3-4

questions was obtained. The regression tree contains 10 final

subgroups (leaves) with estimates for the original test value ranging

from 6,4 to 59,2. The explained variance is 90,4%.

Education


25/25

Education

Statistical interpretation

A very good estimate can be given for the original test value by using the

obtained regression tree.

I nterpretation for test designers

Using the the tree structur e, cumputer assisted test can be constructed,

which is much shor ter, without loosing the power of the or iginal test.

SourceM . Hunya: F inding optimal in teractive test structures (1982)

Applications e

Documents

Transcript of Applications e