C ontingency (frequency ) tables
description
Transcript of C ontingency (frequency ) tables
Contingency (frequency) tables
Dependence of two qualitative variables
Examples of problems
• Is survival of a person send to choleric area dependent on the fact whether the person have been vaccinated against cholera or not?
• Is there any connection between hair colour and sex?
• Are parasite species distributed independently?
Contingency table
FACTOR 2
Category 1 Category 2 Category 3 Sum
FACTOR 1 Category 1 f11 f12 f13 R1
Category 2 f21 f22 f23 R2
Sum C1 C2 C3 n
Survived in tropic
yes no sum
yes 100 10 110
Vaccinated no 100 110 210
sum 200 120 320
Species 1
present absent sum
present 100 200 300
Species 2 absent 200 1000 1200
sum 300 1200 1500
Dependence of survival on vaccination
Mutual dependence of two species
Relationship between two categorial variables in table
• in the case, when one from the variables is manipulated
• in the case, when one of the variables is probably a cause and the second one is a consequence (response), but the study is based on non-manipulative observations
• And finally, in the case, when the possible causality is unclear
Basic rules from theory of probability
• Probability of common occurrence of two independent events is Pi,j = Pi . Pj
•Example: In population is a half of its members male gender (Pmale=0.5) and a tenth of all individuals are albino (Palbino =0.1). If albinos are equally common in both sexes (i.e. albinism and sex are independent events), then probability that randomly chosen individual is albino male is Pmale
* Palbino 0.5 * 0.1 = 0.05
Basic rules from theory of probability
• Expected number of successes E(a) from n experiments, where probability of a success is Pa is
•E(a)=Pa . n
•Example: Probability that mutation occurs is 0.02 - in 100 randomly chosen individuals we expect 2 individuals with this mutation
How we compute 2 ?
k
i i
iik
i fff
1
2
1
22
ˆ)ˆ(
OE)-O(
How we obtain expected values?
H0 says – events are independent – so, with help of probability of common occurrence of two independent events.
r
i
c
j ij
ijij
fff
1 1
22
ˆ)ˆ(
Calculation of expected values
FACTOR 2
Category 1 Category 2 Category 3 Sum
FACTOR 1 Category 1 f11 f12 f13 R1
Category 2 f21 f22 f23 R2
Sum C1 C2 C3 n
With help of marginal sums
Pi. = Ri /n P.j = Cj / n Pij=Pi.P.j,
E (fij) = Pij . n = (Ri / n) . (Cj / n) . n = Ri . Cj / n
What I need to know to know result of complete experiment
(given the fixed marginal frequencies?)
df = (c-1) . (r - 1)
number of columnsnumber of rows
Critical value on 5% level of significance by df=3.
What we usually write to our paper
This area is 0.029, so we write 2 =8.99, df=3, P=0.029
Even here is sometimes (when extremely low expected frequencies) used Yates’
correlation
k
i i
ii
f
ff
1
22
ˆ
)5.0ˆ(
better protection against Type I error, but weaker test
Another test criteria, but also with 2 distribution
i j i j
jjiiijij nnCCRRffG lnlnlnln2
i i j
jjii
j
ijij nnCCRRffG loglogloglog60517.4
so-called 2 likelihood ratio (LR)
Similar results
“Normal” 2 =8.99
2 by 2 tables
Character 1
present absent sum
present a b m=a+b
Character 2 absent c d n=c+d
sum r=a+c s=b+d N=a+b+c+d
mnrsbcadN
RRCCffffn 2
2121
2211222112 )()(
Notice, that for null hypothesis’ table holds
ad = bc
Statistical and causal dependence
• Causal dependence can be proved just due to manipulative experiment
Survived in tropics
yes no sum
yes 100 10 110
Vaccinated no 100 110 210
sum 200 120 320
For “correct” experiment everyone has to be vaccinated, but half of them gets just placebo (compare what is possible and what is demanded by statistics).
Fundamentals of experimenter
• Every treatment has to have its control
• Control differs from treatment just in impact, which I want to prove (it is often very difficult)
• I have to have independent replications
Advantages of experiment and observation study
• Causality can be proved due to experiment
• Range of experimental manipulations is usually limited
• Almost every experimental impact has side effects, which are sometimes unpredictable
Fisher’s exact testHow big is probability, that I get such or more different table in given marginal frequencies (providing that null hypothesis is true, computed with help of combinatorics).
It is used for 2 x 2 table when numbers of observations are low.
If I have table
+ - marg. + 5 7 12 - 4 20 24marg. 9 27 36
Than Fisher’s test computes directly probability of this table, and all (from the view of H0) more extreme, i.e.
+ - marg. + 6 6 12 - 3 21 24marg. 9 27 36
+ - marg. + 7 5 12 - 2 22 24marg. 9 27 36
+ - marg. + 8 4 12 - 1 23 24marg. 9 27 36
Sum of all these probabilities is reached level of significance for one-way test (that’s why statistics also prints 2*p)
+ - marg. + 9 3 12 - 0 24 24marg. 9 27 36
Let us compare two tables:
Species 1
present absent sum
present 100 200 300
Species 2 absent 200 1000 1200
sum 300 1200 1500
Species 1
present absent sum
present 10 20 30
Species 2 absent 20 100 120
sum 30 120 150
2 and power of test grow with number of observations - hereat both tables are choice from one population in great probability
Measurements of association stregth in 2 x 2 table –
independent on sample sizeY = ad/bc =f11f22 / f21f12 - disadvantage - asymmetric: 0 for negative association, 1 for independence, to + infinity for positive association
from -1 over 0 for independence to + 1; -1 and + 1 (maximal possible association for given values of marg. frequencies)
nRRCC
ffffV
2
2121
21122211 )( from -1 over 0 for independence to + 1; -1 and + 1 (maximal possible association for any values of marg. frequenies)
Multidimensional frequency tables
Nowadays generalized linear models are used in these cases.
Years
Species A
Species B
present
present
absent
absent