Post on 05-Jan-2016
An Index of Data Sizeto Extract Decomposable Structures in LAD
Hirotaka Ono
Mutsunori Yagiura
Toshihide Ibaraki
(Kyoto University)
Overview1. Overview of LAD2. Decomposability
- Importance & motivation3. An index of decomposability
- #data vectors needed to extract reliable decomposable structures
- Based on probabilistic analyses4. Numerical experiments5. Conclusion
Logical Analysis of Data (LAD)
Input:
Output: discriminant function
nFT }1 ,0{ ,
Fx
Txxf
for 0
for 1 )(
T: positive examples (the phenomenon occurs)F: negative examples (the phenomenon does not occur)
f(x): a logical explanation of the phenomenon
For a phenomenon
Example: influenzaFever Headache Cough Snivel Stomachache
1 1 0 1 1
1 0 1 1 1
1 1 1 1 0
1 0 0 1 1
1 1 0 0 0
0 1 0 1 1
T
F
: Set of patients having influenza: Set of patients having common coldF
T
An example of discriminant functions: 431421)( xxxxxxxf
1=Yes, 0=No
5x4x3x1x 2x
Discriminant function f (x) represents knowledge “influenza”.
One kind of knowledge acquisition
Guideline to find a discriminant function
• Simplicity• Explain the structure of the phenomenon
x1 x2 x3 x4 x5 h(x[S1])
T
1 1 0 1 1 1
1 0 1 1 1 1
1 1 1 1 0 1
F
1 0 0 1 1 0
1 1 0 0 0 1
0 1 0 1 1 1
Decomposability
S0 {1, 4, 5}
h(x[S1]) x2 x3
f (x) x1x2x4 x1x3x4
x1x4 h(x[S1])
decomposable!
S1 {2, 3}
f is decomposable f (x) g(x[S0], h(x[S1]))
(T, F) is decomposable decomposable discriminant f
Example: concept of “square”
i 1 1 1 0
ii 1 1 1 1
iii 0 1 1 0
iv 1 0 0 1
v 1 1 0 1
1x 2x 3x 4x
1x : the lengths of all edges are equal2x : the number of vertices is 43x : contains a right angle4x : the area is over 100
T
F iii
iv
i ii
v
Example: concept of “square”Square
- the lengths of all edges are equal- the number of vertices is 4
- contains a right angle
- contains a right angle
Square
- rhombus
- the lengths of all edges are equal- the number of vertices is 4
Hierarchical structures and decomposable structures
Concept
attribute attributeattributeattributeattributeattributeattribute
)(xf
Hierarchical structures and decomposable structures
Concept
attribute attributeattributeattribute
attributeattributeattribute
]))[(],[()( 10 SxhSxgxf
Sub-Concept
])[( 1Sxh)(xf
0S
1S
Previous research on decomposability
]))[(],[( 10 SxhSxg),( FT
• Finding basic decomposable functions (e.g, ) for given and attribute sets
• case: polynomial time [Boros, et al. 1994]
• Finding other classes (positive, Horn, and their mixtures ) of decomposable functions for and attribute set
[Makino, et al. 1995]
• Finding a (positive) decomposable function for given ( is not given)
• NP-hard • proposing a heuristic algorithm [Ono, et al. 1999]
),( FT
]))[(],[( 10 SxhSxg
]))[(],[( 10 SxhSxg),( 10 SS),( FT
The number of data and decomposable structures
• Case 1: The size of given data is small.– Advantage:
Less computational time is needed to find a decomposable structure.
– Disadvantage:Decomposable structures easily exist in data(because of less constraints)= Most decomposable structures are deceptive.
The number of data and decomposable structures
• Case 2: The size of given data is large.– Advantage:
Deceptive decomposable structures will not be found.
– Disadvantage:More computational time is needed.
How many data vectors should be prepared
to extract real decomposable structures?
Index of decomposability
(T, F) is decomposable conflict graph of (T, F) is bipartite
Overview of our approach
Assume that (T, F) is the set of l randomly chosen vectors from {0, 1}n.
1. Compute the probability of an edge to appear in the conflict graph
2. Regard the conflict graph as a random graph
Investigate the probability of the conflict graph to be non-bipartite
Conflict graph
1 0 0 1 1
0 1 0 1 0
1 0 0 1 0
1 0 0 0 1
0 1 0 1 1
0 1 0 0 1
0S 1S
T
F00
01
11
10
Conflict graph
1)11( Suppose h
])[( 1Sxh0)01( h 1)10( h
0)11( h
(T, F) is decomposable conflict graph of (T, F) is bipartite
Probability of an edge to appear in conflict graph
0S 1S
T
F yy
a
b
a
b
graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .
. and
or , and
TbyFay
FbyTay
A pair of vectors is called linked if ),( byay
otherwise. 0
linked, is ),( 1 byayX ey
0}1,0{ Sy
eye XX
1eX
Define a random variable by
where
edge appears in the conflict graph.
We want to compute .
eX
1Pr)1Pr(0}1,0{ Sy
eye XX
graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .
e
Assumptions
• Generation of (T, F)
- |T| + |F| = l vectors are randomly sampled from {0, 1}n without replacement.
- A sampled vector is in T with probability p, and in F with probability q 1 p.
• M 2n
• || 02 Sm
How to compute
1Pr)1Pr(0}1,0{ Sy
eye XX
)1Pr( eyX is easier to compute.
1. Both of2. They have different values (i.e., 0 and 1).
. in chosen are and FTbyay
)),(( 1 baeX ey
)1(
)1(2)1Pr(
MM
llpqX ey
2. 1.
Upper and lower bounds on
)1Pr( eX
)1(
)1(2)1Pr(
MM
llpqX ey
By Markov’s inequality and linearity of expectation,
)1(
)1(2)1Pr(Ex
ExEx)Ex( 00 }1,0{}1,0{
MM
llpqmXmXm
XXX
eyey
yey
yeye
SS
)1Pr( )1Pr( 00 }1,0{',
'}1,0{
SS yyeyey
yey XXX
By the principle of inclusion and exclusion,
)1Pr( eX
Upper Bound
Lower Bound
)1Pr( eX
Approximation of )1Pr( eX
2
2
2)1(
)1(2
M
lpqm
MM
llpqm
)1Pr( eX
holds. )1Pr( , smallFor eX
Random graph
r 1r 0r
rIn our analysis, is assumed to be the probability of an edge to appear in the conflict graph.
Random graph G(N, r)
- N: the number of vertices
- Each edge e (u, v) appears in G(N, r)
with probability r independently
Probability of a random graph to be non-bipartite
Yodd: Random variable representing the number of odd cycles in G(N, r)Pr(Yodd 1): Probability that G(N, r) is not bipartite
odd :
3oddodd 2
Ex 1Pr
kNk
kk
rk
NYY
Markov’s inequality
)1()1( kNNNN k The number of sequences of k vertices
k
kk
k
kNk
kk
rk
Nr
k
NY
odd :
3odd :
3odd 22
Ex
zz
z
1
1ln
2
1
2
1
Taylor series of ln(1 z))10( zNrz
)(zU
)(zU
hold? 1)Ex( doesWhen odd Y
Upper bound:
1 Ex 9950.0 odd YNr
)1( ln42)(
1
2 Ex ε5.0
odd :3
odd :3
odd
ε5.0
ONc
k
c
Nrk
rNY
N
kk
k
kNk
kk
Lower bound when Nr 1:
1 if as Ex odd NrNY
For sufficiently large N, 1 1Ex odd NrY
(c [0, 1) and (0, 0.5) are constants)
1 Ex 9950.0 odd YNr
hold? 1)Ex( doesWhen odd Y
Assumptions
Our index
2
2
2)1Pr(M
lpqmX e
Probability of an edge to appear in conflict graph
Threshold for a random graphto be bipartite or not
1Nr
nM 2 || 02 Sm |||| FTl
)1(Pr and 2 || 1 eS XrN
1
2
2
2||
22)1Pr(21 1
ne
S lpq
M
lpqm
m
MXNr
pql n /2 1
- probabilities p and q are given by p : q |T| : |F|
- conflict graph is a random graph
(|S0| |S1| n)
Our index
pqFT n /2 1
• If , tends to have many deceptive decomposable structures.
• If tends to have no deceptive decomposable structure.
pqFT n /2 |||| 1
,/2 |||| 1 pqFT n ) ,( FT
) ,( FT
1 ,:: qpFTqp
Numerical Experiments
1. Prepare non-decomposable randomly generated functions and construct 10 for each data size ( )
2. Check their decomposability
Randomly generated data Target functions are not decomposable Dimensions of data are n 10, 20 Two types of data:
are biased and not biasedqp and
|||| FT ),( FT
Randomly generated data
)5.0 ,5.0() ,( ,10 qpn our index
Sampling ratio (%)
Rat
io o
f de
com
posa
ble
(T, F
)s (
%)
Randomly generated data
)1.0 ,9.0() ,( ,10 qpn )5.0 ,5.0() ,( ,20 qpn
Sampling ratio (%) Sampling ratio (%)
Rat
io o
f de
com
posa
ble
(T, F
)s (
%)
Rat
io o
f de
com
posa
ble
(T, F
)s (
%)
our index
Breast Cancer in Wisconsin (a.k.a BCW) Already binarized The dimension is n 11 Comparison with randomly generated data wit
h the same n, p and q
Real-world data
BCW and randomly generated data
)270.0 ,730.0() ,( ,11 qpnBCW Randomly generated data
Sampling ratio (%) Sampling ratio (%)
Rat
io o
f de
com
posa
ble
(T, F
)s (
%)
Rat
io o
f de
com
posa
ble
(T, F
)s (
%)
our index
Discussion and conclusion
1 ,:: /2 1 qpFTqppqFT n
An index to extract reliable decomposable structures
Computational experiments on random & real-world data
- proposed index is a good estimate
- |S0| 1 or |S1| 2 threshold behavior is not clear
Future workAnalyses on sharpness of the threshold behavior:
to know sufficient |T| + |F| to extract reliable decomposable structures
Apply similar approach to other classes of Boolean functions
|T| |F|
#dec
ompo
sabl
e
st
ruct
ures
proposed index
we want to estimate