A novel framework for detecting maximally banded matrices in binary data

15
A Novel Framework for Detecting Maximally Banded Matrices in Binary Data Faris Alqadah 1, Raj Bhatnagar 1 and Anil Jegga 2 1 Department of Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA 2 Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA Received 3 May 2010; revised 21 July 2010; accepted 28 July 2010 DOI:10.1002/sam.10089 Published online 7 September 2010 in Wiley Online Library (wileyonlinelibrary.com). Abstract: Binary data occurs often in real-world applications ranging from social networks to bioinformatics. As such, extracting patterns from binary data has been a fundamental task of data mining. Recently, the utility of banded structures in binary matrices has been pointed out for applications such as paleontology, bioinformatics, and social networking. A binary matrix has a banded structure if both the rows and columns can be permuted so that the 1s exhibit a staircase pattern down the rows, along the leading diagonal. Natural interpretations of banded structures include overlapping communities in social networks, patterns of species occurring in spatially correlated sites, and overlapping roles of genes in various diseases. In this paper, we show the correspondence between formal concept analysis and banded structure; as a direct result of this correspondence a novel framework for discovering banded structures is presented. Utilizing the framework, the MMBS algorithm (mine maximally banded submatrices) is developed. The current state-of-the-art algorithm, MBS, only allows for the discovery of a single band and assumes a fixed-column permutation. On the other hand, MMBS facilitates the discovery of multiple bands that may possibly be overlapping or segmented. Our experimental results, presented here, clearly indicate the advantage of MMBS over MBS with both, synthetic and real datasets. 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 431–445, 2010 Keywords: banded matrices; seriation; matrix re-ordering; formal concept analysis; biclustering; coclustering 1. INTRODUCTION Binary matrices occur frequently in many real-world applications such as market-basket data [1], bioinformatics [2], paleontology, ecology [3], and information retrieval [4]. As a result, extracting patterns and clusters in 0–1 data is an important task and has been an active field of study in data mining, resulting in the development of association rule mining [1], sequence mining, and biclustering [5] algorithms. In this paper, we study the banded structure of binary matrices; a binary matrix is said to be fully banded if both the rows and columns can be permuted such that the 1s exhibit a staircase pattern of overlapping rows along the leading diagonal (Fig. 1). The idea of banded matrices has its origins in numerical analysis [6], however, the concept has been studied recently in the data mining community [7,8]. From the data min- ing perspective, banded structures have myriad applications and natural interpretations. Consider, for example, a binary matrix containing documents as the rows and keywords as Correspondence to: Faris Alqadah ([email protected]) the columns, where the set of documents revolve around the single theme of ‘clustering.’ Early documents on clustering, from around the 1960s, may contain terms like ‘k-means,’ while documents in the 1970s may contain both ‘k-means’ and ‘expectation maximization,’ and eventually documents in recent years will contain terms like ‘subspace clustering,’ ‘biclustering,’ and ‘curse of dimensionality.’ While recent documents may contain the terms ‘k-means’ and ‘expecta- tion maximization,’ we do not expect documents from the 1960s or 1970s to contain the terms ‘bicluster’ or ‘subspace cluster.’ Thus, an evolution of concepts can be seen in the documents via the terms that occur in the documents, and this pattern will resemble a band in the data matrix. As another example, consider a dataset containing genes and pathways; many situations exist where pathways A and B may start C; B and C may start D and C and D may start E. The whole pathway can be initiated by starting with A and B. If this pattern is encoded in a context, it will not look like a maximal rectangle, but more like a parallelogram, or stair- case pattern of 1s. Other natural interpretations of a banded structure include overlapping communities in social net- works, overlapping roles of genes in various diseases, and 2010 Wiley Periodicals, Inc.

Transcript of A novel framework for detecting maximally banded matrices in binary data

Page 1: A novel framework for detecting maximally banded matrices in binary data

A Novel Framework for Detecting Maximally Banded Matrices in Binary Data

Faris Alqadah1∗, Raj Bhatnagar1 and Anil Jegga2

1Department of Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA

2Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA

Received 3 May 2010; revised 21 July 2010; accepted 28 July 2010DOI:10.1002/sam.10089

Published online 7 September 2010 in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Binary data occurs often in real-world applications ranging from social networks to bioinformatics. As such,extracting patterns from binary data has been a fundamental task of data mining. Recently, the utility of banded structures inbinary matrices has been pointed out for applications such as paleontology, bioinformatics, and social networking. A binary matrixhas a banded structure if both the rows and columns can be permuted so that the 1s exhibit a staircase pattern down the rows,along the leading diagonal. Natural interpretations of banded structures include overlapping communities in social networks,patterns of species occurring in spatially correlated sites, and overlapping roles of genes in various diseases. In this paper,we show the correspondence between formal concept analysis and banded structure; as a direct result of this correspondence anovel framework for discovering banded structures is presented. Utilizing the framework, the MMBS algorithm (mine maximallybanded submatrices) is developed. The current state-of-the-art algorithm, MBS, only allows for the discovery of a single bandand assumes a fixed-column permutation. On the other hand, MMBS facilitates the discovery of multiple bands that may possiblybe overlapping or segmented. Our experimental results, presented here, clearly indicate the advantage of MMBS over MBS withboth, synthetic and real datasets. 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 431–445, 2010

Keywords: banded matrices; seriation; matrix re-ordering; formal concept analysis; biclustering; coclustering

1. INTRODUCTION

Binary matrices occur frequently in many real-worldapplications such as market-basket data [1], bioinformatics[2], paleontology, ecology [3], and information retrieval[4]. As a result, extracting patterns and clusters in 0–1 datais an important task and has been an active field of studyin data mining, resulting in the development of associationrule mining [1], sequence mining, and biclustering [5]algorithms. In this paper, we study the banded structure ofbinary matrices; a binary matrix is said to be fully bandedif both the rows and columns can be permuted such thatthe 1s exhibit a staircase pattern of overlapping rows alongthe leading diagonal (Fig. 1).

The idea of banded matrices has its origins in numericalanalysis [6], however, the concept has been studied recentlyin the data mining community [7,8]. From the data min-ing perspective, banded structures have myriad applicationsand natural interpretations. Consider, for example, a binarymatrix containing documents as the rows and keywords as

Correspondence to: Faris Alqadah([email protected])

the columns, where the set of documents revolve around thesingle theme of ‘clustering.’ Early documents on clustering,from around the 1960s, may contain terms like ‘k-means,’while documents in the 1970s may contain both ‘k-means’and ‘expectation maximization,’ and eventually documentsin recent years will contain terms like ‘subspace clustering,’‘biclustering,’ and ‘curse of dimensionality.’ While recentdocuments may contain the terms ‘k-means’ and ‘expecta-tion maximization,’ we do not expect documents from the1960s or 1970s to contain the terms ‘bicluster’ or ‘subspacecluster.’ Thus, an evolution of concepts can be seen in thedocuments via the terms that occur in the documents, andthis pattern will resemble a band in the data matrix. Asanother example, consider a dataset containing genes andpathways; many situations exist where pathways A and Bmay start C; B and C may start D and C and D may start E.The whole pathway can be initiated by starting with A andB. If this pattern is encoded in a context, it will not look likea maximal rectangle, but more like a parallelogram, or stair-case pattern of 1s. Other natural interpretations of a bandedstructure include overlapping communities in social net-works, overlapping roles of genes in various diseases, and

2010 Wiley Periodicals, Inc.

Page 2: A novel framework for detecting maximally banded matrices in binary data

432 Statistical Analysis and Data Mining, Vol. 3 (2010)

A B C D E1 1 1 1 0 02 0 1 1 0 03 0 0 1 0 04 0 0 1 1 05 0 0 0 1 1

Fig. 1 A fully banded matrix.

patterns of species occurring in spatially correlated sites[3,7,8].

This paper illustrates the correspondence between for-mal concept analysis (FCA) and banded structures in binarymatrices. As far as we know, no other work has pointed outand studied the relation between the two seemingly disjointproblems. FCA has served as the implicit theoretical basisof fundamental data mining tasks such as association rules[9], itemset mining [9], and biclustering [10]. FCA speci-fies concepts (biclusters) of the binary matrices as maximalrectangles of 1s under suitable permutations of the rowsand columns. While this model of biclustering does notconsider staircase patterns of 1s to be clusters, we show inthis paper that FCA theory can be utilized as a starting pointto discover such patterns. Employing this correspondence,a novel framework for mining banded matrices in datais proposed. Moreover, the novel MMBS algorithm (minemaximally banded submatrices) is developed in the contextof the framework. Our algorithm allows for the discoveryof multiple, possibly overlapping or segmented, maximallybanded submatrices from the original data matrix. A matrixwith segmented bands contains several band-blocks withinthe same dataset. This differs from the previously proposedMBS [8] algorithm which only allows for the discoveryof a single band and fixes the column permutations of thedata matrix before executing the algorithm. Thus the maincontributions of this paper are:

1. Establishing correspondence between banded struc-tures and FCA.

2. Introducing a novel framework based on the abovecorrespondence for defining and enumerating bandedstructures.

3. Developing the MMBS algorithm to uncover multiple,possibly overlapping, banded submatrices.

4. Empirical results verifying the advantage of MMBSover previous approaches.

The rest of the paper is organized as follows: Section2 discusses related work. Section 3 formally defines thebanded matrix problem, Section 4 conveys the relation-ship between FCA and banded structures and the next twosections present the novel framework and the MMBS algo-rithm followed by experimental results on synthetic andreal data.

2. RELATED WORK

The properties of banded matrices and how they relateto data analysis were first studied by Garriga et al. [8].The authors addressed the minimum banded augmenta-tion (MBA) problem and the maximum banded subma-trix (MBS) problem. The MBA problem is given a binarymatrix K , find the minimum number of 0s that need tobe modified into 1s so that K becomes fully banded. TheMBS problem is given K and integer n find the maximumsubmatrix K ′ of K such that K ′ is banded after n flips.The authors assume a fixed-column permutation in the pro-posed solutions for both problems. While this is not a veryrealistic assumption in many real-world scenarios, heuristicmethods are proposed to determine a suitable fixed-columnpermutation. The solution for the MBS problem is builtupon the fixed-column algorithm utilized for MBA, andtherefore also assumes fixed-column permutations; more-over, only a single maximum submatrix is produced. Ourwork does not make any a priori assumptions about thepermutations, and moreover discovers multiple banded sub-matrices in the data. In Ref. [11], the author illustrated thatspectral ordering yields the optimal solution when attempt-ing to minimize the number of 0s between the 1s whensolving the consecutive 1s problem. This result gives theo-retical justification to the spectral ordering utilized in Ref.[8], when fixing the column permutations in order to dis-cover the banded structure. However, once again, we wishto do away with the fixed-column permutation approach.

In Ref. [7], the ecological concept of nestedness inbinary data was introduced. A dataset is nested if for allpairs of rows, one row is either a superset or a subsetof the other. It was shown in Ref. [8], however, thatbandedness is a generalization of this ecological concept.In Ref. [12,13], the authors establish a hierarchy betweendifferent classes of binary matrices. They consider bandedmatrices, zero partitionable matrices, and nested matrices.A binary matrix is zero partitionable if its rows and columnscan be permuted so that every 0 can be labeled by R orC with every position to the right of an R being a 0 andlabeled R and every position below a C being a 0 labeledC. It was shown that every banded matrix is a zero partitionand they characterized how to determine if a zero partitioncontains a banded structure. In the numerical analysis, field[6,14,15] work has focused on minimizing the distanceof nonzero entries from the main diagonal of the matrix(bandwidth). This problem differs from the problem weaddress, as we attempt to discover several submatricesthat have an approximate banded structure, as opposed tominimizing the bandwidth of the entire matrix.

Recently, Liiv [16] presented a comprehensive reviewand historical overview on seriation and matrix re-orderingmethods. In this paper, a plethora of application fieldsare presented in which matrix re-ordering has played a

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 433

constructive role; these include: archeology, cartography,psychology, ecology, bioinformatics, information visualiza-tion, and operations research. Moreover, in this review theauthor points out an indirect link between biclustering andmatrix re-ordering, while suggesting that we ‘head towardcollaboration and consolidation with diclique decomposi-tion and formal concept analysis.’ In this work, we haveindependently developed a framework of matrix re-orderingand seriation utilizing FCA.

3. PROBLEM DEFINITION

Consider a binary matrix K , with row labels in the setG and column labels in the set M . In the documents–keywords example given earlier, the elements of G wouldbe documents and those of M would be the keywords.Without loss of generality the elements of G and M canbe mapped to consecutive integers {1, 2, . . . , |G|} and{1, 2, . . . , |M|}. Moreover, K may be represented by acontext (as per the theory of formal concept analysis [17])K = (G, M, I), where I is a relation such that if gIm thenK(g, m) = 1 and zero otherwise. We denote the ith row ofK by Ki and the j th column by Kj . Given a permutation π

of G, and permutation τ of M , then Kπτ is the permutation

of rows and columns according to π and τ . We will usegπi to denote the ith row and mτj

to denote the i-th rowand j th column, respectively, under permutations π and τ .

DEFINITION 1: A binary matrix K = (G, M, I) is fullybanded if there exists a permutation π of G and per-mutation τ of M such that (1) for every row i in Kπ

τ

the entries with 1s occur in consecutive column indices{mi, mi + 1, . . . , m�

i } and (2) the values of starting indicesfor 1s in successive rows (i and i + 1) satisfy the condi-tions mi ≤ mi+1 and m�

i ≤ m�i+1. An illustration of this is

given in Fig. 1.

Testing a given matrix to find whether it is bandedcan be accomplished in polynomial time [8]. In realdatasets, a fully banded structure may not exist due to noiseor irrelevant dimensions, however, we are still interestedin discovering (i) approximately banded matrices and (ii)maximally banded submatrices of the dataset. A maximallybanded submatrix of a matrix, intuitively, is one in whichno more rows from the original matrix can be added whilestill preserving the bandedness of the selected submatrix.The quality of a banded structure can be measured in termsof noise, where noise is the minimum number of 0s or 1sthat must be flipped in order to achieve a fully bandedmatrix. Given an approximately banded matrix Kπ

τ , lete(Kπ

τ ) denote the noise or error of the banded structurein K

πτ . We say that K

πτ is ε-banded if e(Kπ

τ ) ≤ ε.

PROBLEM 1: Given binary matrix K and noise thresh-old ε, find all submatrices K of K that are ε-banded andmaximal.

Unveiling all maximal ε-banded submatrices is useful indatasets that contain banded structures in certain subsets ofdimensions along with possibly independent or segmentedbands; segmented bands are the situations in which cellsaround multiple, nonprincipal diagonals are populated by1s. Moreover, relaxing the requirements from full banded-ness to ε-bandedness allows for clear band structures tobe identified despite the presence of noise. In Ref. [11],the consecutive 1s problem was shown to be NP-hard, andhence we expect problem 1 to be hard as the problem canbe reduced to the consecutive 1s problem [8].

4. BANDEDNESS AND FCA

Biclustering, itemset mining, and association rules [9,10,17–21] have all utilized FCA as an implicit theoreticalbasis. Specifically, under the FCA model, biclusters areviewed as maximal rectangles of 1s in the matrix undera suitable permutation of the rows and columns. Suchmaximal rectangles encode tight correlations between thesets of rows and columns that they encompass; however,clearly banded patterns cannot be detected under this modeland thus these types of patterns may represent a limitationof this traditional biclustering scheme. On the other hand,we show, in this section, that FCA theory can in-fact beutilized as a building block to discover maximally ε-bandedsubmatrices.

4.1. Formal Concept Analysis

The row and column vectors of K can be interpretedas sets of indices, that is, a row K

i is a set of columnindices that appear in the row. Given this interpretation,FCA [17] defines operators over sets of row and columnvectors. Given K = (G, M, I) and A ⊆ G, then we defineA′ = {m ∈ M|gIm for all g ∈ A}, the columns of M thatare related to the rows of A. Also for B ⊆ M , we haveB ′ = {g ∈ G|gIm for all m ∈ B}. A formal concept orbicluster of K is then a pair (A, B) such that A′ = B

and B ′ = A. In FCA, elements of G and A are referredto as objects and attributes, respectively. Noticeably, theconcepts of K can be represented by a maximal rectangleof 1s under suitable permutations of the rows and columnsof K (Fig. 2). Given concepts, (A1, B1) (A2, B2), we mayorder them as (A1, B1) ≤ (A2, B2) provided that A1 ⊆ A2

(equivalently B2 ⊆ B1). Bicluster C1 is an upper neighborof C2 if C2 ≤ C1 and there does not exist a concept C3 suchthat C2 ≤ C3 ≤ C1. We denote this by C2 ≺ C1. The set of

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: A novel framework for detecting maximally banded matrices in binary data

434 Statistical Analysis and Data Mining, Vol. 3 (2010)

A B C D E

1 1 1 1 0 0

2 0 1 1 0 0

3 0 0 1 0 0

4 0 0 1 1 0

5 0 0 0 1 1

(a) Concepts as maximal rectangles

A,B,C,D,E

A,B,C

1

C,D

4

D,E

5

B,C

1,2

C

1,2,3,4

D

4,5

1,2,3,4,5

(b) Concept lattice ofsample context

Fig. 2 Concepts and maximal rectangles.

all concepts ordered in this way is denoted by B(G, M, I)

and is called the concept lattice of K; it contains all theconcepts of K arranged in a lattice structure. The basictheorem of FCA states that the bicluster lattice B(G, M, I)

is a complete lattice.Given any concept C = (A, B) of a context K, the

set of rows (columns) specified by A (B) clearly satisfyboth conditions of a banded matrix (every element of abicluster is a 1). Not only do concepts constitute bandedsubmatrices, but by considering sequences of conceptsordered by the hierarchal order larger banded structuresunveil themselves; thus any concept C can be utilizedas an initial building block to detect maximally bandedsubmatrices. Intuitively, any fully banded matrix can besplintered exactly into maximal rectangles of 1s, as shownin Fig. 2. Each of these rectangles corresponds exactly to aconcept. Thus, a fully banded submatrix may be constructedby combining a suitably selected sequence of conceptsC1, . . . , Cn. Formally, given a fully banded matrix, Kπ

τ ,for any row g ∈ Kπ

τ , let �(g) be a mapping from g tothe set of concepts that contain g as a row, other than thenull-concept element.

�(g) = {(A, B)|{g} ⊆ A ∧ B �= ∅ ∧ A′ = B ∧ B ′ = A}.

The objects and attributes of any concept C ∈ �(g) canalways be ordered according to π and τ due to the fact thata concept only contains 1s. Let F(Kπ

τ ) be the union of all�(g) for any g ∈ G, that is

F(Kπτ ) =

⋃g∈G

�(g).

Then F(Kπτ ) can be ordered to make an n-tuple of concepts

{C1, . . . , Cn} having a total ordering {<π1,τ1 , . . . , <πn,τn}

as determined from the lattice structure. Thus we maydefine a lexicographical order <π,τ on C1 × C2 × · · · × Cn.

Therefore, considering the concepts in F(Kπτ ), in order,

we may completely specify the permutations π and τ ; henceF(Kπ

τ ) constitutes a sequence of concepts that completelydetermines the banded structure of K.

PROPOSITION 1: Given a context K, if permutationsπ and τ exist such that Kπ

τ is fully banded then thereexists a sequence of biclusters C1 = (A1, B1), . . . , Cn =(An, Bn) s.t.

π = {A1, A2 \ A1, . . . , An \ An−1

},

τ = {B1 \ B2, . . . , Bn−1 \ Bn, Bn

},

where A2 \ A1 is the set difference.

The proof of Proposition 1 is straightforward: C1, . . . , Cn

can always be constructed by considering the concepts ofF(Kπ

τ ) in order, while set differences are taken to elimi-nate duplicate rows and columns. Proposition 1 establishesa clear correspondence between fully banded matrices andFCA; specifically, if a matrix has a fully banded structurethen there always exists a sequence of biclusters that stipu-lates its π and τ .

EXAMPLE 1: Consider the sample context in Fig. 1.The permutations π = {1, 2, 3, 4, 5} and τ = {A, B, C,

D, E}, therefore, lexicographically, 1 < 2 < 3 < 4 < 5 andA < B < C < D < E. The table below illustrates F(Kπ

τ )

and the resulting lexicographical ordering.

g �(g)

1{(1, ABC), (12, BC), (1234, C)

}2

{(12, BC), (1234, C)

}3

{(1234, C)

}4

{(4, CD), (45, D)

}5

{(5, DE), (45, D)

}F(Kπ

τ ){(1, ABC) < (12, BC) < (1234, C) < (4, CD) < (45, D) < (5, DE)

}

π and τ can be constructed from F(Kπτ ) as

π = {1, 12 \ 1, . . . , 5 \ 45

}= {1, 2, 3, 4, 5},

τ = {ABC \ BC, . . . , D \ DE, DE

}= {A, B, C, D, E}.

In the next section, we show that a banded structure canbe grown as a path of concepts in the concept lattice. Inaddition, we derive an expression for the upper bound ofthe error when constructing such banded submatrices.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 435

4.2. Banded Submatrices and Paths in the Lattice

The concept lattice may be viewed as an undirectedgraph G = (V , E). The set of concepts corresponds tothe set of vertices V , and the set of edges E consistsof edges that connect pairs of concepts—upper neighborsto lower neighbors. For a pair of neighbor concepts wecan say: (C1, C2) ∈ E ↔ C1 ≺ C2 ∨ C2 ≺ C1. Let P =C1, C2, . . . , Cn be any path in the lattice G, then bydefinition, for every edge (Ci, Ci+1) ∈ P , Ai+1 ⊆ Ai andBi ⊆ Bi+1 if Ci ≺ Ci+1 (dually Ai ⊆ Ai+1 and Bi+1 ⊆ Bi

if Ci � Ci+1 ). Due to the duality of upper and lowerneighbors, we restrict the discussion to the case of upperneighbors only without any loss of generality. A bandedsubmatrix can be constructed from any edge of the latticeby initiating the matrix to be exactly Ci, by definition everyposition in this matrix is filled with a 1. Next we place therows of Ai+1 \ Ai beneath the rows of Ai and shift thecolumns, where Ai and Ai+1 \ Ai differ, to the left. By thedefinitions of the hierarchal order and concepts, the 0s arelocated precisely at rows Ai+1 \ Ai and columns Bi \ Bi+1.Consequently, any edge in G can be converted into afully banded submatrix. For example, consider the edge((1, ABC) , (12, BC)) (Fig. 3), the two concepts differ bycolumn {A} and row {2}, thus {A} is placed at the leftmostposition and {2} at the bottom-most position. At this pointthe positions of rows {1}, {2} and column {A} are fixedand cannot be altered, while the position of columns {B}and {C} are interchangeable. Thus we obtain the bandedsubmatrix with permutations π = {1, 2} and τ = {A, C, B}or τ = {A, B, C}. Next, the path is further expanded byadding the edge ((12, BC) , (1234, C)), resulting in shiftingcolumn {C} to the right and rows {3} and {4} to thebottom. This fixes the positions of all the columns butleaves the positions of rows {3} and {4} interchangeable.In general, the procedure for augmenting a path P with anedge (Cn, Cn+1) and maintaining the banded structure isdescribed by the AddToPath procedure displayed below.Notice that the symmetric difference set operator is used, asopposed to regular set difference, due to the fact that upperor lower neighbors may be added to the path. The proceduredetermines if the new concept Cn+1 is an upper or lowerneighbor of Cn and if the new rows or columns of Cn+1 arenot already contained in P (lines 4 and 10). If this is thecase then the new rows A are added to the bottom of theband, as signified by the � symbol (line 5), and the differingcolumns are moved to the leftmost positions within theinterchangeable block that they occur in (lines 6–9). Thiscan be easily accomplished if every path maintains integersx and y indicating the position of fixed blocks of columnsor rows. Lines 10–15 implement the exact same procedure,however the rows and columns are reversed because alower neighbor is added to the path as opposed to an upperneighbor.

Fig. 3 Constructing banded matrices from paths in G.

The AddToPath procedure illustrates how a path in G

can be mapped to a submatrix of K that is at least partiallybanded; hence we refer to a submatrix Kπ

τ and a path P

interchangeably. Due to the fact that each individual edgein P is guaranteed to produce a banded structure, the onlypossible errors when adding a concept Cn+1 � Cn to P

occurs if a newly added row a ∈ A contains a column indexb such that pos(b) < x, where x indicates the position ofthe fixed block of columns.

PROPOSITION 2: Let P n−1 = C1, . . . , Cn be a path,not containing a null-element concept, and constructed byrepeated calls to procedure AddToPath with associatedpermutations π ,τ and integers x and y. Given concept Cn+1

s.t. Cn+1 is augmented to P via AddToPath then

e(Pn) ≤

0 if n ≤ 1

e(P n−1) +∑a∈A

|a′ ∩ B| if Cn+1 � Cn

e(P n−1) +∑b∈B

|b′ ∩ A| if Cn+1 ≺ Cn,

where e is the error measure, A = An+1 � An, B = Bn+1 �Bn, A =

y⋃j=1

aπj and B =x⋃

j=1

bτj.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: A novel framework for detecting maximally banded matrices in binary data

436 Statistical Analysis and Data Mining, Vol. 3 (2010)

PROOF 1: We prove this by induction on n.Base case, n ≤ 1: Only one edge C1, C2 exists in the path,and without loss of generality we assume C1 ≺ C2. In thiscase

π = {A1, A2 � A1}= {aπ1 , . . . , aπ|A1| , aπ|A1|+1 , . . . , aπ|A1 |+|A2�A1| }

and

τ = {B1 � B2, B2}= {bτ1 , . . . , bτ|B1\B2 | , bτ|B1\B2|+1 , . . . , bτ|B1\B2 |+|B2| }.

By definition of a concept, rows aπ1 , . . . , aπ|A1| con-tain 1s in all the columns bτ1, . . . , bτ|B1| . By definitionof the hierarchal order and set difference, the remainingrows aπ|A1 |+1, . . . , aπ|A1|+|A2�A1| contain zeros at preciselybτ1, . . . , bτ|B1\B2| and ones at bτ|B1\B2|+1 , . . . , bτ|B1\B2|+|B2 | .

Thus by definition of banded matrices P is fully banded ande(P1) = 0. Inductive step: Assume the hypothesis is true forpaths up to length n. After adding concept Cn+1 the proce-dure only rearranges column indices that are nonfixed andthus interchangeable (lines 7–9) therefore not effecting theerror. Moreover, every newly added row a ∈ An+1 � An

contains 1s in exactly columns Bn+1, which is a proper sub-set of Bn by the definition of the hierarchical order. ThusB = Bn \ Bn+1 and all columns bτx+1, . . . , bτ|τ | contain 1swhich cause no errors. Hence by definition of banded matri-ces, only 1s occurring in a newly added row a ∈ An+1 \ An

and a column bτ1, . . . , bτx can cause additional error as aresult of adding Cn+1. These 1s can be precisely countedas

∑a∈A |a′ ∩ B|, so by the inductive theorem

e(P n+1) ≤ e(P n) +∑a∈A

|a′ ∩ B|. (1)

5. FCA FRAMEWORK FOR DETERMININGBANDED STRUCTURES

Aggregating Propositions 1 and 2 and the AddToPathprocedure, we can fully specify a framework for deter-mining banded structures in binary matrices. Proposition 1stipulates that any banded submatrix of K can be enumer-ated as a sequence of concepts of K. Moreover, invokingAddToPath repeatedly provides a mechanism to constructa band from a path in the concept lattice, and Proposition 2is utilized to compute an upper bound on the error. Considerthe graph G of a concept lattice B(G, M, I), then any edge(Ci, Cj) can be weighted by the amount of error that wouldbe introduced if Cj is added to the path (P ) = C1, . . . , Ci.

Thus, identifying maximally ε-banded submatrices is equiv-alent to identifying all maximal paths in G with total weightless than ε. This problem differs from the all pairs short-est path problem due to the fact that the edge weights areclearly nonconstant; to the contrary, the edge weights area function of all previously added edges along the cur-rent path P . The framework consists of two basic steps: (i)identify initial concepts C as starting points and (ii) nav-igate the concept lattice starting at C to construct a pathP = C, C1, . . . , Cn s.t. e(P ) ≤ ε via repeated calls to theAddToPath procedure.

The framework provides a high level road map foridentifying banded submatrices. On the other hand, specificalgorithmic schemes must be developed for determiningthe initial candidate concepts and searching the conceptlattice for suitable neighboring concept to augment to acurrent path. We present one such scheme in the formof the MMBS algorithm in the following section. First,however, we further investigate the attributes of the FCA-based framework.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 437

5.1. Segmented Bands and Framework Bias

Proposition 2 stipulates the error bound only for pathsthat do not contain null-element concepts. This stipulation

A B C D E F G H1 1 1 1 0 0 0 0 02 0 1 1 0 0 0 0 03 0 0 1 0 0 0 0 04 0 0 1 1 0 0 0 05 0 0 0 1 1 0 0 06 0 0 0 0 0 1 1 07 0 0 0 0 0 0 1 18 0 0 0 0 0 0 0 1

(a) Segmented bands in the data set

A,B,C,D,E,F,G,H

A,B,C

1

C,D

4

D,E

5

B,C

1,2

C

1,2,3,4

D

4,5

1,2,3,4,5,6,7,8

G

6,7

H

7,8

F,G

6

G,H

7

(b) Segmented bands in the concept lattice

Fig. 4 Segmented bands.

implicitly biases the FCA framework toward mining bandedsubmatrices in which the error is dominated by flips of 1sto 0s as opposed to 0s to 1s. However, this stipulationalso explicitly allows the framework to unveil segmentedbands, which we consider to be an advantageous attribute.Let us illustrate with an example. Consider the datasetand its respective concept lattice illustrated in Fig. 4. Twosegmented bands (yellow and red) can be clearly seen inboth the matrix and the concept lattice. The only pathsthat connect concepts in the yellow band to concepts inthe red band must pass through the null-element concepts.Due to this fact, utilizing the FCA framework, two separatesegmented bands will be uncovered. On the other hand,another possible interpretation of the banded structure ofthe data is that only a single band is present, and this canbe produced by simply flipping the 0 located in the (6, E)

cell. In the context of the FCA framework, this can only beachieved by allowing the null-concepts to be added to thepaths. Unfortunately, in this case, Eq. (2) fails to identifythe error. Hence, if we allow null-concepts to act as abridge connecting disjoint paths without actually applyingthe AddToPath procedure (as it would not be defined)then a slightly more general version of Eq. (2) is required:

e(Pn) ≤

0 if n ≤ 1

e(P n−1) +∑a∈A

|a′ ∩ B|

+H(|Bn+1 \ B|) if Cn+1 � Cn

e(P n−1) +∑b∈B

|b′ ∩ A|

+H(|An+1 \ A|) if Cn+1 ≺ Cn,

where

H(x) ={

0 if x = 01 if x ≥ 1,

(2)

and An+1 and Bn+1 denote the attributes and objects ofthe concept Cn+1 that is being added to the path. With-out loss of generality, typically, |Bn+1 \ B| = 0 due to thefact that for any given concept Cn = (An, Bn) and any ofits upper neighbors Cn+1 = (An+1, Bn+1) then Bn+1 ⊂ Bn.However, by allowing paths to traverse the null-element,then there may exist b ∈ Bn+1 /∈ B which constitutes anerror. Once again consider Fig. 4. Traversing the path ofthe yellow band we start with ({1}, {A, B, C}) and endat ({5}, {D, E}). Following this the bottom null-elementconcept is traversed, without applying AddToPath, fol-lowed by the concept ({6}, {F, G}). At this point, sinceAddToPath was not invoked we have B = {A, B, C,

D, E}, yet Bn+1 = {F, G}. Hence |Bn+1 \ B| = 2 and anadditional error term is added to equation 2 representing aflip of the zero at cell (6, E).

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: A novel framework for detecting maximally banded matrices in binary data

438 Statistical Analysis and Data Mining, Vol. 3 (2010)

6. MMBS ALGORITHM

In this section, we present the MMBS algorithm for miningmaximally banded submatrices within the FCA framework.The algorithm consists of three major steps: (i) computingB(G, M, I) and G, (ii) searching the paths of B(G, M, I),and (iii) determining the top banded submatrices to outputto the user.

6.1. Compute B(G,M, I)

Computing the concept lattice of a context has beenwidely studied and numerous algorithms exist to accom-plish this task [17,20,22–24]. These algorithms can beeither incremental or nonincremental. The incremental algo-rithms ([22,23,25]) compute the lattices one concept at atime, by determining the upper and lower neighbors of anygiven concept. Thus, with these algorithms the task of com-puting B(G, M, I) can be embedded directly into the min-ing process. On the other hand, nonincremental algorithmssuch as those described in Refs. [23,24] do not attain thefull and correct lattice structure until the termination of thealgorithm implying that these algorithms cannot be directlyembedded into MMBS; however, the computation time ofthe CHARM-L algorithm [24] is significantly lower than allother approaches. Thus in our implementation of MMBS, weutilized CHARM-L to compute B(G, M, I). Note, however,that any of the algorithms mentioned above will suffice andin all subsequent descriptions of the MMBS we assume thatB(G, M, I) and therefore G is readily available.

6.2. Search Space

MMBS conducts a depth first search with backtrackingin order to identify maximal ε-banded paths. Undoubtedly,in the worst case the search space is exponential in thesize of the lattice, due to the number of potential paths.Fortunately, the error associated with a path in the latticegrows monotonically allowing the search to prune entirebranches of the space whenever an edge that results inerror greater than ε is encountered. Despite such pruning,the task of searching all paths is still indeed intractable.We present two heuristic arguments to make the problemapproachable. First, let U be the set of upper neighborsof the bottom element in B(G, M, I), then each path isrooted at a concept C ∈ U. In essence, this imposes aslight restriction on the possible row orderings since everyband will begin with the rows contained in U. However,it is these rows precisely that contain the largest sets ofcolumn indices, therefore allowing the greatest freedomto swap these indices as more concepts are added to thepath. Next, we set each vertex to remember the minimumweight edge encountered throughout the search. Due to the

monotonicity of the error measure, nodes that have beenpreviously visited are only added to a new path if and onlyif the newly computed error is less than or equal to theminimum weight edge encountered previously in the search.

6.3. Determining Top Bands

Identifying all maximal ε-banded paths in the lattice isnot very useful to the user, as there may exist an explosivenumber of such paths. In order to determine the mostinteresting banded submatrices we allow several parametersto be set in addition to ε. First, minRows and minCols

parameters determine the minimum number of rows andcolumns any band should contain, and we use the parameterw to specify the relative weight that should be assigned tothe error in a banded structure. Next maxOvlp specifiesthe maximum overlap allowed between any two bands.Computing the overlap between any two bands consistsof computing the Jaccard coefficient of both the rows andcolumns and weighing each. Given path P = C1, . . . , CnLet r(P ) denote the rows of P and c(P ) the columns, thenovlp(P 1, P 2) is:

d ∗ J (r(P 1), r(P 2)) + (1 − d) ∗ J (c(P 1), c(P 2)),

where 0 ≤ d ≤ 1 and J (A, B) = |A∩B||A∪B| . Finally, a quality

measure q is needed to rank candidate ε-banded paths thatsatisfy all other user parameters. At an intuitive level, thequality of a band should be monotonic in the size of theband and penalized for errors. Following this intuition weutilized the simple quality measure

q(P ) = |r(P )| ∗ |c(P )| − w ∗ e(P ). (3)

6.4. Pseudocode and Complexity

Pseudo code of MMBS and the search procedure appearsas Algorithm 3 and procedure Search. We focus on the

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 9: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 439

Search procedure as most of the work is conducted there.The first for loop on line four iterates through all possibleconcepts that the current path P can expand to. If the errorof augmenting the path with candidate concept C is lessthan ε and at no more than any previous edge weight,then the best edge weight of C is updated and the path isexpanded to include C (utilizing AddToPath). The searchdelves deeper in order to attempt to maximize the path (line10), while the copy on line 11 implements backtracking. Ifthe expand flag is never set to true then the current path P

cannot be further expanded and the algorithm must evaluateif P meets all the user defined parameters. If P meets theminRows and minCols parameters, then it is comparedto all previously mined bands to determine its degree ofoverlap (line 15). If the overlap between P and all otherbands Pi does not exceed maxOvlp then it is added to R.

On the other hand, if the overlap of P and Pi does exceedthe threshold, then the higher quality band is kept.

The three basic operations of MMBS are set difference,set intersection and swap operations. Let X = max |Ai �Aj | for all Ai ∈ Ci, Aj ∈ Cj s.t. Cj � Ci. Also let Y =max |Bi � Bj | for all Bi ∈ Ci, Bj ∈ Cj s.t. Cj ≺ Ci. Aug-menting a concept to a path utilizing the AddToPathprocedure involves two set differences and at most O(X)

or O(Y) swaps, while computing the error on line 6involves at most O(X) or O(Y) set intersections. Clearly,AddToPath is invoked at least as many times as the erroris computed; however the number of times this occurs isdifficult to analyze, as it depends on several factors suchas ε, and the actual structure of the lattice. Nevertheless,the error will never be computed more than O(|E|) times.Lastly, the search procedure invoked |U| times, thus thetotal cost of MMBS is

O(|U| × |E| × max{X, Y }|.) (4)

The memory footprint of MMBS is minimal, only a singlepath needs to be maintained in memory throughout thesearch, while G can be maintained on disk or in mainmemory. Moreover, our experimental results indicate thatit is very plausible to maintain G in the main memorycomfortably for even moderately large matrices (4000 ×4000) as 0–1 datasets tend to be sparse.

6.4.1. Speeding up the algorithm

The dominant term in Eq. (4) is clearly |E|, while if|U| is large then the computational cost is quite expensive.Unfortunately, many of the biclusters in |U| tend to be verysimilar, and the Search procedure often does not yield anynew bands, but consumes computation time. Thus, in orderto speed up the algorithm, we introduce a pre processingstep to eliminate similar concepts and reduce the numberof times the Search procedure is invoked. Specifically,ovlp(C1, C2) is computed for all biclusters C1, C2 ∈ U,C1 �= C2, if ovlp(C1, C2) > maxOvlp then the larger con-cept is kept. In all performance tests, this preprocessing stepaccelerated the computation time dramatically (see nextsection) while producing very comparable results to ouroriginal MMBS algorithm.

7. EXPERIMENTAL RESULTS

In this section, we illustrate the efficacy of MMBS exper-imentally. The performance of MMBS is compared with theMBS algorithm proposed in Ref. [8]. We show that MMBSconsistently outperforms MBS in three different ways:

1. MMBS uncovers several banded structures as opposedto a single band mined by MBS.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 10: A novel framework for detecting maximally banded matrices in binary data

440 Statistical Analysis and Data Mining, Vol. 3 (2010)

2. MMBS consistently uncovers higher quality bands.

3. More scalable computation time.

These results are exhibited with both synthetic and real-world datasets. MMBS was implemented in C++ utilizingthe STL data structures and is available upon request. Allexperiments were conducted on a 2.7 GHz AMD Athlon64 x2 CPU with 5.8 GB of RAM, running Ubuntu Linux.Additionally, MMBS_Fast was implemented, in which thepreprocessing step described above was applied. Sourcecode for MBS was kindly provided by the authors ofRef. [8].

7.1. Synthetic Datasets

Several synthetic datasets were created to test the capa-bility of MMBS. Three different methods were utilized tocreate three different classes of datasets. First, single banddatasets were created utilizing the method described in Ref.[8]. Briefly, for a given number of rows (|G|), columns(|M|) and width parameter wi a fully banded matrix isgenerated by means of a random walk starting at the (0, 0)

coordinate. Initially all cells in the matrix are set to 0, andthe walk chooses to either step down or to the right withequal probability. On a step to the right the wi cells aboveand below the current position are all set to 1s. Noise canbe added to the band by flipping the original values to 0 or1 with probabilities p and q. Next, multiple band matriceswere created by first producing several single band matricesK1, . . . , Kn, followed by ‘concatenating’ the single bandsinto a single ‘switch matrix’ as:

· · · · · · 0 K1

· · · · · · K2 0

0 . ..

0...

Kn 0 · · · ...

.

Finally, random binary matrices with a set sparsity levelwere also created.

Three sets of experiments were conducted on each classof synthetic data. All experiments were conducted withw = 1, maxOvlp = 0.1, minRows = minCols = 5, andε = 99 (the maximum allowed misplaced 1s or 0s was 99).The MBS_BD algorithm allows for bidirectional flipping ofboth 0s to 1s and 1s to zeros, while MBS_SD only allowsfor flipping 1s to 0s. Several possible initial orderings arepossible with the MBS algorithms, including Hamiltonianordering with distance measures and spectral ordering withsimilarity measures; all orderings were utilized in the exper-iments and the best results are reported here for comparison.

Figure 5 shows the results, and as can be seen MMBS andMMBS_Fast consistently discover higher quality bands. In

the single band case, we see that MMBS is more resilientto noise and is able to discover a larger portion of thehidden band; this is a direct consequence of not fixingthe column permutations. MBS_BD and MBS_SD discoverbands of higher quality on two of the multiband classdatasets; however up on closer inspection we see that thisis due to a failure of these algorithms to recognize that twosegmented bands exist in the data, (as can clearly be seenfrom Fig. 6b). The MBS algorithms attempt to lump all therows into one band, resulting in a larger band than any ofthe individual segmented bands, but does not complete thejob and permute the matrix into a single band. In otherwords, the band produced by MBS loses a very naturalreal-world interpretation of two segmented bands. On theother hand, the MMBS algorithms was able to discover thetwo segments almost completely, both of which had verysimilar quality. Moreover, in the largest of the multiplebanded datasets the MMBS algorithm discovered both thesegmented bands even though the quality of only one ofthem exceeds the quality of the single band mined by MBS.Finally, the random binary matrices represent the real-worldscenario of not knowing if a banded structure exists in thedata. Once again MMBS algorithms outperform MBS anddiscovers larger bands, with the same limit of error allowed.

7.2. Real-world Datasets

Four real-world datasets from differing domains were uti-lized to examine the true utility of MMBS. The first twodatasets, Gene PhenoTypes and Genes Drugs came fromthe bioinformatics domain and was provided by the Cincin-nati Children’s Hospital Medical Center. This data is avail-able upon request. The rows of each matrix correspond togenes, while the columns to phenotypes and drugs, respec-tively. The final two datasets were obtained from the largeNewsGroups dataset [26], with rows corresponding tousenet documents and columns to words extracted from thesubject lines. Mideast Religion contains documents foundin both the Mideast and Religion groups, while All PCcontains usenet documents from all of the PC groups. Allalgorithms were executed with the same parameters dis-cussed above, and the results appear in Fig. 7. Once again,we notice that MMBS algorithms not only consistently pro-duce better quality bands, but also discover several bandedstructures in the data. Figure 8 illustrates the data matricesand the resulting bands discovered by MMBS. InvestigatingFigs 8e and 8a reveals the advantage of mining multiplebanded submatrices as opposed to a single band. In Fig. 8ethe phenotypes all correspond to abnormal nervous systemphenomenon; on the other hand, in Fig. 8a the first fivephenotypes are abnormalities in the eyes and eyelids, whilethe last phenotypes indicate abnormalities in the ears. Inboth cases, the banded structure uncovers the overlapping

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 11: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 441

Dataset name Dataset Size p Num. Planted bands Algorithm Quality top ranked Num.bands mined

SynBand100_001 100 × 100 0:01 1

MMBS 3590 6MMBS_Fast 3406 4

MBS_BD 2507 1MBS_SD 438 1

SynBand100_005 100 × 100 0:05 1

MMBS 2278 9MMBS_Fast 1503 8

MBS_BD 1050 1MBS_SD 1201 1

SynBand500_001 500 × 500 0:01 1

MMBS 8918 7MMBS_Fast 8261 6

MBS_BD 2822 1MBS_SD 2145 1

SynMultiBand100_001 100 × 100 0:01 2

MMBS 3367 2MMBS_Fast 3367 2

MBS 4101 1MBS_SD 4045 1

SynMultiBand200_001 100 × 100 0:05 2

MMBS 4054 2MMBS_Fast 3933 2

MBS_BD 3910 1MBS_SD 3736 1

SynMultiBand500_001 500 × 500 0:01 2

MMBS 28242 8MMBS_Fast 21346 5

MBS_BD 17498 1MBS_SD 430 1

SynRandom100_005 100 × 100 0:05 unknown

MMBS 3311 17MMBS_Fast 3220 14

MBS_BD 2801 1MBS_SD 1949 1

SynRandom500_001 500 × 500 0:01 unknown

MMBS 18635 73MMBS_Fast 16163 64

MBS_BD 16771 1MBS_SD 5229 1

Fig. 5 Experimental results on synthetic data.

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

(a) Synthetic Single band with noise (SynBand500_001)

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

160

180

200

(b) Synthetic Two segmented bands (SynMultiBand200_001)

10 20 30 40 50 60

10

20

30

40

50

60

(c) One of the segmented bands found by MMBS (SynMulti-Band200_001)

5 10 15 20 25 30 35 40

10

20

30

40

50

60

70

80

(d) Another segmented band found by MMBS (SynMulti-Band200_001)

Fig. 6 Synthetic data matrices and bands minded by MMBS.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 12: A novel framework for detecting maximally banded matrices in binary data

442 Statistical Analysis and Data Mining, Vol. 3 (2010)

Dataset Size Sparsity Algorithm Quality top ranked Num. bands mined

Genes_Phenotypes 1910 × 3965 0:008

MMBS 6665 56MMBS_Fast 6665 43

MBS_BD 5204 1MBS_SD 3578 1

Genes_Drugs 1608 × 49 0:042

MMBS 6423 18MMBS_Fast 6423 13

MBS_BD 5346 1MBS_SD 3047 1

NewsGroups_Mideast_Religion 2000 × 890 0:003

MMBS 72906 42MMBS_Fast 61410 31

MBS_BD 59781 1MBS_SD 58713 1

NewsGroups_AllPC 5000 × 2805 0:0001

MMBS 93368 5MMBS_Fast 93368 5

MBS_BD 89106 1MBS_SD 74125 1

Fig. 7 Experimental results on real-world datasets.

50 100 150 200 250 300 350 400

1

2

3

4

5

6

7

8

9

10

early eyelidopening

eyelids openat birth

abnormal timingof postnataleyelid opening

abnormaleyelid morphology

abnormalhomeostasis

abnormal earphysiology

abnormalhearing physiology

abnormal brainstemaudiotry evokedpotential

deafness

(a) Genes_Phenotypes

100 200 300 400 500 600 700 800 900

1

2

3

4

5

6

7

(b) Gene_Drugs

10 20 30 40 50 60 70 80

100

200

300

400

500

600

700

(c) NewsGroups_Mideast_Religion

10 20 30 40 50 60 70 80 90 100 110

100

200

300

400

500

600

700

800

900

(d) NewsGroups_AllPC

100 200 300 400 500 600

1

2

3

4

5

6

abnormal entericneuronmorphology

abnormalenteric nervoussystemmorphology

abnormalautonomicnervoussystemmorphology

abnormalnervoussystemmorphology

abnormalimmunesystemphysiology

abnormalinducedmorbidity/mortality

(e) Genes_Phenotypes

Fig. 8 Real-world datasets and mined submatrix bands.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 13: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 443

Fig. 9 Natural interpretation of band as overlapping communities.

roles of genes in different phenotypes, but the fact that theyare two separate bands emphasizes the correlation betweenthe phenotypes within each submatrix.

Utilizing the mined band from the Mideast Religion, wewere able to construct a graph that successfully illustrateddifferent themes discussed among usenet users (Fig. 9).Specifically, collections of keywords and documents wereconstructed by following the permutations of the band untila row(s) or column(s) was encountered that was longerthan average. All documents up to this point were placedinto a collection (blue circles), while the longer than aver-age row(s) or column(s) were placed in the yellow boxeswith the bold font. For ease of interpretation we have onlyincluded keywords in the graph. At an intuitive level wecan see that within each collection (circles) the keywordsare highly correlated around a similar theme. At the sametime, keywords across collections are quite distinct, how-ever, keywords in the boxes represent overlapping themesthat links the two distinct collections of documents andkeywords. Thus the banded structure carries with it a nat-ural interpretation of the distinct discussion threads in the

0 20 40 60 80 100100

101

102

103

104

epsilon

CP

U T

ime

(sec

onds

) MMBS_fastMMBSMBS

(a) NewsGroups_Mideast_Religion

0 20 40 60 80 100102

103

104

105

epsilon

CP

U T

ime

(sec

onds

) MMBS_fastMMBSMBS

(b) NewsGroups_AllPC

0 20 40 60 80 100101

102

103

104

105

epsilon

CP

U T

ime

(sec

onds

)

MMBS_fastMMBSMBS

(c) Genes_Phenotypes

0 20 40 60 80 10010-1

100

101

102

epsilon

CP

U T

ime

(sec

onds

)

MMBS_fastMMBSMBS

(d) Gene_Drugs

Fig. 10 Performance Study.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 14: A novel framework for detecting maximally banded matrices in binary data

444 Statistical Analysis and Data Mining, Vol. 3 (2010)

newsgroup, along with possible links between these distinctthreads.

7.3. Performance Tests

Performance tests to measure the practical computationalcost of MMBS, MMBS_Fast, and MBS were conducted onthe real-world datasets. Each algorithm was executed tentimes, while the ε-parameter was varied, and the averageCPU times are reported in Fig. 10. We made use of theCHARM-L algorithm [24] to compute the concept latticeof each dataset and the cost of this computation is includedin the results. Noticeably, the cost of MMBS_Fast issignificantly lower than the other algorithms in three ofthe four datasets. In all three of these cases MMBS_Fastoutperforms MMBS by an order of magnitude, and MBSby two orders of magnitude. The MBS algorithm provednot to be very sensitive to the ε parameter resulting inmore efficient performances in smaller datasets such as theGene Drugs dataset. On the other hand, both MMBS andMMBS_Fast are very sensitive to ε. As ε increases, lesspruning steps occur and the search procedure tends to theworst case cost of exploring all edges of the lattice. Despitethis fact, both the MMBS algorithms scaled much better tothe three larger datasets than MBS, as even the originalMMBS outperformed MBS by at least an order of magnitudein each case.

8. CONCLUSION

In this work, we explored the connection between FCAand banded structures in binary data. It was shown thatbanded submatrices of a dataset correspond to paths in theconcept lattice of that dataset. This correspondence formedthe basis of the MMBS algorithm which discovers maximallyε-banded submatrices by exploring paths in the conceptlattice. Experiments on synthetic and real-world datasetsindicated three main advantages of MMBS over previousalgorithms. First, multiple banded structures with naturalinterpretations are uncovered in the data as opposed toa single structure. Additionally, MMBS consistently minedhigher quality bands, while the performance study illus-trated that the computational cost scaled better to largerdatasets.

Future work will focus on more efficient methods ofsearching the concept lattice for banded structures throughstronger bounding criterion on the error and more effectiveheuristics. In addition, different quantitative measures ofbandedness should be developed to correspond to specificapplication needs. Incorporating such measures into thelattice search should yield more interesting and relevantbanded structures with respect to the specific application.

REFERENCES

[1] R. Agrawal, T. Imielinski, and A. Swami, Mining associationrules between sets of items in large databases, InSIGMOD ’93: Proceedings of the 1993 ACM SIGMODInternational Conference on Management of Data, NewYork, ACM, 1993, 207–216.

[2] I. Shmulevich and W. Zhang, Binary analysis andoptimization-based normalizatoin of gene expression data,Bioinformatics, 18 (2002), 555–565.

[3] M. F. Kai Puolamaki and H. Mannila, Seriation inpaleontological data using markov chain monte carlomethods, PLoS Comput Biol 2 (2006), e6.

[4] R. Baeza-Yates and B. Ribeiro-Neto, Modern InformationRetrieval, Addison Wesley, 1999.

[5] F. Alqadah and R. Bhatnagar, Detecting significantdistinguishing sets among bi-clusters, In CIKM ’08:Proceeding of the 17th ACM Conference on Informationand Knowledge Management, New York, ACM, 2008,1455–1456.

[6] R. Rosen, Matrix bandwidth minimization, In Proceedings ofthe 1968 23rd ACM national conference, New York, ACM,1968, 585–595.

[7] H. Mannila and E. Terzi, Nestedness and segmentednestedness, In KDD ’07: Proceedings of the 13th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, New York, ACM, 2007, 480–489.

[8] G. C. Garriga, E. Junttila, and H. Mannila, Banded structurein binary matrices, In KDD ’08: Proceeding of the 14th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, New York, ACM, 2008, 292–300.

[9] M. J. Zaki and M. Ogihara, Theoretical foundations ofassociation rules, 3rd SIGMOD’98 Workshop on ResearchIssues in Data Mining and Knowledge Discovery (DMKD),Seattle, Washington, 1998.

[10] S. C. Madeira and A. L. Oliveira, Biclustering algorithms forbiological data analysis: a survey, IEEE/ACM Trans ComputBiol Bioinform, 1(1) (2004), 24–45.

[11] N. Vuokko, Consecutive ones property and spectral ordering,In Proceedings of the SIAM International Conference onData Mining, SDM 2010, April 29 - May 1, 2010, Columbus,Ohio, 2010.

[12] I. jen Lin, M. K. Sen, and D. B. West, Classes of intervaldigraphs and 0,1-matrices, In 28th S.E. Conf. Comb. Graph.Th. and Congressus Numerantium 125 (1997), p. 201–209.

[13] M. Sen and B. K. Sanyal, Indifference digraphs: ageneralization of indifference graphs and semiorders, SIAMJ Discret Math, 7(2) (1994), 157–165.

[14] E. Cuthill and J. McKee, Reducing the bandwidthof sparse symmetric matrices, In Proceedings of the1969 24th National Conference, New York, ACM, 1969,157–172.

[15] C. Aykanat and A. Pinar, Permuting sparse rectangularmatrices into block-diagonal form, SIAM J Sci Comput, 25(2004), 1860–1879.

[16] I. Liiv, Seriation and matrix reordering methods: an historicaloverview, Stat Anal Data Mining, 3 (2010), 70–91.

[17] B. Gamter and R. Wille, Formal Concept Analy-sis: Mathematical Foundations, Berlin, Springer-Verlag,1999.

[18] H. Bian and R. Bhatnagar, A levelwise search algorithm forinteresting subspace clusters, In ICDM ’05: Proceedings ofthe Fifth IEEE International Conference on Data Mining,IEEE Computer Society, 2005, 573–576.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 15: A novel framework for detecting maximally banded matrices in binary data

Alqadah, Bhatnagar, and Jegga: Framework for Detecting Maximally Banded Matrices in Binary Data 445

[19] K. S. G. Liu and J. Li, Efficient Mining of Large MaximalBicliques, Dawak, 2006, 437–448.

[20] H. Bian and R. Bhatnagar, An Algorithm for Well StructuredSubspace Clusters in SDM, 2005.

[21] F. Alqadah and R. Bhatnagar, Discovering substantialdistinctions among incremental bi-clusters, In Proceedingsof the SIAM International Conference on Data Mining,SDM 2010, April 29–May 1, 2010, Columbus, Ohio, 2009,197–208.

[22] C. Lindig and G. Datensystene, Fast concept analysis, inWorking with Conceptual Structures 150 Contributions toInternational Conference on Conceptual Structures, August14–18, 2000, Darmstadt, Germany, 2000, 152–161.

[23] S. O. Kuznetsov and S. A. Obiedkov, Algorithms for theconstruction of concept lattices and their diagram graphs,In PKDD ’01: Proceedings of the 5th European Conference

on Principles of Data Mining and Knowledge Discovery,London, UK, Springer-Verlag, 2001, 289–300.

[24] M. J. Zaki and C.-J. Hsiao, Efficient algorithms for miningclosed itemsets and their lattice structure, IEEE Trans KnowlData Eng, 17(4) (2005), 462–478.

[25] P. Becker, J. Hereth, and G. Stumme, Toscanaj - an opensource tool for qualitative data analysis, In Advances inFormal Concept Analysis for Knowledge Discovery inDatabases, Proceedings Workshop FCAKDD of the 15thEuropean Conference on Artificial Intelligence (ECAI 2002),V. Duquenne, B. Ganter, M. Liquiere, E. M. Nguifo, and G.Stumme, eds. Lyon, France, 2002.

[26] A. Asuncion and D. Newman, UCI Machine LearningRepository, 2007.

Statistical Analysis and Data Mining DOI:10.1002/sam