Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern...

31
Complex & Intelligent Systems (2021) 7:589–619 https://doi.org/10.1007/s40747-020-00226-4 SURVEY AND STATE OF THE ART Comparative evaluation of pattern mining techniques: an empirical study Anindita Borah 1 · Bhabesh Nath 1 Received: 9 December 2019 / Accepted: 27 October 2020 / Published online: 11 November 2020 © The Author(s) 2020 Abstract Pattern mining has emerged as a compelling field of data mining over the years. Literature has bestowed ample endeavors in this field of research ranging from frequent pattern mining to rare pattern mining. A precise and impartial analysis of the existing pattern mining techniques has therefore become essential to widen the scope of data analysis using the notion of pattern mining. This paper is therefore an attempt to provide a comparative scrutiny of the fundamental algorithms in the field of pattern mining through performance analysis based on several decisive parameters. The paper provides a structural classification of the widely referenced techniques in four pattern mining categories: frequent, maximal frequent, closed frequent and rare. It provides an analytical comparison of these techniques based on computational time and memory consumption using benchmark real and synthetic data sets. The results illustrate that tree based approaches perform exceptionally well over level wise approaches in case of dense data sets for all the categories. However, for sparse data sets, level wise approaches performed better than the former ones. This study has been carried out with an aim to analyze the pros and cons of the well known pattern mining techniques under different categories. Through this empirical study, an endeavor has been made to enable the researchers identify some fruitful and promising research directions in one of the most remarkable area of research, pattern mining. Keywords Association rule mining · Frequent itemsets · Pattern mining · Performance analysis Introduction Enormous quantity of data generated by organizations, emphasize on the discovery of valuable and significant infor- mation that led to the emergence of the field of data mining. Data mining has established itself as an inspiring area of database research whose prime concern is to extract hidden and meaningful information from databases. An imperative area of data mining research is pattern mining that aims to identify momentous patterns and correlations existing within a database. Since its inception, a considerable amount of research has been carried out in the field of pattern min- ing targeting different kinds of patterns as well as the issues and challenges faced during their extraction [10,11,15]. After establishing itself as a compelling and fruitful research area B Anindita Borah [email protected] Bhabesh Nath [email protected] 1 Department of Computer Science and Engineering, Tezpur University, Napaam, Tezpur, Sonitpur, Assam 784028, India for over a decade, pattern mining demands for an overview and re-examination of the various techniques developed and let the researchers identify their pros and cons, in order to establish it as a cornerstone approach in the field of data mining. Pattern mining is the initial phase of association rule mining that extracts significant patterns and further gen- erates meaningful association rules from those patterns. Pattern mining techniques generate different types of patterns depending upon the requirements of the user. These patterns and rules serve different applications ranging from fraud detection, medical diagnosis, web mining to customer trans- action analysis. A wide range of pattern mining techniques are available in the literature covering different aspects of data mining applications. A careful study of all these tech- niques is therefore a necessary requirement to have complete grasp of the area of pattern mining. Even though a large number of theoretical as well as empirical reviews in the field of pattern mining can be found in the literature, no initiative has been taken to carry out a comparative evaluation of the techniques generating different categories of patterns or itemsets using experimental valida- 123

Transcript of Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern...

Page 1: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619https://doi.org/10.1007/s40747-020-00226-4

SURVEY AND STATE OF THE ART

Comparative evaluation of pattern mining techniques: an empiricalstudy

Anindita Borah1 · Bhabesh Nath1

Received: 9 December 2019 / Accepted: 27 October 2020 / Published online: 11 November 2020© The Author(s) 2020

AbstractPatternmining has emerged as a compelling field of datamining over the years. Literature has bestowed ample endeavors in thisfield of research ranging from frequent pattern mining to rare pattern mining. A precise and impartial analysis of the existingpatternmining techniques has therefore become essential towiden the scope of data analysis using the notion of patternmining.This paper is therefore an attempt to provide a comparative scrutiny of the fundamental algorithms in the field of patternminingthrough performance analysis based on several decisive parameters. The paper provides a structural classification of thewidelyreferenced techniques in four pattern mining categories: frequent, maximal frequent, closed frequent and rare. It provides ananalytical comparison of these techniques based on computational time and memory consumption using benchmark real andsynthetic data sets. The results illustrate that tree based approaches perform exceptionally well over level wise approachesin case of dense data sets for all the categories. However, for sparse data sets, level wise approaches performed better thanthe former ones. This study has been carried out with an aim to analyze the pros and cons of the well known pattern miningtechniques under different categories. Through this empirical study, an endeavor has been made to enable the researchersidentify some fruitful and promising research directions in one of the most remarkable area of research, pattern mining.

Keywords Association rule mining · Frequent itemsets · Pattern mining · Performance analysis

Introduction

Enormous quantity of data generated by organizations,emphasize on the discovery of valuable and significant infor-mation that led to the emergence of the field of data mining.Data mining has established itself as an inspiring area ofdatabase research whose prime concern is to extract hiddenand meaningful information from databases. An imperativearea of data mining research is pattern mining that aims toidentify momentous patterns and correlations existing withina database. Since its inception, a considerable amount ofresearch has been carried out in the field of pattern min-ing targeting different kinds of patterns as well as the issuesand challenges faced during their extraction [10,11,15]. Afterestablishing itself as a compelling and fruitful research area

B Anindita [email protected]

Bhabesh [email protected]

1 Department of Computer Science and Engineering, TezpurUniversity, Napaam, Tezpur, Sonitpur, Assam 784028, India

for over a decade, pattern mining demands for an overviewand re-examination of the various techniques developed andlet the researchers identify their pros and cons, in order toestablish it as a cornerstone approach in the field of datamining.

Pattern mining is the initial phase of association rulemining that extracts significant patterns and further gen-erates meaningful association rules from those patterns.Patternmining techniques generate different types of patternsdepending upon the requirements of the user. These patternsand rules serve different applications ranging from frauddetection, medical diagnosis, web mining to customer trans-action analysis. A wide range of pattern mining techniquesare available in the literature covering different aspects ofdata mining applications. A careful study of all these tech-niques is therefore a necessary requirement to have completegrasp of the area of pattern mining.

Even though a large number of theoretical as well asempirical reviews in the field of pattern mining can be foundin the literature, no initiative has been taken to carry out acomparative evaluation of the techniques generating differentcategories of patterns or itemsets using experimental valida-

123

Page 2: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

590 Complex & Intelligent Systems (2021) 7:589–619

tion. An empirical attempt can be found in [46] that analyzesonly the closed frequent itemset generation algorithms. Fewother attempts illustrate the behavior of only frequent item-set generation algorithms through experimental evaluation[20,51]. However, no initiative has been taken to perform ananalytical study on the maximal frequent itemset and rareitemset generation techniques. This study offers an unbiasedempirical study of the fundamental techniques in the area ofpattern mining, generating different categories of itemsets.To the best of our knowledge, till now no attempt has beenmade to perform an empirical study and experimental evalu-ation of pattern mining techniques developed under variousconstraints using different thresholds and data sets.

The pattern mining algorithms basically attempts to iden-tify the four main categories of itemsets: frequent, maximalfrequent, closed frequent and rare. This paper attempts toprovide a comparative analysis of all the mainstream algo-rithms generating different categories of itemsets. The mainpurpose of this study is to let the researchers in the field ofpattern mining identify the pros and cons of techniques gen-erating all the main categories of patterns. To the best of ourknowledge, it is the first attempt to do a comparative studyof different categories of pattern mining techniques together.The primary aimof this study is to perform a detailed analysisof the existing techniques widely referenced in the literature,under the following four themes: (1) frequent itemset gen-eration, (2) closed frequent itemset generation, (3) maximalfrequent itemset generation and (4) rare itemset generation.Through this work, an attempt has been made to improve onthe existing studies in the following way:

– evaluation of a large number (19) of widely referencedpattern mining techniques using different thresholds.

– analysis of the techniques through experimental evalu-ation on several real and synthetic benchmark data sets[19].

– comparison of the techniques through graphical analysisunder different scenarios.

The remaining paper is organized as follows: Sect. 2illustrates the algorithms considered for this experimentalstudy. Section 3 elicits and discusses the experimental resultsobtained for different categories of algorithms. Finally, Sect.6 summarizes and discusses the findings of this empiricalstudy.

Algorithms considered for the study

This section illustrates the algorithms that have been con-sidered for this study. Several mainstream algorithms in thefield of pattern mining have been taken into account for com-parison and evaluation using different thresholds and data

sets [19]. Nineteen algorithms encompassing different pat-tern mining issues have been considered for this study. Thealgorithms have been chosen from the studies numerouslyreferenced in the literature [4,6,18,21–24,26,29,32–34,39,43,44,49,50]. They can be distinguished depending upon thetype of itemsets generated during itemset generation phase.

Frequent itemset generation

The notion of pattern mining was introduced, consideringthe usefulness and applicability of frequent patterns or item-sets present in the database [8]. Transaction databases usuallycontain huge number of distinct single itemswhose combina-tion further tends to generate enormous quantity of itemsets[31]. Devising scalable techniques to handle such large num-ber of itemsets itself is a challenging task. Thus existingpattern mining techniques employ several strategies to dealwith this issue. For this study, five primer and most widelyaccepted frequent pattern mining techniques have been con-sidered which are described as follows:

(a) Apriori: To handle the vast quantity of itemsets pro-duced and to generate only the fruitful frequent itemsets,[4] introduced a downward closure property called Apri-ori. According to this property, an itemset can beconsidered to be frequent, only if all its proper subsetsare frequent. Their Apriori algorithm adopts a candidategeneration approach in which the frequent 1-itemsets areinitially generated by scanning the database and then pro-ceeding towards the generation of candidate 2-itemsetsfrom these frequent 1-itemsets. The same process of fre-quent itemset and candidate generation continues untilno further frequent n-itemsets can be generated for someparticular item n. In spite of being the primer and oneof the simplest techniques, levelwise search and hugenumber of candidate itemsets produced adds to thecomplexity of the algorithm. The performance of thealgorithm is thus affected in terms of execution time andmemory usage due to continuous generation of candidateitemsets.

(b) FP-Growth: Multiple scanning of the database andgeneration of enormous number of candidate itemsetsthrough pattern matching escalates the computationalcost of Apriori algorithm. [24] in their algorithm man-aged to overcome the shortcomings faced by the lev-elwise pattern mining algorithms. Their well-knownalgorithm called Frequent Pattern Growth (FP-Growth)employs a trie-based tree data structure to reduce thenumber of database scans and avoid the generationof candidate itemsets. The algorithm uses divide andconquer strategy and generates a compressed tree rep-resentation of the original database using the frequentitems generated during the initial scan of the database.

123

Page 3: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 591

Pattern mining is then performed on the tree represen-tation of the database by generating conditional patternbases. Employing such a strategy enables the frequentitemsets to bemined using only two scans of the databaseand also avoids extravagant steps of patternmatching andcandidate generation. This greatlyminimizes the compu-tational complexity of the algorithm, thus making it themost widely accepted pattern mining technique. Despiteits eminence and numerous advantages, FP-Growth stillsuffers from certain drawbacks, most crucial being thelimitation of memory. It fails miserably when huge num-ber of frequent itemsets are generated in case of large datasets and henceforth the tree can no longer be accommo-dated in main memory.

(c) AprioriTID: The AprioriTID algorithm [4] was devel-oped as an extension of the primer Apriori algorithmwhere the former does not use the database for supportcounting of candidate itemsets after the initial pass. Thealgorithm uses the encoding of the candidate itemsetsgenerated in the previous pass. The size of the encodingtends to get reduced after each pass and becomes muchsmaller as compared to the original database that reducesthe mining intricacy to a extent. However, in the initialpasses Apriori performs better than AprioriTID.

(d) ECLAT: [49] developed the Apriori-like EquivalenceClass Transformation (ECLAT) algorithm that employsa depth-first search strategy for the generation of frequentitemsets. It is based on the property of set intersectionand is capable of performing sequential as well paral-lel execution. Unlike Apriori and FP-Growth, ECLAToperates on vertical data format and generates the TIDset for each item. The intersection of the TID sets ofcurrent n-itemsets is used to find the TID sets of thenext n+1-itemsets. The algorithm avoids rescanning ofthe database for support counting of the n+1-itemsets, asthe TID sets of n-itemsets contains the entire informationnecessary for calculating their support values. Similar toApriori, ECLAT also suffers from the drawback of gen-erating too many candidate itemsets.

(e) H-Mine: In addition to memory limitation, FP-Growthsuffers from performance bottlenecks in case of sparsedata. Experimental results available in the literatureillustrates that the performance of FP-Growth tends todeteriorate with the increase in sparseness of data. Thenumber of frequent items decreaseswith growing sparse-ness of data thus reducing the probability of node sharingamong frequent items in the tree. Henceforth, the result-ing tree becomes very large affecting the performanceof the algorithm in terms of execution time and mainmemory. [34] managed to overcome this demerit of FP-Growth by introducing their technique called H-Minethat makes use of queues instead of tree data structure.H-Mine stores the items of the transactions in separate

queues and links transactions having same first itemname using hyper-links. It performs appreciably wellwith sparse data set and is more efficient than Aprioriand FP-Growth in terms of space usage and executiontime. However, FP-Growth still overpowers H-Mine incase of dense data sets.

Numerous other frequent pattern mining techniques havebeen developed encompassing different issues associatedwith frequent pattern mining [2,30,47]. The issue of can-didate generation and multiple database scans was resolvedby FP-Growth algorithm. To further improve the FP-Growthalgorithm, several variants of FP-Growth were proposed, themost efficient being the LP-Tree algorithm [35]. LP-Treemanaged to reduce the tree traversal time with the help ofadditional data structures. LP-Tree algorithm was furtherenhanced by [38] that uses a top down approach to reduce thenumber of level searches and identify the frequent itemsets.Incremental mining is an important issue in the field of pat-ternmining asmost of the techniques fail to efficiently handlethe incremental data sets. FP-Growth managed to reduce thenumber of database scans to two but still could not handlethe issue of incremental mining effectively. [41] modified theFP-Growth algorithm to generate the frequent itemsets in asingle database scan for efficient incremental and interactivemining. Their algorithmcalledCP-Tree performbranch read-justment to obtain the FP-Tree structure. It however, cannotmaintain the FP-Tree structure at all times and the resultantFP-Tree is generated only after the user defined criteria issatisfied. [37] enhanced the CP-Tree algorithm to maintainthe FP-Tree structure at all times and employed two hashtables to minimize the number of branch comparisons.

Closed frequent itemset generation

An important issue to deal with during frequent patternmining is the generation of inexpediently large number offrequent itemsets. Research on pattern mining identified asignificant solution to solve this issue in the form of closedfrequent itemsets (CFI) that posses the same potential as thatof the large set of frequent itemsets generated. The CFI’sdespite having a smaller magnitude than the larger set offrequent itemsets, are capable of identifying the explicit fre-quencies of all the itemsets. The most widely referencedclosed frequent mining techniques considered for this studyare described as follows:

(a) Apriori Close: For the purpose of generating the closedfrequent itemsets, Apriori Close [32] initially generatesa set of generators based on the closure properties ofitemsets employing a levelwise search procedure. Thesupport counts of all the generators are then obtained toremove the unwanted generators. The closure of all the

123

Page 4: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

592 Complex & Intelligent Systems (2021) 7:589–619

desirable generators are computed to obtain the set ofCFI’s. Apriori Close performs exceptionally well withdense data set but suffers from performance degradationwhile dealing with data set with long patterns at lowsupport thresholds.

(b) FPClose: [22] proposed another variant of FP-Growthcalled FPclose to mine the closed frequent itemsets.The algorithm makes use of a tree data structure namedClosed Frequent Itemset Tree (CFI-Tree) to store theCFI’s obtained from the set of frequent itemsets. When-ever a new frequent itemset is generated, it is comparedwith the existing closed frequent itemsets in the CFI-Tree. A frequent itemset will be considered to be closed,provided it has no superset in the CFI-Tree with samesupport count. FPclose performs efficiently due to theusage of array based implementation and compact treestructure. The drawback however is the additional timespent by the algorithm in checking the closedness of fre-quent itemsets.

(c) LCM: To generate the frequent closed itemsets, [44]introduced the Linear Time Closed Itemset Miner (LCM)algorithm that employs a parent child relationshipbetween the frequent closed itemsets. The algorithmbuilds set enumeration tree for the frequent itemsets andthe closure of these itemsets is obtained by traversal ofthe enumeration tree. The algorithm obtains the frequentclosed itemsets in linear computational time.

(d) CHARM: CHARM [50] introduced a novel data struc-ture called Itemset Tree to examine the itemset space aswell the transaction space. Employing a hybrid searchmethod by the algorithm, allows faster generation ofCFI’s by avoiding search at several levels of the itemsettree. The algorithm further uses a hash based techniqueto remove the itemsets that do not satisfy the closureconstraints. For efficiency in memory usage and fasterfrequency calculations, CHARM employs another datastructure called diffset. It is however, not very efficientin case of data sets with long patterns

(e) CLOSET:CLOSET [33] algorithmwas introducedwiththe purpose of developing a scalable technique that iscapable of handling larger data sets. It extends the con-cept of pattern growth adopted previously byFP-Growthalgorithm. To mine the CFI’s, the algorithm generates acompressed tree representation of the database, similarto FP-Tree. Furthermore, it uses a partition based pro-jection technique to lessen the search space as well asto achieve scalable pattern mining. CLOSET is fasterand efficient than other CFI generation techniques overdense data sets. However, in case of sparse data sets, itstill needs improvement.

Scalability and reduction of search space has always beena crucial issue while mining closed frequent itemsets. [36]

developed an efficient technique that identifies the closed fre-quent itemsets using a tree structure. The algorithm is highlyscalable and shows appreciable performance in terms of exe-cution time. [28] attempted to identify the closed frequentitemsets using a vertical data structure. Their proposed verti-cal data structure reduces the storage space to a great extentand therefore can effectively handle large data sets. Anotherissue to deal with in case of closed frequent pattern min-ing is the generation of closed frequent patterns from datastreams. [25] developed an algorithm to handle the conceptdrift problem and identify the closed frequent itemsets fromdata streams.

Maximal frequent itemset generation

Extracting the entire set of frequent itemsets is a computa-tionally challenging as well as extravagant phase of patternmining. Studies in the literature illustrate that if a particu-lar itemset is frequent then its subset must also be frequent.Such a notion emphasizes that extraction of only the maxi-mal frequent itemsets (MFI) will be sufficient rather than thecomplete set of frequent itemsets, thus minimizing the hugenumber of frequent itemsets generated. A frequent itemsetqualifies to become a maximal frequent itemset, only if noneof its superset is frequent. A brief discussion on the algo-rithms considered for this empirical study is given below:

(a) Max-Miner: In order to generate the long patternsor more specifically, the maximal frequent itemsets,[6] developed an extension of Apriori algorithm calledMax-Miner. The algorithm reduces the search space byremoving the candidate itemsets having an infrequentsubset. Max Miner avoids expensive traversal of searchspace in a bottom up fashion and employs a look aheadsearch for faster identification of MFI’s. However, itexpends multiple database scans while searching for thelong frequent patterns.

(b) MAFIA: MAFIA [18] employs a depth-first searchapproach to generate the maximal frequent itemsets. Itobtains the frequency count of the itemsets based ona bitmap representation. The columns of the verticalbitmap represent the count of the itemsets. A bitvectorand operation is applied on each bitvectors of individ-ual items to obtain the bit vector for the entire itemset.The algorithm prunes the subsets and supersets based ontheir frequency counts.MAFIA is able to generate all thesupersets of MFI’s but does not perform well with datasets that have long average pattern length.

(c) GenMax:GenMax developed by [21] operates on verti-cal representation of data sets. It adopts themechanismofprogressive focussing to identify the set of MFI’s. Basedon this technique, the algorithm generates a list of localmaximal frequent itemsets (LMFI).Whenever, a new fre-

123

Page 5: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 593

quent itemset is produced, instead of comparing it withall the previously generated MFI’s, GenMax comparesit with the list of local maximal frequent itemsets. It per-formswell with data sets having long average transactionlength but demands performance improvement in case ofdata sets having long average pattern length.

(d) FPMax: FPMax [23] was developed as an extensionof FP-Growth algorithm to find the MFI’s. It initiallyconstructs the Maximal Frequent Itemset Tree (MFI-Tree) to store the MFI’s. Only those frequent itemsetswill be inserted into the MFI-Tree that are subsets ofitemsets already present in the tree. The algorithm gen-erates the supersets of frequent itemsets and removes thenon-maximal frequent itemsets. FPMax is highly scal-able and works well with data sets having short averagetransaction length. However, it spends a lot of time inconstruction of the MFI-Tree.

To reduce the complexity of mining maximal frequentitemsets and minimize the cost of large execution time andmemory usage, several techniques have been developed. [45]developed an algorithm that uses an N-List structure to com-press the data set and mine the top-rank k maximal frequentpatterns. They also proposed a pruning technique in order toreduce the search space. One of the challenges in the field ofmaximal frequent pattern mining is the generation of max-imal frequent patterns from data streams. [48] developed anovel technique to identify the maximal frequent patternsover data streams by imposing some weight constraints. Thealgorithm performs a single scan of the database, thus min-imizing the overhead of extracting the patterns from datastreams. [27] developed a sliding window based technique tomine the maximal frequent patterns from data streams. Thealgorithm employs a strategy to avoid meaningless patterngeneration by pruning the unnecessary operations.

Rare itemset generation

The paradigm of frequent pattern mining have always treatedthe rare patterns and itemsets to be of least importanceand thus have always demanded for its removal or elimina-tion during itemset generation phase. However, extractionof rare patterns or itemsets have recently gained muchimportance considering its manifold applications in severaldomains [10]. Substantial quantity of research has alreadybeen performed in the area of rare pattern mining that con-tributes enormous techniques looking for these previouslyunwanted rare itemsets [9,12–14,16,17]. The rare patternmining techniques considered for comparative analysis havebeen described below.

(a) MS Apriori: The first endeavor towards rare patternmining was taken by [29], who attempted to retain some

rare items along with the frequent itemsets during thephase of itemset generation. They argued that using asingle support threshold alone might not allow the fruit-ful rare items to be retained. This led them using separatesupport threshold for each item called their MinimumItem Support (MIS) value which is obtained by spec-ifying an additional parameter β. With the usage ofindividual MIS values for the items, the algorithm satis-fies different support requirements of the users.However,introduction of an additional user defined parameterincreases the overhead of the algorithm.

(b) Apriori Inverse: Apriori Inverse proposed by [26] usesinverse downward closure property to obtain a special setof itemsets called sporadic itemsets. The support valueof sporadic itemsets must be below a maximum supportthreshold but higher than a minimum support threshold.Despite its capability of find the sporadic itemsets fasterthan Apriori, Apriori Inverse fails to find the completeset of rare items.

(c) Apriori Rare: [40] introduced a modification of Apriorialgorithm called Apriori Rare, with a view to gener-ate both the frequent as well as rare itemsets. However,instead of generating the entire set of rare items, the algo-rithm is capable of generating a special class of itemsetcalled Minimal Rare Itemsets (MRI).

(d) ARIMA: [39] developed Another Rare Itemset Min-ing Algorithm (ARIMA), targeting the complete set ofrare items. ARIMA employs an Apriori-like levelwiseapproach and works as combination of two algorithmsfor the generation of rare itemsets. ARIMA uses AprioriRare to find out theMRI’s that were initially removed byApriori algorithm. The algorithm then proceeds to findthe Mininimal Rare Generators (MRG) from the Fre-quent Generators (FG) using another algorithm calledMRG-Exp. With the MRI’s as the input ARIMA gener-ates the complete set of rare itemsets, those having zeroas well as non- zero support. ARIMA however, fails tobe time efficient as it spends a lot of time generating theMRI’s and MRG’s.

(e) RP-Tree: Keeping in view the significance of FP-Growth based approaches, [43] developed the first treebased algorithm for the extraction of rare itemsets calledRare Pattern Tree (RP-Tree). The algorithm during itsinitial phase obtains the rare items based on its supportcount. The next phase is the tree construction phase con-sidering those transactions that contain at least one rareitem in it. The pattern mining operation is same as thatof FP-Growth by generating conditional pattern basesand conditional trees. In spite of its efficiency, RP-Treeis unable to generate the complete set of rare itemsets.It targets only a special class of rare itemsets called rareitem itemset where in the entire itemset is composed ofonly rare items.

123

Page 6: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

594 Complex & Intelligent Systems (2021) 7:589–619

To resolve the issue of different support requirements ofthe user, RP-Tree algorithm is later extended by [7] usingmultiple support framework by assigning each item theirindividual support values. In order to minimize the numberof searches, [42] proposed Rarity algorithm that selects thelongest transaction within the database and performs a topdown search for finding the rare itemsets thus avoiding thelower layers of the search space containing frequent itemsets.However, the mining intricacy of the algorithm is increasedwith the extraction of large number of candidates and level-wise search. AfRIM [1] uses the same top down strategy asRarity algorithm forfinding the rare itemsets.AfRIM initiallysearches for the itemset that contains all the items present inthe database. During candidate generation, all the possiblecombinations of rare itemset pairs in the previous level areexamined to find the common itemset subsets between them.The efficiency of the algorithm is highly affected by costlysteps of candidate generation and pruning.Amajor issuewithrare pattern mining techniques is inefficiency in handlingincremental data sets. Several techniques have been proposedin the literature taking into account the issue of incremen-tal rare pattern generation. Borah and Nath [13] developedan incremental rare pattern mining technique for earthquaketrend analysis and anticipation. The technique could onlyhandle the case of transaction insertion. Therefore, they fur-ther extended their approach in [12] to efficiently generatethe rare patterns upon insertion as well as deletion of trans-actions. A queue data structure based rare pattern miningapproach was proposed in [9] to handle the issue of generat-ing rare patterns from sparse data set.

Table 1 shows a comparative analysis of the different cate-gory of algorithms considered for this study. The comparisonhas been done based on the strategy ormechanismused, num-ber of database scans performed, type of itemset generatedby the algorithms as well as their merits ans demerits.

Performance analysis

An extensive comparative analysis has been carried outon nineteen fundamental algorithms using three real andthree synthetic data sets. Some publicly available Javaimplementations of certain standard algorithms like Apriori,FP-Growth, FPClose, CHARM, GenMax and FPMax havebeen used for this study. Rest of the algorithms have beenimplemented in Java on a 64-bit machine of 4 GB RAM.

Several benchmark real-life and synthetic data sets havebeen used for experimentation. The selected data sets are con-sidered as benchmark data sets as these data sets are beingwidely used by the patternmining techniques in the literature,specifically the techniques considered in this study. Table 2illustrates the characteristics of the data sets used for evaluat-ing the performance of pattern mining algorithms. The data

sets are obtained from UCI Machine Learning Repository[19].

This section discusses the performance of selected patternmining algorithms on the benchmark real and synthetic datasets. For simplicity andmore clarity, the figures depicting thecorresponding results were divided into four different cate-gories. The first category contains five algorithms generatingonly the frequent itemsets namely the Apriori, FP-Growth,Apriori TID, ECLAT and H-Mine. The second category ofalgorithms include Apriori Close, FPClose, LCM, CHARMand CLOSET algorithms generating closed frequent item-sets. Four algorithms generating maximal frequent itemsetsnamely Max-Miner, MAFIA, GenMax and FPMax consti-tute the third category. The fourth category on the other hand,includes algorithms likeMSApriori, Apriori Inverse, AprioriRare, ARIMA and RP-Tree that generates rare itemsets.

Zoo data set

This section illustrates the performance evaluation of the con-sidered algorithms on zoo data set. The algorithms under thefour categories have been compared based on their executiontime, memory usage and number of itemsets generated.

Execution time

The execution time invested by the considered algorithms isillustrated through graphical analysis in Fig. 1. Figure 1a–ddepicts the execution time analysis of the four categories ofalgorithms separately while Fig. 1d highlights the same forall the algorithms.

(a) First Category: In the first category of algorithms, FP-Growth outperforms all the other algorithms in termsof execution time. This is quite obvious since Zoo is adense data set and performance of tree based approachesis best in case of dense data sets. Zoo data set contains alot of frequent items due towhich FP-Tree achieves goodcompression due to the overlapping of common items.Moreover, due to less database scans, execution time isgreatly reduced in case of FP-Growth.H-Minemanages to perform better than rest of the algo-rithms due to the usage of queue data structure for storingthe frequent itemsets. It however fails to outperform FP-Growth, as the data set is dense where the performanceof tree based approaches have proven to be the best.ECLAT manages to gain a spot between the levelwisealgorithms Apriori TID and Apriori and pattern growthapproachesFP-Growth andH-Mine.Apriori andAprioriTID compensates in terms of execution time due to hugenumber of candidate generation and multiple databasescans. Apriori TID spends slightly higher execution timethan Apriori due to the inclusion of the item ids along

123

Page 7: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 595

Table1

Com

parisonbetw

eenpatte

rnminingalgo

rithms

Algorith

mStrategy

DatabaseSc

anItem

setG

enerated

Merit

Dem

erit

Apriori[4]

Levelwisesearch

Multip

leFrequent

Generates

morefrequent

itemsets

Hugemem

oryconsum

ption

FP-G

rowth

[24]

Treebased

Two

Frequent

Lessdatabase

scans

FP-Treeforlargedatamay

notfi

tinmem

ory

AprioriTID

[3]

Levelwisesearch

Multip

leFrequent

Databaseisnotu

sedforsupportcountingof

candidates

Higherexecutiontim

e

ECLAT[49]

Levelwisesearch

Multip

leFrequent

Fastsupportcounting

Hugemem

oryconsum

ption

H-M

ine[34]

Queue

based

Two

Frequent

Mem

oryefficient

Not

suitablefordensedatasets

AprioriClose

[32]

Levelwisesearch

Multip

leClosedFrequent

Generates

moreclosed

frequent

itemsets

Costly

whenmininglong

patte

rns

FPClose

[22]

Treebased

Two

ClosedFrequent

Lessexecutiontim

eFP

-Treeforlargedatamay

notfi

tinmem

ory

LCM

[44]

Levelwisesearch

Multip

leClosedFrequent

Lessexecutiontim

eDoesno

tworkwellatlow

ersupp

orts

CHARM

[50]

Treebased

Two

ClosedFrequent

Quickly

identifi

estheclosed

frequent

itemsets

Doesnotw

orkwellath

ighersupports

CLOSE

T[33]

Treebased

Two

ClosedFrequent

Avoidsredundantcom

putatio

nDoesnotw

orkwellatlow

ersupports

Max-M

iner

[6]

Levelwisesearch

Multip

leMaxim

alFrequent

Quickly

identifi

eslong

frequent

itemsets

Moremem

oryconsum

ption

MAFIA[18]

Levelwisesearch

Mutlip

leMaxim

alFrequent

Efficientfor

databaseshaving

long

itemsets

Performsexpensivebitv

ectorop

erations

GenMax

[21]

Levelwisesearch

Multip

leMaxim

alFrequent

Fastsupportcounting

Performsexpensivebitv

ectoroperations

FPMax

[23]

Treebased

Two

Maxim

alFrequent

Lessexecutiontim

eNot

suita

bleforlargedatasets

MSApriori[29]

Levelwisesearch

Multip

leFrequent

itemseth

avingrare

item

Generates

morerare

items

Determinationof

parameter

βiscumbersom

e

AprioriInverse[26]

Levelwisesearch

Multip

leRare

Lessexecutiontim

eAssignm

ento

ftwothresholds

AprioriRare[40]

Levelwisesearch

Multip

leRare

LessExecutio

nTim

eGenerates

only

theminim

alrare

itemsets

ARIM

A[39]

Levelwisesearch

Multip

leFrequent

andRare

Generates

thecompleteseto

frare

items

Hugemem

oryconsum

ption

RP-Tree[43]

Treebased

Two

Rare

Lessexecutiontim

eCanno

tgeneratethecompleteseto

frare

items

123

Page 8: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

596 Complex & Intelligent Systems (2021) 7:589–619

Table 2 Data sets used Data sets Number of transactions Number of items Average transaction size Type

Zoo 101 36 17 Dense

Lymph 148 68 18 Dense

Mushroom 8124 119 23 Dense

Retail 88,162 16,470 10 Sparse

T10I4D100K 100,000 1000 10 Sparse

T40I10D100K 100,000 1000 40 Sparse

with the frequent itemsets. The execution time for all thealgorithms tends to increase with decrease in the mini-mum support values due to generation of higher numberof frequent itemsets.

(b) Second Category: The performance of the second cat-egory of algorithms are as follows. FP-Close performsbetter than CLOSET for all minimum support values.Both FP-Close and CLOSET recursively constructs theFP-Trees for dense data set Zoo. The difference in theexecution times of the algorithms lies in the fact thatFP-Close retains only a part of the extracted CFI’s inthe recursively generated CFI-trees and the subsumptionchecking cost is also very less compared to CLOSET.LCM outperform CHARM at all minimum support val-ues due to its ability of duplicate detection and usage ofseveral other optimization strategies. CHARM howevermanaged to perform better than Apriori Close.

(c) Third Category:Among the third category of algorithmsgeneratingmaximal frequent itemsets, the most effectiveone is theFPMax algorithm due to its obvious advantageof projection tree construction. Moreover, Zoo data sethas a short average transaction length (ATL) of 17 andthe average pattern length (APL) of the generated MFI’sis comparable to ATL. FPMax will generate a smallFP-Tree from this data set effectively generating MFI’sfrom the extracted MFI-Tree. GenMax and MAFIA onthe other hand, performs expensive bitvector operationsand set intersections.Thus, FP Max shows better per-formance than GenMax and MAFIA in such data sets.Minimal difference can be seen in the execution timesof GenMax and MAFIA. However, both the algorithmsmanaged to perform better than Max-Miner.

(d) Fourth Category: ARIMA proves to be the most expen-sive one, in the fourth category of algorithms generatingrare itemsets. This is due to the fact that it spends a lotof time in the execution of Apriori Rare generating theMRI’s andMRG-Exp extracting the MRG’s. RP-Tree onthe other hand, spends the least amount of execution timeamong all the algorithms due to the generation of onlya subset of rare itemsets. Fixing the Maximum Support(maxsup) value at 60%, it has been observed that AprioriInverse incurs execution time only slightly higher than

RP-Tree. Using a β value of 0.1, MS-Apriori performsbetter than ARIMA but proves to be slightly expensivethan Apriori Rare.

Memory Usage

The memory usage of the different categories of algorithmsis depicted in this section. Figure 2a–d, shows a graphicalanalysis of the four categories. Figure 2e shows a comparisonof all the algorithms in terms of their memory usage.

(a) First Category: Among the algorithms belonging to thefirst category,FP-Growth proved to be the best algorithmin terms of memory usage. Due to the presence of toomany frequent items in the data set, the size of generatedFP-tree will be very less due to the sharing of commonitems. The next best in this catgory is the H-Mine algo-rithm. Employing queue data structures during frequentitemset generation gave an added advantage toH-Mine asqueue data structures consume less memory. ECLAT hasbeen found to be showing an average amount of mem-ory consumption only slightly higher than FP-Growthand H-Mine. Apriori TID is the most expensive in thiscategory followed by Apriori due to the storage of can-didate itemsets generated during each phase of itemsetgeneration.

(b) SecondCategory: AprioriClose consumed themaximumamount of memory among the other algorithms underthis category. The reason behind this overhead is thatAprioriClose tends to store each extractedMinimal Fre-quent Generator (FMG) in main memory until it obtainsthe entire set. Thus the closure computation for each gen-erated FMG proves to be expensive. CLOSET managedto become the best algorithm in terms of memory con-sumption as it does not retain the set of FMG’s in mainmemory. It however performs the closure computationof FMG’s during support counting, for which it needs tostore the closed itemsets (CI) with supports lower thanthe minimum support value in main memory. Since thenumber of CI’s in case of sparse data sets is very large,CLOSET does not perform well in case of sparse datasets.

123

Page 9: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 597

Fig. 1 Execution time on zoo data set. a Frequent itemset generation. b Closed frequent itemset generation. cMaximal frequent itemset generation.d Rare itemset generation. e Execution time of algorithms

123

Page 10: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

598 Complex & Intelligent Systems (2021) 7:589–619

Fig. 2 Memory usage on zoo data set. a Frequent itemset generation. b Closed frequent itemset generation. cMaximal frequent itemset generation.d Rare itemset generation. e Memory usage of algorithms

123

Page 11: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 599

LCM on the other hand, maintained a steady memoryconsumption at all support values as it avoids storingthe previously extracted FCI’s in main memory. BothCHARM and FP-Close proved to be a little expensivein terms of memory consumption at low support val-ues. This is due to the fact that both the algorithms tendto store the previously generated FCI’s in main mem-ory. Even though, FPClose retains only a portion of theextracted FCI’s in CFI tree but storing multiple CFI treesstill requires ample amount of main memory.

(c) Third Category: FPMax outperformed the rest of thealgorithms under this category. Since the FP-Tree gener-ated for this data set will be too small due to short ATLwhich gives FPMax an advantage over others. GenMaxconsumes a slightly higher memory as it needs to retainthe LMFI’s for comparison with the generated MFI’s.Storing the bit vectors for each itemset generated, madeMAFIA a bit expensive than the previous two algorithms.Max-Miner on the other hand, consumed the maximumamount of memory among other algorithms.

(d) Fourth Category: In this category, the algorithm con-suming the maximum amount of memory is ARIMA dueto the storage of the MRI’s and MRG’s obtained fromApriori Rare and MRG-Exp. The next expensive algo-rithm in terms of memory consumption is MSApriori.Storage of the candidate itemsets at each phase of item-set generation demands more amount of main memoryfor this algorithm. Apriori Rare managed to consumeless memory than ARIMA and MS-Apriori due to theretainment of only the MRI’s. RP-Tree is undoubtedlythe best algorithm in this regard due to the generation ofsmall FP-Tree in this data set, that too storing only therare-item itemsets. With maximum support threshold of60% and for storing only the sporadic itemsets, Apri-ori Invserse managed to consume a marginal amount ofmain memory slightly higher than RP-Tree.

Itemsets generated

Figure 3 illustrates the different types of itemsets generatedfrom Zoo data set. The maximum number of itemsets gen-erated in this data set is that of the frequent itemsets. Zoobeing a highly dense data set has maximum number of fre-quent itemsets. The next type of itemsets generated are therare itemsets followed by the rare-item itemsets. Number ofCFI’s generated have been found to be quite higher than thatof MFI’s. MRI’s and sporadic itemsetsmake up only a smallportion of the data set.

Lymph

The performance evaluation of the algorithms under differentcriterias on Lymph data set is illustrated in this section.

Fig. 3 Itemsets generated in zoo data set

Execution time

This section contains the performance analysis of the algo-rithms in terms of execution time. Figure 4a–d compares thefour categories of algorithms separately while Fig. 4e exam-ines the algorithms altogether through graphical analysis.

(a) First Category: The performance of the first categoryof algorithms in terms of execution time are as follows.The most expensive one under this category has beenfound to be Apriori TID with minimally higher execu-tion time than Apriori. The idea of generating candidatesand storing the identifiers of items escalates the execu-tion time of Apriori TID making it the most extravagantalgorithm. ECLAT performed better than the previoustwo algorithms and managed to present a moderate con-duct. FP-Growth has again shown the best performanceamong all the other algorithms, for Lymph being a densedata set followed by H-Mine.

(b) SecondCategory:FPClose surpassed the rest of the algo-rithms under this categorywith only a nominal differencewith that of CLOSET. The reduced subsumption check-ing cost and idea of storing only a fraction of CFI’sfacilitated FPClose to be the best among others. LCMdisplayed a consistent performance and CHARM man-aged a slot just below it. It, however, performed thanApriori Close.

(c) Third Category: In this category, the performance ofFPMax outperformed the other algorithms. For the samereasons discussed previously, appreciable performancecan be observed from FPMax. GenMax andMAFIA hasslightly higher execution time due to bitvector operationsand set intersections.Max-Miner again fail to show con-vincing performance as it spends a lot of time searchingfor the long MFI’s.

123

Page 12: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

600 Complex & Intelligent Systems (2021) 7:589–619

Fig. 4 Execution time on lymph data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. e Execution time of algorithms

123

Page 13: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 601

Fig. 5 Memory usage on lymph data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. e Memory usage of algorithms

123

Page 14: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

602 Complex & Intelligent Systems (2021) 7:589–619

(d) Fourth Category:Under this category, RP-Tree has beenfound to spent lesser execution time in comparison toother algorithms. Reducing the database size and gen-erating only the rare-item itemsets enabled RP-Tree toexpend minor execution time. Apriori Inverse followedRP-Tree generating only the sporadic itemsets belowthe maximum support threshold. ARIMA generating thecomplete set of rare itemsets has been themost expensiveone followed by MSApriori and Apriori Rare. AprioriRare generating the MRI’s managed to display a satisfy-ing performance in terms of execution time.

Memory usage

This section elicits the memory usage of the algorithms. Thememory efficiency of each category of algorithms is shown inFig. 5a–d. Figure 5e on the other hand, examines thememoryefficiency of all the algorithms together.

(a) First Category: The algorithm that consumed the lowestamount of memory under this category is FP-Growth.Just like the previous data set, Lymph is also a dense dataset that led to the generation of compact and small FP-Trees resulting in less usage of main memory. H-Mineconsumed a bit morememory thanFP-Growth as its per-formance in case of dense data set is not very appreciableas that of sparse data set. The levelwise approaches Apri-ori TID, Apriori and ECLAT expended more memory ascompared to the previous two approaches.

(b) Second Category: CLOSET dominated rest of the algo-rithms under this category due to similar reasons dis-cussed in case of Zoo data set. LCM maintained itsconsistency of memory consumption in Lymph data setas well. CHARM and FPClose have been again found tobe consumingmorememory at low support values due tothe retainment of FCI’s at every pass. Nonetheless, theirmemory consumption have been still lower than that ofApriori Close as it attempts to store the FMG’s in mainmemory before generating the entire FMG set.

(c) Third Category:Memory consumption by FPMax algo-rithm is the lowest one among the MFI generatingalgorithms due to the construction of small FP-Treesfrom a comparatively dense data set. Storage of MFI’sin main memory by MAFIA and GenMax, increases thememory overhead for these two algorithms with only aminimal difference between the two. Max-Miner againconsumed the highest amount of memory among all.

(d) Fourth Category:Despite the fact thatARIMA is the onlyalgorithm generating the complete set of rare itemsets,it consumed the highest amount of memory particularlydue to the storage of MRI’s and FG’s in main memory.MSApriori too consumed lot ofmemory due to the retain-ment of rare items along with the frequent ones during

Fig. 6 Itemsets generated in lymph data set

itemset generation. RP-Tree has shown the best perfor-mance among all in Lymph data set as well, followedby Apriori Inverse with a maximum support thresholdof 60%. Apriori Rare generating the MRI’s performedbetter than ARIMA and MSApriori but failed to surpassRP-Tree and Apriori Inverse.

Itemsets generated

Figure 6 illustrates the number of different itemsets generatedfor Lymph data set. The number of different types of itemsetsgenerated in Lymph data set is lesser than Zoo data set. Rareitemsets generated from this data set is quite higher than thatof the frequent itemsets. The closed frequent itemsets are thenext in number followed by the MRI’s. On the contrary, onlya limited number of maximal frequent and sporadic itemsetshave been obtained. Figure 6 shows a graphical representa-tion of the different types of itemsets generated for the Lymphdata set.

Mushroom

This section elicits the performanceof the algorithms in termsof execution time, memory usage and number of itemsetsgenerated on Mushroom data set.

Execution time

The execution time invested by different categories of algo-rithms considered in the study is illustrated in this section.Figure 7a to 7e presents a graphical analysis and comparisonof the execution time expend by the algorithms.

(a) First Category:The same scenario is repeated in this caselike the previous two data sets.FP-Growth still proved to

123

Page 15: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 603

be the best algorithm for frequent itemset mining amongothers. H-Mine had slightly higher execution time thanFP-Growth since Mushroom is a comparatively denserdata set. Apriori TID and Apriori used the highest exe-cution time as expected due to multiple database scansand generation of candidates. ECLAT on the other hand,managed to perform better than these two algorithms.

(b) SecondCategory: In this categoryof algorithms,FPClosesurpassed rest of the algorithms at all minimum sup-port values specifically at low supports. Despite showingan appreciable performance, CLOSET failed to performbetter than FPClose. Both the algorithms construct pro-jected FP-Trees for dense data set Mushroom. FPCloseon one hand stores a portion of the FCI’s in the gener-ated CFI Trees while CLOSET stores all the FCI’s in aglobal prefix tree that made the difference in their perfor-mance. LCM outperformedCHARM due to its efficiencyin duplicate detection. Apriori Close consumed the high-est amount of execution time as predicted.

(c) Third Category: The third category of algorithms pre-sented similar type of performances obtained in earlierdata sets. FPMax performed exceptionally well at allminimum support values. Data set Mushroom has longaverage pattern length and a short average transactionlength. Performance of FPMax is appreciable in case ofdata sets having short ATL and long APL.GenMaxman-aged to have a decent execution time, slightly higher thanFPMax.MAFIA andMax-Miner still continued to be theexpensive ones in this category of algorithms.

(d) Fourth Category: Performance analysis of the fourth cat-egory of algorithms illustrates that RP-Tree expends thelowest amount of execution time among others due to thegeneration of only the rare-item itemsets.With a maxsupof 60%, Apriori Inverse is the next best algorithm underthis category. Apriori Rare utilized a decent amount ofexecution time for the generation of the MRI’s. Settingthe β value to 0.1, MS-Apriori completed its executionbefore ARIMA. ARIMA despite generating the completeset of rare itemsets, proved to be the costliest.

Memory usage

Memory efficiency of the algorithms and their compara-tive scrutiny based on memory usage is demonstrated usinggraphical analysis from Fig. 8a–e.

(a) First Category: The same outstanding performance ofFP-Growth on dense data sets can be seen in case ofMushroom data set as well due to smaller size of theFP-Tree. H-Mine have been found to consume little bitmore memory than FP-Growth. The highest memoryconsumption was by Apriori TID followed by Apriori.

(b) Second Category: Due to the same reason discussed forprevious data sets, CLOSET managed to consume thelowest amount of memory among the second category ofalgorithms. CHARM and LCM have been consistent intheir amount of memory consumption as the earlier datasets.FP-Close on the contrary have shown disappointingperformance due to its tendency of retaining the FCI’s atevery pass. Apriori Close as expected has consumed thehighest amount of memory among all.

(c) Third Category: The size of the FP-Tree generated byFPMax forMushroom is very small due to the density ofthe data set. As, a result thememory consumption by thisalgorithm is quite less compared to other algorithms ofthe same category, the highest being that ofMax-Miner.Due to the storage of MFI’s in main memory, MAFIAand GenMax consumed higher memory than FPMax.

(d) Fourth Category: RP-Tree algorithm under this categoryconsumed the lowest amount of memory which is obvi-ous due to the retainment of only the rare-item itemsets.Followed by RP-Tree is the Apriori Inverse algorithmthat consumed a reasonable amount of memory uponsetting the maxsup value to 60%. Highest amount ofmemory is consumed by ARIMA which is the onlyalgorithm generating the complete set of rare itemsets.Apriori Rare and MS-Apriori curtailed the amount ofmemory consumed to some extent compared to ARIMA.

Itemsets generated

The number of itemsets generated under the given thresholdfor Mushroom data set are quite high in number, the highestbeing that of the rare itemsets. The frequent itemsets gener-ated are slightly less in number than the rare itemsets. Thenumber of closed frequent and maximal frequent items gen-erated are very less, almost negligible in comparison to therare and frequent itemsets. Figure 9 graphically shows thenumber of different types of itemsets generated fromMush-room data set.

Retail

Detailed analysis based on performance of the algorithms onRetail data set is provided in this section.

Execution time

The execution time analysis of the algorithms is depicted inthis section using graphical analysis from Fig. 10a–e.

(a) First Category: FP-Growth expend the highest amountof execution time for the Retail data set. This is obvi-ous as it is a sparse data set containing lesser numberof frequent items. A higher amount of execution time

123

Page 16: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

604 Complex & Intelligent Systems (2021) 7:589–619

Fig. 7 Execution time on mushroom data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. e Execution time of algorithms

123

Page 17: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 605

Fig. 8 Memory usage on mushroom data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. eMemory usage of algorithms

123

Page 18: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

606 Complex & Intelligent Systems (2021) 7:589–619

Fig. 9 Itemsets generated in mushroom data set

is invested in the FP-Tree construction and the result-ing tree is big and bushy as the amount of node sharingis minimal. H-Mine outperformed all the algorithmsdue to its efficient use of tree data structure. Level-wise approaches like Apriori, Apriori TID and ECLATperformed better than FP-Growth as lesser number offrequent patterns and candidates are generated.

(b) Second Category: Array based technique employed byFP-Close proved to be costlier at lower supports as hugenumber of itemsets are generated. CLOSET performedbetter than FP-Close due to the usage of top-downprojection tree that requires the traversal of only theglobal FP-Tree. The large number of candidates gen-erated increases the closure computation cost and theexecution timeofAprioriClose. ThismadeAprioriClosethe most expensive algorithm in this category. LCM out-performs all the algorithms under this category and theperformance gap becomes more evident as support valuedecreases.

(c) Third Category: Retail is a data set having short ATLand the MFI’s generated are also very short. The FP-Tree generated by FP-Max in this case, is very big andbushy. At high support values, there are only fewMFI’s.FP-Max however, expends a lot of time in the FP-Treeconstruction and despite that only few MFI’s are gener-ated from the FP-Tree. Short ATL of Retail however didnot affect the bit vector operations performed byMAFIAand GenMax due to which they performed better thanFP-Max.

(d) Fourth Category: ARIMA proved to be costliest amongall for this data set as well.MS-Apriori managed to per-form better than ARIMA like previous cases. RP-Treespent higher execution time for this data set as comparedto other data sets due to the construction of costly FP-

Trees. Apriori Inverse and Apriori Rare outperformedothers in this category.

Memory usage

Performance evaluation of the algorithms in terms of mem-ory usage is provided in this section. Figure 11a–e provides acomparative analysis of the algorithms based on their mem-ory efficiency.

(a) First Category: The highest amount of memory underthis category is consumed by the FP-Growth algorithm.As Retail is a sparse data set, the amount of node sharingis less due to which large number of nodes are generatedin the FP-Tree. Storage of these nodes in main memoryproved to be costly for FP-Growth in terms of mem-ory usage. However, the number of frequent patterns andhence the candidates generated are less that befittedApri-oriTID and Apriori. H-Mine is the clear winner in thiscase as queue data structures consume less amount ofmemory.

(b) Second Category: The performance of FPClose algo-rithm has been found to be worse among the algorithmsunder this category. The retainment of recursively builtFP-Trees and CFI trees resulted in a high amount ofmemory consumption.CHARM unexpectedly consumedthe lowest amount of memory. The memory consump-tion ofAprioriClose is due to the retainment of FMG’s inmain memory obtained using off-line closure computa-tion. Memory consumption remained constant for LCMwhile CLOSET consumed slightly higher memory thanApriori Close.

(c) Third Category: The storage of big and bushy FP-Treesgenerated by FPMax resulted in high memory consump-tion by the algorithm.Max-Miner consumed slightly lessmemory than FP-Max. GenMax maintained its consis-tency in memory consumption and proved to be the bestalgorithms under this category. The superiority of Gen-Max comes from the fact that it maintains only the localMFI’s for comparison rather than the entire set of MFI’s.MAFIA is the next best algorithm in terms of memoryconsumption under this category.

(d) Fourth Category: ARIMA still consumed the highestamount of memory among the rare itemset generationalgorithms due to the retainment of both frequent andrare itemsets followed byMSApriori. RP-Tree proved tobe a little costlier this time as compared to its memoryconsumption in dense data sets. The reason is same withthat of other pattern mining algorithms. Apriori Rareconsumed less amount of memory this time as comparedto RP-Tree followed by Apriori Inverse.

123

Page 19: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 607

Fig. 10 Execution time on retail data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. e Execution time of algorithms

123

Page 20: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

608 Complex & Intelligent Systems (2021) 7:589–619

Fig. 11 Memory usage on retail data set. a Frequent itemset generation. b Closed frequent itemset generation. c Maximal frequent itemsetgeneration. d Rare itemset generation. e Memory usage of algorithms

123

Page 21: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 609

Fig. 12 Itemsets generated in retail data set

Itemsets generated

Retail being a sparse data set has lesser number of frequentitemsets. Thus, as expected the number of rare itemsets gen-erated is highest for this data set. Minimal rare itemsets arethe next in number followed by closed frequent itemsets.Only few maximal frequent itemsets and sporadic itemsetsare generated for this data set.

T10I4D100K

This section contains performance evaluation of the algo-rithms on sparse data set, T10I4D100K based upon theexecution time invested, the amount of memory used andthe number of itemsets generated.

Execution time

Execution time spent by the algorithms and the respectivecomparative analysis is shown in this section with the helpof graphical analysis from Fig. 13a–e.

(a) First Category: The same trend has been observed forthis category of algorithms like the previous case. H-Mine maintained its effective performance in case ofsparse data set. FP-Growth still proved to be the mostexpensive one among all. Apriori TID and Apriori werethe next in row in terms of performance. ECLAT demon-strated consistent performance for this data set as well.

(b) Second Category: Apriori Closewas the most expensivealgorithm under this category of algorithms unlike theRetail data set. This is because Apriori Close comparesthe support of each k-candidate itemset generated withthe k-1 subsets to find out whether it is an FMG whichincrease its computational overhead. FPClose surpris-

ingly performed better than CLOSET due to the usageof an array based implementation avoiding repeated FP-Tree traversal during the construction of header tablesfor new entries. Among CHARM and LCM, the formerhighly outperforms LCM specifically at higher supportvalues.

(c) Third Category: The performance of MAFIA and Gen-Max continued to be consistent for this data set aswell. FPClose once again failed to impress in terms ofperformance with Max-Miner spending slightly lowerexecution time than FPClose.

(d) Fourth Category: ARIMA has been found to compen-sate its performance in terms of execution time dueto the retainment of frequent as well as rare itemsets.MS-Apriori invested slightly lower execution time thanARIMA. RP-Tree spent higher execution time for thisdata set as compared to dense data sets due to the genera-tion of FP-Trees but displayed feasible execution time asit generates only the rare-item itemsets. Apriori Inverseand Apriori Rare demonstrated similar performanceswith a slight advantage to the former one.

Memory usage

This section illustrates thememory analysis of the algorithmsconsidered in the study from Fig. 14a to e.

(a) First Category: H-Mine maintained its appreciable per-formance for this data set as well. ECLAT was slightlyexpensive thanH-Mine. FP-Growth, as expected provedto be extravagant for this sparse data set. AprioriTID andApriori were the second and third expensive algorithmsin terms of memory consumption under this category.

(b) Second Category: For this category of algorithms, thescenario is completely opposite to that in Retail data set.AprioriClose in this case consumed the highest amountof memory rather than FPClose. This is because Apri-oriClose attempts to store each extracted FMG in mainmemory prior generating the whole set of FMG’s. On theother hand, the higher memory consumption of FPClosecomes from the fact that it retains the recursively builtFP-Trees and CFI trees in main memory that are quitelarge in case of sparse data sets. CHARM largely outper-formed the rest of the algorithms under this category.

(c) Third Category: FPMax is the most extravagant algo-rithm under this category as it needs to store all theFP-Trees and MFI trees generated, in main memory.Max-Miner stores every candidate itemset generateddue to which its memory consumption considerablyincreases. GenMax outperformed the rest of the algo-rithms due to retainment of only the set of local MFI”s.

(d) Fourth Category: ARIMA stores the frequent item-sets and MRG’s which increases its memory overhead.

123

Page 22: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

610 Complex & Intelligent Systems (2021) 7:589–619

Fig. 13 Execution time on T10I4D100K data set. a Frequent itemset generation. b Closed frequent itemset generation. cMaximal frequent itemsetgeneration. d Rare itemset generation. e Execution time of algorithms

123

Page 23: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 611

Fig. 14 Memory Usage on T10I4D100K data set. a Frequent Itemset Generation. b Closed Frequent Itemset Generation. c Maximal FrequentItemset Generation. d Rare Itemset Generation. e Memory Usage of Algorithms

123

Page 24: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

612 Complex & Intelligent Systems (2021) 7:589–619

Fig. 15 Itemsets generated in T10I4D100K data set

Retaining the FP-Trees in memory proved to be expen-sive for RP-Tree. The performance of Apriori Rare andApriori Inverse remained constant for this data set.

Itemsets generated

Figure 15 illustrates the different types of itemsets generatedin T10I4D100K data set. As can be observed in the figure, thehighest number of itemsets generated for this data set are therare itemsets. The frequent itemsets generated are also veryhigh in number but slightly than the number of rare itemsets.The next in number are theminimal rare itemsets followed byfew sporadic, closed frequent andmaximal frequent itemsets.

T40I10D100K

Analysis of the algorithms based on their performance in dataset T40I10D100K is given in this section.

Execution time

The amount of time invested by the algorithms in their execu-tion is illustrated in this section through a graphical analysisin Fig. 16a–e.

(a) First Category: For this data set, AprioriTID largely out-performs FP-Growth as the generation of FP-Trees byFP-Growth expends a lot of time specifically for sparsedata sets. Apriori and ECLAT performed marginallybetter than AprioriTID withH-Mine being the most sub-stantial algorithm under this category.

(b) SecondCategory:AprioriClose spent the highest amountof execution time for reasons similar to the previousdata set. The execution time of CLOSET is affected by

its tendency of preserving all the extracted FCI’s in thesame tree.FPClose under such circumstances performedmarginally better than CLOSET. The performances ofLCM and CHARM have been found to be similar withslight advantage to CHARM.

(c) ThirdCategory:Among the third category of algorithms,the most efficient one is again the GenMax algorithmsince it compares the newly extracted MFI with onlythe set of local MFI’s unlike the other levelwise algo-rithms that performs the comparison with the previouslyextracted MFI’s. FPClose in this case is highly affectedby the recursive generation of FP-Trees and MFI trees.Max-Miner performed slightly better than FPClose fol-lowed by MAFIA.

(d) Fourth Category: ARIMA due to the generation ofMRI’sandMRG’s proved to be expensive even for this data set.MS-Apriori too spent a considerable amount of time indetermining and assigningMIS values to the items.Apri-ori Rare due to the generation of onlyMRI’s and AprioriInverse due to the generation of only sporadic itemsetsfor aMaxsup value of 60% performed consistently wellthan other rare itemset generation algorithms. RP-Treehowever proved to be a little expensive due to generationof large and bushy FP-Trees.

Memory usage

Efficiency of the algorithms in terms of memory usage isdepicted in this section with the help of a graphical analysisfrom Fig. 17a to e.

(a) First Category: The scenario of memory usage for thiscategory is similar to the previous data set. FP-Growthhas been again found to be affected by sparseness ofthe data set where it is obliged to generate large andbushy FP-Trees.Attempt to accommodate the huge num-ber of candidate itemsets generated, proved to be costlyfor both AprioriTID and Apriori. However, their perfor-mance with respect to FP-Growth has been much betteras compared to dense data sets. ECLAT and H-Mineperformed exceptionally well with slight advantage toH-Mine that outperformed all the algorithms.

(b) Second Category: CHARM prove to be most memoryefficient among the other algorithms.Memory consump-tion of CLOSET and FPClose tends to increase at lowerminsup valueswhen the number of FCI’s increases.Apri-oriClose consumed the highest amount of memory inthis category. Memory requirement of LCM however,remained mostly constant.

(c) Third Category: The amount of memory expended byFPMax was considerably higher for this data set. Max-Miner too consumed a higher amount of memory withonly a minimal difference to that FPMax. MAFIA and

123

Page 25: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 613

Fig. 16 Execution time on T40I10D100K data set. a Frequent Itemset Generation. b Closed Frequent Itemset Generation. c Maximal FrequentItemset Generation. d Rare Itemset Generation. e Execution Time of Algorithms

123

Page 26: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

614 Complex & Intelligent Systems (2021) 7:589–619

Fig. 17 Memory Usage on T40I10D100K data set. a Frequent itemset generation. bClosed frequent itemset generation. cMaximal frequent itemsetgeneration. d Rare itemset generation. e Memory usage of algorithms

123

Page 27: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 615

Fig. 18 Itemsets generated in T40I10D100K data set

GenMax continued to perform well for this data set aswell.

(d) Fourth Category: Apriori Rare and Apriori Inversewerethe most memory efficient in this category of algorithms.ARIMA consumed the highest amount of memory fol-lowed by MSApriori. RP-Tree due to the retainment ofonly rare-item itemsets in the FP-Tree, performed betterthan ARIMA and MSApriori.

Itemsets generated

An illustration of different types of itemsets generated in thisdata set is shown in Fig. 18. The scenario is similar to theprevious data set where the number of rare itemsets is quitelarge compared to other types of itemsets. Marginal numberof frequent itemsets have been generated followed by theminimal rare itemsets. On the contrary, only few sporadic,closed frequent and maximal frequent itemsets have beenproduced.

Discussion

We performed the first three set of experiments on dense realdata sets: Zoo, Lymph and Mushroom. The results obtainedwere similar for all the three dense data sets. For the firstcategory of algorithms generating frequent itemsets, the per-formance of FP-Growth has been found to be the best amongall. Efficiency of the algorithm in terms of execution timeand memory usage is quite impressive in case of dense datasets. However, for very large data sets, the algorithm mayrun out of memory. In such cases, an existing scalable vari-ant of the algorithm can be employed. Thus for extractingfrequent itemsets from data sets where there is a dense distri-bution of data, FP-Growth can be preferred. For the second

category, exploiting FPClose for generating closed frequentitemsets can be a good option considering its exceptionalperformance on dense data sets. Even though FPClose out-performed LCM by a slight difference, it is worthmentioningthat the execution time and memory usage of LCM remainedconstant throughout, which is quite appreciable. Among thealgorithms in the third category, FPMax performed well dueto long ATL and short APL of the three dense data sets. How-ever, FPMax fail to show appreciable performance for datasets having long ATL and short APL as well as long ATLand long APL. This is because when length of the transac-tion is very high, then FPMax spends huge amount of time inthe construction of MFI-Tree and the resulting tree will alsobe large and bushy. In such cases, MAFIA and GenMax areexpected to outperform FPMax since the cost of bit-vectoroperations and set intersections will be less. For the fourthcategory of algorithms, RP-Tree is the best option if the usersare interested only in the rare-item itemsets as it failed to gen-erate the complete set of rare items. For generating the rareitemsets, ARIMA even though expensive is preferred as itis the only algorithm that is capable of generating frequentitemsets as well as the complete set of rare itemsets.

The next three set of experiments were conducted on onereal and two synthetic sparse data sets: Retail, T10I4D100Kand T40I10D100K. H-Mine performed exceptionally wellamong the algorithms in the first category for all the threesparse data sets.H-Mine overcomes the poor performance ofFP-Growth on sparse data sets. For the second category ofalgorithms, performances of LCM and CHARM were veryclose to each other. LCM proved to be the most efficient onefor Retail data set while CHARM was better than others incase of the two synthetic data sets. Among the algorithms inthe third category, FPMax failed to perform like in case ofdense data sets. This is because Retail and the other two syn-thetic data sets have short ATL and a relatively short APL.FPMax spends a lot of time in the FP-Tree construction eventhough only a small number of MFI’s are generated from thesparse data sets. Bit-vector operations performed byGenMaxhowever, remains unaffected by short ATL of these sparsedata sets due to which it performed the best among all thealgorithms. In case of the rare itemset generation algorithms,Apriori Rare generating onlyMRI’s andApriori Inverse gen-erating only sporadic itemsets, invested the lowest amountof time. RP-Tree failed to impress in case of sparse data setsdespite generating only a subset of rare itemsets due to thegeneration of big and bushy FP-Trees.

The results obtained for all the data sets in the four differ-ent categories have been summarized in Table 3. The tableelicits the best performing algorithm in terms of executiontime and memory usage in each category of pattern miningtechniques.

From Table 3, it can be observed that among the frequentpattern mining techniques, FP-Growth proved to be the best

123

Page 28: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

616 Complex & Intelligent Systems (2021) 7:589–619

Table3

Performance

comparisonof

allcategoriesof

patte

rnminingtechniqu

es

Dataset

Frequent

itemset

Closedfrequent

itemset

Maxim

alfrequent

itemset

Rareite

mset

Executio

ntim

eMem

oryUsage

Executio

ntim

eMem

oryusage

Executio

ntim

eMem

oryusage

Executio

ntim

eMem

oryusage

Zoo

FP-G

rowth

FP-G

rowth

FP-Close

CLOSE

TFP

Max

FPMax

RP-Tree

RP-Tree

Lymph

FP-G

rowth

FP-G

rowth

FP-Close

CLOSE

TFP

Max

FPMax

RP-Tree

RP-Tree

Mushroom

FP-G

rowth

FP-G

rowth

FP-Close

CLOSE

TFP

Max

FPMax

RP-Tree

RP-Tree

Retail

H-M

ine

H-M

ine

LCM

CHARM

GenMax

GenMax

AprioriRare

AprioriRare

T10I4D100K

H-M

ine

H-M

ine

CHARM

CHARM

GenMax

GenMax

AprioriRare

AprioriRare

T40I10D

100K

H-M

ine

H-M

ine

CHARM

CHARM

GenMax

GenMax

AprioriRare

AprioriRare

algorithm in terms of execution time aswell asmemory usagefor all the dense data sets: Zoo, Lymph and Mushroom. Theperformance of FP-Growth in case of sparse data sets is notvery convincing. Its performance in case of sparse data setscan be enhanced by adopting some mechanism to make theFP-Tree structure more compact or using any other support-ing data structure for storing the patterns. For the sparse datasets: Retail, T10I4D100K and T40I10D100K performanceof H-Mine is the best among all the frequent pattern min-ing techniques. To improve its performance in case of densedata sets, it is beneficial to swap its queue data structureto FP-tree since FP-tree’s compression by common prefixpath sharing and then mining on the compressed structureswill be more efficient compared to the queue data structure.Among the closed frequent pattern mining techniques, FP-Close emerged as the best algorithm in terms of executiontime while the performance of CLOSET was identified to bethe best in terms of memory usage for the dense data sets.Performance of FPClose in terms of memory usage can beimproved if it avoids retaining the FCI’s at every pass. For thesparse data sets, CHARM was the clear winner in terms ofboth execution time andmemoryusage.Only in case ofRetaildata set, performance of LCMwas found to be slightly betterthan CHARM. Among the maximal frequent pattern miningtechniques, FPMax was the best algorithm in terms of execu-tion time and memory usage for all the dense data sets whileGenMax emerged as the best algorithm for all the sparse datasets. RP-Tree proved to be the best algorithm among the rarepattern mining techniques for dense data sets and AprioriRare demonstrated the best performance for sparse data setsin terms of both execution time and memory usage. Perfor-mance of RP-Tree can be improved for sparse data sets byemploying additional data structures for pattern storage or bycompressing the size of the tree structure used. Apriori Rareon the other hand, needs to employ data structures that canguarantee limited storage space for the patterns generated inorder to handle the dense data sets efficiently.

Threats to validity

In order to gauge the quality of work, it is necessary to con-sider the threats to validity of the study [5]. This sectiondiscusses the threats to validity of this study. The followingthreats to validity of the study need to be considered:

(a) Threshold range: The results of study and the per-formance of algorithms are completely dependent onthe value of support thresholds used. The range of val-ues chosen for support thresholds are user defined andchange in valuesmight affect the final results of the study.

(b) Data set reliability: The data sets used in this studyare the ones widely used by the techniques considered

123

Page 29: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 617

for experimentation. The results obtained and the behav-ior of algorithms might vary from one data set to otherdepending on the data set characteristics. It might be pos-sible that the findings may vary for some data sets notbeing used in the study.

(c) Limited data sample size: Due to hardware and soft-ware limitations, the data sets considered are not verylarge. Larger data sets could undermine the validity ofthe results obtained in this study.

(d) Lack of good descriptive statistics: The results shownin this study are basically the average of observed resultsor average of multiple runs. However, measure of vari-ation of the observed results, such as their standarddeviation, minmax range or a box-plot, are not presentedas the study is primarily based on experimental evalua-tion.

(e) Lack of discussion on code instrumentation: The pub-licly available source code or implementations used inthe experiments may hide specific tweaks or instrumen-tations to favor certain instances or algorithms, thusinfluencing the observed results.

(f) Lack of evaluations for instances of growing size andcomplexity: The algorithms considered for evaluationin this study may have been designed to handle data setsthat may vary in size and complexity. The evaluation ofthe algorithms should have been done across a breadthof problem instances, both varying in size and complex-ity, to provide an assessment on their limits to handledifferent instances.

Conclusion

Through this paper, we attempted to provide a structural andanalytical comparative scrutiny of some of the widely refer-enced pattern mining techniques for guiding the researchersin making the most appropriate selection for data analysisusing pattern mining. The prime focus of this comparativestudy is to enable the researchers gain an insight into the per-formance and respective pros and cons of the fundamentalpattern mining techniques through a detailed theoretical andempirical analysis.

Initially, we provide a structural classification of the pat-tern mining techniques in four different categories basedupon the type of itemsets generated and a theoretical com-parison of all these algorithms specifying their merits anddemerits. To gauge the performance of the concerned algo-rithms, we presented an empirical analysis based on severaldecisive parameters. We performed experiments using threereal and three synthetic benchmark data sets. Both denseand sparse data sets have been considered for analyzing thebehavior of the algorithms under different data characteris-tics. The performance is analyzed based upon the execution

time invested, amount of memory consumed and the numberof itemsets generated by the algorithms.

None of the algorithm however, can be termed the bestin its category for all data characteristics under differentconditions. The selection of an algorithm in a particularcategory depends upon the characteristic of data and therequirement of user. There are certain challenging issuesfaced by the algorithms in each category that affects theirperformance.Among the frequent patternmining techniques,Apriori, Apriori TID and ECLAT suffers from the drawbackof huge execution time andmemory consumption due to theirapproach of levelwise search in finding the frequent itemsets.H-Mine despite being less expensive in terms of executiontime and memory consumption, is only suitable for sparsedata sets. FP-Growth is the winner among all when it comesto dense data sets, but for large data sets the FP-Tree struc-ture might not get accommodated in main memory. For theclosed frequent pattern mining techniques, the variant of FP-Growth, FPClose suffers from the same drawback. AprioriClose on the other hand, proves to be costly while mininglong patterns. LCM and CLOSET does not work well atlower support thresholds while CHARM fails to show goodperformance at higher support thresholds. Maximal frequentpattern mining technique Max-Miner suffers from the samedrawback of high execution time and memory consumptionlike other levelwise search approaches. FPMax is not suit-able for large data sets as the generated FP-Tree may notfit into main memory. MAFIA and GenMax perform expen-sive bit vector operations due to which they are not suitablefor databases having short itemsets. Among the rare patternmining techniques, MSApriori expends huge execution timeand memory and also requires the identification of an extraparameter β to identify the rare itemsets. Apriori Inverse andApriori Rare can generate only the minimal rare itemsets.ARIMA despite generating the complete set of rare itemsets,proves to be costly in terms of execution time and memoryusage. RP-Tree being a tree based approach performed betterthan all the algorithms, but cannot generate the complete setof rare itesmets.

The experimental study indicates that performance of thealgorithms is greatly affected by the characteristics of datasets used. An in depth study of the data set characteristicscan be an important aspect of research in the field of patternmining. To better understand the behavior of algorithms, astructural characterization of the data sets need to be donebeyond the number of transactions/items, the average trans-action size or the density. It has been further observed thataverage transaction length and average pattern length aresome other factors that have an important impact on algo-rithm performances. In this regard, an important researchdirection would be to perform a detailed analysis of howand why these factors affect the performances of algorithms.Finding effective and costless metrics to analyze the perfor-

123

Page 30: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

618 Complex & Intelligent Systems (2021) 7:589–619

mance of algorithms can be another future perspective towork on.

Conflict of interest None of the authors have any competing interests.

Open Access This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing, adap-tation, distribution and reproduction in any medium or format, aslong as you give appropriate credit to the original author(s) and thesource, provide a link to the Creative Commons licence, and indi-cate if changes were made. The images or other third party materialin this article are included in the article’s Creative Commons licence,unless indicated otherwise in a credit line to the material. If materialis not included in the article’s Creative Commons licence and yourintended use is not permitted by statutory regulation or exceeds thepermitted use, youwill need to obtain permission directly from the copy-right holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References

1. Adda M, Wu L, Feng Y (2007) Rare itemset mining. In: MachineLearning and Applications, 2007. ICMLA 2007. Sixth Interna-tional Conference on, IEEE, pp 73–80

2. Aggarwal A, Toshniwal D (2019) Frequent pattern mining on timeand location aware air quality data. IEEE Access 7:98,921–98,933

3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: DataEngineering, 1995. Proceedings of the Eleventh International Con-ference on, IEEE, pp 3–14

4. Agrawal R, Srikant R, et al (1994) Fast algorithms for mining asso-ciation rules. In: Proc. 20th int. conf. very large data bases, VLDB,vol 1215, pp 487–499

5. Ali S, Briand LC, Hemmati H, Panesar-Walawege RK (2009) Asystematic review of the application and empirical investigationof search-based test case generation. IEEE Trans Software Eng36(6):742–762

6. Bayardo RJ Jr (1998) Efficiently mining long patterns fromdatabases. ACM Sigmod Record 27(2):85–93

7. Bhatt U, Patel P (2015) A novel approach for finding rare itemsbased on multiple minimum support framework. Proc Comput Sci57:1088–1095

8. Borah A, Nath B (2017a) Mining patterns from data streams: anoverview. In: 2017 international conference on I-SMAC (IoT insocial, mobile, analytics and cloud)(I-SMAC), IEEE, pp 371–376

9. Borah A, Nath B (2017b) Mining rare patterns using hyper-linkeddata structure. In: International conference on pattern recognitionand machine intelligence, Springer, pp 467–472

10. BorahA,NathB (2017c)Rare association rulemining: a systematicreview. Int J Knowl Eng Data Min 4(3–4):204–258

11. Borah A, Nath B (2018a) Fp-tree and its variants: towards solvingthe pattern mining challenges. In: Proceedings of first internationalconference on smart system, innovations and computing, Springer,pp 535–543

12. Borah A, Nath B (2018b) Identifying risk factors for adverse dis-eases using dynamic rare association rulemining. Expert Syst Appl113:233–263

13. Borah A, Nath B (2018c) Rare association rule mining from incre-mental databases. In: Pattern analysis and applications, pp 1–22

14. Borah A, Nath B (2019a) Incremental rare pattern based approachfor identifying outliers in medical data. Appl Soft Comput85(105):824

15. Borah A, Nath B (2019b) Performance analysis of tree-basedapproaches for pattern mining. In: Computational intelligence indata mining, Springer, pp 435–448

16. Borah A, Nath B (2019c) Rare pattern mining: challenges andfuture perspectives. Complex & Intelligent Systems 5(1):1–23

17. Borah A, Nath B (2019d) Tree based frequent and rare pattern min-ing techniques: a comprehensive structural and empirical analysis.SN Appl Sci 1(9):972

18. Burdick D, Calimlim M, Gehrke J (2001) Mafia: A maximalfrequent itemset algorithm for transactional databases. In: DataEngineering, 2001. Proceedings. 17th International Conference on,IEEE, pp 443–452

19. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

20. Goethals B (2003) Survey on frequent pattern mining. Univ ofHelsinki 19:840–852

21. Gouda K, Zaki MJ (2005) Genmax: an efficient algorithm for min-ing maximal frequent itemsets. Data Min Knowl Disc 11(3):223–242

22. Grahne G, Zhu J (2003a) Efficiently using prefix-trees in miningfrequent itemsets. In: FIMI, vol 90

23. Grahne G, Zhu J (2003b) High performance mining of maximalfrequent itemsets. In: 6th International workshop on high perfor-mance data mining

24. Han J, Pei J, Yin Y,MaoR (2004)Mining frequent patterns withoutcandidate generation: a frequent-pattern tree approach. Data MinKnowl Disc 8(1):53–87

25. Han M, Ding J, Li J (2017) Tdmcs: an efficient method for miningclosed frequent patterns over data streams based on time decaymodel. Int Arab J Inf Technol 14(6):851–860

26. Koh YS, Rountree N (2005) Finding sporadic rules using apriori-inverse. In: Advances in knowledge discovery and data mining,Springer, pp 97–106

27. Lee G, Yun U, Ryu KH (2014) Sliding window based weightedmaximal frequent pattern mining over data streams. Expert SystAppl 41(2):694–708

28. Li Y, Xu J, Yuan YH, Chen L (2017) A new closed frequent itemsetmining algorithm based on gpu and improved vertical structure.Concurr Comput Pract Expe 29(6):e3904

29. Liu B, HsuW,MaY (1999)Mining association rules with multipleminimum supports. In: Proceedings of the fifth ACM SIGKDDinternational conference onKnowledge discovery and datamining,ACM, pp 337–341

30. Min F, Zhang ZH, Zhai WJ, Shen RP (2020) Frequent pattern dis-covery with tri-partition alphabets. Inf Sci 507:715–732

31. Nam H, Yun U, Yoon E, Lin JCW (2020) Efficient approach forincremental weighted erasable pattern mining with list structure.Expert Syst Appl 143(113):087

32. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discoveringfrequent closed itemsets for association rules. In: International con-ference on database theory, Springer, pp 398–416

33. Pei J, Han J, Mao R et al (2000) Closet: an efficient algorithmfor mining frequent closed itemsets. ACM SIGMOD workshop onresearch issues in data mining and knowledge discovery 4:21–30

34. Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine:Hyper-structure mining of frequent patterns in large databases. In:Data Mining, 2001. ICDM 2001, Proceedings IEEE InternationalConference on, IEEE, pp 441–448

35. Pyun G, Yun U, Ryu KH (2014) Efficient frequent pattern miningbased on linear prefix tree. Knowl-Based Syst 55:125–139

36. Rodríguez-González AY, Lezama F, Iglesias-Alvarez CA,Martínez-Trinidad JF, Carrasco-Ochoa JA, de Cote EM (2018)Closed frequent similar patternmining: reducing the number of fre-quent similar patterns without information loss. Expert Syst Appl96:271–283

123

Page 31: Comparative evaluation of pattern mining techniques: an ... · algorithm called Frequent Pattern Growth (FP-Growth) employs a trie-based tree data structure to reduce the number of

Complex & Intelligent Systems (2021) 7:589–619 619

37. Shahbazi N, Soltani R, Gryz J, An A (2016) Building fp-tree onthe fly: Single-pass frequent itemset mining. In: Machine learningand data mining in pattern recognition, Springer, pp 387–400

38. Sinthuja M, Puviarasan N, Aruna P (2019) Mining frequent item-sets using proposed top-down approach based on linear prefix tree(td-lp-growth). In: International conference on computer networksand communication technologies, Springer, pp 23–32

39. Szathmary L, Napoli A, Valtchev P (2007) Towards rare itemsetmining. In: Tools with artificial intelligence, 2007. ICTAI 2007.19th IEEE International Conference on, IEEE, vol 1, pp 305–312

40. Szathmary L, Valtchev P, Napoli A (2010) Generating rare associ-ation rules using the minimal rare itemsets family. Int J Softw Inf(IJSI) 4(3):219–238

41. Tanbeer SK, Ahmed CF, Jeong BS, Lee YK (2009) Efficientsingle-pass frequent pattern mining using a prefix-tree. Inf Sci179(5):559–583

42. Troiano L, Scibelli G, Birtolo C (2009) A fast algorithm for miningrare itemsets. In: 2009Ninth international conference on intelligentsystems design and applications, IEEE, pp 1149–1155

43. Tsang S, Koh YS, Dobbie G (2011) Rp-tree: rare pattern tree min-ing. In: Data warehousing and knowledge discovery, Springer, pp277–288

44. Uno T,Asai T, UchidaY,ArimuraH (2003) Lcm:An efficient algo-rithm for enumerating frequent closed item sets. In: FIMI, vol 90

45. Vo B, Pham S, Le T, Deng ZH (2017) A novel approach for miningmaximal frequent patterns. Expert Syst Appl 73:178–186

46. Yahia SB, Hamrouni T, Nguifo EM (2006) Frequent closed itemsetbased algorithms: a thorough structural and analytical survey.ACMsIGKDD Explor Newslett 8(1):93–104

47. Yan D, QuW, Guo G, Wang X (2020) Prefixfpm: a parallel frame-work for general-purpose frequent pattern mining. In: 2020 IEEE36th international conference on data engineering (ICDE), IEEE,pp 1938–1941

48. Yun U, Lee G, Ryu KH (2014) Mining maximal frequent patternsby considering weight conditions over data streams. Knowl-BasedSyst 55:49–65

49. Zaki MJ (1997) Fast mining of sequential patterns in very largedatabases. University of Rochester Computer Science Department,New York

50. Zaki MJ, Hsiao CJ (2002) Charm: an efficient algorithm for closeditemset mining. SDM SIAM 2:457–473

51. Zheng Z, Kohavi R, Mason L (2001) Real world performanceof association rule algorithms. In: Proceedings of the 7th ACMSIGKDD international conference on Knowledge discovery anddata mining, ACM, pp 401–406

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123