Research Article Semantic Clustering of Search...

Research ArticleSemantic Clustering of Search Engine Results

Sara Saad Soliman1 Maged F El-Sayed2 and Yasser F Hassan1

1Department of Mathematics amp Computer Science Faculty of Science Alexandria University Alexandria 21511 Egypt2Department of Information Systems amp Computers Faculty of Commerce Alexandria University Alexandria 26516 Egypt

Correspondence should be addressed to Sara Saad Soliman sarasedhomyahoocom

Received 13 August 2015 Revised 11 December 2015 Accepted 15 December 2015

Academic Editor Chun-Wei Tsai

Copyright copy 2015 Sara Saad Soliman et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

This paper presents a novel approach for search engine results clustering that relies on the semantics of the retrieved documentsrather than the terms in those documents The proposed approach takes into consideration both lexical and semantics similaritiesamong documents and applies activation spreading technique in order to generate semantically meaningful clustersThis approachallows documents that are semantically similar to be clustered together rather than clustering documents based on similar termsA prototype is implemented and several experiments are conducted to test the prospered solution The result of the experimentconfirmed that the proposed solution achieves remarkable results in terms of precision

1 Introduction

Search engines are the main tool for searching and retrievinginformation from theWebWhen the user query is submittedto traditional search engines they return a list of resultssorted in a way that depends on the search engine algorithmHowever while traditional search engines are useful for well-articulated search queries they do not perform very wellwhen it comes to ambiguous queries which have more thanone meaning The result of an ambiguous query is typicallylarge and diverse making it hard for the typical user toanalyze and comprehend Comprehending such result mightrequire the user to analyze a large result that normallycontains irrelevant documents to reach for information ofinterest [1] Such searches are known as ldquolow precisionsearchesrdquo [2]

One way of helping users to find quickly what theyare looking for is to group the search results by topicsor categories The process of grouping documents is calledclustering where grouping is applied to a set of documentsso that documents belonging to the same cluster are similarand documents belonging to different clusters are dissimilarSearch results clustering can be defined as the process of auto-matically grouping results of the search into objective groups[2] Systems that perform Web search results clustering alsoknown as clustering engines have become popular in recent

years [3] Several commercial clustering engines have beenlaunched recently the most popular one among them is theVivisimo engine [4] Vivisimo won the ldquobest meta-searchengine awardrdquo assigned by Search Engine httpwatchcomfrom 2001 to 2003 [3]

The main contribution of this work is introducing a newsolution for clustering search engine result Unlikemost othersearch engine result clustering solutions our solution doesnot just rely on the specific terms in the retrieved documentsto compute similarities among documents and to performclustering accordingly Instead proposed solution performssimilarity comparisons and clustering based on the semanticsof the retrieved documents This is similar to what a humanwould do if asked to cluster a group of documents Thiscontributes largely to the quality of the resulting clusters asmeasured by the precision measure [5]

This paper is organized as follows Section 2 discussesrelated work Section 3 outlines the overall architecture of theproposedmethodology Section 4 gives proposed experimen-tal results Finally in Section 5 we conclude and describe ourvision for the future work

2 Related Work

Frequent ItemsetHierarchical Clustering (FIHC) [6] is a clus-tering technique of document which proposes the concept

Hindawi Publishing Corporatione Scientific World JournalVolume 2015 Article ID 931258 9 pageshttpdxdoiorg1011552015931258

2 The Scientific World Journal

of the frequent item sets used in data miningThe idea of thistechnique is that documents which share a set of words thatappear frequently are related and this is used to cluster doc-uments This technique improves the scalability by reducingthe dimensions by storing only the frequencies of the frequentarticles which occur in a certain minimum fraction of thedocuments in vectors of document TermRank [7] is a vari-ation of the PageRank algorithm that counts term frequencynot only by classic metrics of TF and TF times IDF but also byterm-to-term associations From eachWeb page the blocks inwhich the search keyword appears are retrieved Suffix TreeClustering (STC) [8] is a postretrieval document browsingtechnique (ie used in Grouper [9]) STC is an incrementaland linear time clustering algorithm that is based on identify-ing the phrases that are common to groups of documents andbuilding a suffix tree structure Semantic HierarchicalOnline Clustering (SHOC) [8] algorithm uses suffix arraysto extract frequent phrases and singular value decomposition(SVD) techniques to discover the cluster content Lingo [10]combines common phrase discovery and latent semanticindexing techniques to group search results into meaningfulgroups Lingo can create semantic descriptions by applyingthe cosine similarity equation and computing the similaritybetween frequent phrases and abstract concepts The systempresented in [11] consists of two separate phases The firstphase called ldquoIndexingrdquo builds an index to enable searchingThe second phase called ldquoRetrievalrdquo allows users to submitqueries and then uses the index to retrieve relevant docu-mentsThe result is clustered by using a SuffixTree Clusteringalgorithm [8] and the user is presented with the clusteringresults

ScatterGather [12] divides the data collection into a smallnumber of clusters the user selected clusters of interestand the system reclustered the indicated subcollection ofdocuments dynamically Vivisimo [4 13] is possibly the mostpopular commercial clustering search engine Vivisimo callssearch engines such as Yahoo and Google to extract relevantinformation (titles URLs and short descriptions) from theresult retrieved It groups documents in the retrieved resultbased on summarized informationTheVivisimo search clus-tering engine was sold to Yippy Inc in 2010 Grouper [9] usessnippets obtained by the search engines It is an interface forthe results of the Husky Search meta-search engine Grouperuses the Suffix Tree Clustering (STC) algorithm to clustertogether documents that have great common subphrasesCarrot2 [14] is a clustering search engine solution that usessearch results from various search engines including YahooGoogle and MSN It uses five different clustering algorithms(STC FussyAnts Lingo HAOG-STC and Rough k-means)where Lingo Algorithm is the default clustering algorithmusedThe output is a flat folder structure overlapping foldersare revealed when the user places themouse over a documenttitle The system presented in [15] is a meta-search clusteringengine called the Search Clustering System (SCS) whichorganizes the results returned by conventional Web searchengines into a cluster hierarchy The hierarchy is producedby the Cluster Hierarchy Construction Algorithm (CHCA)Unlike most other clustering algorithms CHCA operates onnominal data its input is a set of binary vectors representing

Web documents Document representations are based eitheron snippets or on the full contents of the retrieved pages

All of the above clustering engines except [15] use snippetsthat probably contain terms that are part of the querykeywords Snippets are not necessarily good representativeof the whole document contents which affects the quality ofthe clusters Proposed solution uses whole documents ratherthan titles and short snippets to ensure proper extraction ofthe semantics of the retrieved documents While all of theabove clustering engines have been mostly performed with-out explicit use of lexical semantics proposed work takes intoconsideration both lexical and semantics similarities Thisenables proposed system to provide better clustering quality

3 Overall Architecture ofthe Proposed Methodology

The framework of the proposed solution is shown in Figure 1The system receives the userrsquos query (119902) which is expressed interms of keywords The system performs the following steps

(i) Submitting the query to a search engine and receivingthe result

(ii) Preprocessing documents from the results andextracting features from each document

(iii) Enriching document features using ontology andconstructing semantic network to model the docu-ment

(iv) Applying spreading activation algorithm on the con-structed semantic network

(v) Computing the dissimilarity matrix among docu-ments using themost significant features representingthe retrieved documents as highlighted by spreadingactivation

(vi) Applying clustering algorithm on the similaritymatrix to obtain the clusters

Now we will describe in detail each of the steps ofproposed solution using a simple example to elaborate thesemantic clustering ability of our proposed solution whichsets it apart from term-based clustering solutions

Four simple documents are used 1198631 1198632 1198633 and 1198634Each of these four documents has only two terms (withfrequency one for each of them) After preprocessing thedocuments and extracting the features we got the following

1198631 = Apple 1Headphone 1

1198632 = Apple 1 diet 1

1198633 = Orange 1 diet 1

1198634 = Samsung galaxy 1Bluetooth 1

(1)

The similarity matrix between the four documents isformed using extracted terms in the same way as in thetraditional clustering approaches as shown in (lowast)

The Scientific World Journal 3

Query interface

Search engine

Query

Preprocessing andfeatures extraction

Search results

Semantic modeling

Ontology

Spreading activation

Similarity computation

Results clustering

Figure 1 System overview

Similarity Matrix among the Four Documents Consider

1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

05

00

00

05

10

05

00

00

05

10

00

00

00

00

10

)

(lowast)

If we cluster the four documents using the computed similarlyvalues in (lowast) we get the following clusters

1198621 = 119863111986321198633

1198622 = 1198634

(2)

When we examine the term-based similarity values as shownin (lowast) we found that it is not acceptable from the perspectiveof the human being because of the following reasons

(i) While 1198631 and 1198632 are two documents that arereflecting two totally different domains the computedsimilarity value is 50

(ii) The similarity value between 1198631 and 1198634 is 0 whileboth documents are related to mobile phones

(iii) The similarity value between 1198632 and 1198633 is only 50while both documents are related to health and diet

Then the resulting clusters1198621 and1198622have very lowprecision0 for this simple example

Nowwewill discuss how proposed solution computes thesimilarity and clusters for the same four documents

31 Preprocessing and Feature Extraction The proposedsolution first extracts terms from each retrieved documentthrough tokenization and then removes stop words (eg aon be as ) from the token set Multiword phrases weretaken into consideration Next lemmatized terms are used

D1 D2

Apple1

Headphone1

Apple1

Diet1

Orange1

Diet1

Samsunggalaxy

1Bluetooth

1D3 D4

Figure 2 Initializing the graphs for the four documents

based on the WordNet [16] Finally the proposed solutioninitializes graph 119881 using the extracted features Each featurebecomes a node in 119881 and is annotated with the frequency ofthe term in 119889

119894 Figure 2 shows the initialization of the four

graph nodes representing the four documents in our runningexample

32 Feature Enrichment Using the ontology proposed solu-tion augments 119881 with concepts and relationships fromontology which are related to terms in 119881 This processenriches 119881 both lexically and semantically The concepts thatare added to 119881 are assigned frequency of zero Unlike manyother semantic systems that rely mainly on WordNet oursystem uses ontology that contains not only lexical terms andrelationships but also other semantic terms and relationshipsOur ontology can be depicted in the form of an enrichedgraph that could be considered as a semantic representationof the retrieved document In this graph terms with similarmeaning represent a concept Figures 3 4 5 and 6 show theenriched graphs

33 Spreading Activation The graph that we have con-structed in the last steps not only models the retrieveddocument but also contains additional lexically and seman-tically relevant concepts and relationships These addi-tional concepts and relationships help in linking features of


D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1

Fruit 0

Plant0

is-a

is-a

is-a

is-a

is-a

is-afor-ahas-a

Figure 3 Enriched graph for1198631

D2

Nutrient 0

Diet1

Company 0

Apple 1

Fruit 0

Plant0

has-a

is-a

is-a

is-a

has-a


the retrieved document The enriched graph does not yetcontain enough semantics to perform high quality clusteringbecause of two reasons

(i) Concepts and relationships that are added to thegraph are mainly related to the terms in the originaldocument and not filtered to the specific semanticsof the document Hence they could be contained ina graph representation of another document that hassimilar terms even if this document has differentsemantics

(ii) The weight of the added concepts is initially set tozero which does not reflect their relative importanceof those concepts to the semantics of the document

Proposed solution resolves the two issues above through twosteps

(a) Selecting of concepts and relationships in 119881 that issemantically relevant to the context of the documentThis is done through applying a shortest path algo-rithm [17] shown in Algorithm 1

(b) Adjusting the weights of concepts in 119881 to betterintegrate new information that is added to 119881 Thiswill affect the similarity computation in later step We

D3

Nutrient 0

Diet1

Orange1

Fruit 0

Plant0

has-a

is-a

is-a has-a


D4

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1Samsung galaxy

1

Mobile phone 0

is-ais-a

is-a

is-ais-a

for-ahas-a


perform that through a spreading activation process[18]

These two steps allow proposed solution to performsemantic clustering rather than term-based clustering Thisleads to better clustering result as will be shown in theexperiments section

Proposed system computes the shortest path using theFloyd-Warshall algorithm [17] The pseudocode of shortestpath procedure is shown in Algorithm 1 Figures 7 8 9 and10 show the nodes and relationships in shortest path for everygraph

After determining nodes and relationships in the shortestpath our system applies a spreading activation algorithm [18]that operates on nodes in the shortest path We consider thefrequencies on the graph nodes as initial activation values forthe spreading activation process

The main idea of spreading activation is to activate nodesand to propagate this activating from one node to othernodes while incrementing frequencies The pseudocode ofthe spreading activation procedure that proposed solutionuses is shown in Algorithm 2 The algorithm consists of thefollowing steps

(i) Initial nodes to be activated are placed in a priorityqueue


let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0

for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)

for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]

end if

Algorithm 1 Shortest path

List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())

currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)

destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput

outputinsertVertex (currVertex)return output

Algorithm 2 Spreading activation

D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1

Figure 7 Shortest path for1198631

(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput

119868119895 (119905 + 1) = 119868119895 (

119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)

D2

Nutrient 0

Diet1

Apple

1


D3

Nutrient

0

Diet1

Orange1


(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874

119894(119905)) by the weight of the edge 119908

119894119895

(iv) The output of a node is given by the function 119874119894(119905)

The value 119908119894119895corresponds to the numerical weight of the

relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process

Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples


D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0

Electronics0 + 1 = 1

Eaccessories0 + 1 = 1

1 lowast 1

1 lowast 1

1 lowast 1

Figure 11 Spreading activation for1198631

D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1


The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18

34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents

D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4


Electronics0 + 1 + 1 = 2


1

Mobile phone 0 + 1 = 1

1 lowast 1

1 lowast 1

1 lowast 1 1 lowast 1


Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in

sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)

radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2

(4)

where 119904 119889 are the two documents119908119904(119905) is the weight of term

119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889

documentEquation (lowastlowast) shows the calculated similarity between

every two documents based on the features and frequenciesobtained from our solution

Similarity Matrix as Computed by Proposed Solution Con-sider

1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1

Figure 15 Frequencies after spreading activation for1198631

D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1


The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following

(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions

(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions

(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions

D4

Eaccessories1

Electronics2


1

Mobile phone 1


35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3

If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters

1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)

This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed

4 Experimental Results

A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]

To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted


(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output

Algorithm 3 Clustering algorithm

Table 1 The first experiment values

Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95

the manual clustering and the results obtained from themwere validated

119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)

We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes

We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments

Table 1 shows precision values for the resulting clusters infirst experiment

The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for

Table 2 The second experiment values


the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo

In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2

Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution

As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents

5 Conclusion

Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution


has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006

[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999

[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009

[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web

search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013

[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003

[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007

[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998

[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999

[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005

[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006

[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf

[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results

clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004

[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine

resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005

[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-

Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach

for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004

[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005

[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002

[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu

[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet

[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query

[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl

[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



of the frequent item sets used in data miningThe idea of thistechnique is that documents which share a set of words thatappear frequently are related and this is used to cluster doc-uments This technique improves the scalability by reducingthe dimensions by storing only the frequencies of the frequentarticles which occur in a certain minimum fraction of thedocuments in vectors of document TermRank [7] is a vari-ation of the PageRank algorithm that counts term frequencynot only by classic metrics of TF and TF times IDF but also byterm-to-term associations From eachWeb page the blocks inwhich the search keyword appears are retrieved Suffix TreeClustering (STC) [8] is a postretrieval document browsingtechnique (ie used in Grouper [9]) STC is an incrementaland linear time clustering algorithm that is based on identify-ing the phrases that are common to groups of documents andbuilding a suffix tree structure Semantic HierarchicalOnline Clustering (SHOC) [8] algorithm uses suffix arraysto extract frequent phrases and singular value decomposition(SVD) techniques to discover the cluster content Lingo [10]combines common phrase discovery and latent semanticindexing techniques to group search results into meaningfulgroups Lingo can create semantic descriptions by applyingthe cosine similarity equation and computing the similaritybetween frequent phrases and abstract concepts The systempresented in [11] consists of two separate phases The firstphase called ldquoIndexingrdquo builds an index to enable searchingThe second phase called ldquoRetrievalrdquo allows users to submitqueries and then uses the index to retrieve relevant docu-mentsThe result is clustered by using a SuffixTree Clusteringalgorithm [8] and the user is presented with the clusteringresults

ScatterGather [12] divides the data collection into a smallnumber of clusters the user selected clusters of interestand the system reclustered the indicated subcollection ofdocuments dynamically Vivisimo [4 13] is possibly the mostpopular commercial clustering search engine Vivisimo callssearch engines such as Yahoo and Google to extract relevantinformation (titles URLs and short descriptions) from theresult retrieved It groups documents in the retrieved resultbased on summarized informationTheVivisimo search clus-tering engine was sold to Yippy Inc in 2010 Grouper [9] usessnippets obtained by the search engines It is an interface forthe results of the Husky Search meta-search engine Grouperuses the Suffix Tree Clustering (STC) algorithm to clustertogether documents that have great common subphrasesCarrot2 [14] is a clustering search engine solution that usessearch results from various search engines including YahooGoogle and MSN It uses five different clustering algorithms(STC FussyAnts Lingo HAOG-STC and Rough k-means)where Lingo Algorithm is the default clustering algorithmusedThe output is a flat folder structure overlapping foldersare revealed when the user places themouse over a documenttitle The system presented in [15] is a meta-search clusteringengine called the Search Clustering System (SCS) whichorganizes the results returned by conventional Web searchengines into a cluster hierarchy The hierarchy is producedby the Cluster Hierarchy Construction Algorithm (CHCA)Unlike most other clustering algorithms CHCA operates onnominal data its input is a set of binary vectors representing

Web documents Document representations are based eitheron snippets or on the full contents of the retrieved pages

All of the above clustering engines except [15] use snippetsthat probably contain terms that are part of the querykeywords Snippets are not necessarily good representativeof the whole document contents which affects the quality ofthe clusters Proposed solution uses whole documents ratherthan titles and short snippets to ensure proper extraction ofthe semantics of the retrieved documents While all of theabove clustering engines have been mostly performed with-out explicit use of lexical semantics proposed work takes intoconsideration both lexical and semantics similarities Thisenables proposed system to provide better clustering quality

3 Overall Architecture ofthe Proposed Methodology

The framework of the proposed solution is shown in Figure 1The system receives the userrsquos query (119902) which is expressed interms of keywords The system performs the following steps

(i) Submitting the query to a search engine and receivingthe result

(ii) Preprocessing documents from the results andextracting features from each document

(iii) Enriching document features using ontology andconstructing semantic network to model the docu-ment

(iv) Applying spreading activation algorithm on the con-structed semantic network

(v) Computing the dissimilarity matrix among docu-ments using themost significant features representingthe retrieved documents as highlighted by spreadingactivation

(vi) Applying clustering algorithm on the similaritymatrix to obtain the clusters

Now we will describe in detail each of the steps ofproposed solution using a simple example to elaborate thesemantic clustering ability of our proposed solution whichsets it apart from term-based clustering solutions

Four simple documents are used 1198631 1198632 1198633 and 1198634Each of these four documents has only two terms (withfrequency one for each of them) After preprocessing thedocuments and extracting the features we got the following

1198631 = Apple 1Headphone 1

1198632 = Apple 1 diet 1

1198633 = Orange 1 diet 1

1198634 = Samsung galaxy 1Bluetooth 1

(1)

The similarity matrix between the four documents isformed using extracted terms in the same way as in thetraditional clustering approaches as shown in (lowast)


Query interface

Search engine

Query


Search results

Semantic modeling

Ontology



Results clustering



1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

05

00

00

05

10

05

00

00

05

10

00

00

00

00

10

)

(lowast)


1198621 = 119863111986321198633

1198622 = 1198634

(2)








D1 D2

Apple1

Headphone1

Apple1

Diet1

Orange1

Diet1

Samsunggalaxy

1Bluetooth

1D3 D4








D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1

Fruit 0

Plant0

is-a

is-a

is-a

is-a

is-a

is-afor-ahas-a


D2

Nutrient 0

Diet1

Company 0

Apple 1

Fruit 0

Plant0

has-a

is-a

is-a

is-a

has-a








D3

Nutrient 0

Diet1

Orange1

Fruit 0

Plant0

has-a

is-a

is-a has-a


D4

Product0

Eaccessories0

Electronics0

Brand 0

Company 0


1

Mobile phone 0

is-ais-a

is-a

is-ais-a

for-ahas-a












end if







D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1



119868119895 (119905 + 1) = 119868119895 (

119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)

D2

Nutrient 0

Diet1

Apple

1


D3

Nutrient

0

Diet1

Orange1




119894119895






D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0



1 lowast 1

1 lowast 1

1 lowast 1


D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1




D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4




1


1 lowast 1

1 lowast 1




sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)


(4)






1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Query interface

Search engine

Query


Search results

Semantic modeling

Ontology



Results clustering



1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

05

00

00

05

10

05

00

00

05

10

00

00

00

00

10

)

(lowast)


1198621 = 119863111986321198633

1198622 = 1198634

(2)








D1 D2

Apple1

Headphone1

Apple1

Diet1

Orange1

Diet1

Samsunggalaxy

1Bluetooth

1D3 D4








D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1

Fruit 0

Plant0

is-a

is-a

is-a

is-a

is-a

is-afor-ahas-a


D2

Nutrient 0

Diet1

Company 0

Apple 1

Fruit 0

Plant0

has-a

is-a

is-a

is-a

has-a








D3

Nutrient 0

Diet1

Orange1

Fruit 0

Plant0

has-a

is-a

is-a has-a


D4

Product0

Eaccessories0

Electronics0

Brand 0

Company 0


1

Mobile phone 0

is-ais-a

is-a

is-ais-a

for-ahas-a












end if







D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1



119868119895 (119905 + 1) = 119868119895 (

119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)

D2

Nutrient 0

Diet1

Apple

1


D3

Nutrient

0

Diet1

Orange1




119894119895






D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0



1 lowast 1

1 lowast 1

1 lowast 1


D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1




D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4




1


1 lowast 1

1 lowast 1




sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)


(4)






1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1

Fruit 0

Plant0

is-a

is-a

is-a

is-a

is-a

is-afor-ahas-a


D2

Nutrient 0

Diet1

Company 0

Apple 1

Fruit 0

Plant0

has-a

is-a

is-a

is-a

has-a








D3

Nutrient 0

Diet1

Orange1

Fruit 0

Plant0

has-a

is-a

is-a has-a


D4

Product0

Eaccessories0

Electronics0

Brand 0

Company 0


1

Mobile phone 0

is-ais-a

is-a

is-ais-a

for-ahas-a












end if







D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1



119868119895 (119905 + 1) = 119868119895 (

119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)

D2

Nutrient 0

Diet1

Apple

1


D3

Nutrient

0

Diet1

Orange1




119894119895






D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0



1 lowast 1

1 lowast 1

1 lowast 1


D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1




D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4




1


1 lowast 1

1 lowast 1




sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)


(4)






1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







end if







D1

Product0

Eaccessories0

Electronics0

Brand 0

Company 0

Headphone1

Apple 1



119868119895 (119905 + 1) = 119868119895 (

119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)

D2

Nutrient 0

Diet1

Apple

1


D3

Nutrient

0

Diet1

Orange1




119894119895






D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0



1 lowast 1

1 lowast 1

1 lowast 1


D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1




D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4




1


1 lowast 1

1 lowast 1




sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)


(4)






1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




D4

Eaccessories0

Electronics0


1

Mobile phone 0


D1

Company 0 + 1 = 1

Brand 0

Headphone1

Apple 1

Product0



1 lowast 1

1 lowast 1

1 lowast 1


D2

Nutrient 0 + 1 + 1 = 2

Diet1

Apple 1

1 lowast 1

1 lowast 1




D3

Nutrient 0 + 1 + 1 = 2

Diet1

Orange1

1 lowast 1

1 lowast 1


D4




1


1 lowast 1

1 lowast 1




sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)


(4)






1198631 1198632 1198633 1198634

1198631

1198632

1198633

1198634

(

10

018

00

064

018

10

083

00

00

083

10

00

064

00

00

10

)

(lowastlowast)


D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




D1

Company 1

Brand 0

Headphone1

Apple 1

Product0

Electronics1

Eaccessories1


D2

Nutrient 2

Diet1

Apple1


D3

Nutrient 2

Diet1

Orange1






D4

Eaccessories1

Electronics2


1

Mobile phone 1




1198621 = 1198891 1198894

1198622 = 1198892 1198893

(5)











119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in









119875 (119862119895) =

10038161003816100381610038161003816119862119905

119895

10038161003816100381610038161003816

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

119875 =

sum119862119895isin119862119875 (119862119895)

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

sum119862119895

10038161003816100381610038161003816119862119895

10038161003816100381610038161003816

(6)











5 Conclusion






References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







References


































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Research Article Semantic Clustering of Search...

Documents

Transcript of Research Article Semantic Clustering of Search...