Research Article Semantic Clustering of Search...
Transcript of Research Article Semantic Clustering of Search...
Research ArticleSemantic Clustering of Search Engine Results
Sara Saad Soliman1 Maged F El-Sayed2 and Yasser F Hassan1
1Department of Mathematics amp Computer Science Faculty of Science Alexandria University Alexandria 21511 Egypt2Department of Information Systems amp Computers Faculty of Commerce Alexandria University Alexandria 26516 Egypt
Correspondence should be addressed to Sara Saad Soliman sarasedhomyahoocom
Received 13 August 2015 Revised 11 December 2015 Accepted 15 December 2015
Academic Editor Chun-Wei Tsai
Copyright copy 2015 Sara Saad Soliman et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
This paper presents a novel approach for search engine results clustering that relies on the semantics of the retrieved documentsrather than the terms in those documents The proposed approach takes into consideration both lexical and semantics similaritiesamong documents and applies activation spreading technique in order to generate semantically meaningful clustersThis approachallows documents that are semantically similar to be clustered together rather than clustering documents based on similar termsA prototype is implemented and several experiments are conducted to test the prospered solution The result of the experimentconfirmed that the proposed solution achieves remarkable results in terms of precision
1 Introduction
Search engines are the main tool for searching and retrievinginformation from theWebWhen the user query is submittedto traditional search engines they return a list of resultssorted in a way that depends on the search engine algorithmHowever while traditional search engines are useful for well-articulated search queries they do not perform very wellwhen it comes to ambiguous queries which have more thanone meaning The result of an ambiguous query is typicallylarge and diverse making it hard for the typical user toanalyze and comprehend Comprehending such result mightrequire the user to analyze a large result that normallycontains irrelevant documents to reach for information ofinterest [1] Such searches are known as ldquolow precisionsearchesrdquo [2]
One way of helping users to find quickly what theyare looking for is to group the search results by topicsor categories The process of grouping documents is calledclustering where grouping is applied to a set of documentsso that documents belonging to the same cluster are similarand documents belonging to different clusters are dissimilarSearch results clustering can be defined as the process of auto-matically grouping results of the search into objective groups[2] Systems that perform Web search results clustering alsoknown as clustering engines have become popular in recent
years [3] Several commercial clustering engines have beenlaunched recently the most popular one among them is theVivisimo engine [4] Vivisimo won the ldquobest meta-searchengine awardrdquo assigned by Search Engine httpwatchcomfrom 2001 to 2003 [3]
The main contribution of this work is introducing a newsolution for clustering search engine result Unlikemost othersearch engine result clustering solutions our solution doesnot just rely on the specific terms in the retrieved documentsto compute similarities among documents and to performclustering accordingly Instead proposed solution performssimilarity comparisons and clustering based on the semanticsof the retrieved documents This is similar to what a humanwould do if asked to cluster a group of documents Thiscontributes largely to the quality of the resulting clusters asmeasured by the precision measure [5]
This paper is organized as follows Section 2 discussesrelated work Section 3 outlines the overall architecture of theproposedmethodology Section 4 gives proposed experimen-tal results Finally in Section 5 we conclude and describe ourvision for the future work
2 Related Work
Frequent ItemsetHierarchical Clustering (FIHC) [6] is a clus-tering technique of document which proposes the concept
Hindawi Publishing Corporatione Scientific World JournalVolume 2015 Article ID 931258 9 pageshttpdxdoiorg1011552015931258
2 The Scientific World Journal
of the frequent item sets used in data miningThe idea of thistechnique is that documents which share a set of words thatappear frequently are related and this is used to cluster doc-uments This technique improves the scalability by reducingthe dimensions by storing only the frequencies of the frequentarticles which occur in a certain minimum fraction of thedocuments in vectors of document TermRank [7] is a vari-ation of the PageRank algorithm that counts term frequencynot only by classic metrics of TF and TF times IDF but also byterm-to-term associations From eachWeb page the blocks inwhich the search keyword appears are retrieved Suffix TreeClustering (STC) [8] is a postretrieval document browsingtechnique (ie used in Grouper [9]) STC is an incrementaland linear time clustering algorithm that is based on identify-ing the phrases that are common to groups of documents andbuilding a suffix tree structure Semantic HierarchicalOnline Clustering (SHOC) [8] algorithm uses suffix arraysto extract frequent phrases and singular value decomposition(SVD) techniques to discover the cluster content Lingo [10]combines common phrase discovery and latent semanticindexing techniques to group search results into meaningfulgroups Lingo can create semantic descriptions by applyingthe cosine similarity equation and computing the similaritybetween frequent phrases and abstract concepts The systempresented in [11] consists of two separate phases The firstphase called ldquoIndexingrdquo builds an index to enable searchingThe second phase called ldquoRetrievalrdquo allows users to submitqueries and then uses the index to retrieve relevant docu-mentsThe result is clustered by using a SuffixTree Clusteringalgorithm [8] and the user is presented with the clusteringresults
ScatterGather [12] divides the data collection into a smallnumber of clusters the user selected clusters of interestand the system reclustered the indicated subcollection ofdocuments dynamically Vivisimo [4 13] is possibly the mostpopular commercial clustering search engine Vivisimo callssearch engines such as Yahoo and Google to extract relevantinformation (titles URLs and short descriptions) from theresult retrieved It groups documents in the retrieved resultbased on summarized informationTheVivisimo search clus-tering engine was sold to Yippy Inc in 2010 Grouper [9] usessnippets obtained by the search engines It is an interface forthe results of the Husky Search meta-search engine Grouperuses the Suffix Tree Clustering (STC) algorithm to clustertogether documents that have great common subphrasesCarrot2 [14] is a clustering search engine solution that usessearch results from various search engines including YahooGoogle and MSN It uses five different clustering algorithms(STC FussyAnts Lingo HAOG-STC and Rough k-means)where Lingo Algorithm is the default clustering algorithmusedThe output is a flat folder structure overlapping foldersare revealed when the user places themouse over a documenttitle The system presented in [15] is a meta-search clusteringengine called the Search Clustering System (SCS) whichorganizes the results returned by conventional Web searchengines into a cluster hierarchy The hierarchy is producedby the Cluster Hierarchy Construction Algorithm (CHCA)Unlike most other clustering algorithms CHCA operates onnominal data its input is a set of binary vectors representing
Web documents Document representations are based eitheron snippets or on the full contents of the retrieved pages
All of the above clustering engines except [15] use snippetsthat probably contain terms that are part of the querykeywords Snippets are not necessarily good representativeof the whole document contents which affects the quality ofthe clusters Proposed solution uses whole documents ratherthan titles and short snippets to ensure proper extraction ofthe semantics of the retrieved documents While all of theabove clustering engines have been mostly performed with-out explicit use of lexical semantics proposed work takes intoconsideration both lexical and semantics similarities Thisenables proposed system to provide better clustering quality
3 Overall Architecture ofthe Proposed Methodology
The framework of the proposed solution is shown in Figure 1The system receives the userrsquos query (119902) which is expressed interms of keywords The system performs the following steps
(i) Submitting the query to a search engine and receivingthe result
(ii) Preprocessing documents from the results andextracting features from each document
(iii) Enriching document features using ontology andconstructing semantic network to model the docu-ment
(iv) Applying spreading activation algorithm on the con-structed semantic network
(v) Computing the dissimilarity matrix among docu-ments using themost significant features representingthe retrieved documents as highlighted by spreadingactivation
(vi) Applying clustering algorithm on the similaritymatrix to obtain the clusters
Now we will describe in detail each of the steps ofproposed solution using a simple example to elaborate thesemantic clustering ability of our proposed solution whichsets it apart from term-based clustering solutions
Four simple documents are used 1198631 1198632 1198633 and 1198634Each of these four documents has only two terms (withfrequency one for each of them) After preprocessing thedocuments and extracting the features we got the following
1198631 = Apple 1Headphone 1
1198632 = Apple 1 diet 1
1198633 = Orange 1 diet 1
1198634 = Samsung galaxy 1Bluetooth 1
(1)
The similarity matrix between the four documents isformed using extracted terms in the same way as in thetraditional clustering approaches as shown in (lowast)
The Scientific World Journal 3
Query interface
Search engine
Query
Preprocessing andfeatures extraction
Search results
Semantic modeling
Ontology
Spreading activation
Similarity computation
Results clustering
Figure 1 System overview
Similarity Matrix among the Four Documents Consider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
05
00
00
05
10
05
00
00
05
10
00
00
00
00
10
)
(lowast)
If we cluster the four documents using the computed similarlyvalues in (lowast) we get the following clusters
1198621 = 119863111986321198633
1198622 = 1198634
(2)
When we examine the term-based similarity values as shownin (lowast) we found that it is not acceptable from the perspectiveof the human being because of the following reasons
(i) While 1198631 and 1198632 are two documents that arereflecting two totally different domains the computedsimilarity value is 50
(ii) The similarity value between 1198631 and 1198634 is 0 whileboth documents are related to mobile phones
(iii) The similarity value between 1198632 and 1198633 is only 50while both documents are related to health and diet
Then the resulting clusters1198621 and1198622have very lowprecision0 for this simple example
Nowwewill discuss how proposed solution computes thesimilarity and clusters for the same four documents
31 Preprocessing and Feature Extraction The proposedsolution first extracts terms from each retrieved documentthrough tokenization and then removes stop words (eg aon be as ) from the token set Multiword phrases weretaken into consideration Next lemmatized terms are used
D1 D2
Apple1
Headphone1
Apple1
Diet1
Orange1
Diet1
Samsunggalaxy
1Bluetooth
1D3 D4
Figure 2 Initializing the graphs for the four documents
based on the WordNet [16] Finally the proposed solutioninitializes graph 119881 using the extracted features Each featurebecomes a node in 119881 and is annotated with the frequency ofthe term in 119889
119894 Figure 2 shows the initialization of the four
graph nodes representing the four documents in our runningexample
32 Feature Enrichment Using the ontology proposed solu-tion augments 119881 with concepts and relationships fromontology which are related to terms in 119881 This processenriches 119881 both lexically and semantically The concepts thatare added to 119881 are assigned frequency of zero Unlike manyother semantic systems that rely mainly on WordNet oursystem uses ontology that contains not only lexical terms andrelationships but also other semantic terms and relationshipsOur ontology can be depicted in the form of an enrichedgraph that could be considered as a semantic representationof the retrieved document In this graph terms with similarmeaning represent a concept Figures 3 4 5 and 6 show theenriched graphs
33 Spreading Activation The graph that we have con-structed in the last steps not only models the retrieveddocument but also contains additional lexically and seman-tically relevant concepts and relationships These addi-tional concepts and relationships help in linking features of
4 The Scientific World Journal
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Fruit 0
Plant0
is-a
is-a
is-a
is-a
is-a
is-afor-ahas-a
Figure 3 Enriched graph for1198631
D2
Nutrient 0
Diet1
Company 0
Apple 1
Fruit 0
Plant0
has-a
is-a
is-a
is-a
has-a
Figure 4 Enriched graph for1198632
the retrieved document The enriched graph does not yetcontain enough semantics to perform high quality clusteringbecause of two reasons
(i) Concepts and relationships that are added to thegraph are mainly related to the terms in the originaldocument and not filtered to the specific semanticsof the document Hence they could be contained ina graph representation of another document that hassimilar terms even if this document has differentsemantics
(ii) The weight of the added concepts is initially set tozero which does not reflect their relative importanceof those concepts to the semantics of the document
Proposed solution resolves the two issues above through twosteps
(a) Selecting of concepts and relationships in 119881 that issemantically relevant to the context of the documentThis is done through applying a shortest path algo-rithm [17] shown in Algorithm 1
(b) Adjusting the weights of concepts in 119881 to betterintegrate new information that is added to 119881 Thiswill affect the similarity computation in later step We
D3
Nutrient 0
Diet1
Orange1
Fruit 0
Plant0
has-a
is-a
is-a has-a
Figure 5 Enriched graph for1198633
D4
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1Samsung galaxy
1
Mobile phone 0
is-ais-a
is-a
is-ais-a
for-ahas-a
Figure 6 Enriched graph for1198634
perform that through a spreading activation process[18]
These two steps allow proposed solution to performsemantic clustering rather than term-based clustering Thisleads to better clustering result as will be shown in theexperiments section
Proposed system computes the shortest path using theFloyd-Warshall algorithm [17] The pseudocode of shortestpath procedure is shown in Algorithm 1 Figures 7 8 9 and10 show the nodes and relationships in shortest path for everygraph
After determining nodes and relationships in the shortestpath our system applies a spreading activation algorithm [18]that operates on nodes in the shortest path We consider thefrequencies on the graph nodes as initial activation values forthe spreading activation process
The main idea of spreading activation is to activate nodesand to propagate this activating from one node to othernodes while incrementing frequencies The pseudocode ofthe spreading activation procedure that proposed solutionuses is shown in Algorithm 2 The algorithm consists of thefollowing steps
(i) Initial nodes to be activated are placed in a priorityqueue
The Scientific World Journal 5
let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0
for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)
for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]
end if
Algorithm 1 Shortest path
List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())
currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)
destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput
outputinsertVertex (currVertex)return output
Algorithm 2 Spreading activation
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Figure 7 Shortest path for1198631
(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput
119868119895 (119905 + 1) = 119868119895 (
119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)
D2
Nutrient 0
Diet1
Apple
1
Figure 8 Shortest path for1198632
D3
Nutrient
0
Diet1
Orange1
Figure 9 Shortest path for1198633
(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874
119894(119905)) by the weight of the edge 119908
119894119895
(iv) The output of a node is given by the function 119874119894(119905)
The value 119908119894119895corresponds to the numerical weight of the
relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process
Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
2 The Scientific World Journal
of the frequent item sets used in data miningThe idea of thistechnique is that documents which share a set of words thatappear frequently are related and this is used to cluster doc-uments This technique improves the scalability by reducingthe dimensions by storing only the frequencies of the frequentarticles which occur in a certain minimum fraction of thedocuments in vectors of document TermRank [7] is a vari-ation of the PageRank algorithm that counts term frequencynot only by classic metrics of TF and TF times IDF but also byterm-to-term associations From eachWeb page the blocks inwhich the search keyword appears are retrieved Suffix TreeClustering (STC) [8] is a postretrieval document browsingtechnique (ie used in Grouper [9]) STC is an incrementaland linear time clustering algorithm that is based on identify-ing the phrases that are common to groups of documents andbuilding a suffix tree structure Semantic HierarchicalOnline Clustering (SHOC) [8] algorithm uses suffix arraysto extract frequent phrases and singular value decomposition(SVD) techniques to discover the cluster content Lingo [10]combines common phrase discovery and latent semanticindexing techniques to group search results into meaningfulgroups Lingo can create semantic descriptions by applyingthe cosine similarity equation and computing the similaritybetween frequent phrases and abstract concepts The systempresented in [11] consists of two separate phases The firstphase called ldquoIndexingrdquo builds an index to enable searchingThe second phase called ldquoRetrievalrdquo allows users to submitqueries and then uses the index to retrieve relevant docu-mentsThe result is clustered by using a SuffixTree Clusteringalgorithm [8] and the user is presented with the clusteringresults
ScatterGather [12] divides the data collection into a smallnumber of clusters the user selected clusters of interestand the system reclustered the indicated subcollection ofdocuments dynamically Vivisimo [4 13] is possibly the mostpopular commercial clustering search engine Vivisimo callssearch engines such as Yahoo and Google to extract relevantinformation (titles URLs and short descriptions) from theresult retrieved It groups documents in the retrieved resultbased on summarized informationTheVivisimo search clus-tering engine was sold to Yippy Inc in 2010 Grouper [9] usessnippets obtained by the search engines It is an interface forthe results of the Husky Search meta-search engine Grouperuses the Suffix Tree Clustering (STC) algorithm to clustertogether documents that have great common subphrasesCarrot2 [14] is a clustering search engine solution that usessearch results from various search engines including YahooGoogle and MSN It uses five different clustering algorithms(STC FussyAnts Lingo HAOG-STC and Rough k-means)where Lingo Algorithm is the default clustering algorithmusedThe output is a flat folder structure overlapping foldersare revealed when the user places themouse over a documenttitle The system presented in [15] is a meta-search clusteringengine called the Search Clustering System (SCS) whichorganizes the results returned by conventional Web searchengines into a cluster hierarchy The hierarchy is producedby the Cluster Hierarchy Construction Algorithm (CHCA)Unlike most other clustering algorithms CHCA operates onnominal data its input is a set of binary vectors representing
Web documents Document representations are based eitheron snippets or on the full contents of the retrieved pages
All of the above clustering engines except [15] use snippetsthat probably contain terms that are part of the querykeywords Snippets are not necessarily good representativeof the whole document contents which affects the quality ofthe clusters Proposed solution uses whole documents ratherthan titles and short snippets to ensure proper extraction ofthe semantics of the retrieved documents While all of theabove clustering engines have been mostly performed with-out explicit use of lexical semantics proposed work takes intoconsideration both lexical and semantics similarities Thisenables proposed system to provide better clustering quality
3 Overall Architecture ofthe Proposed Methodology
The framework of the proposed solution is shown in Figure 1The system receives the userrsquos query (119902) which is expressed interms of keywords The system performs the following steps
(i) Submitting the query to a search engine and receivingthe result
(ii) Preprocessing documents from the results andextracting features from each document
(iii) Enriching document features using ontology andconstructing semantic network to model the docu-ment
(iv) Applying spreading activation algorithm on the con-structed semantic network
(v) Computing the dissimilarity matrix among docu-ments using themost significant features representingthe retrieved documents as highlighted by spreadingactivation
(vi) Applying clustering algorithm on the similaritymatrix to obtain the clusters
Now we will describe in detail each of the steps ofproposed solution using a simple example to elaborate thesemantic clustering ability of our proposed solution whichsets it apart from term-based clustering solutions
Four simple documents are used 1198631 1198632 1198633 and 1198634Each of these four documents has only two terms (withfrequency one for each of them) After preprocessing thedocuments and extracting the features we got the following
1198631 = Apple 1Headphone 1
1198632 = Apple 1 diet 1
1198633 = Orange 1 diet 1
1198634 = Samsung galaxy 1Bluetooth 1
(1)
The similarity matrix between the four documents isformed using extracted terms in the same way as in thetraditional clustering approaches as shown in (lowast)
The Scientific World Journal 3
Query interface
Search engine
Query
Preprocessing andfeatures extraction
Search results
Semantic modeling
Ontology
Spreading activation
Similarity computation
Results clustering
Figure 1 System overview
Similarity Matrix among the Four Documents Consider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
05
00
00
05
10
05
00
00
05
10
00
00
00
00
10
)
(lowast)
If we cluster the four documents using the computed similarlyvalues in (lowast) we get the following clusters
1198621 = 119863111986321198633
1198622 = 1198634
(2)
When we examine the term-based similarity values as shownin (lowast) we found that it is not acceptable from the perspectiveof the human being because of the following reasons
(i) While 1198631 and 1198632 are two documents that arereflecting two totally different domains the computedsimilarity value is 50
(ii) The similarity value between 1198631 and 1198634 is 0 whileboth documents are related to mobile phones
(iii) The similarity value between 1198632 and 1198633 is only 50while both documents are related to health and diet
Then the resulting clusters1198621 and1198622have very lowprecision0 for this simple example
Nowwewill discuss how proposed solution computes thesimilarity and clusters for the same four documents
31 Preprocessing and Feature Extraction The proposedsolution first extracts terms from each retrieved documentthrough tokenization and then removes stop words (eg aon be as ) from the token set Multiword phrases weretaken into consideration Next lemmatized terms are used
D1 D2
Apple1
Headphone1
Apple1
Diet1
Orange1
Diet1
Samsunggalaxy
1Bluetooth
1D3 D4
Figure 2 Initializing the graphs for the four documents
based on the WordNet [16] Finally the proposed solutioninitializes graph 119881 using the extracted features Each featurebecomes a node in 119881 and is annotated with the frequency ofthe term in 119889
119894 Figure 2 shows the initialization of the four
graph nodes representing the four documents in our runningexample
32 Feature Enrichment Using the ontology proposed solu-tion augments 119881 with concepts and relationships fromontology which are related to terms in 119881 This processenriches 119881 both lexically and semantically The concepts thatare added to 119881 are assigned frequency of zero Unlike manyother semantic systems that rely mainly on WordNet oursystem uses ontology that contains not only lexical terms andrelationships but also other semantic terms and relationshipsOur ontology can be depicted in the form of an enrichedgraph that could be considered as a semantic representationof the retrieved document In this graph terms with similarmeaning represent a concept Figures 3 4 5 and 6 show theenriched graphs
33 Spreading Activation The graph that we have con-structed in the last steps not only models the retrieveddocument but also contains additional lexically and seman-tically relevant concepts and relationships These addi-tional concepts and relationships help in linking features of
4 The Scientific World Journal
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Fruit 0
Plant0
is-a
is-a
is-a
is-a
is-a
is-afor-ahas-a
Figure 3 Enriched graph for1198631
D2
Nutrient 0
Diet1
Company 0
Apple 1
Fruit 0
Plant0
has-a
is-a
is-a
is-a
has-a
Figure 4 Enriched graph for1198632
the retrieved document The enriched graph does not yetcontain enough semantics to perform high quality clusteringbecause of two reasons
(i) Concepts and relationships that are added to thegraph are mainly related to the terms in the originaldocument and not filtered to the specific semanticsof the document Hence they could be contained ina graph representation of another document that hassimilar terms even if this document has differentsemantics
(ii) The weight of the added concepts is initially set tozero which does not reflect their relative importanceof those concepts to the semantics of the document
Proposed solution resolves the two issues above through twosteps
(a) Selecting of concepts and relationships in 119881 that issemantically relevant to the context of the documentThis is done through applying a shortest path algo-rithm [17] shown in Algorithm 1
(b) Adjusting the weights of concepts in 119881 to betterintegrate new information that is added to 119881 Thiswill affect the similarity computation in later step We
D3
Nutrient 0
Diet1
Orange1
Fruit 0
Plant0
has-a
is-a
is-a has-a
Figure 5 Enriched graph for1198633
D4
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1Samsung galaxy
1
Mobile phone 0
is-ais-a
is-a
is-ais-a
for-ahas-a
Figure 6 Enriched graph for1198634
perform that through a spreading activation process[18]
These two steps allow proposed solution to performsemantic clustering rather than term-based clustering Thisleads to better clustering result as will be shown in theexperiments section
Proposed system computes the shortest path using theFloyd-Warshall algorithm [17] The pseudocode of shortestpath procedure is shown in Algorithm 1 Figures 7 8 9 and10 show the nodes and relationships in shortest path for everygraph
After determining nodes and relationships in the shortestpath our system applies a spreading activation algorithm [18]that operates on nodes in the shortest path We consider thefrequencies on the graph nodes as initial activation values forthe spreading activation process
The main idea of spreading activation is to activate nodesand to propagate this activating from one node to othernodes while incrementing frequencies The pseudocode ofthe spreading activation procedure that proposed solutionuses is shown in Algorithm 2 The algorithm consists of thefollowing steps
(i) Initial nodes to be activated are placed in a priorityqueue
The Scientific World Journal 5
let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0
for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)
for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]
end if
Algorithm 1 Shortest path
List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())
currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)
destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput
outputinsertVertex (currVertex)return output
Algorithm 2 Spreading activation
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Figure 7 Shortest path for1198631
(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput
119868119895 (119905 + 1) = 119868119895 (
119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)
D2
Nutrient 0
Diet1
Apple
1
Figure 8 Shortest path for1198632
D3
Nutrient
0
Diet1
Orange1
Figure 9 Shortest path for1198633
(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874
119894(119905)) by the weight of the edge 119908
119894119895
(iv) The output of a node is given by the function 119874119894(119905)
The value 119908119894119895corresponds to the numerical weight of the
relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process
Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 3
Query interface
Search engine
Query
Preprocessing andfeatures extraction
Search results
Semantic modeling
Ontology
Spreading activation
Similarity computation
Results clustering
Figure 1 System overview
Similarity Matrix among the Four Documents Consider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
05
00
00
05
10
05
00
00
05
10
00
00
00
00
10
)
(lowast)
If we cluster the four documents using the computed similarlyvalues in (lowast) we get the following clusters
1198621 = 119863111986321198633
1198622 = 1198634
(2)
When we examine the term-based similarity values as shownin (lowast) we found that it is not acceptable from the perspectiveof the human being because of the following reasons
(i) While 1198631 and 1198632 are two documents that arereflecting two totally different domains the computedsimilarity value is 50
(ii) The similarity value between 1198631 and 1198634 is 0 whileboth documents are related to mobile phones
(iii) The similarity value between 1198632 and 1198633 is only 50while both documents are related to health and diet
Then the resulting clusters1198621 and1198622have very lowprecision0 for this simple example
Nowwewill discuss how proposed solution computes thesimilarity and clusters for the same four documents
31 Preprocessing and Feature Extraction The proposedsolution first extracts terms from each retrieved documentthrough tokenization and then removes stop words (eg aon be as ) from the token set Multiword phrases weretaken into consideration Next lemmatized terms are used
D1 D2
Apple1
Headphone1
Apple1
Diet1
Orange1
Diet1
Samsunggalaxy
1Bluetooth
1D3 D4
Figure 2 Initializing the graphs for the four documents
based on the WordNet [16] Finally the proposed solutioninitializes graph 119881 using the extracted features Each featurebecomes a node in 119881 and is annotated with the frequency ofthe term in 119889
119894 Figure 2 shows the initialization of the four
graph nodes representing the four documents in our runningexample
32 Feature Enrichment Using the ontology proposed solu-tion augments 119881 with concepts and relationships fromontology which are related to terms in 119881 This processenriches 119881 both lexically and semantically The concepts thatare added to 119881 are assigned frequency of zero Unlike manyother semantic systems that rely mainly on WordNet oursystem uses ontology that contains not only lexical terms andrelationships but also other semantic terms and relationshipsOur ontology can be depicted in the form of an enrichedgraph that could be considered as a semantic representationof the retrieved document In this graph terms with similarmeaning represent a concept Figures 3 4 5 and 6 show theenriched graphs
33 Spreading Activation The graph that we have con-structed in the last steps not only models the retrieveddocument but also contains additional lexically and seman-tically relevant concepts and relationships These addi-tional concepts and relationships help in linking features of
4 The Scientific World Journal
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Fruit 0
Plant0
is-a
is-a
is-a
is-a
is-a
is-afor-ahas-a
Figure 3 Enriched graph for1198631
D2
Nutrient 0
Diet1
Company 0
Apple 1
Fruit 0
Plant0
has-a
is-a
is-a
is-a
has-a
Figure 4 Enriched graph for1198632
the retrieved document The enriched graph does not yetcontain enough semantics to perform high quality clusteringbecause of two reasons
(i) Concepts and relationships that are added to thegraph are mainly related to the terms in the originaldocument and not filtered to the specific semanticsof the document Hence they could be contained ina graph representation of another document that hassimilar terms even if this document has differentsemantics
(ii) The weight of the added concepts is initially set tozero which does not reflect their relative importanceof those concepts to the semantics of the document
Proposed solution resolves the two issues above through twosteps
(a) Selecting of concepts and relationships in 119881 that issemantically relevant to the context of the documentThis is done through applying a shortest path algo-rithm [17] shown in Algorithm 1
(b) Adjusting the weights of concepts in 119881 to betterintegrate new information that is added to 119881 Thiswill affect the similarity computation in later step We
D3
Nutrient 0
Diet1
Orange1
Fruit 0
Plant0
has-a
is-a
is-a has-a
Figure 5 Enriched graph for1198633
D4
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1Samsung galaxy
1
Mobile phone 0
is-ais-a
is-a
is-ais-a
for-ahas-a
Figure 6 Enriched graph for1198634
perform that through a spreading activation process[18]
These two steps allow proposed solution to performsemantic clustering rather than term-based clustering Thisleads to better clustering result as will be shown in theexperiments section
Proposed system computes the shortest path using theFloyd-Warshall algorithm [17] The pseudocode of shortestpath procedure is shown in Algorithm 1 Figures 7 8 9 and10 show the nodes and relationships in shortest path for everygraph
After determining nodes and relationships in the shortestpath our system applies a spreading activation algorithm [18]that operates on nodes in the shortest path We consider thefrequencies on the graph nodes as initial activation values forthe spreading activation process
The main idea of spreading activation is to activate nodesand to propagate this activating from one node to othernodes while incrementing frequencies The pseudocode ofthe spreading activation procedure that proposed solutionuses is shown in Algorithm 2 The algorithm consists of thefollowing steps
(i) Initial nodes to be activated are placed in a priorityqueue
The Scientific World Journal 5
let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0
for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)
for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]
end if
Algorithm 1 Shortest path
List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())
currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)
destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput
outputinsertVertex (currVertex)return output
Algorithm 2 Spreading activation
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Figure 7 Shortest path for1198631
(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput
119868119895 (119905 + 1) = 119868119895 (
119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)
D2
Nutrient 0
Diet1
Apple
1
Figure 8 Shortest path for1198632
D3
Nutrient
0
Diet1
Orange1
Figure 9 Shortest path for1198633
(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874
119894(119905)) by the weight of the edge 119908
119894119895
(iv) The output of a node is given by the function 119874119894(119905)
The value 119908119894119895corresponds to the numerical weight of the
relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process
Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
4 The Scientific World Journal
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Fruit 0
Plant0
is-a
is-a
is-a
is-a
is-a
is-afor-ahas-a
Figure 3 Enriched graph for1198631
D2
Nutrient 0
Diet1
Company 0
Apple 1
Fruit 0
Plant0
has-a
is-a
is-a
is-a
has-a
Figure 4 Enriched graph for1198632
the retrieved document The enriched graph does not yetcontain enough semantics to perform high quality clusteringbecause of two reasons
(i) Concepts and relationships that are added to thegraph are mainly related to the terms in the originaldocument and not filtered to the specific semanticsof the document Hence they could be contained ina graph representation of another document that hassimilar terms even if this document has differentsemantics
(ii) The weight of the added concepts is initially set tozero which does not reflect their relative importanceof those concepts to the semantics of the document
Proposed solution resolves the two issues above through twosteps
(a) Selecting of concepts and relationships in 119881 that issemantically relevant to the context of the documentThis is done through applying a shortest path algo-rithm [17] shown in Algorithm 1
(b) Adjusting the weights of concepts in 119881 to betterintegrate new information that is added to 119881 Thiswill affect the similarity computation in later step We
D3
Nutrient 0
Diet1
Orange1
Fruit 0
Plant0
has-a
is-a
is-a has-a
Figure 5 Enriched graph for1198633
D4
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1Samsung galaxy
1
Mobile phone 0
is-ais-a
is-a
is-ais-a
for-ahas-a
Figure 6 Enriched graph for1198634
perform that through a spreading activation process[18]
These two steps allow proposed solution to performsemantic clustering rather than term-based clustering Thisleads to better clustering result as will be shown in theexperiments section
Proposed system computes the shortest path using theFloyd-Warshall algorithm [17] The pseudocode of shortestpath procedure is shown in Algorithm 1 Figures 7 8 9 and10 show the nodes and relationships in shortest path for everygraph
After determining nodes and relationships in the shortestpath our system applies a spreading activation algorithm [18]that operates on nodes in the shortest path We consider thefrequencies on the graph nodes as initial activation values forthe spreading activation process
The main idea of spreading activation is to activate nodesand to propagate this activating from one node to othernodes while incrementing frequencies The pseudocode ofthe spreading activation procedure that proposed solutionuses is shown in Algorithm 2 The algorithm consists of thefollowing steps
(i) Initial nodes to be activated are placed in a priorityqueue
The Scientific World Journal 5
let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0
for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)
for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]
end if
Algorithm 1 Shortest path
List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())
currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)
destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput
outputinsertVertex (currVertex)return output
Algorithm 2 Spreading activation
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Figure 7 Shortest path for1198631
(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput
119868119895 (119905 + 1) = 119868119895 (
119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)
D2
Nutrient 0
Diet1
Apple
1
Figure 8 Shortest path for1198632
D3
Nutrient
0
Diet1
Orange1
Figure 9 Shortest path for1198633
(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874
119894(119905)) by the weight of the edge 119908
119894119895
(iv) The output of a node is given by the function 119874119894(119905)
The value 119908119894119895corresponds to the numerical weight of the
relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process
Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 5
let dist be a |119881| times |119881| array of minimum distances initialized toinfin (infinity)for each vertex Vdist[V][V] larr 0
for each edge (119906 V)dist[119906][V] larr 119908(119906 V) the weight of the edge (119906 V)
for 119896 from 1 to |119881|for 119894 from 1 to |119881|for 119895 from 1 to |119881|if dist[119894][119895] gt dist[119894][119896] + dist[119896][119895]dist[119894][119895] larr dist[119894][119896] + dist[119896][119895]
end if
Algorithm 1 Shortest path
List SpreadingActivation (VertexPirorityQueue input)List outputActivationFunction activationFunwhile (inputisNotEmpty())
currVertex = inputRemoveMax()activation = activationFun (currVertex)currVerexVisited = truefor (every edge eOrig(e) == currVertex)
destVertex = egetDestination()deltaInput = activation lowast egetweigth()destVertexActivation += deltaInput
outputinsertVertex (currVertex)return output
Algorithm 2 Spreading activation
D1
Product0
Eaccessories0
Electronics0
Brand 0
Company 0
Headphone1
Apple 1
Figure 7 Shortest path for1198631
(ii) The current node spreads its activation value to itsneighbors Considering the source node as 119894 and thetarget node as 119895 spreading to the neighbors occursaccording to (3) where 119868 denotes input and119874 denotesoutput
119868119895 (119905 + 1) = 119868119895 (
119905 + 1) + 119874119894 (119905) lowast 119908119894119895 (3)
D2
Nutrient 0
Diet1
Apple
1
Figure 8 Shortest path for1198632
D3
Nutrient
0
Diet1
Orange1
Figure 9 Shortest path for1198633
(iii) The contribution of 119894 is added to the current inputvalue of node 119895 Thus the algorithm rewards thosenodes which are reached through different paths byadding the contributions of all its neighbors Thiscontribution is obtained by multiplying the outputvalue of node 119894 (119874
119894(119905)) by the weight of the edge 119908
119894119895
(iv) The output of a node is given by the function 119874119894(119905)
The value 119908119894119895corresponds to the numerical weight of the
relationship obtained from the ontology proposed solutionuses in running example the weight is equal to 1 At the endthe result list contains the nodes which represent the result ofthe spreading activation process
Figures 11 12 13 and 14 show the execution of thespreading activation algorithm for the four examples
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
6 The Scientific World Journal
D4
Eaccessories0
Electronics0
Headphone1Samsung galaxy
1
Mobile phone 0
Figure 10 Shortest path for1198634
D1
Company 0 + 1 = 1
Brand 0
Headphone1
Apple 1
Product0
Electronics0 + 1 = 1
Eaccessories0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1
Figure 11 Spreading activation for1198631
D2
Nutrient 0 + 1 + 1 = 2
Diet1
Apple 1
1 lowast 1
1 lowast 1
Figure 12 Spreading activation for1198632
The final frequency values of nodes in the four graphsafter spreading activation are shown in Figures 15 16 17 and18
34 Similarity Computation After applying the shortest pathalgorithm and the activation spreading algorithm the con-cepts with their frequencies in each semantic network graphare extracted to be used in the similarity comparison betweenevery two documents
D3
Nutrient 0 + 1 + 1 = 2
Diet1
Orange1
1 lowast 1
1 lowast 1
Figure 13 Spreading activation for1198633
D4
Eaccessories0 + 1 = 1
Electronics0 + 1 + 1 = 2
Headphone1Samsung galaxy
1
Mobile phone 0 + 1 = 1
1 lowast 1
1 lowast 1
1 lowast 1 1 lowast 1
Figure 14 Spreading activation for1198634
Proposed solution uses the cosine similarity function[19] to check the similarity between the extracted conceptsrepresenting two documents The cosine similarity functionis shown in
sim (119904 119889) =sum119905isin(119904119889)119908119904 (119905) sdot 119908119889 (119905)
radicsum119905isin119904119908119904 (119905)2sdot radicsum119905isin119904119908119889 (119905)2
(4)
where 119904 119889 are the two documents119908119904(119905) is the weight of term
119905 in the 119904 document and119908119889(119905) is the weight of term 119905 in the 119889
documentEquation (lowastlowast) shows the calculated similarity between
every two documents based on the features and frequenciesobtained from our solution
Similarity Matrix as Computed by Proposed Solution Con-sider
1198631 1198632 1198633 1198634
1198631
1198632
1198633
1198634
(
10
018
00
064
018
10
083
00
00
083
10
00
064
00
00
10
)
(lowastlowast)
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 7
D1
Company 1
Brand 0
Headphone1
Apple 1
Product0
Electronics1
Eaccessories1
Figure 15 Frequencies after spreading activation for1198631
D2
Nutrient 2
Diet1
Apple1
Figure 16 Frequencies after spreading activation for1198632
D3
Nutrient 2
Diet1
Orange1
Figure 17 Frequencies after spreading activation for1198633
The similarity values computed by proposed solutionshown in (lowastlowast) reflect better results according to the seman-tics of the documents In particular consider the following
(i) 1198631 and1198632 similarity is 18 in our solution instead of50 in term-based solutions
(ii) 1198631 and 1198634 similarity is 64 in our solution insteadof 0 in term-based solutions
(iii) 1198632 and1198633 similarity is 83 in our solution instead of50 in term-based solutions
D4
Eaccessories1
Electronics2
Headphone1Samsung galaxy
1
Mobile phone 1
Figure 18 Frequencies after spreading activation for1198634
35 Results Clustering Proposed solution uses an agglomer-ative hierarchical clustering [20] which is a bottom-up clus-tering method The Euclidean distance that was used is thesimilarity measure In initialization each document is con-sidered as cluster Similar documents are merged into clusteruntil a termination condition is satisfied This conditioncould be reaching a certain number of clusters (119896) Theagglomerative hierarchical clustering algorithm is shown inAlgorithm 3
If proposed solution in this case clusters the four doc-uments using the computed similarly values in (lowastlowast) thesolution gets the following clusters
1198621 = 1198891 1198894
1198622 = 1198892 1198893
(5)
This clustering result has very high precision 100 for thissimple example and has major improvement over the resultas shown above when traditional clustering has been per-formed
4 Experimental Results
A prototype was built for testing proposed solution usingJava programming language The prototype performs searchengine result preprocessing feature extraction andmodelingontology enrichment spreading activation and similaritycomputation Protege [21] is used for building the ontologywhich has been used in the experiments We use Jena [22]as an API programmatic environment for querying RDFand OWL based data models that uses the SPARQL querylanguage [23 24]The agglomerative clustering algorithmwasimplemented using R software [25]
To compute the quality of the result we use the preci-sion measure which represents the percentage of positivepredictions by the system that is correct as shown in (6) Weuse human clustering as the reference for the correctness ofthe result clustering where three different people conducted
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
8 The Scientific World Journal
(1) Initialization cluster(11) Each object be a cluster(12) Creating similarity matrix(2) Clustering(21) Finding a pair of the most similar clusters and merging(22) Computing the distances between new cluster and others(23) Pruning and updating the similarity matrix(24) If the terminal condition is satisfied then output else repeating (21) to (23)(3) Clustered output
Algorithm 3 Clustering algorithm
Table 1 The first experiment values
Query PrecisionApple 95Paris 90Jaguar 90Hollywood 95Red Hot Chili Peppers 95Mac 85Snow Leopard 90Lion 80Tiger 85Mouse 95
the manual clustering and the results obtained from themwere validated
119875 (119862119895) =
10038161003816100381610038161003816119862119905
119895
10038161003816100381610038161003816
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
119875 =
sum119862119895isin119862119875 (119862119895)
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
sum119862119895
10038161003816100381610038161003816119862119895
10038161003816100381610038161003816
(6)
We have used ten different queries for testing namelyldquoApplerdquo ldquoParisrdquo ldquoJaguarrdquo ldquoHollywoodrdquo ldquoRed Hot Chili Pep-persrdquo ldquoMacrdquo ldquoSnow Leopardrdquo ldquoLionrdquo ldquoTigerrdquo and ldquoMouserdquoThe first 5 queries were also used for testing in [11] thus weuse them for comparison purposes
We ran two experiments the first one measures theprecision of the clustering resulting from our solution whenapplying all phases of the solution The second experimenttests the system without applying the spreading activationstep to determine the significance of spreading activation Inour experiments we limit documents to be clustered from theresult to 20 documents for each query and we set numbersof clusters to 5 We use httpswwwgooglecom as thesearch engine of choice for retrieving results related to ourexperiments
Table 1 shows precision values for the resulting clusters infirst experiment
The results reported in [11] for the first 5 queries are 575for the query ldquoApplerdquo 85 for the query ldquoParisrdquo 76 for
Table 2 The second experiment values
Query PrecisionApple 75Paris 80Jaguar 60Hollywood 40Red Hot Chili Peppers 75Mac 75Snow Leopard 80Lion 70Tiger 60Mouse 85
the query ldquoJaguarrdquo 86 for the query ldquoRed Hot ChiliPeppersrdquo and 865 for the query ldquoHollywoodrdquo
In the second experiment (running the system with-out spreading activation) the resulting precision values areshown in Table 2
Comparing these values to the values obtained in experi-ment 1 we conclude that activation spreading has contributedto large extent to the high precision results of our solution
As explained earlier the spreading activation algorithmstep gives the proposed solution the ability to performsimilarity comparison and to cluster the document on thesemantic level rather than on the syntax level which setsthe proposed solution apart from most other solutions Thisallows the proposed solution to function in a way that issimilar to a large extent to what a human will do if asked tocluster the documents
5 Conclusion
Searching the Web is a task that consumes too much timeand effort especially for ambiguity queries which have manymeanings WebPages clustering could help in reaching therequired documents that the user is searching for In thispaper a novel approach has been introduced for search resultsclustering that are based on the semantics of the retrieveddocuments rather than the syntax of the terms in those docu-mentsThis means that documents that are semantically sim-ilar are clustered together rather than clustering together doc-uments that just contain similar termsThe proposed solution
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 9
has been implemented and tested Our experiments showremarkable accuracy level for our solution Our future workis to examine the effect of using more constraints in thespreading activation step scaling the solution to support largenumber of retrieved search engine results and improving theontology used to support more queries and domains
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R Campos G Dias and C Nunes ldquoWISE hierarchical softclustering of web page search results based on web contentmining techniquesrdquo in Proceedings of the IEEEWICACMInternational Conference on Web Intelligence (WI rsquo06) pp 301ndash304 Hong Kong December 2006
[2] O E ZamirClusteringWebDocuments a Phrase-BasedMethodfor Grouping Search Engine Results University of Washington1999
[3] C Carpineto S Osinski G Romano and D Weiss ldquoA surveyof web clustering enginesrdquoACMComputing Surveys vol 41 no3 article 17 2009
[4] Vivisimo httpvivisimocom[5] A Di Marco and R Navigli ldquoClustering and diversifying web
search results with graph-based word sense inductionrdquo Com-putational Linguistics vol 39 no 3 pp 709ndash754 2013
[6] B C Fung K Wang and M Ester ldquoHierarchical documentclustering using frequent itemsetsrdquo in Proceedings of the 2003SIAM International Conference on Data Mining (SDM rsquo03) pp59ndash70 2003
[7] F Gelgi H Davulcu and S Vadrevu ldquoTerm ranking for cluster-ing web search resultsrdquo in Proceedings of the 10th InternationalWorkshop on the Web and Databases (WebDB rsquo07) BeijingChina June 2007
[8] O Zamir and O Etzioni ldquoWeb document clustering a feasibil-ity demonstrationrdquo in Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval pp 46ndash54 Melbourne Australia August1998
[9] O Zamir and O Etzioni ldquoGrouper a dynamic clustering inter-face toweb search resultsrdquoComputer Networks vol 31 no 11ndash16pp 1361ndash1374 1999
[10] S Osinski and D Weiss ldquoA concept-driven algorithm forclustering search resultsrdquo IEEE Intelligent Systems vol 20 no3 pp 48ndash54 2005
[11] H O Borch Clustering on-line clustering of web search results[MS thesis] Norwegian University of Science and Technology2006
[12] A Popescul and L H Ungar ldquoAutomatic labeling of documentclustersrdquo In press httpciteseeristpsueduviewdocdownloaddoi=101133141amprep=rep1amptype=pdf
[13] Clusty httpclustycom[14] S Osinski J Stefanowski and D Weiss ldquoLingo search results
clustering algorithmbased on singular value decompositionrdquo inIntelligent Information Processing andWebMining pp 359ndash368Springer 2004
[15] A Schenker M Last and A Kandel ldquoDesign and implemen-tation of a web mining system for organizing search engine
resultsrdquo International Journal of Intelligent Systems vol 20 no6 pp 607ndash625 2005
[16] C FellbaumWordNet Wiley 1998[17] K Rosen Discrete Mathematics and Its Applications McGraw-
Hill Science 7th edition 2011[18] C Rocha D Schwabe andM P de Aragao ldquoA hybrid approach
for searching in the semantic webrdquo in Proceedings of the 13thInternationalWorldWideWebConference (WWWrsquo04) pp 374ndash383 New York NY USA May 2004
[19] S C Gates W Teiken and K-S F Cheng ldquoTaxonomies by thenumbers building high-performance taxonomiesrdquo in Proceed-ings of the 14th ACM International Conference on Informationand Knowledge Management (CIKM rsquo05) pp 568ndash577 ACMBremen Germany November 2005
[20] Y Zhao and G Karypis ldquoEvaluation of hierarchical clusteringalgorithms for document datasetsrdquo in Proceedings of the 11thInternational Conference on Information and Knowledge Man-agement (CIKM rsquo02) pp 515ndash524 November 2002
[21] Protege ldquoThe Protege Ontology Editorrdquo 2011 httpprotegestanfordedu
[22] JenaASemanticWeb Framework for Java Jena 2012 httpjenasourceforgenet
[23] E Prudrsquohommeaux and A Seaborne SPARQL Query Languagefor RDF 2011 httpwwww3orgTRrdf-sparql-query
[24] W3C Web Ontology Language (OWL) httpwwww3org2001swowl
[25] J Fox and R Andersen Using the R Statistical ComputingEnvironment to Teach Social Statistics Courses Department ofSociology McMaster University 2005
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014