Automatic construction of ontology by exploiting web using Google API and JSON

8/7/2019 Automatic construction of ontology by exploiting web using Google API and JSON

http://slidepdf.com/reader/full/automatic-construction-of-ontology-by-exploiting-web-using-google-api-and-json 1/7

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 3, MARCH 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/

WWW.JOURNALOFCOMPUTING.ORG 40

Automatic construction of ontology byexploiting web using Google API and JSON

Kalyan Netti

Abstract— Much of the data available on the web is unstructured and constructing ontology from an unexplored domain is a difficult

task. Automatic generation of ontology from the unstructured data is a very important part in semantic web. In this paper we present a

methodology to automatically contruct an ontology from the information extracted from the web for a given keyword. This ontology

represents taxonomy of classes for the specified keyword’s domain and facilitates user to choose most significant sites that he can find

on the Web. The automatic construction of ontology that is being suggested, discharges generation and renewal of ontology

automatically whenever searching is completed. A key resource in our work is Google Ajax Search API for extracting information and

JSON is used to parse the output for the construction of ontology.The obtained classification in hierarchial structured list of the most

representative web sites for each ontology class is a great help for finding and accessing the desired web resources.

Index Terms— Semantic Web, Ontology, Google AJAX API, JSON, RDF, OWL, Information Extraction, Knowledge Base.

—————————— ——————————

1 INTRODUCTION

he web hosts millions of information pieces and isstill growing at a rapid pace.No single human canhave an overview of all web pages and the informa-

tion they provide, thus, the trend towards a machine“understandable" web has been proposed in the semanticweb initiative [1]. If machines can read, understand (dis-ambiguate) and aggregate information pieces from manydifferent sources, the human can consume the desiredinformation much faster. In di-fferent knowledge domains the approaches for retrievingand extracting knowledge can vary. This requires morehuman effort in configuring the extraction

system. This paper suggests a mechanism which mini-mizes the human effort andautomatically builds a structured knowledge basefrom the information that is available on the web. Forachieving machine interoperability a structured way forrepresenting information is required and ontologies [2](machineprocessable representations that contain the se-mantic information of a domain), can be very useful. Themost widely quoted definition of “ontology” was givenby Tom Gruber in 1993, who defines ontology as [3]: “Anexplicit specification of a conceptualization.” Ontologieshave proved their usefulness in different applicationsscenarios, such as intelligent information integration,knowledge-based systems, natural language processing.

The role of ontologies is to capture domain knowledge ina generic way and provide a commonly agreed upon un-derstanding of a domain. The common vocabulary of on-tology, defining the meaning of terms and relations, isusually organized in taxonomy. Ontology usually con-tains modelling primitives such as concepts, relationsbetween concepts, and axioms. Ontologies have shown to

be the right answer to the structuring and modellingproblems arising in Knowledge Management. They allowtransferring and processing information effectively in adistributed environment (like multiagent systems). There-fore, the building of an ontology that represents a speci-fied domain is a critical process that has to be made care-fully. However, manual ontology building is a difficulttask that requires an extended knowledge of the domain(an expert) and in most cases; the result could be incom-plete or inaccurate.In order to ease the ontology construction process, auto-matic methodologies can be used after extracting struc-

tured knowledge, like concepts and semantic relations,from unstructured resources that cover the main do-main's topics [4]. The solution proposed in this paper is touse the available information on the Web to create theontology. This method has the advantage that the ontolo-gy is built automatically and fully represents the actualstate of the art of a domain (based on the web pages thatcover a specific topic).The methodology used in this paper is to extract informa-tion from the Web to build ontology for a given domainand the most representative web sites for each ontologyconcept will be retrieved.

2 RELATED WORK To improve machine interpretability of the web compenetthere have been some proposals to make web componentsstructured like using XML [5] notation to represent con-cepts and hierarchies, SHOE [6], “Simple HTMLOntology Extensions” is a small extension to HTMLwhich allows web page authors to annotate their webdocuments and include tags with semantic information.However, none of these techniques has been widely usedand, in addition, the information about visual represent-tation that makes difficult the search for useful data hasincreased. The search for the information and the ma-chine processing of data are not easy because the web is

————————————————

Kalyan Netti is with the National Geophysical Research Institute, Hydera-bad under Council of Scientific and Industrial Researach, India

T






unstructured and there is no standard way ofrepresenting information. The main problem for not ex-tracting structured information from the web is mainlydue the web designers who design website in their ownway, information about visual representation, the usageof large no.of different files like .doc, .ps, .ppt, .pdf etc.This becomes a serious drawback when one tries to createstructured information representations like ontologiesfrom unstrcutred ones.Though several tools are available to ease the web searchas give below we will see why they donot satify the usereuirement,1. Search engines like Google [7], Yahoo [8] and Bing [9]

do a great job of indexing web sites but the way to ob-tain them is quite limited they simply check the pres-ence or absence of a key word on each web page. Thelist of results is sorted using an effective rating me-chanism according to the website’s relevance.

2. API search tools and directory services like The GoogleAPI Search Tool [10][11], Yahoo Developer NetworkAPI [12], Yahoo Directory services [13] provides many

options like structure the list of websites on differentcategories of search. Many projects have been explor-ing this path like Guided Google [14], Google APIProximity Search (GAPS) developed by Staggerna-tion.com [15], but in many cases the result is quite re-duced and sometimes the information is outdated.

These tools can be useful when one knows exactly whatto search and the domain where it belongs, but in most ofthe cases the amount of returned results makes it difficultto obatain desired information.However, we will see in this paper how these difficultiesare addressed by constructing Ontologies.To achieve thispowerful Google Ajax Search API is used in this paper,

which has access to the vast index of more than two bil-lion web pages and information in the Google cache.Google Web APIs support the same search syntax as theGoogle.com site. In short; the Google AJAX Search APIsserve as an enhanced conduit to several of Google's mostpopular hosted services. This paper also uses JSON [15]for parsing the streamed response obtained from search.This paper also gives an insight on parsing the streamedresponse using JSON through web interface by using JSP,thus provides a structured output for construction of on-tology.Much information about JSON implementationusing JSP is not available on the web.

3 METHODOLOGY 3.1 Architecture

This section describes the metholodology used to discov-er and select class names along with URLs for construc-tion of final Ontlogy.The mechanism as shown in Figure: 1 involves anlaysinga large number of websites in order to find related con-cepts for a domain by studying the keyword’s neigh-bourhood. The output is processed using JSON to selectthe most appropriate ones. The selected concepts are used

as classes in constructing the final ontology; OWL is usedas language to construct Ontologies. For each conceptwhich is used as Class in the Ontology, a URL fromwhere the concepts are extracted is associated. Theprocess is repeated recursively using the combination ofthe new concepts inorder to build an appropriate hie-rarchy.

… …

……

FIGURE 1: MAIN ARCHITECTURE

The follwing sections explain the process in detail.

3.2 Google Ajax Search API

The architecture of how Google Web Services interact

with user applications is shown in Figure 2. The Googleserver is responsible for processing users’ search queries.The programmers develop applications in a language oftheir choice (Java, C, Perl, PHP, .NET, etc.) and connect tothe remote Google Web APIs service. Communication isperformed via the Google AJAX Search Api. The AJAXSearch API allows you to easily integrate some very po-werful and diverse Google based search mechanisms or"controls" onto a Web page with relatively minimal cod-ing. These include: [10]

© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

http://sites.google.com/site/journalofcomputing/

Keyword Input Google Ajax API

Parsing using JSON

Class Selection for ontology

ClassandURL

ClassandURL

ClassandURL

Class Name+Keyword Class Name+Keyword

Ontology

(OWL)






a) Web Search: This is a traditional search input fieldwhere, when a query is entered, a series of text searchresults appear on the page.b) Local Search: With Local Search, a Google Map ismashed together with a search input field and the searchresults are based on a specific location.c) Video Search: The AJAX Video Search provides theability to offer compelling video search along with ac-companying video based search results.Once connected, the application will be able to issuesearch requests to Google's index of more than two billionweb pages and receive results as structured data, accessinformation in the Google cache, and check the spelling ofwords. Google Web APIs support the same search syntaxas the Google.com site.In short, the Google AJAX APIs serve as an enhancedconduit to several of Google's most popular hosted ser-vices. The hosted services such as Google Search orGoogle Maps can be accessed directly, but with AJAXAPIs comes the ability to integrate these hosted servicesinto anyone's custom web pages. The way the AJAX APIs

work is by allowing any web page that is hosted on theInternet access to Google search (or feed) data throughJavaScript code.

Figure 2: Google AJAX Search API Architecture

The core JavaScript code that fetches the search or feeddata can be as simple as Search.execute ( ) or Feed.load ( ).As the request is made to Google's worldwide servers, aresponse of either Search data or prepared AJAX Feeddata is streamed back to the Web page in either JSON(JavaScript Object Notation) or XML formats. Parsing ofthis data can either be done manually or automatically byusing one of the provided UI controls that are built uponthe lower level AJAX APIs.

3.3 JSONJSON (an acronym for JavaScript Object Notation) [16]as shown in Figure 3, is a lightweight text-based openstandard designed for human-readable data interchange.It is derived from the JavaScript programming languagefor representing simple data structures and associativearrays, called objects. Despite its relationship to Java-Script, it is language-independent, with parsers availablefor virtually every programming language.

In JSON the String data structure take on these formsas shown in Figure 3. JSON Schema [17] is a specificationfor a JSON-based format for defining the structure ofJSON data. JSON Schema provides a contract for what

JSON data is required for a given application and how itcan be modified, much like what XML Schema providesfor XML. JSON Schema is intended to provide validation,documentation, and interaction control of JSON data.JSON Schema is based on the concepts from XML Sche-ma, RelaxNG, and Kwalify, but is intended to be JSON-based, so that JSON data in the form of a schema can beused to validate JSON data, the same serializa-tion/deserialization tools can be used for the schema anddata, and it can be self descriptive.

Figure 3: JSON Schema

Apart of certain limitations which are limited to textualdata formats which also apply to XML and YAML, JSONis primarily used for communicating data over the Inter-net.The proposed system in this paper uses JSON exten-

sively to parse the response send by the Google API. Pars-ing the response using JSON enables the system to storethe results in array, get the required URLs, Count the to-tal results, Content of website etc. JSON parsing allowsths program to do exhaustive analysis on the responsedata to select appropriate candidate words.

3.4 Ontology Representation

Ontology is nothing but a specification of aconceptualization [3]. A conceptualization is an abstract,simplified view of the world that we wish to represent forsome purpose [3]. Every knowledge base, knowledge-based system, or knowledge-level agent is committed tosome conceptualization, explicitly or implicitly. Itinvolves explicit representation of knowledge about sometopic. Ontologies can enhance the functioning of the Webin many ways. They can be used in a simple fashion toimprove the accuracy of Web searches—the generalsearch program can look for only those pages that refer toa precise concept instead of all the ones using ambiguouskeywords. Problem-solving methods, domain-independent applications, and software agents useknowledge bases built from ontologies as data.Ontologies can be domain specific as well. Domainspecific ontologies describe a specific domain by definingdifferent terms applicable to that domain and specifying






the interrelationship between them. Several domainontologies can also be merged for a more generalrepresentation. The work in this paper mainly involvesdevelopment of a domain specific ontology described viaone of the most common ontology language OWL [18][21] which is a vocubulary extension of RDF [19].Development of ontology has several utilities, Firstly ithelps in understanding an area of knowledge better,enables multiple machines to share the knowledge anduse the knowledge in various applications. ResourceDescripton Framework is nothing but a language torepresent the ontology of an area of knowledge.Ontology provides greater machine interpretabilitywhere processing of information is essential. Processingof information allows find easy relations between classesand subclasses etc. Moreover OWL is supported by manyOntology visualizers and editores like Protégé whichallows domain experts to build knowledge-based systemsby creating and modifying reusable ontologies andproblem-solving methods. In this paper we use Protégé3.4.1 [20], which facilitates user to explore, understand,

analyse or even modify the resulting Ontology. The finalhierarchy which is the output of the process, describedunder sub section 3.1, is presented in a refined way .Theontology for Network keyword as visualized in Protégé isshown in Figure 4.

Figure 4: Network Ontolgy on Protégé 3.4.1

Jambalaya [21] is a Protege plug-in that uses Shrimp to

visualize Protege-Frames and Protege-OWL ontologies.Ontology with URLs for Network keyword whenvisualized in jambalaya is shown in Figure 5, jambalayafacilitates the user to grpahically visualize the hierarchy.Jambalya extension is available along with the installany where Protégé 3.4.1.

Figure 5: Jambalaya tab for Network Ontolgy on Protégé3.4.1

The instances can be searched in Jambalaya, using a

search tool.

4 DESIGN AND IMPLEMENTATION

The detailed architecture as shown in Figure 3.1 has fol-lowing seaquence and steps. The entire mechanism isimplemented in JAVA, JSP.

1. The whole process starts by giving a keyword in theHTML interface, for which the ontology is constructed;here Network was choosen as keyword.2. After entering the keyword, the program uses GoogleAjax search API to obtain the information of the keywordand websites that contain the keyword. The followingcode is used to connect to Google,

query = URLEncoder.encode(query, "UTF-8");

URL url = new

URL("http://ajax.googleapis.com/ajax/servi

ces/search/web?start=0&rsz=large&v=1.0&q="

+ query);

// opening connection

URLConnection connection =

url.openConnection();

3. The result after searching is shown in Figure 6 (here thekeyword used is Network). Response data is in the form

of an array which contains useful information of thematching websites like title, URL, content etc. The re-sponse date is given as below, filters like restricting thenumber of returned results (here restricted to 100), areused in the program

Figure 6: Array output of result after searchin google us-ing goolge ajax api search keyword “net-work”

{"responseData": {"results": [

{"GsearchRe-

sultClass":"GwebSearch","unescapedUrl":"http://www.webopedia.c

om/TERM/N/network.html","url":"http://www.webopedia.com/T

ERM/N/network.html","visibleUrl":"www.webopedia.com","cache

Url":"http://www.google.com/search?q\u003dcache:HGyueh94nIk

J:www.webopedia.com","title":"What is

\u003cb\u003enetwork\u003c/b\u003e? - A Word Definition From

the Webopedia Computer

\u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"What is

network? - A Word Definition From the Webopedia Computer

...","content":"This page describes the term

\u003cb\u003enetwork\u003c/b\u003e and lists other pages on theWeb where you can find additional information."},

4. The response data is required to be parsed to select theappropriate class and the associated URL, for the con-structing of Ontology. JSON is used to parse the responsedata. After parsing, the program is able to separate thewebsites returned for that keyword and catch the similaror relevant words associated with the main Keyword(“here Network”) which is explained in the next point.Code snippet used to parse the response data using JSONis as follows,

// Get the JSON responseString line;StringBuilder builder = new StringBuilder();BufferedReader reader = new BufferedReader(new

InputStreamReader(connection.getInputStream()));while((line = reader.readLine()) != null) {builder.append(line);

out.println("\n\n"+line);}String response1 = builder.toString();

//Parsing of response dataout.println("Total results =

"+json.getJSONObject("responseData").getJSONObject("cursor").getString("estimatedResultCount")+" ");out.println(" ");

JSONArray ja =json.getJSONObject("responseData").getJSONArray("results");

out.println("Results: ");for (int i = 0; i < ja.length(); i++) {out.println(" ");JSONObject j = ja.getJSONObject(i);

out.println("Title:"+j.getString("titleNoFormatting"));out.println(" ");out.println("URL:"+j.getString("url"));out.println(" ");out.println("Content:"+j.getString("content"));

5. Selection of URLs for the class is choosen on the basis ofthe number of occurances of the keyword in the content

of the webpage (the “content” is extracted from the re-sponse data by using JSON as shown above). After selec-tion of the URLs for the class appropriately, candidatewords or subclasses for constructing the hierarchy of themain keyword are selected according to their associationwith the main key word. The program chooses the wordswhich are relevant to the main key word i.e having noprepositions, determinants etc and those have minimumsize i.e having more than two characters and thoserepresented in standard ASCII.6. For each candidate word selected, a detailed analysis isconducted. Detailed analysis includes checking the num-ber of occurances of the word in the content of the web-

site. After this analysis an appropriate candidate key isselected on the basis of the depth (total no of occurances)of the word.7. After the candidate word is selected, a new keyword isselected by joining the candidate word and the initialKeyword (eg. LAN Network) and the entire process isrepeated recursively. Each repeated process has its ownselection of candidates based on the constraints men-tioned above. The recursive process is stopped when nosearch results are found for that word.8. The final result is the hierarchy of class, subclasseswhich is ontology. Each candidate word is choosen as aclass name and URLs are stored from where they are se-

lected. The websites associated with each class are themost appropriate ones.

5 EVALUATION AND RESULTS

As an example “Network” is choosen as the initial key-word.As mentioned in Section 4.0 different constraintsare set like the minimum size of the candidate word(more than two characters), maximum no.of appearancesin the webpage etc. With these constraints the search isperformed and the result class hierarchy when visualizedin protégé is shown in the figure 7. As an example, takethe candidate keyword broadcast, the combined keyword

“broadcast network” returns information from differentwebsites. Candidate words like cable, satellite and terre-strial (different types of broadcasting [23]) are choosenaccording to the exhaustive analysis mentioned above.The main intention in this paper is to build a hierarchy ofclasses using OWL. OWL facilitates the user to find inter-class relations automatically like intersections, inclusionsor equalities [22], in this case the result OWL shows that,Internet class includes social network and web and socialnetwork inturn includes facebook, twitter . The OWL file isvisualized using OWL editor Protégé as shown in figure7.

Figure 7: Network hierarchy as shown in protégé

The mechanism also stores appropriate URLs along withthe class names, allowing the user to access the most ap-propriate or rather representative websites for the key-word. As an example of URLs stored with the classes, theURLs of the subclass mail of web and social network isshown in figure 8.The Jambalaya plugin of Protégé shows the completegraphical view of the entire ontology in different for-

mats.Examle views like nested tree map ,class and indi-vidual tree map, class tree are shown in figures 9,10,11.The OWL file generated can be updated by repeating thewhole process at every certain amount of time for thesame main keyword, URLs are updated accordingly.Small part of the OWL file generated by the current me-chanism is given as below,

<owl:Class rdf:ID="Facebook"><rdfs:subClassOf

rdf:resource="#Social_Network"/>

</owl:Class><owl:Class rdf:ID="FDDI">

<rdfs:subClassOf rdf:resource="#LAN"/></owl:Class><LAN

rdf:about="http://computer.howstuffworks.com"/><Inter_net

rdf:about="http://computer.howstuffworks.com_"/><CableCARD

rdf:about="http://en.wikipedia.org/wiki/CableCARD"/>

<Computerrdf:about="http://en.wikipedia.org/wiki/Computer_network_"/>

Figure 8: URLs associated with Facebook calss shown inprotégé

Figure 9: “Nested Tree Map” as shown under Jambalayaplugin in protégé






Figure 10: “Class and Individual Tree map” as shownunder Jambalaya in protégé

Figure 11: “Class Tree” as shown under Jambalaya inprotégé

6 CONCLUSIONS & FUTURE WORK

Many authors have been working on ontology learningand construction from different kinds of structured in-formation sources like data bases, knowledge bases ordictionaries [24, 25]; some authors are putting their efforton processing natural language texts [26]. Most of the

Ontologies are constructed on the basis of an exploreddomain or on structured information like databases.However, taking into consideration the amount of re-sources available easily on the Internet, creation of ontol-ogy automatically from unstructured documents like webpages is interesting and important. In this paper a metho-dology is choosen to automatically construct and updatethe ontology based on the unstructured data from webusing a low cost approach. The low cost approach, forautomatic construction of ontology, uses publicly availa-ble search engines like Google through its API likeGoogle Ajax API search and JSON to parse the data. Theentire mechanism is implemented using JAVA, JSP. The

obtained hierarchical structured list as a result of parsingthe response data and choosing appropriate candidateclasses consists of the most representative websites foreach ontology class. This structured list is a great help forthe users to find and access the desired web resources.The most suggested future work would be to do moreexhaustive analysis on the relevant websites using theintitial keyword and also designing a complex analysisalgorithm to choose between the candidate words.

ACKNOWLEDGMENT

This research has been supported and funded by CSIR, India

under Empower Scheme and grant no: OLP-2104-28.

REFERENCES

[1] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific

American, 284(5):28{37, 2001.

[2] D. Fensel, Ontologies: A Silver Bullet for Knowledge Management and

Electronic Commerce, volume 2, Springer Verlag, 2001. W.-K. Chen, Li-

near Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135,

1993. (Book style)

[3] T.R. Gruber, “A Translation Approach to Portable Ontology Specifica-

tion,” Knowledge Acquisition, 1993.

[4] David Urbansky, Automatic Construction of a Semantic,Domain-

Independent Knowledge Base..

[5] Extensible Markup Language (XML) W3C. Web page

http://www.w3.org/XML/.

[6] Simple HTML Ontology Extensions

http://www.cs.umd.edu/projects/plus/SHOE/

[7] Google. http://google.com

[8] Yahoo. http://yahoo.com

[9] Bing. http://bing.com

[10] http://code.google.com/apis/ajaxsearch/

[11] Google Groups (web-apis)http://groups.google.com/groups?group=google.

[12] http://developer.yahoo.com/

[13] http://dir.yahoo.com/computers_and_internet/internet/directory_serv

ices/[14] Ding Choon Hoong and Rajkumar Buyya,”Guided Google: A Meta

Search Engine and its Implementation using the Google Distributed WebServices

[15] Google API Proximity Search (GAPS) -http://www.staggernation.com/gaps/readme.html

[16] www.json.org[17] http://json-schema.org [18] http://www.linkeddatatools.com/introducing-rdf-part-2 [19] http://www.linkeddatatools.com/introducing-rdf-part-2 [20] Protégé 3.4.1 , http://protege.stanford.edu/ [21] http://www.thechiselgroup.org/jambalaya

[22] OWL API , http://sourceforge.net/projects/owlapi/ [23] http://img.shopping.com/cctool/WhatsIs/1/1399_20943.epi.html [24] Tim Finin and Zareen Syed: Creating and exploiting a web of Semantic Data

using wikitology,.[25] D.Mukhopadhyay,A Banik,S Mukherjee: A Technique for Automatic Construc-

tion of Ontology form Existing Database to fecilitate Sematic web[26] M.Y.Dahab,H.A.Hassan,A.Rafae: TextOntoEx: Automatic Ontology Construc-

tion from Natural English textKalyan Netti, born in Andhra Pradesh, India. He obtained Master of Tehnolo-gy (M.Tech) in Computer Science and Engineering with specialization in Data-base Management systems from JNTU, Andhra Pradesh, India, in2004.Currently pursuing Ph.D (Computer Science and Engineering) in Data-mining related areas. Kalyan Netti is interested in the following areas: Seman-tic web technologies, Ontologies, Data Interoperability, Web Mining, SemanticHeterogeneity, Relational database systems, temporal databases and tempor-al data modeling.

Automatic construction of ontology by exploiting web using Google API and JSON

Documents

Transcript of Automatic construction of ontology by exploiting web using Google API and JSON