[IEEE 2008 Seventh Mexican International Conference on Artificial Intelligence (MICAI) - Mexico,...

An Alternate Downloading Methodology of Webpages

Anirban Kundu1, Alok Ranjan Pal1, Tanay Sarkar1, Moutan Banerjee1, Subhendu Mandal1,Rana Dattagupta2 and Debajyoti Mukhopadhyay3

1Netaji Subhash Engineering College (West Bengal University of Technology),West Bengal-700 152, India

{anik76in, chhaandasik, tanay.sarkar, moutanbanerjee, subhendu.mndl}@gmail.com

2 Jadavpur University, West Bengal-700 032, [email protected]

3 Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal-743 503, [email protected]

Abstract

We propose an advanced method for downloading Web-pages from the internet. In this technique, the whole sys-tem is considered as a bundle of crawlers which havebeen created dynamically at execution time. Numbers ofcrawlers are used depending on the requirement of down-loading Webpages. The software module which interactswith WWW to search one or more Webpages is known ascrawler. The numbers of crawlers are generated using thehierarchy structure of the Web server from which the datawould be downloaded. Webpage downloader is an impor-tant issue for downloading Web documents from the internetto facilitate a Web user in terms of knowledge gathering.This type of downloaders are very popular in the ‘Infor-mation Technology’ field. All kinds of public data, accessi-ble throughout the world without any authentication, can beretrieved any time from any geographic location using thedownloading methodology. Typically, a downloading tech-nique has been utilized to accumulate Webpages of differentdomains within a single computer machine one at a time.So, our aim in this paper is to show an advanced techniquefor downloading a lot of related Webpages with a minimumeffort and time using Hierarchical Downloader consistingof several dynamic crawlers.

Keyword - Multi-downloading, Hierarchical downloading.

1 Introduction

In recent years, it has become important for performdownloading operation efficiently in terms of informationretrieval [1] due to the enormous growth of World WideWeb (WWW). The world at present generates near about1 to 2 exabytes of unique information each year, and alsotranslates to about 250 megabytes for every man, womanand child on earth (an exabyte is a billion gigabytes). TheWorld Wide Web Worm (WWWW) was one of the first WebSearch Engines, and was basically a storage of huge volumeof information [2]. With the advent of the WWW, users arenow trying to propagate the information to a much wideraudience more quickly via some medium of communica-tion. In this information & technology era, if somebodywishes to gather information, he/she can find a lot of datarelated to the topic through WWW from any location in theworld using some Web browser [3]. Web browser helpspeople to reach the desired information with an ease instan-taneously over the internet. In practical scenario, a typicalWeb browser invokes Webpages one at a time. One has tocheck all the links available on a Webpage for downloadingmore than one Webpage for collecting overall informationon a particular topic [4]. For example, if the user wishesto read a tutorial of a specific subject, all the hyperlinksshould be checked in a trial and error basis. So, all of theinformation is specified within a Webpage in terms of theirURLs [5]. A typical Webpage consists of a set of URLs thatpossibly contain the sought information. So, it will take alonger time to retrieve complete information using the avail-able methods.

2008 Seventh Mexican International Conference on Artificial Intelligence

978-0-7695-3441-1/08 $25.00 © 2008 IEEE

DOI 10.1109/MICAI.2008.13

393

This paper is organized as follows: Section 2 shows Re-lated Work in a brief. Our Approach is described in Section3. In Section 4, Experimental Results are shown. Finally,Section 5 depicts the conclusion of the work.

2 Related Work

A huge volume of research works have been carried outin the past few years in Web Technology field [6]. Serialcrawling has been proposed initially to download Web doc-uments from WWW. In this technique, the Webpages arefetched and downloaded one after another. So, the availablebandwidth of Internet connection has not been fully utilized.

In serial crawling, Bandwidth utilized is calculated bydividing ‘Size of a Webpage to be downloaded’ by ‘Timetaken to download’. Then, parallel crawling technique cameinto picture. Here, all the Webpages have been downloadedat a time in a parallel fashion to reduce the time for down-loading as well as to utilize the bandwidth of Internet con-nection more with respect to serial crawling[7-9].

In parallel crawling, Bandwidth utilized is calculated bydividing ‘Total size of Webpages to be downloaded’ by‘Time taken to download all the Webpages’.

From the above discussion, it has been observed that par-allel crawling is too better than serial crawling with refer-ence to resource utilization and time complexity as a the-oretical view [10]. But in practical sense, there are somelimitations in parallel crawling also. The major limitationis that somebody has to fix the number of crawlers beforestarting the download operation. In some cases, it is notfeasible to calculate number of crawlers in advance. Band-width utilization is not 100% in all the scenarios, since thenumber of crawlers are predefined [11].

To overcome this problem, our crawling technique is in-troduced in this paper for client-side operation. By our ap-proach, 100% bandwidth utilization is always achieved. Inbest case, the time complexity is better than parallel crawl-ing and in worst case both are same. In addition, a userneeds not to know all the details and also needs not to followall the links (URLs) on each Webpage. So, our approach isfurther time saving method with respect to a user. DuplicateWebpages are not downloaded through our approach.

3 Our Approach

In this section, the detailed view of Hierarchical Down-loader is shown using algorithms and snapshots. Beforegoing into details, following definitions should be remem-bered.Definition 1: Seed queue - A storage space with the prop-erty of a queue (First-Come-First-Serve) is defined as Seedqueue. The URLs which come first into the queue are re-

trieved first by the crawler module to download the corre-sponding Webpages from WWW.Definition 2: Seed url - The url which is stored within theseed queue as the initial seed for beginning the download-ing process from WWW, is defined as Seed url.Definition 3: Life Cycle of Dynamic crawler - Life Cycle ofDynamic crawler refers to time period between its creation& destruction for a particular job to be executed.Definition 4: Hierarchical Downloader - HierarchicalDownloader is a set of dynamically created crawlers whosenumber of instances is dependent on the number of seedurls. One crawler can download only one Webpage usingits assigned URL from the seed queue in its life time. Af-ter downloading the Webpages, these crawlers are destroyedimmediately.Definition 5: Depth - Level of searching through WWW fordownloading the required Webpages in a Hierarchical fash-ion.

This paper is based on multi-downloading of Webpagesfrom WWW in an extended form of parallel crawling.Threaded programming has been used to maintain a rea-sonable number of crawlers at runtime. In this client basedapproach, the crawlers are considered as dynamic in nature.So, crawlers are destroyed automatically after completionof its execution. Algorithm 1 describes the life time of dy-namic crawler. Number of URLs of seed queue are beingchecked for producing same number of crawlers in real timeto download specific Webpages from WWW at a time. Thisconcept of life time is utilized in this work. Each crawlerdownloads a particular Webpage.

Figure 1. Hierarchical Downloading Snapshot

In our approach, the Webpages are downloaded fromWWW using Algorithm 2. All visited Webpages have beendiscarded where as new Webpages are being saved withinthe storage of the computer machine itself. These Web-pages are further parsed using ‘Parser’ tool for next levelof downloading of the Webpages. The software module,

394

which extracts all the URLs from the downloaded Webpage,is known as Parser. So, our ‘Parser’ module only concen-trates on the external links of the Webpages and not on theinternal links. Internal link means a link to the differentsections of same Webpage; where as external link meansa link to the different Webpage. The depth level of search-ing / downloading is predefined and mandatory for stoppingthe downloading procedure such as set to a threshold value.After parsing, extracted URLs are being checked with theURL log entries within the system. Newly found / uniqueURLs are stored within seed queue for further download-ing at the next level of search where as visited URLs arediscarded as discussed in Algorithm 2. Algorithm 2 callsAlgorithm 1 for creating several dynamic crawlers for nextlevel of downloading whenever required.

In the 1st phase, the user has to submit the value ofdepth level while downloading through the software inter-face. Figure 1 shows Hierarchical downloading. There isa field named Number of Dynamic crawlers on the graph-ical user interface (GUI) of the package. This field showsnumber of active crawlers at any depth level of search. So,this is dynamic in nature. In every level of search, this valuemay be changed depending on number of URLs within seedqueue at runtime.

Figure 2. URL extraction from a downloadedWebpage using Parser

In the 2nd phase of our approach, the downloaded Web-pages are parsed using ‘href’ tag of ‘html’ programmingwithin Webpages as shown in Figure 2. The crawler &Parser module work in tandem, which dramatically re-duces response time for data-intensive operations on largedatabases typically associated with decision support sys-tems. To handle the Parser module, Fast Lexical Anal-yser Generator (FLEX) has been used running in the back-ground. It is basically a tool for generating programs thatperforms pattern-matching on text. Only valid URLs arestored for next level of downloading after extraction. Allother URLs are discarded through the software. Here, valid

URLs mean the URLs which are required for download-ing either the full Web site or a tutorial or data sets, etc.as per the requirement of the user / client. No irrelevantURLs are stored within URL log file. For pointing out therelated URLs, previously visited URLs are required. Con-sider, a user wishes to download a “C Tutorial” Webpagesfrom “http://xyz.com/c tutorial/index.html”. In this case,all the valid URLs should contain “xyz.com/c tutorial/” as apart of its URL followed by some other characters. If someextracted URLs do not contain “xyz.com/c tutorial/”, theseURLs are irrelevant for download of this particular “C Tu-torial” Webpages.

Figure 3. An instance of URL database cre-ated through our approach

In the 3rd phase of our approach, newly extracted URLsare checked with the existing URL database. Unvisited /unique URLs are being saved as depicted in Figure 3 usingMS-Access database. The following snippet shows how thechecking took place within the programming.

If IsNull(DatabaseLookUp(“URL”, “SQL Query”)) ThenAdd the record

ElseDiscard the URL

End If

In the 4th (final) phase of our approach, the downloadedWebpages are finally saved within the computer as shownin Figure 4 & Figure 5 respectively.

Algorithm 1 Life Time of Dynamic crawlerInput : A set of seed urlsOutput : Downloaded WebpagesStep 1 : Check number of URLs within seed queueStep 2 : Generate ‘N’ number of crawlers in runtime basedon number of URLs in seed queueStep 3 : Assign each seed (URL) to a specific crawler

395

Figure 4. An instance of Webpage saving

{cj< −si}Step 4 : Search for Webpages in the Web using the specificcrawlersStep 5 : Download Webpages which are found throughsearchingStep 6 : Kill the crawlers after successful download of Web-pagesStep 7 : Stop

Figure 5. Webpage Repository after down-loading from WWW

Figure 6 shows internal structure of Hierarchical down-loading. Hierarchical technique means a technique with ahierarchical view. A number of crawlers have been uti-lized based on the requirement of the level of hierarchyfor this purpose. The ‘number of crawlers’ is calculatedusing the directory structure of the concerned Web direc-tory. In this paper, an advanced technique, for downloadinga lot of related Webpages with a minimum effort and timeusing Hierarchical Downloader consisting of several real-time crawlers, is depicted using Algorithm 2. The search

limit, for downloading Webpages, depends on the ‘Depth’value. Until the ‘Depth’ value is reached, our method down-loads all the related Webpages as per schedule. All the re-quired Webpages of each level have been tried to downloadat a time in a parallel manner using dynamically createdcrawlers utilizing all the available bandwidth of Internetconnection dedicated for the particular client machine.

Algorithm 2 Hierarchical DownloadingInput : A set of seed urls within seed queue and Depth levelof searchingOutput : Storage of downloaded WebpagesStep 1 : Initialize i with 0Step 2 : Continue loop until i > DepthStep 3 : Call Algorithm 1Step 4 : Check whether the URLs of downloaded Webpagesare already visited or notStep 5 : Discard already visited WebpagesStep 6 : Save new WebpagesStep 7 : Hyperlinks of all saved Webpages are extracted us-ing ParserStep 8 : Check whether the extracted URLs are already vis-ited or notStep 9 : Save those extracted hyperlinks(URLs), which arestill not visited, within seed queue as well as in storageStep 10 : Increment i by 1Step 11 : Stop

Storage

WebpageDiscard

DiscardURL

Seed Queue

Number of URLs( S := {S , S , ........ , S })1 2 n

No No

No

Note : ‘n’ number of Seed URLs are selected at a time by ‘n’ number ofdynamically created Crawlerswhere n = 1, 2, 3, ........

Yes

Yes

Search for Web−pages Download Web−pages

WWW

Parser

Crawler Creation Unit Crawler Destruction Unit

Set of Crawlers ( C := {C , C , .............. , C } )1 2 n

VisitedWeb−page

Visited

URL

Figure 6. Internal Architecture of HierarchicalDownloading

Question: Why any parallel approach would be slowerthan our Hierarchical methodology?

Answer: In a typical parallel approach, the number ofcrawlers are fixed before downloading the documents fromWWW. So, a user can have two options for downloading allthe Webpages.

396

1st option: The user has to approximate the amountof number of crawlers before the download operation hastaken place. This approximation is not always correct.Sometimes, number of crawlers are greater than the avail-able URLs; and in some cases, it is vice-versa. So, the prac-tical usage is not 100% effective.

2nd option: The user has to supply all the URLs as seedsfor fixing the number of crawlers accurately. But, it is prettydifficult to know all the details of URLs at starting point(i.e., in case of a tutorial download).

To overcome these 1st & 2nd options (loop holes), ourHierarchical methodology is a perfect solution. In our ap-proach, the user needs not to know all the details of URLs tobe downloaded, since the number of URLs would be calcu-lated at runtime dynamically. After downloading the Web-page (using the initial seed url), the Parser module parsesall the URLs from the Webpage. These parsed URLs arethen submitted to the seed queue for downloading furtherWebpages. The number of crawlers are decided using num-ber of URLs within seed queue at that time instance. So,there would be no approximation on creating number of dy-namic crawlers. After downloading the specific Webpages,these crawlers would be killed / destroyed at that particu-lar level. So, the ‘number of crawlers’ for the next level isagain dependent on the number of parsed URLs for the nextlevel. So, in our methodology, the total number of crawlersis not fixed rather dynamic in nature. So, all the URLs canbe downloaded in a parallel manner depending on the avail-able bandwidth of the system. No process would be queuedif the bandwidth permits. So, our approach is faster thanany other typical parallel approach.

Example: Consider, there are ‘N’ number of crawlers ina parallel downloader and let ‘P’ number of Webpages havebeen downloaded.Case 1: If (P < N); then ‘P’ number of Webpages can bedownloaded in parallel.Case 2: If (P == N); then ‘P’ number of Webpages canbe downloaded in parallel.Case 3: If (P > N); then ‘N’ number of Webpages can bedownloaded in parallel and the remaining ‘P-N’ Webpageswould be downloaded in the next iterations.

From this example, it can be said that our approach isbetter in case of P > N , since ‘P’ number of crawlerswould be created dynamically for downloading in parallelfashion. That means, our method is faster than other typicalparallel methods.

4 Experimental Results

In this section, experimental results are shown based onthe following configuration of a computer system:Processor - Pentium 4 (P4); Processor speed - 2.8 GHz; Pri-mary memory (RAM) - 512 MB; Hard disk - 80 GB; Inter-

Table 1. Time required for downloading 100URLs using Hierarchical Downloader

Sl. Website Number of downloaded Time needed for downloadingNo. Name URLs / Webpages URLs (hh:mm:ss:ms)

1 www.freshersworld.com 100 0:1:29:982 www.indiagsm.com 100 0:0:34:793 www.cocacola.com 100 0:0:16:394 www.yahoomail.com 100 0:0:39:905 www.gmail.com 100 0:0:32:836 www.pepsi.com 100 0:0:46:807 www.anandabazar.com 100 0:0:49:118 www.telegraph.com 100 0:0:30:079 www.times.com 100 0:0:18:66

10 www.bollyextreme.com 100 0:0:38:26

Table 2. Comparison between DAP & Our Ap-proach

Sl. Set of downloaded Time taken Time taken byNo. Webpages by DAP (sec.) our approach (sec.)1 http://capexindia.com/anirban/computer skills.htm 3 12 http://capexindia.com/anirban/computer skills.htm 5 2

http://capexindia.com/anirban/current working status.htm3 http://capexindia.com/anirban/computer skills.htm 7 2

http://capexindia.com/anirban/current working status.htmhttp://capexindia.com/anirban/index.htm

4 http://capexindia.com/anirban/computer skills.htm 9 3http://capexindia.com/anirban/current working status.htm

http://capexindia.com/anirban/index.htmhttp://capexindia.com/anirban/contact.htm

5 http://capexindia.com/anirban/computer skills.htm 11 5http://capexindia.com/anirban/current working status.htm

http://capexindia.com/anirban/index.htmhttp://capexindia.com/anirban/contact.htm

http://capexindia.com/anirban/personal profile.htm

net connection speed - 512 kbps, 1:1 RF line;

Table 1 enlists the time requirement for downloading 100URLs through our crawling methodology as a sample study.It has shown better result than Download Accelerator Plus(DAP). In Table 2, the comparison between DAP & our ap-proach has been shown. In this part of the experiment, oneor more Webpages have been downloaded all at a time as aconcurrent process.DAP configuration used - 8.5.5.3Maximum simultanious downloads possible by DAP - 20

In this paper, our approach is compared with DAP as it isavailable free of cost (trial version) in the Internet and alsoDAP is very much popular software for the internet users.If somebody wishes to download the Web documents us-ing DAP, it is possible only for maximum 20 documents.Since, our DAP version supports maximum 20 simultane-ous downloads. But, in case of our approach, there is noupper limit for parallel connections. Theoretically, the num-ber of connections are infinite and practically, the numberof connections are dependent to the available bandwidth ofthe Internet connection. In Table 3, the time taken by ourmethod in Single, Parallel & Hierarchical mode of crawl-ing has been shown. From the experiment, it is establishedthat the result would be better in the case of Hierarchicalmode of crawling, since the number of crawlers are not pre-defined.

397

Table 3. Comparative Study on Time taken byHierarchical Downloader in Single, Parallel &Hierarchical mode of crawling

Note: Number of Parallel crawlers = 2 for this experimentation.Website Name Total Number of Time taken by Time taken by Time taken by

Webpages downloaded Single crawler Parallel crawler Hierarchical crawlerfreshersworld.com 4994 9 hr. 30 min. 4 hr. 45 min. 58 min.theatrelinks.com 469 44 min. 22 min. 10 min.

indiagsm.com 34 3 min. 1 min. 30 sec. 30 sec.rediff.com 163 3 min. 1 min. 30 sec. 20 sec.

w3.org 2333 6 hr. 3 hr. 37 min.indiafm.com 7087 7 hr. 3 hr. 30 min. 53 min.

nokia-asia.com 193 2 hr. 1 hr. 17 min.amazon.com 349 2 hr. 58 min. 1 hr. 29 min. 19 min.

5 Conclusion

In this paper, an advanced method for downloading Web-pages has been proposed. In case of a typical downloadingsoftware, Single crawling or Parallel crawling are used fordownloading the Webpages of selected URLs. In our pro-posal, an enhanced methodology is discussed to minimizetime requirement while crawling through WWW using Hi-erarchical crawling. The main advantage of this type ofcrawling system is to generate & kill any number of dy-namic crawlers depending on the number of URLs presentwithin seed queue at any depth level of the concerned hi-erarchy. These crawlers are generated at runtime basedon the number of URLs present within seed queue at anydepth level of concerned Web hierarchy. After download-ing the specific Webpage, the respective crawler would bedestroyed automatically. At any time instance, maximumnumber of Webpages available for Hierarchical download-ing depends on the allowable bandwidth of the system.

References

[1] Hongfei Yan, Jianyong Wang, Xiaoming Li, Lin Gu,“Architectural design and evaluation of an efficient Web-crawling system,” The Journal of Systems and Software,60(3): 185-193, 2002

[2] Soumen Chakrabarti, Byron E. Dom, Ravi Kumar,Prabhakar Raghavan, Shidhar Rajagopalan, AndrewTomkins, David Gibson, Jon Kleinberg, “Mining theWeb’s Link Structure,” IEEE Computer, (32)8: August1999, pp. 60-67

[3] Sergey Brin, Lawrence Page, “The Anatomy of aLarge-Scale Hypertextual Web Search Engine,” Pro-ceedings of the Seventh International World Wide WebConference, Brisbane, Australia, April 1998

[4] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina,Andreas Paepcke, Sriram Raghavan, “Searching theWeb,” ACM Transactions on Internet Technology, Vol-ume 1, Issue 1, August 2001

[5] Debajyoti Mukhopadhyay, Sajal Mukherjee, SoumyaGhosh, Saheli Kar, Young-Chon Kim, “Architecture of AScalable Dynamic Parallel WebCrawler with High SpeedDownloadable Capability for a Web Search Engine,”The 6th International Workshop MSPT 2006 Proceed-ings, Youngil Publication, Republic of Korea, November2006, pp.103-108

[6] M. Burner, “Crawling towards eternity: Building anarchive of the world wide web,” Web Techniques, 2(5),1997

[7] Marc Najork, Allan Heydon, “High-performance webcrawling,” In J. Abello, P. Pardalos, M. Resende, editors,Handbook of Massive Data Sets, Kluwer Academic Pub-lishers, Inc., 2001

[8] Marc Najork, Janet L. Wiener, “Breadth-first searchcrawling yields high-quality pages,” In Proc. of 10th In-ternational World Wide Web Conference, Hong Kong,China, 2001

[9] Martijn Koster, “The Robot Exclusion Standard,”http://www.robotstxt.org/

[10] Paolo Boldi, Bruno Codenotti, Massimo Santini, Se-bastiano Vigna, “Ubicrawler: A scalable fully distributedweb crawler,” In Proc. AusWeb02, The Eighth AustralianWorld Wide Web Conference, 2002

[11] Junghoo Cho, Hector Garcia-Molina, “Parallelcrawlers,” In Proc. of the 11th International World-WideWeb Conference, 2002

398

[IEEE 2008 Seventh Mexican International Conference on Artificial Intelligence (MICAI) - Mexico,...

Documents

Transcript of [IEEE 2008 Seventh Mexican International Conference on Artificial Intelligence (MICAI) - Mexico,...