Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U....

16
Dispatching Java Dispatching Java agents to user for agents to user for data extraction from data extraction from third party web third party web sites sites Alex Roque Alex Roque F.I.U. HPDRC F.I.U. HPDRC

Transcript of Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U....

Page 1: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Dispatching Java agents to Dispatching Java agents to user for data extraction user for data extraction

from third party web sitesfrom third party web sites

Alex RoqueAlex Roque

F.I.U. HPDRCF.I.U. HPDRC

Page 2: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

IntroductionIntroduction

Since the WWW has grown exponentially, data Since the WWW has grown exponentially, data retrieval has become an intensive research retrieval has become an intensive research topic.topic.

However, mechanisms and tools that give users However, mechanisms and tools that give users more power over the data on the web have not more power over the data on the web have not grown in parallel with data increase.grown in parallel with data increase.

For example, no tools exists that allow the user For example, no tools exists that allow the user to extract data from HTML context and use in to extract data from HTML context and use in an external application.an external application.

Page 3: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

A tool created to allow a more coherent A tool created to allow a more coherent and wider set of automatic data extraction, and wider set of automatic data extraction, was the Data Extractor system, which was the Data Extractor system, which treats any Web sites as a data source.treats any Web sites as a data source.

Data extractor has two kinds of Data extractor has two kinds of implementation, as a standalone server implementation, as a standalone server solution and a set of functionality that can solution and a set of functionality that can be embedded in applications and provide be embedded in applications and provide them with data from the internet.them with data from the internet.

Page 4: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Data Extractor InefficienciesData Extractor Inefficiencies

Performance in multi client conditionsPerformance in multi client conditions

Network performance issuesNetwork performance issues

Legal issuesLegal issues

Installing exclusive local server for clients is Installing exclusive local server for clients is a, however, it is expensive. Our a, however, it is expensive. Our alternative, is MDRA: Mobile Data alternative, is MDRA: Mobile Data Retrieval Agents.Retrieval Agents.

Page 5: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

MDRA Composition and DeliveryMDRA Composition and Delivery

The mobile agents server, contains a The mobile agents server, contains a wrapper portal and a knowledgebasewrapper portal and a knowledgebase

Functionality is as follows:Functionality is as follows:

1) Users connect to wrapper portal and 1) Users connect to wrapper portal and request wrapperrequest wrapper

2) In response, package to extract data is 2) In response, package to extract data is constructed and sent to clientconstructed and sent to client

3) Data extraction takes place in client3) Data extraction takes place in client

Page 6: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Wrapper portals: List and package Wrapper portals: List and package wrappers, authenticates users, and allows wrappers, authenticates users, and allows them to change and save their queries them to change and save their queries (references to wrappers).(references to wrappers).

Knowledgebase: Contains information Knowledgebase: Contains information about available wrappers, their about available wrappers, their parameters and status.parameters and status.

Wrappers can be thought of as lightweight Wrappers can be thought of as lightweight programs which use a predefined OO programs which use a predefined OO library to “strip” desired information.library to “strip” desired information.

Page 7: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

MDRA ArchitectureMDRA Architecture

Mobile wrapper controller: Responsible for Mobile wrapper controller: Responsible for controlling behavior of wrappers and flow of datacontrolling behavior of wrappers and flow of dataWrappers: Same as the ones used in Data Wrappers: Same as the ones used in Data Extractor, process which strips data from web Extractor, process which strips data from web site.site.Data Extraction Library: Contains functionality Data Extraction Library: Contains functionality essential for extraction and network operations. essential for extraction and network operations. Compact; can be cached if no update is Compact; can be cached if no update is required.required.Outer packaging: Interface for uniting numerous Outer packaging: Interface for uniting numerous wrappers and controllers.wrappers and controllers.

Page 8: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

How does execution take place?How does execution take place?

1)1) Query formulationQuery formulation

2)2) Agent construction and deliveryAgent construction and delivery

3)3) Agent ExecutionAgent Execution

4)4) Data DeliveryData Delivery

Page 9: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Query FormulationQuery Formulation

User connects to wrapper portal, wrappers User connects to wrapper portal, wrappers are listed, user selects desired wrapper(s) are listed, user selects desired wrapper(s) as well configures execution parameters.as well configures execution parameters.

This configuration can be saved for future This configuration can be saved for future reference.reference.

Page 10: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Agent construction and deliveryAgent construction and delivery

Wrapper portal begins packaging including Wrapper portal begins packaging including outer packaging module, wrapper outer packaging module, wrapper parameter information, wrapper controller, parameter information, wrapper controller, wrapper and Data Extraction Library.wrapper and Data Extraction Library.Components that change frequently are Components that change frequently are packaged separately from the one that do packaged separately from the one that do (aids caching).(aids caching).Compression or digital signatures take Compression or digital signatures take place.place.

Page 11: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Agent executionAgent execution

Once delivered to the client, wrappers Once delivered to the client, wrappers interact with WWW sites, and extract the interact with WWW sites, and extract the desired data.desired data.

Data is passed to outer packaging Data is passed to outer packaging controller where it can be used in controller where it can be used in applications or stored in various mediums.applications or stored in various mediums.

Page 12: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Data DeliveryData Delivery

Data retrieved may be transferred to other Data retrieved may be transferred to other applications programmatically, stored in applications programmatically, stored in various mediums (Excel, XML, Text), or various mediums (Excel, XML, Text), or stored in databases.stored in databases.

May be used for statistical data collection.May be used for statistical data collection.

Page 13: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Source Code ImplementationSource Code Implementation

Because the system needs to have a high Because the system needs to have a high degree of portability, JAVA language was degree of portability, JAVA language was used to perform the implmentation.used to perform the implmentation.

Previous Data Extractor was written in Previous Data Extractor was written in Java, so in order to reuse modules, JAVA Java, so in order to reuse modules, JAVA was again used.was again used.

Speed Performance issues were Speed Performance issues were addressed [7].addressed [7].

Page 14: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

MDRA FrameworkMDRA Framework

In order to deliver MDRA to clients, the In order to deliver MDRA to clients, the method of delivery is that of a Java Applet.method of delivery is that of a Java Applet.

Applets allow to portability which allows Applets allow to portability which allows clients of different platforms to participate clients of different platforms to participate in this data retrieval.in this data retrieval.

Since framework code and libraries do not Since framework code and libraries do not change often, browsers that cache java change often, browsers that cache java applets will keep parts that do not changeapplets will keep parts that do not change

Page 15: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

SecuritySecurity

Applets must be digitally signed in order to Applets must be digitally signed in order to for them to access system and network for them to access system and network resources needed for the retrieval.resources needed for the retrieval.

Proxy servers may be created where the Proxy servers may be created where the applet was downloaded from in order to applet was downloaded from in order to give Applets ability to download third party give Applets ability to download third party web sites. However, this option is prone to web sites. However, this option is prone to a high bottleneck congestion.a high bottleneck congestion.

Page 16: Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

ConclusionConclusion

MDRA “lease” data extraction services to MDRA “lease” data extraction services to users, which retrieve data that can be users, which retrieve data that can be exported to other applications, exported to other applications, This distributed approach takes the load This distributed approach takes the load on the centralized server architecture.on the centralized server architecture.Future research includes different MDRA Future research includes different MDRA implementations (standalone, embedded implementations (standalone, embedded in client side), and tuning of agent in client side), and tuning of agent performance.performance.