1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik...
-
Upload
erica-hall -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Querying Web-Sources within a Data Federation Lynn Wu 1, 2 Lynn Wu 1, Aykut Firat 2, 31 Tarik...
1
Querying Web-Querying Web-Sources within a Sources within a Data FederationData Federation
Lynn WuLynn Wu11, , Aykut Firat22, Tarik Alatovic33, Stuart Madnick11
11MIT Sloan School of ManagementMIT Sloan School of Management22Northeastern UniversityNortheastern University
33INSEADINSEAD
International Conference on Information Systems International Conference on Information Systems (ICIS)(ICIS)
December 11, 2006December 11, 2006
2
Motivating ScenarioMotivating Scenario
You want:You want: The current stock quotes of all The current stock quotes of all
companies listed on the Stock companies listed on the Stock ExchangeExchange that are in the biotechnology industry. that are in the biotechnology industry.
And you want to see each of the stock And you want to see each of the stock quotes in all the major currencies.quotes in all the major currencies.
3
Good NewsGood NewsAll of the necessary information is available All of the necessary information is available
(and for free) on the Web …(and for free) on the Web …
Listing of companies in an industryStock price for any company
Conversion between any two currencies
So what’s the problem?
4
Process – Part 1Process – Part 1Web sites are not like Relational (SQL) databases.Web sites are not like Relational (SQL) databases.
Must go step-by-step: first find all the biotech Must go step-by-step: first find all the biotech companies.companies.
5
Biotechnology Ticker
Acadia Pharmaceuticals Inc. ACAD
Accentia Biopharmaceuticals, I ABPI
Achillion Pharmaceuticals, Inc ACHN
Acorda Therapeutics, Inc. ACOR
Adherex Technologies Inc. ADH
Advanced Cell Technology Inc. ACTC.OB
Advanced Life Sciences Holding ADLS
Advaxis Inc. ADXS.OB
Adventrx Pharmaceuticals Inc. ANX
Alfacell Corp. ACEL
Alnylam Pharmaceuticals Inc. ALNY
……
……
ACHN
ABPI
ADLS
ACOR
ADH
ANX
ACAD
237 Biotech firms237 Biotech firms
Process – Part 2Process – Part 2Then must find the stock Then must find the stock
price of each, one-by-one.price of each, one-by-one.
6
Process – Part 3Process – Part 3Ticker Price($)
ACAD 8.96
ABPI 3.55
ACHN 17.52
ACOR 17.13
ADH 0.31
ACTC.OB 0.76
ADLS 2.43
ADXS.OB 0.14
ANX 2.42
ACEL 1.62
ALNY 23.81
Ticker Price($) EURO JPY
ACAD 8.96 6.8096 792.2289
ABPI 3.55 2.698 313.8853
ACHN 17.52 13.3152 1549.09
ACOR 17.13 13.0188 1514.607
ADH 0.31 0.2356 27.4097
ACTC.OB 0.76 0.5776 67.19798
ADLS 2.43 1.8468 214.8567
ADXS.OB 0.14 0.1064 12.37858
ANX 2.42 1.8392 213.9725
ACEL 1.62 1.2312 143.2378
ALNY 23.81 18.0956 2105.242
Then must convert stock Then must convert stock price of each, one-by-one.price of each, one-by-one.
7
General ScenarioGeneral Scenario
Users often have to browse through Users often have to browse through many websites and collect and many websites and collect and process a lot of information process a lot of information manually.manually.
Wouldn’t it be great if you could get Wouldn’t it be great if you could get all the stock quotes in the biotech all the stock quotes in the biotech industry using one query?industry using one query?
select ticker, price from yahooF where ticker IN (select companyticker from companytable where industry='Biotechnology')
8
Why is this so difficult?Why is this so difficult?
Websites have various capability Websites have various capability restrictions.restrictions. Web sites do not accept general queries Web sites do not accept general queries
(e,g., SQL).(e,g., SQL). Assuming they somehow accepted general Assuming they somehow accepted general
queries, there are still problems. For example:queries, there are still problems. For example: select price from yahooFselect price from yahooFThis is not answerable as Yahoo! Finance requires at This is not answerable as Yahoo! Finance requires at
least one ticker at a time to get the stock quote.least one ticker at a time to get the stock quote. select exchanged, expressed, rate, date from olsen select exchanged, expressed, rate, date from olsen
where expressed='USD' and date= '12/10/06' where expressed='USD' and date= '12/10/06' Must specify both currencies.Must specify both currencies.
9
Existing SolutionsExisting Solutions Commercial databases can incorporate Commercial databases can incorporate
heterogeneous data sources through the heterogeneous data sources through the use of wrappers: use of wrappers: However, there is no general-purpose wrapper However, there is no general-purpose wrapper
that can query the entire Web. that can query the entire Web. Need to construct one wrapper per website.Need to construct one wrapper per website. This is our focus – how can these be This is our focus – how can these be
improved ?improved ?
Other options:Other options: Using highly expressive context-free grammars Using highly expressive context-free grammars
to express the capability restrictionsto express the capability restrictions Has not been used widely in commercial systems due Has not been used widely in commercial systems due
to their complexity.to their complexity.
10
How does a Federated database How does a Federated database system handle the problem?system handle the problem?
Example: IBM DB2
Wrapper
Web Sources
Capability Handler
Wrapper for S1
Capability Handler
Data Extraction
Wrapper for S2
Capability Handler
Data Extraction
Wrapper for S3
S1-website
Wrapper: Request-Reply Protocol
Federation Engine
Query: Select ..from s1,s2,s3 …
IBM DB2
Data Extraction
S2-website S3-website
For web sites (S1, S2, S3), each wrapper must be custom crafted.
11
Research ContributionResearch Contribution
Offer a complete, practical, and Offer a complete, practical, and scalable solution to easily scalable solution to easily incorporate websites into a data incorporate websites into a data federation.federation.
Abstract wrapper components into Abstract wrapper components into separate reasoning engines.separate reasoning engines. Capability reasoning engine for query Capability reasoning engine for query
planning and executionplanning and execution Data extraction engineData extraction engine
12
Our SolutionOur SolutionTwo-Layered Architecture—current IBM solution
Three-Layered Architecture— with capability declaration
Wrapper
Web Sources
Capability Handler
Wrapper for S1
Capability Handler
Data Extraction
Wrapper for S2
Capability Handler
Data Extraction
Wrapper for S3
S1-website
Wrapper: Request-Reply Protocol
Federation Engine
Query: Select ..from s1,s2,s3 …
IBM DB2
Data Extraction
S2-website S3-website
Wrapper: Request-Reply Protocol
Federation Engine
Query: Select ..from s1,s2,s3 …
Wrapper, Capability
Engine
S1-website
Web Sources
Data Extraction
Engine
IBM DB2
Data Extraction
Engine
Query planning
with capability
declaration
CR for S1
CR for S3
CR for S2
Capability Record Declaration
DE for S1
DE for S2
DE for S3
Data Extraction Spec Files
Wrapper: Request-Reply Protocol
Federation Engine
Query: Select ..from s1,s2,s3 …
Wrapper, Capability
Engine
Web Sources
Data Extraction
Engine
IBM DB2
Data Extraction
Engine
Query planning
with capability
declaration
CR for S1
CR for S3
CR for S2
Capability Record Declaration
DE for S1
DE for S2
DE for S3
Data Extraction Spec Files
S2-website S3-website
13
Adding a web source is Adding a web source is simple.simple.
Define the data extraction rules.Define the data extraction rules. Define the capability record.Define the capability record.
No procedural No procedural coding involved at coding involved at
all !all !
14
Data Extraction: Cameleon Data Extraction: Cameleon EngineEngine
• Extract data from web pages using declarative specifications that extract specific fields within a website.
• Can answer rudimentary queries involving only a single website.
Input param
Regular expression identifying the region and extracts the price
Example data extraction rules for Yahoo! Finance
15
Cameleon Studio tool enables Cameleon Studio tool enables quick creation and testing of the quick creation and testing of the
data extraction rulesdata extraction rules
16
Capability RecordCapability Record For Yahoo Finance!, we have two attributes For Yahoo Finance!, we have two attributes
of interest.of interest. Cameleon extracts data and form a table formatCameleon extracts data and form a table format
Capability RecordCapability Record
TickerTicker PricePrice
relation(‘YahooF’,
[[‘Ticker’, string, bound(1)],
[‘Price’, number, free]],
['='])relation(olsen,
[['Exchanged',string, bound(1)],['Expressed',string, bound(1)],['Rate',number, free], ['Date',string, bound(1)]],['=']).
relation(‘companytable’,
[[‘Industry’, string, bound(1)],
[‘CompanyTicker’, string, free]],
['='])
Must provide one (and only one) Ticker at a time(some sites allow up to 50 Tickers at a time).
Price is value returned.
Can only use equality (=) operator.
17
IBM DB2IBM DB2 Uses wrapper to access non-relational data sources.Uses wrapper to access non-relational data sources. DB2 first decomposes the original query into query DB2 first decomposes the original query into query
fragments and then sends them to wrappers.fragments and then sends them to wrappers. Wrapper sends the result back to DB2 which then Wrapper sends the result back to DB2 which then
assembles the final results.assembles the final results.
DB2 XML Wrapper (Adapted from IBM).
18
Request-Reply-Compensate Request-Reply-Compensate ProtocolProtocol
Request-Reply-Compensate protocol example
Query Fragment
select price * 1.3from YahooFwhere ticker in (‘GE’, ‘IBM’, ‘MSFT’);
RequestHXP: PriceTable: YahooFPredicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’)
Wrapper plan 1
HXP: PriceTable: YahooFPredicate: ticker = ‘GE’ Wrapper plan 2
HXP: PriceTable: YahooFPredicate: ticker = ‘IBM’
Wrapper plan 3
HXP: PriceTable: YahooFPredicate: ticker = ‘MSFT’
19
Query PlanningQuery Planning Now we have a capability record defined.Now we have a capability record defined. Add a secondary mini query planner that Add a secondary mini query planner that
is designed specifically to work with is designed specifically to work with capability records. capability records. Can answer queries involving multiple web Can answer queries involving multiple web
sources.sources. Specify a query execution order of query Specify a query execution order of query
fragments.fragments. Independent query fragments are executed Independent query fragments are executed
first.first. Followed by dependent query fragments that Followed by dependent query fragments that
can uses the prior results.can uses the prior results.
20
Our SolutionOur Solution Example 1Example 1
Find all the stock quotes of biotech companies.Find all the stock quotes of biotech companies.
SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN (SELECT COMPANYTICKER FROM COMPANYTABLE WHERE INDUSTRY='BIOTECHNOLOGY' AND COMPANYTICKER <'AD'))
SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]
Depends on the previous query fragment
SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY’ AND COMPANYTICKER < 'AD')
Independent query fragment
21
Example QueryExample QueryCOMPANYTICKER INDUSTRY---------------------------------------------
ACAD BiotechnologyACAM BiotechnologyACOR BiotechnologyACEL Biotechnology
SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY AND COMPANYTICKER < AD
Independent query fragment
SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]
Depends on the previous query fragment
SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACADSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAMSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACORSELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACEL
TICKER PRICE-------------------------------------------ACAD 14.90ACAM 6.51ACOR 5.10ACEL 3.18
22
Example 2Example 2 Now you want the stock price in all major Now you want the stock price in all major
currencies.currencies.
(select ticker, price from yahooF where ticker IN (select companyticker from companytable
where industry=‘biotechnology’)
(select currency, olsen.rate from (select currency from currency_map where currency <> ‘USD') currency_map, (select exchanged, 'USD', rate, ‘12/10/06'
from olsen where expressed= 'USD' and date=‘12/10/06') olsen where currency_map.currency = olsen.exchanged and currency_map.currency <> 'USD ') as exchange
select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from
23
Example 2Example 2Get all the exchange rates against Get all the exchange rates against
the USD on Dec 10 2006the USD on Dec 10 2006
Query fragment 1
Query fragment 2
select olsen.rate, from (select currency, from currency_map where currency <> ‘USD') currency_map, (select exchanged, ‘USD', rate, ‘12/10/06' from olsen where expressed=‘USD' and date=‘12/10/06') olsen,where currency_map.currency = olsen.exchangedand currency_map.currency <> ‘USD'
24
(select exchanged, 'USD', rate, ’12/10/2006' from olsen where expressed= 'USD' and date='12/10/06' and exchanged in (select currency from currency_map where currency<>’USD’))
(select exchanged, ‘USD', rate, ’12/10/06' from olsen where expressed=‘USD' and date=’12/10/06’) olsen
relation(olsen,[['Exchanged',string, bound(1)],['Expressed',string, bound(1)],['Rate',number, free], ['Date',string, bound(1)]],['=']).
select olsen.rate from (select currency from currency_map where currency <> 'USD') currency_map, (select exchanged, 'USD', rate, '12/10/06' from olsen where expressed= 'USD' and date='12/10/06') olsen,
where currency_map.currency = olsen.exchangedand currency_map.currency <> 'USD'
Query fragment 2
Modified Query fragment 2
Capability record
25
Currency rate----------------------------------------AUD 1.46 CAD 1.32 HKD 7.72 YPY 113.00
TICKER PRICE-------------------------------------------ACAD 14.90ACAM 6.51ACOR 5.10
TICKER PRICE($) PRICExRATE CURRENCY-------------------------------------------------------------------------------------------------------------ACAD 14.90 21.754 AUDACAD 14.90 19.668 CADACAD 14.90 115.028 HKDACAD 14.90 1683.7 YPYACAM 6.51 9.505 AUDACAM 6.51 8.593 CADACAM 6.51 50.257 HKDACAM 6.51 735.63 YPYACOR 5.10 7.446 AUDACOR 5.10 6.732 CADACOR 5.10 39.372 HKDACOR 5.10 576.3 YPY
select ticker, price * exchange.rate, exchanged.currency
26
ConclusionConclusion Three-layered architecture for querying web Three-layered architecture for querying web
sources.sources.
Instead of burying capability handling in each Instead of burying capability handling in each wrapper, we created a generic capability wrapper, we created a generic capability handler.handler.
Using this capability handler, adding a web Using this capability handler, adding a web source to a federated database is as simple as source to a federated database is as simple as declaring the extraction rules and capability declaring the extraction rules and capability record for the source.record for the source.
This was implemented and successfully tested.This was implemented and successfully tested.
This makes millions of semi-structured web sites This makes millions of semi-structured web sites into useful “databases.”into useful “databases.”