Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam,...
-
Upload
austin-long -
Category
Documents
-
view
215 -
download
1
Transcript of Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam,...
![Page 1: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/1.jpg)
Data Mining on Symbolic Knowledge Extracted
from the Web
R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery
Carnegie Mellon University and J.Stefan Institute
![Page 2: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/2.jpg)
The Web
KB
Data Mining
Gather info.
Wrappers, Info. Extraction
![Page 3: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/3.jpg)
Gathering info. from the Web• Wrappers - extract from highly structured,
automatically generated Web pages– eg., movie theaters and restaurants from
Web-based entertainment guides combined with a map system
• Information Extraction - extract from free-form, unstructured data– hand-constructed extraction rules– learned rules:
• to identify the name of a person given home page• to identify relations suggested by hyperlniks
![Page 4: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/4.jpg)
Data sources
• corporations around the world• propositional and relational facts on 4312
companies collected by crawling the Web• data sources:
– Hoovers Online Web resource - selected info. about companies including Web site URL
– the first 50 Web pages from the company site - extracting predefined features
![Page 5: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/5.jpg)
The Web
KBWrapping from corp. info.
Extracting from corp. Web sites
Data Mining
Abstracting features
New knowledge
![Page 6: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/6.jpg)
Features describing a company• Extracted: types of activity, companies it points
to/mentions, officers, sector predicted from the text using Naïve Bayes, locations derived using Naïve Bayes and autoslog-based rules, country derived from URL
• Wrapped: listed sector and sub-sector, type, address including city and state, competitors, subsidiaries, products, officers, auditors, revenue, net-income, net-profit, employees
• Abstracted: companies in the same city/state, sharing officers, linking/mentioning the same companies, reciprocally linking/mentioning/compeeting,
discretized revenue, net-income, net profit, employees
![Page 7: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/7.jpg)
Data Mining Algorithms
Finding regularities:• Association rules - Apriori algorithm
– features first mapped to Boolean features, companies represented with sparse vectors
– Y :- X (support: P(X,Y), confidence:P(Y|X))
Describing a target concept:• Propositional rules - C5.0 decision tree/rules
– set-valued features (locations, officers,…) and relational features (same-city, mentions,…) excluded
• Relational rules - FOIL– Y :- X (covered positive, covered negative)
![Page 8: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/8.jpg)
Apriori experiments• 2658 rules found using default
support and confidence (10%, 80%) and all the features (about 26 000)
• the most frequent rules pointed out the need for data cleaning (errors in extraction)
• 254 rules after removing rules containing wrongly extracted features
![Page 9: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/9.jpg)
• Companies having documentation on their sites, that are located in USA or provide technical assistance, are involved in selling.– activity=sell :- locations=united-states, links-to=adobe-
systems-incorporate (10.8%, 93%)
– activity=sell :- activity=technical-assistance, links-to=adobe-systems-incorporate (11.9%, 91.1%)
• Most companies located in Japan either sell or perform research, while the companies located in USA either sell or supply.– activity=sell :- locations=japan (14.5%,90.8%)
– activity=research :- locations=japan (13.2%, 82.2%)
Regularities found I
![Page 10: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/10.jpg)
• Companies mentioning software on their Web pages are mostly located in USA.– locations=united-states :- activity=supply, activity=expertise,
mentions=software (5.3%, 64%)
• Companies in our dataset are stable in their finances.– revenue-1997=high :- revenue-1996=high, revenue-1995=high,
revenue-1994=high, revenue-1993=high (5%, 99.5%)
Regularities found II
![Page 11: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/11.jpg)
Using support 1%, confidence 10% and 4 features: url-country, sector, competitor, auditors (mapped to 3532 Boolean features)
• The most frequent auditors for our data: Price-Waterhouse Coopers LLP (13.4%), Ernst & Young LLP (11%), Arthur Andersen LLP (10.2%)
Regularities found III
![Page 12: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/12.jpg)
• Companies in computer-software-&-services have Price Waterhouse Coopers (20.9%) or Ernst & Young (14.3%) as their auditor.
• Companies in diversified-services have Price-Waterhouse Coopers (15.7%) or Arthur Andersen (13.9%) as their auditor.
• Companies in drugs have Ernst & Young (26.8%) as their auditor.
Regularities found IV
![Page 13: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/13.jpg)
• About half of the companies that compete with microsoft-corporation are in computer-software-&-services and about a quarter of companies that are in computer-software-&-services compete with microsoft-corporation.– competitor=microsoft-corporation (2.1%, 54.9%)
– competitor=microsoft-corporation :- hoovers-sector=computer-software-&-services (2.1%, 25.7%)
• Competitors as predictors for computer-software-&-services companies: Computer Sciences Corporation, Associates International Inc., SAP Aktiengesellschaft
Regularities found V
![Page 14: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/14.jpg)
• Telecommunications also have some outstanding competitors: Bellsouth Corporation, MCI Worldcom Inc., Bell Atlantic Corporation, Lucent Technologies inc..
• Most companies competing with Conagra inc., KMart Corporation and BP Amoco p.l.c. are in food-beverage-&-tobacco, retail and energy, respectively.
Regularities found VI
![Page 15: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/15.jpg)
Learning target concepts• Target concepts were selected:
hoovers-sector, hoovers-type, auditors, competitors, share-officers, country, and state.
• Decision trees for learning propositional rules
• Foil for learning first order rules: for unary relations, for binary relations
![Page 16: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/16.jpg)
Propositional rules found I
Depending on the city the company is located in, different features are used to predict the hoovers-sector:– for Atlanta, computer companies have a higher revenue than
diversified services companies (same for Chicago).
– for Houston, depending on the Naive Bayes classification (based on the company web-pages), we predict either Manufacturing, Computer Software & Services, or Energy.
– for Dallas, most Health companies are non-profit and thus have a lower income than leisure companies.
![Page 17: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/17.jpg)
Propositional rules found II
When the city is excluded from the feature set(improvements of Web-based sector classifier):
– Telecommunications has more employees than Energy (employees can help weed out incorrect classifications in the coarse-sector prediction for Energy.
– Where the Naive Bayes classifier predicts communications-services, income can be used to distinguish between Media (lower-income) and Telecommunications (high).
– Where the Naive Bayes classifier predicts investment-services, employees can be used to distinguish between Financial Services(lower) and Banking (high).
![Page 18: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/18.jpg)
First order rules found IPredicting hoovers-sector:• Companies headquartered somewhere other than
Fremont competing with ``Computer Associates International'' are in the computer software & services sector (51 pos.,0 neg.).
• Companies headquartered in New York, that are not in natural-gas-industry nor technology-sector, are in the media industry (8 pos., 0 neg.).– media(A) :- hq-city(A,new-york), sector(A,B),
B<>natural-gas-industry, coarse-sector(A,C), C<>technology-sector, competitor(?,A), performs-activity(A,?), not(products(A,?)), not(locations(A,?)).
![Page 19: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/19.jpg)
First order rules found IIPredicting auditors:• Companies headquartered in Madrid having listed
historical financial information use Arthur Andersen as their auditor (4 pos, 0 neg.).– arthur-andersen(A) :- hq-city(A,madrid), net_profit(A,?,?).
Predicting binary competitor from: hq-city, url-country, links-to, hoovers-sector
• Two companies in the same sector are competitors (11407 pos., 0 neg.).– competitor(A,B) :- A <> B, hoovers-sector(A,C), hoovers-
sector(B,C).
![Page 20: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/20.jpg)
Disucssion • Data cleaning needed - additional source of
noise from imperfect feature extractors– very frequent regularities checked manually for
obvious extractor errors
• Feature selection needed, especially for relational learning– learn simple, unary target relation to suggest
useful features for learning binary relation
• Interaction observed between symbolic features and Naïve Bayesian classifier prediction (decision tree improved Naïve Bayes prediction using symbolic features).
![Page 21: Data Mining on Symbolic Knowledge Extracted from the Web R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery Carnegie Mellon University and J.Stefan Institute.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649e695503460f94b66a6a/html5/thumbnails/21.jpg)
Future work• Information from wrapped Web sites can be
used to train extractors.• Extractors extended to use text and
symbolic features (look at Web pages and use background knowledge).
• Use data mining to identify potential errors in extractors.
• Combine unsupervised search for associations with rule learning in an incremental, iterative approach supporting data cleaning and feature selection.