Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

39
1 TIM Ta Nha Linh 13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan

Transcript of Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

Page 1: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

1TIM

Ta Nha Linh

13 March 2009

Harvesting useful information on researchers' home pages

Ta Nha Linh

Supervisor: Asst. Prof. Min-Yen Kan

Page 2: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

2TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 3: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

3TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 4: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

4TIM

Ta Nha Linh

13 March 2009

Motivation

• Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink

• How about the authors of those publications?

• Publication-centric.

Page 5: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

5TIM

Ta Nha Linh

13 March 2009

Motivation

• Researcher-centric database?– Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only

– Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences

– Some other similar databases: manual update, specific to certain organization

Page 6: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

6TIM

Ta Nha Linh

13 March 2009

• Goal: Automated system to build researchers database, for multiple disciplines

• Input: Researchers’ home pages.

– Basic information

– Contact information

– Educational history

– Publications

Page 7: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

7TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 8: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

8TIM

Ta Nha Linh

13 March 2009

Challenges

• Different layouts– Templates

– Personal pages

• Different content– Pages introducing researchers

– CV-like

– Personal pages

• Different content structures– Tables / lists

– Natural language text

Page 9: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

9TIM

Ta Nha Linh

13 March 2009

Page 10: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

10TIM

Ta Nha Linh

13 March 2009

Page 11: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

11TIM

Ta Nha Linh

13 March 2009

Page 12: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

12TIM

Ta Nha Linh

13 March 2009

Challenges

• Different data presentations

hangli at microsoft dot com cs.duke.edu, junyang [email protected] erafalin(at)cs.tufts.edu <Image src=’email.jpg’/> Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu

Page 13: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

13TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 14: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

14TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

Page 15: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

15TIM

Ta Nha Linh

13 March 2009

RICO - Architecture

Home page Identification

Field Identification

Post-Processing

Page 16: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

16TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

Page 17: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

17TIM

Ta Nha Linh

13 March 2009

Field Identification - Purpose

• To identify data in the page contents to corresponding fields in a pre-defined set of desired information.

• Current set includes:Name – Position – Affiliation

Address – Phone – Fax - Email

BS year – BS major – BS university

MS year – MS major – MS university

PhD year – PhD major – PhD university

Research Interest – Publications

Page 18: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

18TIM

Ta Nha Linh

13 March 2009

Field Identification - Related works• Tang et al (2007), (2008) – ArnetMiner

– Prepocessing: tokenize text into 5 categories

– Tagging of tokens by using Conditional Random Field (CRF)

– F1 = 83.37% (~1,000 researchers)

– Set of features used: + Content features (word, morphological, image

features)+ Pattern features (positive word, special token,

reseacher name features)+ Term features (term, dictionary features)

Page 19: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

19TIM

Ta Nha Linh

13 March 2009

Field Identification - Related works

• Tang et al (2007), (2008) – ArnetMiner

– Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM.

– Based only on text of the page. Stylistic information can be of use.

Page 20: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

20TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Input: a researcher home page

• CRF is the learning model

• Features used– Global features

– Lexicon features

– Context features

– Dictionaries features

– Stylistic features

Page 21: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

21TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Global features: apply for current token– Morphological features

– Initials

– Number

– Punctuation

• Lexicon features: apply for current token– Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email

Page 22: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

22TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Context features: apply for whole line

– Name context– Address context– Phone context: 'phone', 'tel', 'mobile'– Fax context: 'fax', 'facsimile'– Email context: 'email', 'e-mail'– Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor'– Master (MS) context: appearance of 'M.S' or 'MS' or 'Master'– Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy'– Research-interest context: multiple line property– Publication context: multiple line property– Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.

Page 23: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

23TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Dictionaries

– Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature

– Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests

– Research dictionary: classified into high/mid/low confidence.

– Universities dictionary: of names of most of universities, according to Open Directory

Page 24: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

24TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Stylistic features– List feature

– Table features

– Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table

Page 25: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

25TIM

Ta Nha Linh

13 March 2009

Field Identification - PerformanceData set of 40 home pages, cross validation

Overall Precision: 70.66 – Recall: 62.73 – F1: 64.87

Classes Precision Recall F1

name 75.66% 51.34% 61.17

phone 53.38% 89.25% 66.80

fax 47.73% 72.41% 57.53

email 79.31% 70.77% 74.80

address 78.90% 74.57% 76.67

affiliation 30.27% 59.47% 40.12

position 79.46% 64.49% 71.20

research-interest

48.48% 36.04% 41.34

publications 71.05% 43.27% 53.79

Classes Precision Recall F1

bs-major 88.89% 78.05% 83.12

bs-uni 68.67% 57.00% 62.30

bs-year 90.00% 72.00% 80.00

ms-major 71.43% 32.26% 44.44

ms-uni 52.94% 52.94% 52.94

ms-year 77.78% 56.00% 65.12

phd-major 83.33% 73.17% 77.92

phd-uni 74.56% 72.03% 73.28

phd-year 100.00% 74.07% 85.11

Page 26: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

26TIM

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• Data fields to be annotated similar to those from ArnetMiner.– Extra: Name, Research Areas, Publications

– Missing: Image

• Stylistic feature used is minimal

Page 27: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

27TIM

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• F1 value is significantly lower than that of ArnetMiner’s– ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. RICO has no prior knowledge about the page to be parsed.

Heuristic to improve confidence of ‘Name’

Make use of Affiliation name input

– Identifying ‘Research Interest’ and ‘Publications’ is challenging.

Improve ‘Publications’

Page 28: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

28TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

Page 29: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

29TIM

Ta Nha Linh

13 March 2009

Home page Identification - Purpose

• Add-on component

• To complete automation of the system

Page 30: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

30TIM

Ta Nha Linh

13 March 2009

Home page Identification – Related works

• Ahoy!– Input: Researcher name and (optional) institution name

– “Home page”: allocated page, classified by URL patterns

• RICO– Input: Institution name

– “Home page”: allocated page with biographical information, classified by contents

Page 31: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

31TIM

Ta Nha Linh

13 March 2009

Home page Identification – Methodology• Collect a list of Universities domains

• Use Yahoo! BOSS to search for professors in the institutions

• For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’.

• Classify by the number of appearance of keywords.

• Home pages will be passed to Fields Identification component.

Page 32: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

32TIM

Ta Nha Linh

13 March 2009

Home page Identification – Discussion

• Query used not able to get all relevant pages. Tune for majority: professors in institutions.– Target researchers in research organizations.

• Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records.– Need high confidence in overall system performance. But researcher names are not unique.

– Best if can eliminate duplication by analyzing URLs. But domain hierarchies differ within department, between departments, and between institutions.

Page 33: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

33TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

Page 34: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

34TIM

Ta Nha Linh

13 March 2009

Post-processing - Purpose

• Input: CRF++ output file from Fields Identification.

• Group neighboring tokens identified with the same annotation tag

• Deduplication

• Store into database (current size ~ 170,000 researchers)

Page 35: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

35TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 36: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

36TIM

Ta Nha Linh

13 March 2009

Contribution

• Produced an automated system for fetching researchers’ information from the world wide web.

• Introduced a number of features for Field Identification machine learning.

Page 37: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

37TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Page 38: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

38TIM

Ta Nha Linh

13 March 2009

Future improvements• Field Identification

– Introduce more features, especially stylistic features– Strengthen features targeting Name, Research Interest and Publications tags– Cater for the <image> tag– Be able to handle pages using HTML frames– Be able to follow links on the page if necessary

• Home page Identification– Improve heuristics

• Post-processing– Be able to refine output from Fields Identification

• A new component to facilitate front end for user to query the database

Page 39: Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan.

39TIM

Ta Nha Linh

13 March 2009

THANK YOU!

Question?