Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

Post on 13-Dec-2015

213 views 0 download

Transcript of Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

1TIM

Ta Nha Linh

13 March 2009

Harvesting useful information on researchers' home pages

Ta Nha Linh

Supervisor: Asst. Prof. Min-Yen Kan

2TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

3TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

4TIM

Ta Nha Linh

13 March 2009

Motivation

• Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink

• How about the authors of those publications?

• Publication-centric.

5TIM

Ta Nha Linh

13 March 2009

Motivation

• Researcher-centric database?– Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only

– Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences

– Some other similar databases: manual update, specific to certain organization

6TIM

Ta Nha Linh

13 March 2009

• Goal: Automated system to build researchers database, for multiple disciplines

• Input: Researchers’ home pages.

– Basic information

– Contact information

– Educational history

– Publications

7TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

8TIM

Ta Nha Linh

13 March 2009

Challenges

• Different layouts– Templates

– Personal pages

• Different content– Pages introducing researchers

– CV-like

– Personal pages

• Different content structures– Tables / lists

– Natural language text

9TIM

Ta Nha Linh

13 March 2009

10TIM

Ta Nha Linh

13 March 2009

11TIM

Ta Nha Linh

13 March 2009

12TIM

Ta Nha Linh

13 March 2009

Challenges

• Different data presentations

hangli at microsoft dot com cs.duke.edu, junyang ASJMZheng@ntu.edu.sg erafalin(at)cs.tufts.edu <Image src=’email.jpg’/> Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu

13TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

14TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

15TIM

Ta Nha Linh

13 March 2009

RICO - Architecture

Home page Identification

Field Identification

Post-Processing

16TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

17TIM

Ta Nha Linh

13 March 2009

Field Identification - Purpose

• To identify data in the page contents to corresponding fields in a pre-defined set of desired information.

• Current set includes:Name – Position – Affiliation

Address – Phone – Fax - Email

BS year – BS major – BS university

MS year – MS major – MS university

PhD year – PhD major – PhD university

Research Interest – Publications

18TIM

Ta Nha Linh

13 March 2009

Field Identification - Related works• Tang et al (2007), (2008) – ArnetMiner

– Prepocessing: tokenize text into 5 categories

– Tagging of tokens by using Conditional Random Field (CRF)

– F1 = 83.37% (~1,000 researchers)

– Set of features used: + Content features (word, morphological, image

features)+ Pattern features (positive word, special token,

reseacher name features)+ Term features (term, dictionary features)

19TIM

Ta Nha Linh

13 March 2009

Field Identification - Related works

• Tang et al (2007), (2008) – ArnetMiner

– Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM.

– Based only on text of the page. Stylistic information can be of use.

20TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Input: a researcher home page

• CRF is the learning model

• Features used– Global features

– Lexicon features

– Context features

– Dictionaries features

– Stylistic features

21TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Global features: apply for current token– Morphological features

– Initials

– Number

– Punctuation

• Lexicon features: apply for current token– Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email

22TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Context features: apply for whole line

– Name context– Address context– Phone context: 'phone', 'tel', 'mobile'– Fax context: 'fax', 'facsimile'– Email context: 'email', 'e-mail'– Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor'– Master (MS) context: appearance of 'M.S' or 'MS' or 'Master'– Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy'– Research-interest context: multiple line property– Publication context: multiple line property– Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.

23TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Dictionaries

– Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature

– Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests

– Research dictionary: classified into high/mid/low confidence.

– Universities dictionary: of names of most of universities, according to Open Directory

24TIM

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Stylistic features– List feature

– Table features

– Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table

25TIM

Ta Nha Linh

13 March 2009

Field Identification - PerformanceData set of 40 home pages, cross validation

Overall Precision: 70.66 – Recall: 62.73 – F1: 64.87

Classes Precision Recall F1

name 75.66% 51.34% 61.17

phone 53.38% 89.25% 66.80

fax 47.73% 72.41% 57.53

email 79.31% 70.77% 74.80

address 78.90% 74.57% 76.67

affiliation 30.27% 59.47% 40.12

position 79.46% 64.49% 71.20

research-interest

48.48% 36.04% 41.34

publications 71.05% 43.27% 53.79

Classes Precision Recall F1

bs-major 88.89% 78.05% 83.12

bs-uni 68.67% 57.00% 62.30

bs-year 90.00% 72.00% 80.00

ms-major 71.43% 32.26% 44.44

ms-uni 52.94% 52.94% 52.94

ms-year 77.78% 56.00% 65.12

phd-major 83.33% 73.17% 77.92

phd-uni 74.56% 72.03% 73.28

phd-year 100.00% 74.07% 85.11

26TIM

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• Data fields to be annotated similar to those from ArnetMiner.– Extra: Name, Research Areas, Publications

– Missing: Image

• Stylistic feature used is minimal

27TIM

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• F1 value is significantly lower than that of ArnetMiner’s– ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. RICO has no prior knowledge about the page to be parsed.

Heuristic to improve confidence of ‘Name’

Make use of Affiliation name input

– Identifying ‘Research Interest’ and ‘Publications’ is challenging.

Improve ‘Publications’

28TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

29TIM

Ta Nha Linh

13 March 2009

Home page Identification - Purpose

• Add-on component

• To complete automation of the system

30TIM

Ta Nha Linh

13 March 2009

Home page Identification – Related works

• Ahoy!– Input: Researcher name and (optional) institution name

– “Home page”: allocated page, classified by URL patterns

• RICO– Input: Institution name

– “Home page”: allocated page with biographical information, classified by contents

31TIM

Ta Nha Linh

13 March 2009

Home page Identification – Methodology• Collect a list of Universities domains

• Use Yahoo! BOSS to search for professors in the institutions

• For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’.

• Classify by the number of appearance of keywords.

• Home pages will be passed to Fields Identification component.

32TIM

Ta Nha Linh

13 March 2009

Home page Identification – Discussion

• Query used not able to get all relevant pages. Tune for majority: professors in institutions.– Target researchers in research organizations.

• Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records.– Need high confidence in overall system performance. But researcher names are not unique.

– Best if can eliminate duplication by analyzing URLs. But domain hierarchies differ within department, between departments, and between institutions.

33TIM

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

34TIM

Ta Nha Linh

13 March 2009

Post-processing - Purpose

• Input: CRF++ output file from Fields Identification.

• Group neighboring tokens identified with the same annotation tag

• Deduplication

• Store into database (current size ~ 170,000 researchers)

35TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

36TIM

Ta Nha Linh

13 March 2009

Contribution

• Produced an automated system for fetching researchers’ information from the world wide web.

• Introduced a number of features for Field Identification machine learning.

37TIM

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

38TIM

Ta Nha Linh

13 March 2009

Future improvements• Field Identification

– Introduce more features, especially stylistic features– Strengthen features targeting Name, Research Interest and Publications tags– Cater for the <image> tag– Be able to handle pages using HTML frames– Be able to follow links on the page if necessary

• Home page Identification– Improve heuristics

• Post-processing– Be able to refine output from Fields Identification

• A new component to facilitate front end for user to query the database

39TIM

Ta Nha Linh

13 March 2009

THANK YOU!

Question?