Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

Ta Nha Linh

13 March 2009

Harvesting useful information on researchers' home pages

Ta Nha Linh

Supervisor: Asst. Prof. Min-Yen Kan

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Researchers Information COllector (RICO)

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Motivation

• Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink

• How about the authors of those publications?

• Publication-centric.

Ta Nha Linh

13 March 2009

Motivation

• Researcher-centric database?– Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only

– Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences

– Some other similar databases: manual update, specific to certain organization

Ta Nha Linh

13 March 2009

• Goal: Automated system to build researchers database, for multiple disciplines

• Input: Researchers’ home pages.

– Basic information

– Contact information

– Educational history

– Publications

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Challenges

• Different layouts– Templates

– Personal pages

• Different content– Pages introducing researchers

– CV-like

– Personal pages

• Different content structures– Tables / lists

– Natural language text

Ta Nha Linh

13 March 2009

Ta Nha Linh

13 March 2009

Ta Nha Linh

13 March 2009

Ta Nha Linh

13 March 2009

Challenges

• Different data presentations

hangli at microsoft dot com cs.duke.edu, junyang ASJMZheng@ntu.edu.sg erafalin(at)cs.tufts.edu <Image src=’email.jpg’/> Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Researchers Information COllector (RICO)

• Field Identification

• Home page Identification

• Post Processing

Ta Nha Linh

13 March 2009

RICO - Architecture

Home page Identification

Field Identification

Post-Processing

Ta Nha Linh

13 March 2009

• Post Processing

Ta Nha Linh

13 March 2009

Field Identification - Purpose

• To identify data in the page contents to corresponding fields in a pre-defined set of desired information.

• Current set includes:Name – Position – Affiliation

Address – Phone – Fax - Email

BS year – BS major – BS university

MS year – MS major – MS university

PhD year – PhD major – PhD university

Research Interest – Publications

Ta Nha Linh

13 March 2009

Field Identification - Related works• Tang et al (2007), (2008) – ArnetMiner

– Prepocessing: tokenize text into 5 categories

– Tagging of tokens by using Conditional Random Field (CRF)

– F1 = 83.37% (~1,000 researchers)

– Set of features used: + Content features (word, morphological, image

features)+ Pattern features (positive word, special token,

reseacher name features)+ Term features (term, dictionary features)

Ta Nha Linh

13 March 2009

Field Identification - Related works

• Tang et al (2007), (2008) – ArnetMiner

– Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM.

– Based only on text of the page. Stylistic information can be of use.

Ta Nha Linh

13 March 2009

Field Identification - Methodology

• Input: a researcher home page

• CRF is the learning model

• Features used– Global features

– Lexicon features

– Context features

– Dictionaries features

– Stylistic features

Ta Nha Linh

13 March 2009

• Global features: apply for current token– Morphological features

– Initials

– Number

– Punctuation

• Lexicon features: apply for current token– Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Context features: apply for whole line

– Name context– Address context– Phone context: 'phone', 'tel', 'mobile'– Fax context: 'fax', 'facsimile'– Email context: 'email', 'e-mail'– Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor'– Master (MS) context: appearance of 'M.S' or 'MS' or 'Master'– Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy'– Research-interest context: multiple line property– Publication context: multiple line property– Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.

Ta Nha Linh

13 March 2009

Field Identification - Methodology• Dictionaries

– Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature

– Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests

– Research dictionary: classified into high/mid/low confidence.

– Universities dictionary: of names of most of universities, according to Open Directory

Ta Nha Linh

13 March 2009

• Stylistic features– List feature

– Table features

– Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table

Ta Nha Linh

13 March 2009

Field Identification - PerformanceData set of 40 home pages, cross validation

Overall Precision: 70.66 – Recall: 62.73 – F1: 64.87

Classes Precision Recall F1

name 75.66% 51.34% 61.17

phone 53.38% 89.25% 66.80

fax 47.73% 72.41% 57.53

email 79.31% 70.77% 74.80

address 78.90% 74.57% 76.67

affiliation 30.27% 59.47% 40.12

position 79.46% 64.49% 71.20

research-interest

48.48% 36.04% 41.34

publications 71.05% 43.27% 53.79

Classes Precision Recall F1

bs-major 88.89% 78.05% 83.12

bs-uni 68.67% 57.00% 62.30

bs-year 90.00% 72.00% 80.00

ms-major 71.43% 32.26% 44.44

ms-uni 52.94% 52.94% 52.94

ms-year 77.78% 56.00% 65.12

phd-major 83.33% 73.17% 77.92

phd-uni 74.56% 72.03% 73.28

phd-year 100.00% 74.07% 85.11

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• Data fields to be annotated similar to those from ArnetMiner.– Extra: Name, Research Areas, Publications

– Missing: Image

• Stylistic feature used is minimal

Ta Nha Linh

13 March 2009

Field Identification - Discussion

• F1 value is significantly lower than that of ArnetMiner’s– ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. RICO has no prior knowledge about the page to be parsed.

Heuristic to improve confidence of ‘Name’

Make use of Affiliation name input

– Identifying ‘Research Interest’ and ‘Publications’ is challenging.

Improve ‘Publications’

Ta Nha Linh

13 March 2009

• Post Processing

Ta Nha Linh

13 March 2009

Home page Identification - Purpose

• Add-on component

• To complete automation of the system

Ta Nha Linh

13 March 2009

Home page Identification – Related works

• Ahoy!– Input: Researcher name and (optional) institution name

– “Home page”: allocated page, classified by URL patterns

• RICO– Input: Institution name

– “Home page”: allocated page with biographical information, classified by contents

Ta Nha Linh

13 March 2009

Home page Identification – Methodology• Collect a list of Universities domains

• Use Yahoo! BOSS to search for professors in the institutions

• For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’.

• Classify by the number of appearance of keywords.

• Home pages will be passed to Fields Identification component.

Ta Nha Linh

13 March 2009

Home page Identification – Discussion

• Query used not able to get all relevant pages. Tune for majority: professors in institutions.– Target researchers in research organizations.

• Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records.– Need high confidence in overall system performance. But researcher names are not unique.

– Best if can eliminate duplication by analyzing URLs. But domain hierarchies differ within department, between departments, and between institutions.

Ta Nha Linh

13 March 2009

• Post Processing

Ta Nha Linh

13 March 2009

Post-processing - Purpose

• Input: CRF++ output file from Fields Identification.

• Group neighboring tokens identified with the same annotation tag

• Deduplication

• Store into database (current size ~ 170,000 researchers)

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Contribution

• Produced an automated system for fetching researchers’ information from the world wide web.

• Introduced a number of features for Field Identification machine learning.

Ta Nha Linh

13 March 2009

Outline

• Motivation

• Challenges

• Contributions

• Future Works

Ta Nha Linh

13 March 2009

Future improvements• Field Identification

– Introduce more features, especially stylistic features– Strengthen features targeting Name, Research Interest and Publications tags– Cater for the <image> tag– Be able to handle pages using HTML frames– Be able to follow links on the page if necessary

• Home page Identification– Improve heuristics

• Post-processing– Be able to refine output from Fields Identification

• A new component to facilitate front end for user to query the database

Ta Nha Linh

13 March 2009

THANK YOU!

Question?

Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

Documents

Transcript of Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...

Scuba diving in Nha TrangHai Anh Tran, Quang Duc Tran, Linh Giang Nguyen, Abdelhamid Mellouk, Hieu Mac and Van Tong. A novel SDN controller based on Ontology and Global Optimization

Tran Thanh Nha

Thong So Linh Kien

LINH NGUYEN - linhcadesign.comlinhcadesign.com/bio/Linh-Nguyen-Resume.pdf · LINH NGUYEN graphic/web designer CONTACT SKILLS LANGUAGES INTERPERSONAL SKILLS ABOUT ME EDUCATION EXPERIENCE

Chavez vs Nha

Bridge Presentation Nha

Chapter 2 NHA

Linh Ninh Resume

The Linh D. Journal

NHA Little Flowers

Menu Linh Linh 2a

Scuba diving in Nha Trang - mica.edu.vn · Hai Anh Tran, Quang Duc Tran, Linh Giang Nguyen, Abdelhamid Mellouk, Hieu Mac and Van Tong. A novel SDN controller based on Ontology and

Chaves vs Nha

BDM2- Linh

BIDDING DOCUMENT - NHA

Linh nguyen portfolio

NCKH - linh 1st

University of Florida NHA Hydrogen Conference, March 21, 2007 Wireless Hydrogen Sensor Networks Using GaN-based Devices Travis Anderson 1, Hung-Ta Wang.

Linh Phan Portfolio

NHA CANDIDATE HANDBOOK