Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow?...

21
Web Document Analysis: How Web Document Analysis: How can Natural Language can Natural Language Processing Help in Processing Help in Determining Correct Content Determining Correct Content Flow? Flow? Hassan Alam, Hassan Alam, Fuad Rahman and Fuad Rahman and Yuliya Tarnikova Yuliya Tarnikova Human Computer Interaction Group Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050 BCL Technologies Inc. Santa Clara, CA 95050 www. www. bcltechnologies bcltechnologies .com .com [email protected] [email protected]

Transcript of Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow?...

Web Document Analysis: How can Web Document Analysis: How can Natural Language Processing Help Natural Language Processing Help in Determining Correct Content in Determining Correct Content

Flow?Flow?

Hassan Alam, Hassan Alam, Fuad Rahman andFuad Rahman and Yuliya Tarnikova Yuliya Tarnikova

Human Computer Interaction GroupHuman Computer Interaction GroupBCL Technologies Inc. Santa Clara, CA 95050BCL Technologies Inc. Santa Clara, CA 95050www.www.bcltechnologiesbcltechnologies.com.com

[email protected]@bcltechnologies.com

Overview of the talkOverview of the talk Web document re-authoring HTML data structure and segmentation Merging and the “mess” Semantic Relatedness of Textual Segments Spoken Language User Interface Toolkit

(SLUITK) How do we do it? Some applications Conclusion and future work

Web Page Data StructureWeb Page Data Structure

Merging R UsMerging R Us

While merging two segments, the only information available to the merging algorithm is the proximity map and broad content classification.

It is not uncommon that sometimes totally unrelated content can easily meet these tests, resulting in the failure of the merging algorithm.

eMerging Questions?eMerging Questions?

How do we determine if two separate web document segments contain related information?

What is the definition of 'relatedness'? If other segments are geometrically embedded within

closely related segments, can we determine if this segment is also related to the surrounding segments?

When a hyperlink is followed and a new page is accessed, how do we know which exact segment within that new page is directly related to the link we just followed?

Natural Language ProcessingNatural Language Processing

SyntaxSemanticsContextAnaphoraTokenizingTheme

Our AnswerOur Answer

Lexical Chains

Lexical Chains

A lexical chain is a sequence of related words in a narrative. It can be composed of adjacent words or sentences or can cover elements from the complete narrative.

Cohesion is a way of connecting different parts of text into a single theme: is a list of semantically related words, constructed by the use of co-reference, ellipses and conjunctions.

This aims to identify the relationship between words that tend to co-occur in the same lexical context.

Lexical Chains

Coreference: The grammatical relation between two words that have a common referent– Example: You said you would come

In the given sentence, both ‘you’ s have the same referent. Ellipsis:  Omission or suppression of parts of words

or sentences– Example: 'the virtues I admire', for, 'the virtues 'which' I

admire'  Conjecture: Reasoning that involves the formation of

conclusions from incomplete evidence– Example: Scientists supposed that large dinosaurs lived in

swamps

SL

UI

Input Sentences

DialogsPRO

GRA

M

S LUITOOLKIT

C + +J a v a

End U s er

P rogrammer

Action Code& VSP

What is SLUI TK?What is SLUI TK?

SLUI is a set of tools

that allows programmers

to rapidly develop

applications with

Natural Language

Processing Functionality

Input S et upInf ormat ion

Expand S FT

Debug S F T

Deploy P rogram

1 . Insert the domain spec ifi c lex icon

2 . Enter s ample sent ences

3 . Enter V ariable S entence P arameter values4 . Enter an act ion code f or each s ample sent ence.5 . Expand S FT

6 . D ebug the results in the S emant ic F rame T able (S FT )

7 . D irect user input t o the S LU I

S teps For P rogrammer

S LU I T oo lk itA na lyzes

S entences

End User Runs the S LUI EnabledP rogram

S LU IA na lyzes U ser

Input andhandles errors

S LU IM aps U s er

Input to A ct ions and R eturns A ct ion C ode

P rogramExecutes

T as ks

SLUI TKSLUI TKSteps for the Steps for the Programmer Programmer

to Follow to Follow while Setting while Setting

up the up the ToolkitToolkit

S peech R ecognit ion Input ter

S entenceT okenizer

Querryc lassifi er

A utoS pellC orrectS yntaxR ecognizer

P arser

A naphoraR es o lut ion

T rans la torF rame G enerat orF rameH andlerA c t ionH andler

GeneratedS emantic FrameT able (S FT )

D ialog M anagerC lassO nto logyC lass G rammar

N ot implemented

S pell C heckerO nto logy

Speech Recognition Inputter

SentenceTokenizerQuerryclassifierAutoSpellCorrectSyntaxRecognizer

ParserAnaphoraResolutionTranslatorFrame GeneratorFrameHandlerActionHandler GeneratedSemantic FrameTable (SFT)

Dialog Manager ClassOntologyClass Grammar

Not implemented

Spell CheckerOntology

S peech R ecognit ion Input ter

S entenceT okenizer

Querryc lassifi er

A utoS pellC orrect

S yntaxR ecognizer

P arser

A naphoraR es o lut ion

T rans la tor

F rame G enerat orF rame

H andlerA c t ionH andler

GeneratedS emantic FrameT able (S FT )

D ialog M anager

C lassO nto logy

C lass G rammar

N ot implemented

S pell C heckerO nto logy

SLUI TKSLUI TKAn innovative way

to assist programmers with

no linguistic knowledge in developing

programs that can understand,

process, and act upon spoken

Natural Language (NL) input

SentenceType

PredicateObject(Arg 2)

Subject(Arg 1)

ActionObject(Arg 3)

Mod 1(Head)

Mod 1(Comp)

Mod 2(Head)

Mod 3(Comp)

YN_Question suggestinternetsite orbook

bcl-computers

? ----internetsite ofbook

that givedetails ....

risk---- ----

SentenceType

PredicateObject(Arg 2)

Subject(Arg 1)

ActionObject(Arg 3)

Mod 1(Head)

Mod 1(Comp)

Mod 2(Head)

Mod 3(Comp)

---- incude information----? ---- informationon cancer

risks---- ----

SentenceType

PredicateObject(Arg 2)

Subject(Arg 1)

ActionObject(Arg 3)

Mod 1(Head)

Mod 1(Comp)

Mod 2(Head)

Mod 3(Comp)

---- give detail----? ---- detailom lowerLDL by 50

points---- ----

SentenceType

PredicateObject(Arg 2)

Subject(Arg 1)

ActionObject(Arg 3)

Mod 1(Head)

Mod 1(Comp)

Mod 2(Head)

Mod 3(Comp)

---- lower LDL----? ---- lowerby 50points

---- ----

OurOurFrameFrameCan you

suggest some internet sites or books that give details on lowering the LDL by 50 points without including

information on cancer

risks?

Sentences collected from email messages received between June 2000 and May 2001

Deleted attachments, html and other tags, header files, and senders’ information.

Also deleted were salutations and greetings Total of 34,640 lines and 170,000 words We constantly update our corpus with new emails

from our customers.

BCL Database

Our Lexical Chains

Relatedness Factor

An Application: Web Page Re-An Application: Web Page Re-authoringauthoring

Segment Scores

Example Output

Future WorkFuture Work

Only a single main theme can be handled per document. In future we are going to address a more generic solution that can handle documents with multiple themes.

Integration of this NLP method in building commercial summarizers and in aiding existing web page summarization techniques based on structural analysis alone is already well underway.

Determining the flow of web information between different web pages as the browser loads up new pages following hyperlinks.

Aiding geometric web parsers in determining the correct logical layout by complementing geometric information with linguistic coherence.

ConclusionsConclusions

A novel approach of determining semantic relationship among segments of web documents using lexical chain computation.

Two related papers in ICDAR 2003– One will explore the application of lexical chains in

building a commercial summarizer capable of summarizing any document

– The other will concentrate on a hybrid approach to web page summarization, combining structural and NLP techniques.