Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow?...
-
Upload
preston-lawrence -
Category
Documents
-
view
215 -
download
0
Transcript of Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow?...
Web Document Analysis: How can Web Document Analysis: How can Natural Language Processing Help Natural Language Processing Help in Determining Correct Content in Determining Correct Content
Flow?Flow?
Hassan Alam, Hassan Alam, Fuad Rahman andFuad Rahman and Yuliya Tarnikova Yuliya Tarnikova
Human Computer Interaction GroupHuman Computer Interaction GroupBCL Technologies Inc. Santa Clara, CA 95050BCL Technologies Inc. Santa Clara, CA 95050www.www.bcltechnologiesbcltechnologies.com.com
[email protected]@bcltechnologies.com
Overview of the talkOverview of the talk Web document re-authoring HTML data structure and segmentation Merging and the “mess” Semantic Relatedness of Textual Segments Spoken Language User Interface Toolkit
(SLUITK) How do we do it? Some applications Conclusion and future work
Merging R UsMerging R Us
While merging two segments, the only information available to the merging algorithm is the proximity map and broad content classification.
It is not uncommon that sometimes totally unrelated content can easily meet these tests, resulting in the failure of the merging algorithm.
eMerging Questions?eMerging Questions?
How do we determine if two separate web document segments contain related information?
What is the definition of 'relatedness'? If other segments are geometrically embedded within
closely related segments, can we determine if this segment is also related to the surrounding segments?
When a hyperlink is followed and a new page is accessed, how do we know which exact segment within that new page is directly related to the link we just followed?
Natural Language ProcessingNatural Language Processing
SyntaxSemanticsContextAnaphoraTokenizingTheme
Lexical Chains
A lexical chain is a sequence of related words in a narrative. It can be composed of adjacent words or sentences or can cover elements from the complete narrative.
Cohesion is a way of connecting different parts of text into a single theme: is a list of semantically related words, constructed by the use of co-reference, ellipses and conjunctions.
This aims to identify the relationship between words that tend to co-occur in the same lexical context.
Lexical Chains
Coreference: The grammatical relation between two words that have a common referent– Example: You said you would come
In the given sentence, both ‘you’ s have the same referent. Ellipsis: Omission or suppression of parts of words
or sentences– Example: 'the virtues I admire', for, 'the virtues 'which' I
admire' Conjecture: Reasoning that involves the formation of
conclusions from incomplete evidence– Example: Scientists supposed that large dinosaurs lived in
swamps
SL
UI
Input Sentences
DialogsPRO
GRA
M
S LUITOOLKIT
C + +J a v a
End U s er
P rogrammer
Action Code& VSP
What is SLUI TK?What is SLUI TK?
SLUI is a set of tools
that allows programmers
to rapidly develop
applications with
Natural Language
Processing Functionality
Input S et upInf ormat ion
Expand S FT
Debug S F T
Deploy P rogram
1 . Insert the domain spec ifi c lex icon
2 . Enter s ample sent ences
3 . Enter V ariable S entence P arameter values4 . Enter an act ion code f or each s ample sent ence.5 . Expand S FT
6 . D ebug the results in the S emant ic F rame T able (S FT )
7 . D irect user input t o the S LU I
S teps For P rogrammer
S LU I T oo lk itA na lyzes
S entences
End User Runs the S LUI EnabledP rogram
S LU IA na lyzes U ser
Input andhandles errors
S LU IM aps U s er
Input to A ct ions and R eturns A ct ion C ode
P rogramExecutes
T as ks
SLUI TKSLUI TKSteps for the Steps for the Programmer Programmer
to Follow to Follow while Setting while Setting
up the up the ToolkitToolkit
S peech R ecognit ion Input ter
S entenceT okenizer
Querryc lassifi er
A utoS pellC orrectS yntaxR ecognizer
P arser
A naphoraR es o lut ion
T rans la torF rame G enerat orF rameH andlerA c t ionH andler
GeneratedS emantic FrameT able (S FT )
D ialog M anagerC lassO nto logyC lass G rammar
N ot implemented
S pell C heckerO nto logy
Speech Recognition Inputter
SentenceTokenizerQuerryclassifierAutoSpellCorrectSyntaxRecognizer
ParserAnaphoraResolutionTranslatorFrame GeneratorFrameHandlerActionHandler GeneratedSemantic FrameTable (SFT)
Dialog Manager ClassOntologyClass Grammar
Not implemented
Spell CheckerOntology
S peech R ecognit ion Input ter
S entenceT okenizer
Querryc lassifi er
A utoS pellC orrect
S yntaxR ecognizer
P arser
A naphoraR es o lut ion
T rans la tor
F rame G enerat orF rame
H andlerA c t ionH andler
GeneratedS emantic FrameT able (S FT )
D ialog M anager
C lassO nto logy
C lass G rammar
N ot implemented
S pell C heckerO nto logy
SLUI TKSLUI TKAn innovative way
to assist programmers with
no linguistic knowledge in developing
programs that can understand,
process, and act upon spoken
Natural Language (NL) input
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
YN_Question suggestinternetsite orbook
bcl-computers
? ----internetsite ofbook
that givedetails ....
risk---- ----
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
---- incude information----? ---- informationon cancer
risks---- ----
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
---- give detail----? ---- detailom lowerLDL by 50
points---- ----
SentenceType
PredicateObject(Arg 2)
Subject(Arg 1)
ActionObject(Arg 3)
Mod 1(Head)
Mod 1(Comp)
Mod 2(Head)
Mod 3(Comp)
---- lower LDL----? ---- lowerby 50points
---- ----
OurOurFrameFrameCan you
suggest some internet sites or books that give details on lowering the LDL by 50 points without including
information on cancer
risks?
Sentences collected from email messages received between June 2000 and May 2001
Deleted attachments, html and other tags, header files, and senders’ information.
Also deleted were salutations and greetings Total of 34,640 lines and 170,000 words We constantly update our corpus with new emails
from our customers.
BCL Database
Future WorkFuture Work
Only a single main theme can be handled per document. In future we are going to address a more generic solution that can handle documents with multiple themes.
Integration of this NLP method in building commercial summarizers and in aiding existing web page summarization techniques based on structural analysis alone is already well underway.
Determining the flow of web information between different web pages as the browser loads up new pages following hyperlinks.
Aiding geometric web parsers in determining the correct logical layout by complementing geometric information with linguistic coherence.
ConclusionsConclusions
A novel approach of determining semantic relationship among segments of web documents using lexical chain computation.
Two related papers in ICDAR 2003– One will explore the application of lexical chains in
building a commercial summarizer capable of summarizing any document
– The other will concentrate on a hybrid approach to web page summarization, combining structural and NLP techniques.