Semantics hidden within co-occurrence patterns

59
Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Semantics hidden within co-occurrence patterns A bottom-up approach to the Semantic Web? Srinath Srinivasa IIIT Bangalore [email protected] IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore

description

Slides of the IEEE CS Society talk delivered at Yahoo India, Nov 20 2009.

Transcript of Semantics hidden within co-occurrence patterns

Page 1: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Semantics hidden within co-occurrence patternsA bottom-up approach to the Semantic Web?

Srinath Srinivasa

IIIT [email protected]

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 2: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Outline

1 Co-occurrence and Meaning

2 Co-occurrence graphs

3 Interpretation of Co-citations

4 Topical Anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 3: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Outline

1 Co-occurrence and Meaning

2 Co-occurrence graphs

3 Interpretation of Co-citations

4 Topical Anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 4: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Conventional WebIR and co-occurrence

Lexical feature extraction: Bag-of-words model

Document vectorization

Implicit assumption of independence of dimensions

Vector space reduction and spectral analyses for identifyinghidden semantics (Ex: LSA, SVD, Clustering, etc.)

In human languages, lexical terms are not only not independent ofone another, important semantic structures are inherent in the wayterms co-occur.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 5: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

Some motivational problems to show limitations of purely lexicalapproaches to IR:

The topical anchor problem

“If ever a player has overshadowed Sachin Tendulkar for sheer class ofbatsmanship, it is V V S Laxman. After a record 353-run fourth-wicketpartnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V VS in perspective.”

What is the best topic of this paragraph: Sachin Tendulkar, V V SLaxman, Sydney, Australia, Cricket, Test Match

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 6: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The semantic attributes problem

Given that a user has searched for the term “Malmo” which of the followingkeywords can be termed as “attributes” that enhance the meaning representedby Malmo:

Driving

History

Mileage

Weather

Symptoms

Elephant

LATEX beamer

Infringement

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 7: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The topical marker problem

The US Federal Aviation Regulations Sec 380.12 states that:

The charter operator may not cancel a charter for any reason (including insufficient participation), exceptfor circumstances that make it physically impossible to perform the charter trip, less than 10 days beforethe scheduled date of departure of the outbound trip.

If the charter operator cancels 10 or more days before the scheduled date of departure, the operator mustso notify each participant in writing within 7 days after the cancellation but in any event not less than 10days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 daysbefore scheduled departure (i.e., for circumstances that make it physically impossible to perform thecharter trip), the operator must get the message to each participant as soon as possible.

If a user who has booked a ticket with a charter operator finds out that her

flight has been cancelled suddenly without notice and wants to confront the

operator; what should she search for: charter operator, FAR, cancellation,

scheduled trip, Sec 380, operator, notification, . . .

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 8: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The topical marker problem

The US Federal Aviation Regulations Sec 380.12 states that:

The charter operator may not cancel a charter for any reason (including insufficient participation), exceptfor circumstances that make it physically impossible to perform the charter trip, less than 10 days beforethe scheduled date of departure of the outbound trip.

If the charter operator cancels 10 or more days before the scheduled date of departure, the operator mustso notify each participant in writing within 7 days after the cancellation but in any event not less than 10days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 daysbefore scheduled departure (i.e., for circumstances that make it physically impossible to perform thecharter trip), the operator must get the message to each participant as soon as possible.

If a user who has booked a ticket with a charter operator finds out that her

flight has been cancelled suddenly without notice and wants to confront the

operator; what should she search for: charter operator, FAR, cancellation,

scheduled trip, Sec 380, operator, notification, . . .

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 9: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The theme problem:

Article 1

A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.

Article 2

La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.

Article 3

Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.

Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 10: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The theme problem:

Article 1

A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.

Article 2

La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.

Article 3

Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.

Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 11: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The theme problem:

Article 1

A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.

Article 2

La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.

Article 3

Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.

Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 12: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Motivational Problems

The theme problem:

Article 1

A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport andwas forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with onlyminor injuries.

Article 2

La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves anumber of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easilyreachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.

Article 3

Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water aftersuffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he wouldnot make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.

Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 13: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and Meaning

Hebbian learning

Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,which states that synaptic plasticity between neurons are determined by repeated and persistentstimulation of the pre- and post-synaptic cells [2].

This is also summarized as: Cells that fire together, wire together

Co-occurrence and the language instinct

Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. Aninteresting example is the “wug” test (cf. [5]):That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;these are .

The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for amoment):The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus islotii? lotes? lotuses?

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 14: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and Meaning

Hebbian learning

Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,which states that synaptic plasticity between neurons are determined by repeated and persistentstimulation of the pre- and post-synaptic cells [2].

This is also summarized as: Cells that fire together, wire together

Co-occurrence and the language instinct

Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. Aninteresting example is the “wug” test (cf. [5]):That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;these are .

The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for amoment):The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus islotii? lotes? lotuses?

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 15: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and meaning

Meaning is usage

The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.

Consider the following paragraphs:

Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.

On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.

In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

“Java” are both resolved by looking at other terms that co-occur with them.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 16: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and meaning

Meaning is usage

The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.

Consider the following paragraphs:

Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.

On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.

In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

“Java” are both resolved by looking at other terms that co-occur with them.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 17: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and meaning

Meaning is usage

The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.

Consider the following paragraphs:

Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.

On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.

In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

“Java” are both resolved by looking at other terms that co-occur with them.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 18: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence and meaning

Meaning is usage

The analytic philosophy worldview: Meaning is usage [1] can be explained byrepresenting usage as co-occurrence analysis.

Consider the following paragraphs:

Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of thebest mileage for pqers in its category. My pqer can seat five people and is agood candidate for pqer-pooling.

On December 26 2004, a massive earthquake measuring 9.1 jolted Java. Thisearthquake triggered a huge tsunami that has been the deadliest in history. Wehave developed an applet to simulate the path taken by the tsunami. You canrun this applet in any browser that has Java enabled.

In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

“Java” are both resolved by looking at other terms that co-occur with them.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 19: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Outline

1 Co-occurrence and Meaning

2 Co-occurrence graphs

3 Interpretation of Co-citations

4 Topical Anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 20: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Capturing co-occurrence

We are given a document corpus that is represented as a setof “contexts”:

C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.

Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C

The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )

Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we

focus on pairwise co-occurrences and derive higher order semantics when

required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 21: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Capturing co-occurrence

We are given a document corpus that is represented as a setof “contexts”:

C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.

Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C

The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )

Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we

focus on pairwise co-occurrences and derive higher order semantics when

required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 22: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Capturing co-occurrence

We are given a document corpus that is represented as a setof “contexts”:

C = {C1, C2, . . . Cn}Depending on the specific problem, a context may takevarious forms like: sentence, paragraph, document, etc.

Two entities ei and ej are said to co-occur (denoted asei � ej ) if there is some context C such that ei , ej ∈ C

The support for a co-occurring pair ei � ej is the probabilityof finding this co-occurrence in any given context C in thecorpus. In other words, the support is the joint probabilityP(ei , ej )

Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we

focus on pairwise co-occurrences and derive higher order semantics when

required.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 23: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graphs

Co-occurrence graph

A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence

Co-occurrence versus n-partite graphs

Semantic co-occurrence graphs

A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 24: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graphs

Co-occurrence graph

A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence

Co-occurrence versus n-partite graphs

Semantic co-occurrence graphs

A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 25: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graphs

Co-occurrence graph

A co-occurrence graph is a weighted, undirected graph G = (E , �, w), whereE is a set of “entities”, �⊆ E × E is a set of co-occurrences, and w :�→ <indicates support for the co-occurrence

Co-occurrence versus n-partite graphs

Semantic co-occurrence graphs

A semantic co-occurrence graph is a co-occurrence graph that is augmentedwith a concept hierarchy. A concept hierarchy is defined by one or more partialorders of the form: v ⊆ E × E , representing relationships like is-a and is-in,that are reflexive, anti-symmetric and transitive.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 26: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graph

Example:

Concept hierarchy construction

1 Start with a baseOntology

2 Use co-occurrencepatterns to guessconceptual relationshipsacross terms

3 Use concept hierarchyto identify deeperco-occurrence patterns

4 Repeat from step 2 in asemi-automated fashionuntil algorithmstabilizes

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 27: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graph

Example:

Concept hierarchy construction

1 Start with a baseOntology

2 Use co-occurrencepatterns to guessconceptual relationshipsacross terms

3 Use concept hierarchyto identify deeperco-occurrence patterns

4 Repeat from step 2 in asemi-automated fashionuntil algorithmstabilizes

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 28: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence graphs

Characteristics of co-occurrence graphs

Triadic closure (highly clustered)

Disconnected components or a single component of very smalldiameter

Co-occurrence graph of all noun phrases in Wikipedia has adiameter of 4

Co-occurrence support for entity pairs follow a power-law

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 29: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Outline

1 Co-occurrence and Meaning

2 Co-occurrence graphs

3 Interpretation of Co-citations

4 Topical Anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 30: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation

Co-citation and bibliographic coupling are important metrics in severaldatasets like scientific literature, web pages, wikis, tagging systems likedelicious, etc.

Co-citation of a pair of documents corresponds to the co-occurrence ofthese references (Ex. URLs) in a context

Pair-wise co-citation graphs have the same properties as co-occurrencegraphs

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 31: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation PatternsHyperlink distance across pairs of highly co-cited pages [8]

0

50

100

150

200

250

300

1 2 3 4 5 6 7 kmax >kmax

k

F

Figure: Hyperlink distance across pairs ofhighly co-cited Web pages

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5 6 7

kmax

>km

ax

kF

Figure: Hyperlink distance across pairs ofhighly co-cited Wikipedia pages

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 32: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation PatternsHyperlink distance across pairs of highly co-cited pages

Endorsement of a citation

Page A endorses the content of page B

Users reading page A, traverses this link andfinds page B useful too

Users create their own pages citing both Aand B

If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation

Topical aggregation

Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics

Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves

Nepotistic co-citations

Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 33: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation PatternsHyperlink distance across pairs of highly co-cited pages

Endorsement of a citation

Page A endorses the content of page B

Users reading page A, traverses this link andfinds page B useful too

Users create their own pages citing both Aand B

If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation

Topical aggregation

Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics

Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves

Nepotistic co-citations

Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 34: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation PatternsHyperlink distance across pairs of highly co-cited pages

Endorsement of a citation

Page A endorses the content of page B

Users reading page A, traverses this link andfinds page B useful too

Users create their own pages citing both Aand B

If A has several outgoing links, and only somepairs of outlinks are co-cited, then co-citationcan be seen as an endorsement of the citation

Topical aggregation

Document A represents content about a“higher-level” topic in terms of is-a or is-inrelationships; and links to (hence co-cites)several pages on “lower-level” topics

Pages on the “lower-level” topics usually citeback the page on the “higher-level” topic,hence giving a citation distance of 2 amongthemselves

Nepotistic co-citations

Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabslike: Home, Departments, Contact Us etc.IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 35: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation graph of a web crawlPairs of pages with at least 100 non-nepotistic co-citations

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 36: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-citation graph of a web crawl

Co-citation graph depicts non-nepotistic co-citations of atleast 100 or more across pairs of pages

In addition to being made of disconnected components, thegraph also shows various recurring structural motifs like:

StarCliqueClique chainDumb-bell

Interpretations for the above motifs along with examples areexplained in Mutalikdesai and Srinivasa (2009) [4]

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 37: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Endorsed hyperlink graph (EHG)

On the web, co-citations usually implies a citation. Hence the EHGis essentially a directed version of the co-citation graph. SomeEHG components are depicted below:

EHG clique chain

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 38: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Endorsed citation graph (ECG) for scientific literatureECG of citation info obtained from CiteSeer

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 39: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Endorsed citation graph

The ECG over scientific literature data (using CiteSeer) showssimilar componentization of the graph, except, the ECG hasone giant component

Citation in scientific literature has some subtle differencesfrom hyperlink citations

Scientific literature citations are always into the past

Very rarely (if at all) do scientific literature citations formcyclic structures

ECG comprises mostly of weakly connected directed graphcomponents, while EHG may contain strongly connectedcomponents

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 40: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

ERankImportance of a page within an EHG

ERank is an authority score of a page within an EHG (ECG)component

Depicts reachability of the page within the component

ERank scores in a component shown to be uncorrelated to thePageRank scores of pages of that component

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 41: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

EndorSeer

A Firefox plugin for augmented browsing of Citeseer

Currently shows endorsed citations from among the list ofcitations from any paper

Currently underway: Show the ECG component and ECGneighbourhood of a paper

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 42: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Outline

1 Co-occurrence and Meaning

2 Co-occurrence graphs

3 Interpretation of Co-citations

4 Topical Anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 43: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Topical Anchors [6, 7]Motivation

Example: “Will my oral insulin drugs, along with my hypertensionand high blood glucose, have any side effects on the health of mypancreas?”

Can a machine detect diabetes as the context?

Another example: A document containing the words, AndyRoddick, Roger Federer and Rafael Nadal.

How likely is it that the word Tennis will be mentioned(semantically) when discussing about these players?

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 44: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Topical Anchors [6, 7]Motivation

Example: “Will my oral insulin drugs, along with my hypertensionand high blood glucose, have any side effects on the health of mypancreas?”

Can a machine detect diabetes as the context?

Another example: A document containing the words, AndyRoddick, Roger Federer and Rafael Nadal.

How likely is it that the word Tennis will be mentioned(semantically) when discussing about these players?

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 45: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Co-occurrence context

Given a set of query terms, the co-occurrence context isdefined as the subgraph formed by the query terms and theset of terms that co-occur with at least one of the terms

Conjecture: The topical anchor of a set of terms, is a highly authoritative term

that lies with the co-occurrence context of the query terms

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 46: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Online Page Importance Computation

Each node i in the context is intialised with a cash ci .

A node a is picked at random and the cash ca is added to its history ha.

Then ca is distributed amongst all its neighbours proportional to the edgeweights.

This process is iterated till the ratio of hi s becomes a near constant.

Node with the largest hi is chosen as the most central node.

Unfortunately OPIC was seen to be unsuitable for determining topical anchors

since it tends to find central nodes for the entire graph

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 47: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Cash Leaking Random Walk

Cooccurrence graphs have extremely small diameters (4-5).

Roger Federer to feral child in two hops.

Football becomes most central to Roger Federer and RafaelNadal instead of Tennis.

Solution: Cash Leakage

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 48: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Bias and History Vectors

There is a hidden bias between query words for the waycentrality is computed.

Example: Jim Carrey, Hugh Grant, Rajkumar

Bias due to difference in neighbourhood sizes

Bias due to polysemy

Example: Java, Beans, Kaffe

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 49: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Bias examples

Query Terms Topical Anchors

Java, Beans, Kaffe Programming language, Indonesia,Food

United States Dollar, Euro, WestAfrican CFA franc

French language, Guinea, Guinea-Bissau

Bayes, Euclid, Ramanujan,Bernoulli

Probability, Mathematics, Number

MIT, Stanford, IIT University, Indian Institute of Tech-nology, Bombay

Leaf, Fruit, Stem, Photosynthesis Linguistics, Plant, TreeBernoulli, Poisson, Weibull, Bino-mial

Godwin, Norway, Harold Godwin-son

Table: Examples with irrelevant topical anchors

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 50: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Solution to the topic bias problem

Labelled cash.

Vector models of CLRW

Cash from each of the query term qi is given a “colour” ci . The cash history atany node is hence a vector of the form (v1, v2, . . . vn) showing cash flow historyfor each of the colours. The vector is then normalized as:

v ′i =vi

v

where v = maxi

vi and v ′i ∈ [0, 1]

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 51: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Projection

Projection

The line joining ~0n to ~1n

represents points where allquery terms have contributedequally to the cash history.This is called the baseline

Hence, for any given node, itsprojection onto the baselinerepresents the importance ofthe node in being a topicalanchor

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 52: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Euclidean Distance

Eucledian distance

Eucledian metric computes theL2 distance from thenormalized cash history vectorof a candidate node with ~1n

Favours uniformity in cashhistory distribution over overallmagnitude of the cash history

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 53: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Cosine Similarity

Cosine similarity

Computes the cosine between agiven node’s normalized cashhistory vector and ~1n

Another metric for factoringboth uniformity in cashdistribution and magnitude

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 54: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Example results

Query Terms Projection Eucledian Cosine

United States Dol-lar, Euro, WestAfrican CFA franc

French language,Guinea, Guinea-Bissau

Currency, Bank,France

Currency, Bank,France

Bayes, Euclid, Ra-manujan, Bernoulli

Probability, Math-ematics, Number

Mathematics,Mathematician,Euler

Mathematics,Mathematician,Probability distri-bution

MIT, Stanford, IIT University, IndianInstitute of Tech-nology, Bombay

University, Col-lege, Technology

University, Col-lege, Science

Leaf, Fruit, Stem,Photosynthesis

Linguistics, Plant,Tree

Plant, Tree,Species

Plant, Tree,Species

Bernoulli, Poisson,Weibull, Binomial

Godwin, Norway,Harold Godwinson

Mathematics,Probability, Ex-pected Value

Mathematics,Probability, Statis-tics

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 55: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

User evaluation

Experimental Setup:

86 volunteer users were given a set of queries and asked to provide topicallabels for these queries ranked according to their perceived importance

66 volunteers answered 100 questions, while the rest answered 30 randomquestions chosen from the 100 questions

User responses were charted for consistency in results (chart shown below)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 56: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

User evaluationCLRW against tf-idf and OPIC

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 57: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

ComparisonComparison with Automatic Topic Labeling algorithm [3]

Caveats: Comparison with Eucledian algorithm. ATL requires document

contexts where the topical anchor is present (unlike CLRW which searches on

the co-occurrence graph built over a corpus)

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 58: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

Future WorkSeveral open questions..

Topical markers, semantic siblings

Co-occurrence semantics when coupled with concepthierarchies

Automatic detection of semantic relations based onco-occurrence

Automatic attribute identification

Thank You!

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore

Page 59: Semantics hidden within co-occurrence patterns

Co-occurrence and MeaningCo-occurrence graphs

Interpretation of Co-citationsTopical Anchors

References

[1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May2009.

[2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge UniversityPress, 2002.

[3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings ofthe 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499,New York, NY, USA, 2007. ACM.

[4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication,2009.

[5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007.

[6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data.In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009.

[7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of acontext. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009.

[8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of theInternational Conference on Management of Data (COMAD), Dec 2006.

IEEE Computer Society talk. Nov 20 2009. c© Srinath Srinivasa, IIIT-Bangalore