NaBIC 2012 presentation

11
What's going on out there right now? A beehive based machine to give snapshot of the ongoing stories on the Web Štefan Sabo and Pavol Návrat [email protected], [email protected]

Transcript of NaBIC 2012 presentation

Page 1: NaBIC 2012 presentation

What's going on out there right now? A beehive based machine to give snapshot of the ongoing stories on the Web

Štefan Sabo and Pavol Návrat

[email protected], [email protected]

Page 2: NaBIC 2012 presentation

General overview

• Method to extract keywords related to stories from news articles is proposed.

• Multiple agents inspired by honey bees foraging for food are used.

• Connections between articles are explored one keyword at a time.

• Most promising keywords that provide links between articles are propagated, uninteresting keywords are discarded.

Page 3: NaBIC 2012 presentation

Outline of presentation

• Motivation• Method overview• Results• Summary• Future work

Page 4: NaBIC 2012 presentation

Motivation

• News stories are often represented by terms that identify the story by providing an easily recognizable label for it.

• These keywords are interesting for navigation in the space of news stories.

• It is difficult to predict in advance which articles will develop into stories over time and which keywords will represent them.

• Dynamic system is needed to follow new articles and account for the changes in the old ones.

• Corpus of all the articles in unavailable.

Page 5: NaBIC 2012 presentation

Method overview

• Most representative keywords are chosen by comparing relevance of multiple articles to a given keyword.

• If two articles are both relevant to a keyword a link is established between them.

• Keywords that provide links between most articles are selected as most interesting.

• Comparison between every two articles regarding every keyword would be impractical.

• To facilitate the process of comparison, the process is performed by a swarm of agents inspired by honey bees.

Page 6: NaBIC 2012 presentation

Method overview - agents

• Every agent carries a single keyword at a time and can independently perform one of 3 actions:o foraging – comparing articleso dancing – propagating its current keywordo observing – selecting a new keyword

• Based on the keyword quality, an agent may decide to propagate an interesting keyword through dancing or select a new keyword through observation.

• This mechanism focuses the swarm on the most interesting keywords for currently visited articles.

Page 7: NaBIC 2012 presentation

Results

• News articles from Reuters web page have been checked daily for a period of 9 days.

• 298 unique keywords had been identified.• On average, 287 articles have been assigned a keywords

every day.• Increased prevalence of proper nouns amongst the top

keywords can be noted.

Page 8: NaBIC 2012 presentation

Results – best keywords

keyword n (k) n (k) / N keyword n (k) n (k) / N

Syria 177.30 6.87 % court 49.90 1.93 %

Egypt 98.10 3.80 % ECB 49.85 1.93 %

Apple 92.65 3.59 % attack 49.41 1.91%

Afghan 78.23 3.03 % Colorado 41.79 1.62 %

Euro 75.50 2.92 % trial 28.90 1.12 %

shooting 56.32 2.18 % Libor 27.75 1.07 %

Samsung 55.71 2.16 % murder 26.38 1.02 %

China 55.30 2.14 % Aleppo 25.31 0.98 %

Page 9: NaBIC 2012 presentation

Results – development over time

4.8. 5.8. 6.8. 7.8. 8.8. 9.8. 10.8. 11.8. 12.8.0

20

40

60

80

100

120

ColoradoChinashootingAfghanEgyptAppleEuroSyria

Page 10: NaBIC 2012 presentation

Summary

• Proposed approach utilizes agents inspired by honey bees foraging for food to extract story related keywords from a set of news articles.

• Articles are compared and their proximity is evaluated multiple times with regard to various keywords.

• To reduce the number of performed comparisons, agents use the mechanisms of propagation and observation to select the best keywords and discard those less desirable.

• Dynamic nature of the process enables agents to react to new articles as well as to changes in the old ones without need for article corpus or machine learning.

Page 11: NaBIC 2012 presentation

Future work

• Multi-level hierarchical grouping of keywords based on their generality.

• Visualization of stories.