Freebase: Wikipedia Mining 20080416

24
Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat

Transcript of Freebase: Wikipedia Mining 20080416

Wikipedia MiningSpring Freebase User Group meeting2008-04-16 / zenkat

2

Why Mine Wikipedia?

• How can we automatically extract theunstructured content from Wikipedia …

• … to create a structured database ofinformation …

• … that can be leveraged by users inapplications and data loads

3

A Remarkable Source of Information

2.15 M articles as of April 2008Doubling every 12 - 18 months

4

Problem is …

• Wikipedia is written by humans, for humans.

- Great if you need to look up a fact, or learn about something

• But you can’t …

- Ask questions:“What movies by George Lucas has Harrison Ford starred in?”

- Search effectively:“Find me all companies that build personal computers.”

- Build applications:“Let’s make a social app that ranks consumer goods listed inwikipedia.”

5

From unstructured …

6

… to structured

7

Searching for Structure: Topics

Articles define a topic

8

Searching for Structure: Types

Categories & Lists provide type

9

Searching for Structure: Types

Categories & Lists provide type

10

Searching for Structure: Properties

Templates & Infoboxes give properties

11

Searching for Structure: Properties

What are the highest buildings in the world?

{ "query" : [ { "type" : "/architecture/structure" "name" : null, "height_meters" : null, "sort" : "-height_meters", "limit" : 10, } ]}

12

Searching for Structure: Properties

What are all the countries that speak English?

{ "query" : [ { "type" : "/location/country" "name" : null, ”official_language" : “English”, "limit" : 100 } ]}

13

A Treasure Trove Waiting To Be Opened

• 2,150,000 articles (ie, topics)

• 7,100,000 category refs (ie, typings)- Found within 280,000 categories

• 42,000,000 template values (ie, properties)- Found within 10,000 templates and 56,000 template keys

• All growing at ~2% every two weeks

• Available information doubles every year!

14

Topic Population From Wikipedia

Topic NameBlurb

Wikipedia Attribution

WikipediaLink

Image

15

Fresh Topic

16

Similar, but different …

• Many pages in wikipedia are not topics- Disambiguation pages, lists, categories, images, docs, talk …

• Only store a 1200-character blurb- We’re not wikipedia, after all

• Don’t need to add “(suffix)” to names- “Python (genus)” vs “Python (programming language)”- Freebase types disambiguate without names

• Cities should be specified without state suffix- “San Francisco” vs “San Francisco, California”- Cleanup in progress, some exceptions remain

• “Exclusionist” vs “Inclusionist”- Exclusionists appear to be winning in Wikilandia- Freebase is inherently more inclusionist

17

You Can’t Read The Same Wikipedia Twice

Every 2 weeks …

- 65,000 new pages- 30,000 new topics- 80,000 new aliases- 10,000 merges

- 8,000 deletes- 5,000 name changes- 1,000 page ID changes- 1,000 splits

… change in Wikipedia

18

Keeping track of changes …

• Store reference information within freebase- Page_ids, article titles and redirects

- Page_id (WPID) is stored in /wikipedia/en_id- Article titles and redirects are stored in /wikipedia/en- “mwcl_wikipedia_en”, “mw_infobot” user

• None of these IDs are stable in wiki-land …

19

Determining actions by comparing keys

• Because we are more inclusionist than wikipedia,we usually do not delete topics.

• Topic renames only occur on “untouched” topics.

• Merges occur automatically on “untouched” topics- Otherwise, flagged for review in “pipeline”

case action

new topic create a new topic

name change add new name as en key; if "untouched", rename the topic

id change change the en_id to the new value

merge move the en key to the new topic; if "untouched", merge the topics

split create new topic, move en key from old topic to new topic

delete keep topic, but delete en_id and en keys from topic

20

Map Template Fields To Properties

21

Map Template Fields To Properties

{{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] |more users = [[Air France-KLM]] |produced = 1993 - Present |number built = 723 as of March 2008 |unit cost = US$187.5-253 million}}

MediaWiki

Template

Rendering

22

Map Template Fields To Properties

{{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] |more users = [[Air France-KLM]] |produced = 1993 - Present |number built = 723 as of March 2008 |unit cost = US$187.5-253 million}}

MediaWiki

Template

Rendering

“manufacturer” -->/aviation/aircraft_model/manufacturer

23

Just the Starting Point …

• Extracted to date from Wikipedia:

- 2,365,000 topics- 2,895,000 typings- 5,638,000 properties

• A complement to user-entered data- User data always takes precedence, won’t be overwritten

• Processes are being automated to keep in sync

24

Thanks!

Tristan Buckner/user/tristan

Colin Evans/user/colin

Al Marks/user/al

Topic updaterImage loader

WEX

Category mapperTemplate mapper

WEX