Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the...

Philips Research, Jan Korst, 26 november 2004 1 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael Verschoor, Nick de Jong, and Gijs Geleijnse

Transcript of Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the...

Page 1: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 1

Ontology-based Extraction of Information

from the Internet

Jan KorstPhilips Reseach

Joint work with Michael Verschoor, Nick de Jong, and Gijs Geleijnse

Page 2: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 2


• Context

• Ontologies

• Searching for enumerations / tables in web pages

• Case Study: Searching for famous persons on the web

• Concluding remarks

Page 3: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 3


recommender system:

ontologies andmetadata

matching andreasoning

preferences,personal history,

and calender

electronic program guide,cultural agenda

recommendationsfor TV shows,

expositions in museums,theatre shows, etc.

Page 4: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 4


An ontology is a “specification of a conceptualization”. [Tom Gruber]

In other words: a formal description of the concepts and their relationships in a certain domain.

Example: music domain

concepts: composers, songs, albums, performers,… relationships: …

To define/specify ontologies for given knowledge domains semantic web languages as RDF(S) and OWL are useful.

Page 5: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 5


An ontology O is defined by a 4-tuple (C, I, P, T ), where:

• C is a set of classes c e.g. composer, song, album, performer,…

• I = { I (c ) | c C } , withI (c ) the set of instances of class c

• P is a set of properties p (c,c’ ) for some c, c’ C e.g. is_composer_of (composer, song)

is_contained_in (song, album)

• T = {T (p) | p P } , withT (p) { (s, p, o) | s I (c), o I (c’ )} for each p P

the set of true statements (triples).

Page 6: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 6

Problem statement

For a partially given ontology O’ = (C, I’, P, T’ ) of a given knowledge domain, with I’ I and T’ T, extend I’ to I’’ and T’ to T’’ to approximate I and T as well as possible.

In other words: how can we populate databases.

Research questions:

- Can this be automated ? - Can we do this by extracting information the web ?

Page 7: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 7

Quality of Approximation

For each class c, we define precision and recall as follows:

precision (c ) =

recall (c ) =

For each property p, precision and recall are defined likewise.






)(' cI

)('' cI

Page 8: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 8

Searching for enumerations on the web

basic idea: words in an enumeration tend to be of the same class.

Given a small subset of instances of a given class, we want to automatically extend this subset: more-of-the-same.

algorithm: - select web pages in which a given sequence or given subset of instances occurs, using Google.

- scan these pages for enumerations in which one or more of the given instances occurs.

- extract other terms that are in these enumerations.

Similar approach has been applied on a corpus of documentsin molecular biology [Nenadić, Spasić & Ananiadou, 2002].

Page 9: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 9

Preselection of relevant

web pages

Extraction ofInstances/


Filter to removefalse positives

General structure of the algorithm

Page 10: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 10


"bach vivaldi mozart" 611 --> [63] bach[154], mozart[46],


haydn[17], beethoven[14], ensembles[9], handel[9], chopin[7],


schubert[5], bizet[4], j[4], albinoni[3], brahms[3], s[3], sanz[3],

tartini[3], 2[2], chaconne[2], corelligeminiani[2], gershwin[2],


http[2], inteacutegrale[2], minor[2], paganini[2], ravel[2],


stravinsky[2], tchaikovsky[2], teleman[2], telemann[2], albeniz[1],

bellini[1], benda[1], berlioz[1], bloch[1], boccherini[1], boellman[1],

boieldieu[1], bruch[1], caccini[1], caldera[1], corelli[1], diabelli[1],

dowland[1], giuliani[1], grieg[1], homekcrrcom[1], jsbach[1],


milano[1], ortiz[1], pergolesi[1], prokofiev[1], purcell[1],


schumann[1], smetana[1], title[1], torelli[1], vieuxtemps[1]

Page 11: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 11

Examples (2)

"france germany england italy" 246 --> [54] france[322],

germany[259], brazil[257], italy[239], argentina[223],

england[218], spain[215], holland[212], yugoslavia[140],

croatia[133], denmark[129], norway[122], chile[91], belgium[88],

nigeria[83], romania[83], mexico[66], bulgaria[59], colombia[54],

scotland[34], austria[33], cameroon[30], team[25], usa[22],

sth[18], states[16], morocco[13], ar[12], netherlands[12],

saudi[11], africa[10], bahamas[10], paraguay[10], czech[8],

jamaica[8], scandinavia[8], canada[7], japan[7], acquitane[4],

australia[4], bali[4], caribbean[4], china[4], czechoslovakia[4],

luxembourg[4], poland[4], us[4], flanders[2], acadeacutemiques[1],

asn[1], cortona[1], europe[1], korea[1], park[1]

Page 12: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 12

Examples (3)

poincare hilbert brouwer 1110 --> [90] brouwer[20], hilbert[20],


deligne[18], gregory[18], mandelbrot[18], taylor[18], turing[18],


poisson[17], banach[16], kolmogorov[16], wiener[16], goldbach[15],


cohen[13], hausdorff[13], jacobi[13], kronecker[13], torricelli[13],


riemann[12], dedekind[11], frege[11], artin[10], babbage[10], barrow[10],

boole[10], bourgain[10], eukleidõs[10], euler[10], fraenkel[10],

heaviside[10], legendre[10], möbius[10], shannon[10], tchebychev[10],

borel[9], fibonacci[9], fisher[9], grothendieck[9], aryabhata[8], birkhoff[8],

bolyai[8], cayley[8], church[8],

descartes[8], hypatie[8], markov[8], minkowski[8], bolzano[7], cramer[7],


painlevÕ[7], cantor[6], morgan[6], puthagoras[6], gauss[5], haldane[5],

hauptman[5], irons[5], lejeune[5], schwartz[5], lie[4], bayes[3],

poincareacute[3], poincarÕ[3], biography[2], brahmagupta[2], carnap[2],

goumldel[2], gödel[2], …

Page 13: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 13

Hypernym-based filtering

Patterns that indicate hypernym relations are distinguished:

”h such as i1 , i2 , …, in” and

”i1 , i2 , …, in and other h ” [Hearst, 1992]

In these patterns h is the plural of the intended class.

Page 14: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 14

Geographic Data

Extract all countries:

Input set Precision Recall

France, China, Germany 0.89 0.99Georgia, Ghana, Latvia 0.84 0.99Kiribati, Monaco, Togo 0.79 0.99

Find out which countries have

a border in common.

Page 15: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 15

Case Study: Finding Famous Persons on the Web

Objective: generate a long list of famous persons, by searching the web.

- A famous person is a person that gets enough hits when being Googled.

- We restrict ourselves to persons that have already died.

Page 16: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 16

Definition of number of hits

Using only the last name is not specific enough. e.g. Bach, Smith

Even the full name might not be specific enough. e.g. Theo van Gogh

In addition, some persons score better with middle name, others without. e.g. Johann Sebastian Bach vs. Johann Bach Antonio Vivaldi vs. Antonio Lucio Vivaldi

While others are best known with initials only. e.g. HG Wells, DH Lawrence

Page 17: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 17

Definition of number of hits

We use the number of hits that are found with query:

“<last name> (<year of birth> - <year of death>)” e.g. “Bach (1685 – 1750)”

By not using the full name, we combine different variants. e.g. Johann Sebastian Bach and JS Bach

For kings, queens, popes, etc, the Latin ordinal number is used as last name.This combines the variants in different languages. e.g. Charles V Carlos V Karel V

Page 18: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 18

Basic idea

We use potential time intervals

“(<year of birth> - <year of death>)”

as starting point to search for persons.

Issue exact queries to Google of the following form:

allintitle: “(y1 – y2)”

where y1 ∈ [1000..1999] and y2-y1 ∈ [20..110],and analyse the summaries Google returns.

Look for the six words that precede “(y1 – y2)” and analyse these words.

Page 19: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 19

Google batch processing

To process the Google queries we use a program that allows batch processing (Nick de Jong):

Program allows parallel execution of multiple queries.

file with queriesfile with queries


file with resultsfile with results

Page 20: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 20

Main Problem: how to separate person names from other names.

Art Blakey Art DecoWest Mae West VirginiaRaul Delcroix Real DecretoHP Lovecraft HP InkjetKoye Somefun Have SomeFun

Potential approaches:- filter out non-persons by using a list of stop words.- filter out non-persons by using an exhaustive list of first names.- carry out further tests (“X was born in”).

We only used a list of 500 stop words, including:Album, Anniversary, Archive, Articles, Biographie, Biography, Births, Boats, Burials, Catalog, Census,…

Page 21: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 21

Additional Problem:

a single person can be presented in various ways

Vasilij Kandinskij Wassily KandinskyVasily KandinskyVassily KandinskyKandinsky, WassilyKandinsky Wassily

Johann Sebastian BachJS BachJohann SebastianSebastian BachBach, Johann Sebastian

Page 22: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 22

Example of the word sequences that are found:

[allintitle: "(1769 - 1852)" -genealogy -genealogie] 111 Rose-Philippine Duchesne ( Rose-Philippine Duchesne (Wellesley, 1st Duke of Wellington ( Home Study Service Rose Philippine DuchesneArthur, 1st Duke of Wellington ( The Duke of Wellington (Wellesley, 1st Duke of Wellington (Arthur Wellesley, Duke of Wellington. (Wellesley, first Duke of Wellington ( People > Duke of Wellington ( > Pobl > Dug Wellington ( medal depicting Duke of Wellington ( Arthur Wellesley Wellington (Wellesley, 1st Duke of Wellington (John Landseer (Wellington, Arthur Wellesley,Duke of,Learning Library: WELLINGTON, DUKE OF (

Page 23: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 23

Another Example:

George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric HANDEL ( Georg Frideric Handel |from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (

Page 24: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 24

1. first reduce capitals:

If a word consists of capitals only, then replace all but the first.e.g. HANDEL Handel

Unless the word contains a hyphen.e.g. SAINT-SAENS Saint-Saens

Unless the word represents a latin ordinal number.e.g. Louis XIV Louis XIV

Unless the word starts with ‘MC’.e.g. MCCULLOCH McCulloch

Unless the word is an abbreviation (initials).e.g. DE KNUTH DE Knuth

Page 25: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 25


George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric HANDEL ( Georg Frideric Handel |from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (

Page 26: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 26


George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (

Page 27: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 27

2. delete pre- and suffixes:

Delete parts that cannot be part of the name.

First delete suffix.

Next, scan through the words from back to front,until e.g. a colon or point is encountered.

Page 28: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 28


George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (

Page 29: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 29


George Frederick Handel George F. HandelX. George Frederick HandelHandel, George FridericGeorge Frederic HandelGeorge Frederic Handel Handel, George FredericHandel, George FredericGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

Page 30: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 30

3. correct inversions:

If two words remain, where the first ends with a comma, then reverse.e.g. West, Mae Mae West

If three words remain, where the first ends with a comma, then reverse.e.g. Handel, George Frederick George Frederick Handel

If three words remain, where the second ends with a comma, then reverse.e.g. Van Gogh, Vincent Vincent van Gogh

Problem: not all inverted names contain commas.

Page 31: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 31


George Frederick Handel George F. HandelX. George Frederick HandelHandel, George FridericGeorge Frederic HandelGeorge Frederic Handel Handel, George FredericHandel, George FredericGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

Page 32: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 32


George Frederick Handel George F. HandelX. George Frederick HandelGeorge Frideric HandelGeorge Frederic HandelGeorge Frederic Handel George Frederic HandelGeorge Frederic HandelGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

Page 33: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 33

4. save two- and three-word names

Scan the list of strings and those consisting of two or three words are stored,provided that they do not contain stop words.

In addition, count how often they are found.

Page 34: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 34


George Frederick Handel George Frederic Handel 5George F. Handel George Frideric Handel 2X. George Frederick Handel George F. Handel 1George Frideric Handel George Frederick Handel 1George Frederic Handel Georg Frideric Handel 1George Frederic Handel by GF Handel 1George Frederic HandelGeorge Frederic Handel George Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich

For each lastname/years combinationthe form that was found most

often is used.

Page 35: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 35

Unexpected Observations

- Franz-Eugen Schlachter (1859 – 1911) has 64,500 hits, but all from the same server!

It concerns an on-line bible, where each bible page is implemented as a separate web page, with Franz-Eugen Schlachter in the title.

We can use the similar pages information that Google gives, to filter these out.

- Koop Juliana (1948 - 1980) has 8,200 hits. “Koop Juliana” results in considerably less hits than “Juliana (1948 – 1980)”. That can be an indication that the first name is not correct.

Page 36: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 36

Number of Persons Found

1000 – 1099: 401100 – 1199: 421200 – 1299: 791300 – 1399: 1061400 – 1499: 3571500 – 1599: 10501600 – 1699: 2258 1700 – 1799: 72391800 – 1899: 286371900 – 1999: 12101

Total 51909

Page 37: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 37

Top 16 born between 1500 and 1599

1 William Shakespeare (1564 - 1616) 51300 2 Rene Descartes (1596 - 1650) 33400 3 Galileo Galilei (1564 - 1642) 27300 4 Francis Bacon (1561 - 1626) 25200 5 John Dowland (1563 - 1626) 25000 6 Orlandus Lassus (1532 - 1594) 23200 7 Johannes Kepler (1571 - 1630) 22700 8 Thomas Hobbes (1588 - 1679) 15400 9 Frescobaldi Girolamo (1583 - 1643) 11900 10 Claudio Monteverdi (1567 - 1643) 11600 11 Peter Paul Rubens (1577 - 1640) 11400 12 Tycho Brahe (1546 - 1601) 11000 13 Michel de Montaigne (1533 - 1592) 10700 14 John Calvin (1509 - 1564) 9990 15 Elizabeth I (1558 - 1603) 7520 16 Andrea Palladio (1508 - 1580) 714017 Gibbons Orlando (1508 – 1580) 703018 Nicolas Poussin (1594 - 1665) 6790

Page 38: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 38

Top 16 born between 1600 and 1699

1 Johann Sebastian Bach (1685 - 1750) 86600 2 Antonio Vivaldi (1678 - 1741) 39700 3 Henry Purcell (1659 - 1695) 37600 4 Georg Philipp Telemann (1681 - 1767) 35700 5 Georg Friedrich Haendel (1685 - 1759) 336006 Voltaire (1694 - 1778) 32800 7 Isaac Newton (1642 - 1727) 31700 8 Domenico Scarlatti (1685 - 1757) 28300 9 Arcangelo Corelli (1653 - 1713) 27300 10 Francois Couperin (1668 - 1733) 27100 11 Jean-Philippe Rameau (1683 - 1764) 26700 12 Alessandro Scarlatti (1660 - 1725) 25600 13 Tomaso Albinoni (1671 - 1751) 25000 14 Jean-Baptiste Lully (1632 - 1687) 24900 15 Giuseppe Tartini (1692 - 1770) 23800 16 de la Barca (1600 - 1681) 2300017 John Locke (1632 - 1704) 22800 18 Blaise Pascal (1623 - 1662) 22700

Page 39: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 39

Top 16 born between 1700 and 1799

1 Wolfgang Amadeus Mozart (1756 - 1791) 79000 2 Ludwig van Beethoven (1770 - 1827) 69400 3 Franz Schubert (1797 - 1828) 62300 4 Napoleon Bonaparte (1769 - 1821) 61500 5 Joseph Haydn (1732 - 1809) 50300 6 Johann Wolfgang Goethe (1749 - 1832) 45800 7 Immanuel Kant (1724 - 1804) 35800 8 Gioacchino Rossini (1792 - 1868) 34300 9 Benjamin Franklin (1706 - 1790) 28600 10 Washington Irving (1783 - 1859) 26900 11 Luigi Boccherini (1743 - 1805) 25100 12 Luigi Cherubini (1760 - 1842) 24100 13 William Blake (1757 - 1827) 22000 14 Arthur Schopenhauer (1788 - 1860) 2190015 Thomas Jefferson (1743 - 1826) 20100 16 Jean-Jacques Rousseau (1712 - 1778) 1940017 Boyce William (1711 - 1779) 1740018 Heinrich Heine (1797 - 1856) 15900

Page 40: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 40

Top 16 born between 1800 and 1899

1 Charles Darwin (1809 - 1882) 73400 2 Albert Einstein (1879 - 1955) 70500 3 Johannes Brahms (1833 - 1897) 60600 4 James Joyce (1882 - 1941) 59300 5 Peter Iljitsch Tschaikowsky (1840 - 1893) 476006 47600 Robert Schumann (1810 - 1856) 45300 7 Frederic Chopin (1810 - 1849) 41200 8 Giuseppe Verdi (1813 - 1901) 41100 9 Claude Debussy (1862 - 1918) 39400 10 Winston Churchill (1874 - 1965) 39300 11 Franz Liszt (1811 - 1886) 38500 12 Richard Wagner (1813 - 1883) 38300 13 Richard Strauss (1864 - 1949) 37800 14 Antonin Dvorak (1841 - 1904) 35700 15 Maurice Ravel (1875 - 1937) 35300 16 Gustav Mahler (1860 - 1911) 34300

Page 41: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 41

Top 16 born between 1900 and 1999 16 nov. 2004 29 nov. 20041 Ronald Reagan (1911 - 2004) 44800 Yasser Arafat (1929 - 2004) 842002 Benjamin Britten (1913 - 1976) 31700 Ronald Reagan (1911 - 2004) 466003 John Peel (1939 - 2004) 27400 Benjamin Britten (1913 - 1976) 320004 Samuel Barber (1910 - 1981) 26600 Samuel Barber (1910 - 1981) 263005 John Fitzgerald Kennedy (1917 - 1963) 24100 John Peel (1939 - 2004) 217006 Robertson Davies (1913 - 1995) 18900 Robertson Davies (1913 - 1995) 188007 Yasser Arafat (1929 - 2004) 16600 John F. Kennedy (1917 - 1963) 173008 Peter Ustinov (1921 - 2004) 16500 Peter Ustinov (1921 - 2004) 167009 Kurt Cobain (1967 - 1994) 14800 Kurt Cobain (1967 - 1994) 1440010 Salvador Dali (1904 - 1989) 14600 Salvador Dali (1904 - 1989) 1400011 Christopher Reeve (1952 - 2004) 13900 Jon Lee (1968 - 2002) 1390012 Jon Lee (1968 - 2002) 13900 Marlon Brando (1924 - 2004) 1120013 Marlon Brando (1924 - 2004) 11200 Christopher Reeve (1952 - 2004) 1080014 Van Gogh (1957 - 2004) 10900 Jean-Paul Sartre (1905 - 1980) 979015 Albert Camus (1913 - 1960) 9730 Chostakovitch Dimitri (1906 - 1975) 964016 Jean-Paul Sartre (1905 - 1980) 9630 Albert Camus (1913 - 1960) 918017 Ted Hughes (1930 - 1998) 8970 Van Gogh (1957 - 2004) 905018 Jim Morrison (1943 - 1971) 8930 Steve Reich (1965 - 1995) 8370

Page 42: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 42

Top 16 born between 1000 and 1999

1 Johann Sebastian Bach (1685 - 1750) 86600 2 Wolfgang Amadeus Mozart (1756 - 1791) 79000 3 Charles Darwin (1809 - 1882) 73400 4 Albert Einstein (1879 - 1955) 70500 5 Ludwig van Beethoven (1770 - 1827) 69400 6 Franz Schubert (1797 - 1828) 62300 7 Napoleon Bonaparte (1769 - 1821) 61500 8 Johannes Brahms (1833 - 1897) 60600 9 James Joyce (1882 - 1941) 59300 10 Leonardo da Vinci (1452 - 1519) 53400 11 William Shakespeare (1564 - 1616) 51300 12 Joseph Haydn (1732 - 1809) 50300 13 Peter Iljitsch Tschaikowsky (1840 - 1893) 47600 14 Johann Wolfgang Goethe (1749 - 1832) 45800 15 Robert Schumann (1810 - 1856) 4530016 Ronald Reagan (1911 - 2004) 44800

Page 43: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 43

Testing recall

Herinneringen in Steen 195 persons recall: 0.77

150 found: James Baldwin, Olaf Palme, Simone Signoret, Henry Moore, Carel Willink, Joan Miro, Theolonius Monk, Georges Brassens, John Lennon, Jean-Paul Sartre, Simone de Beauvoir, Mae West, Kurt Gödel, Elvis Presley, Maria Callas, Charlie Chaplin, Benjamin Britten, Paul Robeson, Mao Zedong, Agatha Christie, Lotte Lehmann, Robert Stolz, Edward Kennedy, Pablo Picasso, Pablo Casals, Maurits Cornelis Escher, Ezra Pound, Jim Morrison, Louis Armstrong, Igor Stravinsky, Jimi Hendrix, Barnett Newman, Charles de Gaule, Judy Garland, Dwight David Eisenhower, Ho Tsji Minh, Martin Luther King, Robert Kennedy, Erneste Guevara, John William Coltrane,…

45 not found: Louis Paul Boon, Adriaan Roland Holst, Stijn Streuvels, Ernest Claes, Johannes XXIII, Dag Hammarskjöld, William Christopher Handy, Lucien Guitry, Antony Fokker, Pieter Jelles Troelstra, Paul van Ostaijen, Hugo Verriest,…

Page 44: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 44

Testing recall

Het Kunst Boek of the first 200 (dead) persons recall: 0.84

167 found: Jaques-Laurent Agasse, Josef Albers, Allesandro Algardi, Washington Allston, Jacopo Amigoni, Fra Angelico, Antonello da Messina, Alexander Archipenko, Giuseppe Arcimboldo, Hendrick Avercamp, Francis Bacon, Giacomo Balla, Fra Bartolommeo, Jean-Michel Basquiat, Jacopo Bassano, Pompeo Batoni, Willi Baumeister, Frederic Bazille, Domenico Beccafumi, Max Beckmann, Gentille Bellini, Giovanni Bellini, Hans Bellmer, Gianlorenzo Bernini, Josef Beuys, Albert Bierstadt,…

45 not found: Andrea del Sarto, Sofonisba Anguissola, Jean Arp, John James Audubon, Hans Baldung, Andre Beauneveu, Bernardo Bellotto, George Bellows,…

Page 45: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 45

Testing recall

The Science Book of the 156 (dead) persons recall: 0.70

109 found: Leon Battista Alberti, Nicolas Copernicus, Andreas Vesalius, Conrad Gesner, Tycho Brahe, William Gilbert, Johannes Kepler, Galileo Galilei, John Napier, William Harvey, Blaise Pascal, Pierre de Fermat, Christiaan Huygens, James Clerk Maxwell, Robert Boyle, Nicolaus Steno, Giovanni Domenico Cassini, Isaac Newton, Edmond Halley, Carolus Linnaeus, Lazzaro Spallanzani, Johan Heinrich Lambert, Joseph Priestley, Antoine Laurent Lavoisier, William Herschel, Henry Cavendish, James Hutton, Edward Jenner, Pierre-Simon Laplace, Georges Cuvier, Thomas Robert Malthus, Alexander von Humboldt, Allesandro Volta, Thomas Young,...

45 not found: Fibonacci, Piero della Francesca, Jeremiah Horrocks, Antoni van Leeuwenhoek, Rudolph Jacob Camerarius, George Hadley, Carl Wilhelm Scheele, James Hall, Joseph von Frauenhofer, William Smith,…

Page 46: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 46

Testing precision precisionCounting false positives: 4900 – 4999 0.90

9900 – 9999 0.88 14900 – 14999 0.96 19900 – 19999 0.97

Povijest Jugoslavije (1918 - 1991) Oeuvre Poetique (1925 - 1965) Alabama Wills (1808 – 1870) Black Tennesseans (1900 - 1930) Nippon Porcelain (1891 - 1921) Personal Favorites (1977 - 1998) Wheeling Glass (1829 - 1939) Political Impact (1770 - 1814) Movie Set (1959 - 1980) Transatlantic Dialogues (1775 - 1815)Sailing Navy (1775 - 1854) Home Children (1869 - 1930) Peace Pilgrim (1908 - 1981) Briton Riviere (1840 - 1920) La Regle (1917 - 1947) Farm Tractors (1890 - 1960)Western Warfare (1775 - 1882) Le Peintre (1877 - 1968)Exakta Cameras (1933 - 1978) Offene Briefe (1945 - 1968) Portraitmatilde Muti (1862 - 1943) Nature Morte (1946 - 1993) Dessins Inconnus (1901 - 1954) Jacques Lacan-Seminaires (1952 - 1980) Legendary Parties (1922 - 1972) Memory Joggers (1940 - 1989)Klondike Ho (1897 - 1997) Events From (1907 - 1977)

estimated precision for first 5000: 0.90

Page 47: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 47

Some observations

- Composers dominate the top for some centuries.

- Recently-died persons have relatively high score.

- Person names only consisting of one word, such as pseudonyms Voltaire, Caravaggio, and Nadar are not yet found.

- Likewise, names consisting of four or more words are not yet found, such as Joost van den Vondel.

- Also, persons that died as teenagers are not found, such as Jeanne d’Arc and Anne Frank. - More advanced approximate pattern matching is required to better cluster the name variations of one person and potential errors in years.

Page 48: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 48

Concluding remarks

- Enumeration search offers an interesting approach to find more-of-the-same, since it is generally applicable.

- The famous-persons case study indicates that with simple techniques already non-trivial results can be obtained.

- Further research: extend the case study to also include information on nationality, profession, etc. of persons. Automatically search for biographic data.

- Other intended application domains: music and medical domain.

Page 49: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 49

Fun Section

Election of ‘De Grootste Nederlander’: Vincent van Gogh

Page 50: Philips Research, Jan Korst, 26 november 20041 Ontology-based Extraction of Information from the Internet Jan Korst Philips Reseach Joint work with Michael.

Philips Research, Jan Korst, 26 november 2004 50

Fun Section

Persons that are born and died in the same years:

Sir Christopher Wren (1632 – 1723)Anthony van Leeuwenhoek (1632 – 1723)

Leo Tolstoy (1828 - 1910)Henri Dunant (1828 - 1910)

Edouard Manet (1832 - 1883) Gustave Dore (1832 - 1883)

JRR Tolkien (1892 – 1973)Pearl Buck (1892 – 1973)

Miles Davis (1926 – 1991)Klaus Kinski (1926 – 1991)